Here I would like to share my experience with Elasticsearch implementation using logstash. We have executed this project to one of the large user base group company on trend analysis for beverages
Customer approached with a performance issue for the website they are having. This website was accessed by all of their end users, for the daily statistics published by them. The statics spanned for a period of two years with as many as 10 million records. The final stats shown on the page will be calculated values generated from these 10 million records are few thousands. The query execution time itself is around 20 ~ 25 seconds. And client foresees a data growth up to 20 million records. Client wants the response time to below 5 seconds.
Analysis – We analyzed the issue with the SQL queries for fine tuning but the time to execute is not coming down to below 20 seconds. Then we thought of having a separate database to store the Result sets, but this introduces additional task. Hence we planned to use Elasticsearch Service from AWS. Having in-house infrastructure for Elasticsearch might be costly, so we used AWS Elastic search service.
- Create an AWS Elastic search domain
- Generate the target datasets in CSV format from the Database
- Create the index templates in the AWS Elastic search using sense extension or shell scripts
- Create the configuration files for logstash to load the CSV files to Elasticsearch index
- Run the log stash commands to load the index with CSV data
- Create a lambda function that takes filter conditions and format the final output
- Create a trigger on AWS lambda function with AWS API gateway this provides a simple API that takes simple formatted request and result a formatted response
- Ensure to have a API from application that gives me the latest Data
- Use the API to update the Elasticsearch index regularly
- Create an Event scheduler on Lambda function that calls the API and updates index on a predefined time as per the expression set.
AWS Elastic search domain was created first. Then we have generated the data from the Data base for the last two years, the data size is around 800MB with 5 million records in a CSV file. Now we are in the process of moving this data from CSV file to Elasticsearch index. We have created a template for the index in the Elastic search domain. Then we used Log stash to load the CSV file data to Elasticsearch index. Configuration file for log stash is shown below.
path => ‘c:\data\logdata.csv’
start_position => beginning
columns => [ ‘date’,’message’,’details’]
separator => ‘,’
remove_field => [‘message’,’path’,’host’,’@version’,’type’,’@timestamp’]
hosts => [‘http://localhost:9200’%5D
action => ‘index’
index => ‘myindexname’
document_type => ‘log’
Logstash will be available in zip download we can use it by unzipping the file in a specific path. Place your configuration files in json format inside the bin folder and then execute the below commands to populate the index
From command prompt go to logstash folder and then go to the bin folder to execute the below command. You can place the config file anywhere. Here I have kept my json file in the bin folder itself.
First ensure that the configuration file is proper by using the below command
Logstash -t -f config.json
Now start the below command to load the data to the index
Logstash -f config.json
Logstash -w 10 –b 1000 -f config.json (this command initiates 10 worker threads and makes 1000 entries in batch)
Using the above command populated all the records to the AWS Elastic search domain
If you are inside office network ensure that logstash can access Elastic search domain.
Why log stash? Because of the huge data size we tried log stash, as this is more reliable to populate index in less time. We can write our own code to parse the CSV and populate index by calling post API, but this takes a considerable amount of time and it’s not suggested if your file sizes are more than 100MB. We tried log stash with around 800 MB (5 million records) and it worked well
What are the advantage? If you are already having an application, over the period the data grows. Application starts performing slow. Then it is ideal or easy to implement Elastic search for better performance. Log stash will be useful to load millions of data seamlessly.
What are the challenges? While loading data I mostly faced to parse the data for few field types like Date and json data, ensure proper format applied in the maapping created in elastic search otherwise it will treat every field as string using its own logstash template.
While populating, sometimes we use to delete index and execute logstash command. But data was not populated. To address this issue, in the logstash folder “D:\logstash-5.0.1\data\plugins\inputs\file” remove the sincedb file as this maintains the state of the csv file any changes in csv will pushed to index, but if delete index and try to re populate, in that case you need to delete this file every time.
What are the alternatives? For any existing application that runs from long period than data might have been grown in huge size. To implement Elastic search we can use some plug-in such as river.
- River is a plug-in that connects DB and populates the index. But this plug-in was obsolete now.
- Write your own code for moving data to Elasticsearch index. Such as node.js where if you can connect to your DB. Then you query intended data for indexing and then call Elasticsearch domain to do post call for the data.
- Using the AOP, you can make an API call to Elastic index for any CRUD operations on your data to update the Index.
AWS Elastic Search Service
Creating the service in AWS is a simple process. Elasticsearch domain available within no time. Ensure proper security policies configured. Then querying Elasticsearch index was easy like any other DB queries. Getting calculated field result is faster when compared to the other Databases. But query syntax is not straight forward needs some practice.
AWS Lambda Function
This function’s are simple to write, the input and out to the Elastic search index can be customised inside this function. To trigger this function we can use an AWS API Gateway.
AWS API Gateway
This is a simple AWS resource that used to trigger AWS Lambda functions on need basis with defined request input
AWS Event Scheduler
This is a cron scheduler task that can be setup on a Lambda function to execute in predefined timings
Conclusion – Performance was achieved seamlessly without touching any of the existing source code with few API’s that replace their existing API’s. We can create Elastic search index and can provided them search URL’s. But Elastic query creation and response format will need more changes from the Application side. Hence we used the AWS lambda functions and AWS API Gateway for the formatted requests and responses for smooth execution of this project.