Lessons learned from building a large scale historical data analysis system using Azure Data Explorer - Part 1
notes about some lesson learned from building a large scale historical data analysis system that has hundreds of terabytes data using Microsoft Azure Data Explorer - Part I
Recently I had an opportunity to participate in another data project that also uses Azure Data Explorer(ADX) as the core data process and store engine. In this project we used ADX to ingest and process more than half of petabytes data. Like most projects we were under some time and resource constraints, and also encountered a few unexpected technical challenges due to the constraints. Though we couldn’t implement the system using the best-optimized architecture (it will take too much time than the project was allowed), we still managed to achieve the project goal. It’s an exciting and fun journey and here are a few lessons we learned.
Lesson 1 Select proper SKU
There are couples different SKU options for ADX, some are more CPU optimized like D-Series (D-series, Ds-series) VM, they have more powerful CPU; some are more storage optimized like Ls-Serious VM, they are equipped with larger SSD to achieve more I/O performance. Have a testing plan to test the key user query patterns on these different type of VMs and check which one is best for your query workload can benefit the project in the long run.
Lesson 2 Check different ingestion options.
In Azure Data Explorer, it supports several different ingestion solutions, the decision will depend on the purpose and stage of your development. You can check here for detail information of these solutions.
It’s better to read through the official document, understand their differences before making a decision. Meanwhile, here are a few thumb rules:
- For query testings, verifying scripts, tables, you can use Inline ingestion (push)
- For ad-hoc feature engineering, data cleaning, you can use Ingest from query
- For ingestion testing, create some volumes of data, you can use Ingest from storage (pull)
- For production ingestion pipeline testing, I normally will use Queued ingestion.
In addition to the above basic ingestion options, you can also check the following options based on your scenario and environment.
And there is a new ingestion option One-Click Ingestion which are just been announced in Microsoft Build 2020.
Eeek, a lot of choices. :p
Lesson 3 Have a clean ingestion pipeline
Normally when trying to ingest data into a data repository, we might need to do some data pre-process such as check file format, clean dirty data, do a few data transformation, etc. Azure Data Explorer provides some of these capabilities through User-defined function and update policy. You can use these mechanisms to quickly perform some data pre-process tasks within ADX without setup extra computing services to handle it.
While these are convenient ways to massage data, it will occupy ADX’s resources and potentially introduce more data fragmentation. There are complex mechanisms within ADX to optimize the resources it has, maintains and organize data in its storage and keep the system in a healthy status.
Under the condition that data ingestion volume is big, these data pipeline activities could impact the resource available for ADX to handle queries or do internal housekeeping tasks. You might need to carefully monitor ADX status and do a few fine tunes on its configuration. While I did use these mechanisms for other smaller-scale projects and love it, it is still a better practice just to keep the data pre-process tasks outside of ADX in large scale project, at least before you are very familiar each internal mechanisms within ADX.)
Data Factory, Databricks, Azure Functions/App services, AKS, HDInsight provide good foundational capabilities to pre-process data.
In my git project I shared some codes that I used Azure Functions to do Queued ingestion.
Lesson 4 Evaluate the horizontal scale (# of servers) needed
When planning system roll-out, a solid estimation of the number of servers needed and a well-estimated expansion plan for the future is important. It can also help save costs by preventing under-utilization and provide valuable information for system design. System scale out/ horizontal scaling is one of the core capabilities that we can make sure the system can be adaptive to the workload and provide just enough resource to the users. In our test, one of the key ADX features that users love is it can provide almost linear performance growth when scaling out. ADX also provides non-destructive services when scaling out.
It’s suggested that you should run a few workload simulations and test about how the scale-out can increase your system capabilities.
ACI and AKS can be helpful if you want to simulate the system workload.
Lesson 5 Understand and test KQL query performance
One of the key strengths of ADX is its powerful query language Kusto Query Language (KQL). Like most other data query languages, some query operators in KQL are similar, give you the same result but could have a huge difference in performance.
It’s always good the validate what’s your key query scenarios and try how different query syntax impact the query performance.
You can review the query performance using Kusto.Explorer tool or ADX Web UI. You can also use .show queries operator to review the performance of historical queries.
.. to be continued