Agile Delivery for Large Scale Data Ingestion using Azure Data Explorer

agile-delivery

When it comes to data one of the most common issues we see is when companies are joining data from a variety of sources into a “single source of truth.” Databricks and Customer Insights were built to help people take data from disparate sources and combine it, but they involve significant overhead maintaining configuration, cost, and a host of other factors.

How do we get started quickly, support streaming intake, and have a persisted data storage while being low cost? How do we show value to our stakeholders by showing data as we consume it into reports? Enter Azure Data Explorer (ADX). Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of data streaming from applications, websites, IoT devices, and more. You can use Azure Data Explorer to collect, store, and analyze diverse data to improve products, enhance customer experiences, monitor devices, and boost operations. ADX offers a different paradigm to standard aggregation systems and instead of “combing” all data in your lake it seamlessly handles ingestion and allows you to query raw data.

Architecture Under the Hood

Azure Architecture Diagram

With many built-in connectors and ingestion pipelines, you can ingest your data with zero code and start exploring. With the right SKU for the right workload, you can “start small” and grow with your data and its adoption.

A Real-World Use Case

Recently the Cognizant Microsoft Business Group delivered an ADX solution for a major managed general underwriting insurance company. For simplicity’s sake, their data sources are reflected below:

  1. Policy Management system
  2. On-premises invoicing system that also has legacy policy data
  3. Flat file uploads of claims information

ADX diagram

On-premises data, we use Azure Integrated Runtime to pull information from the on-premise store and write to ADX – populating working tables in the ADX database.

API data ingestion work-the team created a basic Azure function to query the API from the policy management system – unfortunately, this API was using legacy SOAP protocols with certificate authentication – so we needed the .Net Core web services library to effectively manage the request, which was the sole reason we used Azure functions-this could have been an event grid or a logic app. The Function App writes parquet files to an ADLS gen 2 Data lake and using the Event Grids/ Event Hubs we push file data to ADX through its data connection support.

Now that third-party data and on-premises data are under one platform (ADX) we have data insight. This is “starting small” – we can create an export of data sets that we believe are informing data models used for future processes.

At this point, you might be wondering: how do I unify data sets and create models?

Enter KQL. Kusto query language (KQL) a is native ADX query language that offers a large set of operators providing filtering on rows, joining two datasets, group by aggregates, and many more to query the data stored in the ADX cluster without modifying data or metadata. KQL is also the language used by many other analytics and security products like Application Insights, Log Analytics, and Azure Sentinel. It is the type of language that bridges a gap between set-based thinking and PowerShell development and for that reason much beloved by data wonks. Using KQL we can shape some of our datasets into a conical model to start basic reporting and providing data insight.

This is similar to SQL but keep in mind that the schema and querying are so much more performant, and with the right setup can process petabytes of data in minutes using the same technology as what runs application insights. SQL is also a supported language to query data stored in ADX.

At this point, you have some data shaping going on. This allows an engineer to get this in front of business owners and make sure that the sprint activities of data collection are on track. Did we validate our use case? Are we missing data? Do we need to extract from more sources or create reference tables to give the data a “better” shape?

rz_OnPrem_Database_Master_Dataset 

| project INVOICENUM, TRANCODE, POLICYNUM, TRANDATE, EFFDATE, POLEFFECTIVE, POLEXPIRE, Fee, Premium, Tax 

| join kind=inner (api_stage_PolicyInfo | extend POLICYNUM=PolicyNumber) on POLICYNUM 

| join kind=inner (rz_OnPrem_stage_Claim | extend POLICYNUM=PolicyNumber) on POLICYNUM

 

During the process of shaping, we can apply business rules to the shaping of the data. We add data quality rules that standardize data to given data formats. The KQL library is very robust and is constantly evolving.

This lends itself to the Agile approach, we can show continuous value by adding to the conical model, introducing new data models, and starting data shaping conversations.

As we shape this data we store it in a refine zone and think about what deployment to a traditional SQL data warehouse looks like.

Things to Consider

They say that with great power comes great responsibility. What we have found as data sets get larger and larger is that you cannot just throw more power at the problem. If we had used traditional ADF or Synapse you could do that, and you would pay more but the net is negligible given you don’t want 3 engineers trying to performance tune the solution.

Though KQL offers many optimizations out of the gate, some query time optimization may be needed to solve the problem without spending extra money. As a Data engineer, a good guideline is to check your query performance early, follow query best practices. Ensure that all nodes on the cluster are distributing the workload correctly. Are you using the right join with the right key and shuffle hints? Ensure that data is shared evenly.

The ADX product team has been phenomenal in helping us understand query optimization and the Kusto language in general. I was impressed how quickly the client and my team were able to navigate Kusto, build more insight into the data model and therefore drive more discussion.

Overall, the benefits of using Azure Data Explorer included a quick solution port, a straightforward shift from SQL to KQL, and a unified data platform that helped our group, offer large and sophisticated reporting with greater performance at a reduced cost to our client

Share Post:

Share on linkedin
Share on twitter
Share on email

Stay Connected

More Articles

Technical Review of Viva Connections

Announced during Microsoft Ignite, Microsoft Viva is an employee experience platform that brings together communications, knowledge, learning, resources and insights.

Agile Production Processing with Microsoft Power Platform

Among the myriad list of things plaguing your organization, it is likely that regulation and production speed rank in the top 10. If you are subject to any of the new US federal end-to-end recordkeeping requirements, there are a host