Generate a proper schema files for your data tables in Azure Data Factory
In this post I'll try to fill in the gaps on how to use Azure Data Factory for daily data transfers between Microsoft Sentinel and Azure Data Explorer, and more specifically how to properly generate your data table schemas using simple kusto query in your Azure Monitor, to avoid wrong data mapping or lose data on daily basis.
SUMMARY
If you are a Microsoft Sentinel user, or you are at least considering it as an option for your business, you should know that the included retention period (without additional fee) is up to 90 days. After that period it may become a little bit more costly than you'd expect it to be and Microsoft know that, so they've created a whole chapter in their Microsoft Sentinel How-tos section, dedicated to address that small inconvenience, using Azure Data Explore (ADX). To be completely honest with you, it is pretty well crafted material, providing you with different options for execution and a lot of flexibility, which is typical for Microsoft.
Having that said, I was recently tasked to integrate Microsoft Sentinel with ADX for a client, and due to various specific aspects of their environment and limitations of the different scenarios, the designated integration method was to utilize Azure Storage Account and Azure Data Factory (ADF) to complete the task. Now, this was the place where I noticed a serious gap in Microsoft's documentation about the practical aspect of using the Azure Data Factory (ADF) for data transferring in Microsoft Sentinel's context and specifics.
THE PROBLEM
The process in ADF is relatively easy, and following the existing guides to create a dataset out of existing file in your Azure Blob Storage account is nothing else but a few clicks:
After navigating to the targeted file location and selecting the designated file, it gives you a few options to create the dataset schema:
Use the already select file from the Storage Account (pre-selected option)
Use a manually crafted schema file
Do not import any schema
As it's supposed to be a simple process, the most obvious choice is to let the system create the schema based on the existing data in the selected file. And here's where the pain comes from!
Understanding it the hard way, using a single file from the Storage Account to generate the source schema is not an effective at all when you're working with logs. One of reasons is because the files in the Storage Account are generated every 5 minutes and based on the data table, they may or may not contain every single event type, and this leads me to the main problem I was able to identify. With the risk of oversimplifying the issue I'll represent the initial and the final states of my SecurityEvents table schema files, and the number of fields they were generated with. For better distinction I'll call them as follows:
Auto-generated schema: The initial schema that was generated using the file in the Storage Account (described above)
Final schema: The schema generated by a refined method i came up with while trying to find a golden schema standard for my data tables.
| Auto-generated schema | Final schema |
Columns | 40 | 225 |
Type.string | 40 | 214 |
Type.int | 0 | 9 |
Type.datetime | 0 | 2 |
Lets be clear, i'm not stating that all the 225 field will be in use on daily basis, because they relate to different events, and it's not necessary to have the same events generated every day, but it's really important to have in mind that these events may occur at anytime and if the schema is not full, there is a chance to either trigger a failure event in the Data Factory Copy data activity or scramble the data in ADX table due to the lack of proper mapping between the source and destination Columns. Not to mention the rare case, that I was lucky enough to experience myself, of having a table with Column of type int32, but once every week or two it gets a couple of int64 values (e.g. SentBytes).
Oh yes, you guessed it! The result is a big fat error that basically says it cant fit int64 value into an int32 ColumnType.
Yet another potential problems that may occur (of course i got that one too) as a result of that difference in the schema files could be wrong Column value mapping in the destination data table (ADX):
And here is where the rabbit hole starts. I've had a lot of discussions with Microsoft Engineers trying to find the best way to fix that problem, but all the solutions eventually ended up to the obvious one - to drop the current progress and start over. Believe me or not, I tried everything in my attempts to fix the problems in the targeted ADX table. I manually corrected the differences in the table schema, but then the mapping in the ADF Copy data activity started complaining about wrong ColumnTypes. Adjusting the ADF mapping and rerunning the task started reporting an error for inconsistencies in the source schema, so I started to correct that as well, and even when something started working, something else suddenly stopped, like a vicious circle.
In the end I gave up! The easiest way really was to drop everything and start over, but that's actually the moment the breakthrough came.
THE SOLUTION
During that madness of fixing issues here and there I've decided to go back to the source of the data. Not the Storage Account data, but the Log Analytics Workspace where I was getting the data from. If there was a schema that could be considered as a golden standard for any of my data tables in ADX, it must be the one that's already in the Log Analytics Workspace.
Getting the schema from the Log Analytics Workspace was the easy part, but then i realized it should be in .JSON format, so i can import it. As it's not an out-of-the-box feature, i had to export the data in .CSV, copy the ColumnNames and based on the ColumnType set the proper values. It took me a while to finalize and polish the schema, but it was worth it. I've imported the schema to the Source dataset, created a new table in ADX using the freshly developed schema file and fixed the mapping in the ADF Copy data activity so it can use the new schema. After a while the job finished successfully and the results were more than satisfying:
All the Columns were properly recognized, the data was populated as expected, and when it hit me - I have to do that for all the tables I want to store in ADX, and they are many, 30 as start!
OPTIMIZATION IS THE KEY
To speed-up the process of generating the schema for all the tables I wanted to use the native tooling I was provided with by Microsoft, so i decided to create a kusto query that will generate a .JSON ready result that I can use, so here is what i came up with:
//GET Schema in JSON, ready for ADX Integration
SecurityEvent
| getschema
| extend Export = strcat(
'{',
'"', ColumnName, '"',
':',
iff(ColumnType == "string", "\"string\"",
iff(ColumnType == "int", "11",
iff(ColumnType == "long", "1111111111111111",
iff(ColumnType == "real", "11.1",
iff(ColumnType == "datetime", "\"2001-01-01T01:01:01.101Z\"",
iff(ColumnType == "bool", "true",
iff(ColumnType == "dynamic", "\"string\"", ColumnType))))))),
'}')
| summarize make_bag(todynamic(Export))
I separated the Export expression with new lines to make it easier to read, but you can always fit it in a single line, and it's going to work exactly the same way.
To clarify the need of iff() function, I'm using it to replace the different ColumnTypes with the respective values that i would expect them to be, so it can properly work when imported as a schema file. For strings that would be "a string", for integer, a value "11", and so on. The rest of the expression should be pretty clear by now - it's a simulation of a .JSON structure:
{ "ColumnName": }
Once the query is ready it comes the easy part, you can either export the result in a .CSV file and get the value out of it, or you can directly copy it from the query results in Azure Monitor, and create your new .JSON file with it:
Using this method saved me hours of mapping various Columns and values, and most importantly it provided me with sustainable schema files that not only fixed my initial problems with the data transferring, but also helped me to speed up the onboarding process for every new data table that should be kept in ADX for longer period.
Comments