Data Processing Made Simple - Your First Data Process with Maestro
Last week I wrote why I built Neudesic Maestro - today I want to show it in action. Maestro was built to be a better execution configuration system. I will highlight how Maestro does a lot of work for you - leaving your data processes simple and easy to understand.
Let's create a common data process - extracting data from a SQL database and saving the data as parquet files. I will be connecting to an AdventureWorks database I have created, hosted on an Azure SQL Database. And I will be using Microsoft Fabric pipelines to perform the actual data movement.
Creating our Maestro Module
Metadata-driven processes always start with execution configuration data. With Maestro, we can build modules (aligns to data processes), module properties (aligns to process parameters), and module items (aligns to process executions). The application gives us pages to create modules and module properties.
First, I add a new module to perform the SQL extraction to parquet:
Next, I add a module property for each process parameter that I will use in my pipeline:
Maestro provides us with some options as we define these module properties:
I find it very common to start with an idea of the needed parameters and then change them as I build out my process. Maestro makes it easy rename, remove, or add new module properties. What if our pipeline needs to change in the future? Its as easy as updating the module properties in Maestro to match the pipeline change.
Creating a Module Item
Now that we have defined our module, we now need to define one or more module items. At runtime these items are passed to our dynamic pipeline - extracting data from tables and placing that data into files. Let's start by creating a single module item in Maestro. Let's use the SalesLT.Customer table in the AdventureWorks database. Let's first start by reviewing our module properties:
And here are the values for our SalesLT.Customer module item:
Looking at these values, you are probably surprised by the query, workspace, and lakehouse values. What's with the curly braces? Those are Maestro tokens. Let's dig deeper into tokens and building expressions with Maestro.
Maestro Tokenization
A core principal of Maestro is that it should make things easier for our developers. I found that data engineers would usually keep values in execution configuration systems very basic - "SELECT * FROM table" - is a good example. And its for good reason. We have a system that will execute thousands of activities nightly. If we make complex statements, it may increase maintenance. Do you want to update thousands of queries when you find an issue with your pattern?
That's where Maestro comes in. It is a smarter execution configuration system. It allows us to "push left" - the system should be able to do most of the thinking so our processes get those values back and run with them. This reduces process complexity tremendously. One way that Maestro is smarter is through its tokenization system. Within our module items, you can use tokens - text surrounded by curly braces - to indicate that you want to perform some kind of find and replace. Token types include:
Let's take a look at our SQL extraction query. It is stored as an environment variable named DefaultSqlServerQuery.
DECLARE @Watermark DATETIME2; SET @Watermark = TRY_CONVERT(DATETIME2, NULLIF('{watermark}',''));
DECLARE @WatermarkColumnName SYSNAME; SET @WatermarkColumnName = NULLIF('{watermarkColumnName}','');
IF @Watermark IS NULL
BEGIN
SELECT {SelectColumnList} FROM {SchemaName}.{TableName};
END;
ELSE IF (@WatermarkColumnName IS NOT NULL)
BEGIN
DECLARE @sql NVARCHAR(MAX);
SET @sql = N'SELECT {SelectColumnList} FROM {SchemaName}.{TableName} WHERE ' + QUOTENAME(@WatermarkColumnName) + ' >= DATEADD(HOUR, {DefaultIncrementalHours}, @Watermark);';
EXEC sp_executesql @sql, N'@Watermark DATETIME2', @Watermark;
END;
ELSE
BEGIN
RAISERROR('No valid option found. Please ensure the watermark column name is set properly.',16,1);
END;
A few things you may notice:
Recommended by LinkedIn
This is very enabling to data engineers - far more than simple execution configuration systems like CSV files, such as:
This is very powerful, but can also be tricky. Once you go a few levels deep, it may be difficult to know what will be returned to our pipeline. To help with this, Maestro has a resolve feature on each module item. This shows the values that are returned by the uspGetModuleItem procedure. It looks like this:
Now that we have entered and validated what Maestro will return, let's build out our pipeline.
Building our Data Extraction Pipeline
The big question many of you are probably asking is - how do I get the data out of Maestro and leverage it in my pipeline? Maestro provides stored procedures that act as our interface into the system. Users will not and should not write custom queries to retrieve data from the system.
To get the module item information from Maestro, we run the below procedure:
EXECUTE [NeudesicMaestroRuntime].[uspGetModuleItem]
@ModuleItemName = 's2pl.adv.SalesLT.Customer',
@EnvironmentName = 'Dev';
This procedure will return the exact same properties as the previous image - the module item resolution feature within Maesto. This makes it easy to know exactly what will be returned to whatever data processing tool I use.
Let's take a look at the Fabric pipeline.
Maestro allows our data processes to follow simple templates. This pipeline does the following:
You may be wondering - how do we reference the module item properties? It looks like this:
@activity('MaestroModuleItem').output.firstRow.TargetFileName
Very clean, and very simple. This is also simple when we add new module properties - simply add the property in Maestro, and then use it in the pipeline. No need to create new pipeline objects like variables, parameters, etc.
The pipeline itself is also easy to understand. No extra steps for limited find and replace. No branching logic for full load vs incremental loads. No coming up with a custom solution for handling watermark values. Maestro makes it easy to build data processes.
Wrap Up
As a developer, I've found that building data processes with Maestro is fast and easy. Processes that would have taken days can be done in hours. I don't have to spend any time building out my execution configuration system. Maestro's templates also make it incredibly easy. I just copy the template and update the activities as needed. I spend my time focused on building out the data platform - not on the execution configuration system.
We have just scratched the surface of what an advanced execution configuration system can do. In the upcoming weeks we will explore scaling our data platform with Maestro. Bulk updates, environment promotion, ADO integration, automated module item creation. We will also look at how we can leverage a copilot to maintain data in our Maestro system.
Next week we will dig deeper into scaling this pipeline. We will add logging to the pipeline and leverage Maestro's bulk update system to create dozens of module items. We will wrap it up with promoting our module items to the production environment. See you next week!
Great walkthrough, Thomas Fowles. Maestro is simplifying what used to be complex, and enabling teams to focus more on delivering results.