Data Processing Made Simple - Your First Data Process with Maestro

Data Processing Made Simple - Your First Data Process with Maestro

Last week I wrote why I built Neudesic Maestro - today I want to show it in action. Maestro was built to be a better execution configuration system. I will highlight how Maestro does a lot of work for you - leaving your data processes simple and easy to understand.

Let's create a common data process - extracting data from a SQL database and saving the data as parquet files. I will be connecting to an AdventureWorks database I have created, hosted on an Azure SQL Database. And I will be using Microsoft Fabric pipelines to perform the actual data movement.

Creating our Maestro Module

Metadata-driven processes always start with execution configuration data. With Maestro, we can build modules (aligns to data processes), module properties (aligns to process parameters), and module items (aligns to process executions). The application gives us pages to create modules and module properties.

First, I add a new module to perform the SQL extraction to parquet:

Article content
The module to power our SQL extraction to parquet files data process.

Next, I add a module property for each process parameter that I will use in my pipeline:

Article content
The module properties that will be passed to our dynamic pipeline.

Maestro provides us with some options as we define these module properties:

  • Type: Maestro supports strings, integers, decimals, booleans, and code. The code type allows users to edit the value in a large code screen.
  • Default Value: This allows us to specify an initial value for the property for existing and new module items.
  • Order: Execution configuration data is easy to think of and read in tabular format. This ensures that we can put the property in an order that is easy to understand.
  • Category: Supports configuration and runtime. Most properties are configuration, meaning that the user sets it and it should be maintained by the user. Runtime is useful for properties like a watermark that automated processes typically control.
  • Actions: If we add a property and no longer need it, we can delete it. Note that everything in Maestro is a soft delete. We can get the property back easily if needed.

I find it very common to start with an idea of the needed parameters and then change them as I build out my process. Maestro makes it easy rename, remove, or add new module properties. What if our pipeline needs to change in the future? Its as easy as updating the module properties in Maestro to match the pipeline change.

Creating a Module Item

Now that we have defined our module, we now need to define one or more module items. At runtime these items are passed to our dynamic pipeline - extracting data from tables and placing that data into files. Let's start by creating a single module item in Maestro. Let's use the SalesLT.Customer table in the AdventureWorks database. Let's first start by reviewing our module properties:

  • SourceQuery - This is the query that we will execute on the AdventureWorks database
  • FabricWorkspaceId - Our pipeline will allow us to define our lakehouse target dynamically. We will need to give the pipeline a Workspace ID due to this.
  • TargetLakehouse - This will be the GUID of the the lakehouse we want to store our parquet file in. Note that Fabric does not yet support names if supplying a value dynamically.
  • TargetDirectory - This is the folder structure we will use to store the file.
  • TargetFileName - The name of the parquet file we are creating.
  • SchemaName - The source schema name. This is used to build our query expression.
  • TableName - The source table name. This is used to build our query expression.
  • SelectColumnList - The columns we wish to select. This is used to build our query expression.
  • WatermarkColumnName - The column name used in our incremental query. This is used to build our query expression.
  • Watermark - The watermark value. This is used to build our query expression.

And here are the values for our SalesLT.Customer module item:

  • SourceQuery - {DefaultSqlServerQuery}
  • FabricWorkspaceId - {FabricWorkspaceId}
  • TargetLakehouse - {BronzeLakehouseGuid}
  • TargetDirectory - AdventureWorks/SalesLT/
  • TargetFileName - Customer.parquet
  • SchemaName - SalesLT
  • TableName - Customer
  • SelectColumnList - *
  • WatermarkColumnName - ModifiedDate
  • Watermark - (not set as it has not yet executed)

Looking at these values, you are probably surprised by the query, workspace, and lakehouse values. What's with the curly braces? Those are Maestro tokens. Let's dig deeper into tokens and building expressions with Maestro.

Maestro Tokenization

A core principal of Maestro is that it should make things easier for our developers. I found that data engineers would usually keep values in execution configuration systems very basic - "SELECT * FROM table" - is a good example. And its for good reason. We have a system that will execute thousands of activities nightly. If we make complex statements, it may increase maintenance. Do you want to update thousands of queries when you find an issue with your pattern?

That's where Maestro comes in. It is a smarter execution configuration system. It allows us to "push left" - the system should be able to do most of the thinking so our processes get those values back and run with them. This reduces process complexity tremendously. One way that Maestro is smarter is through its tokenization system. Within our module items, you can use tokens - text surrounded by curly braces - to indicate that you want to perform some kind of find and replace. Token types include:

  • Environment Variables - Do you have a value that may need to change environment-to-environment? Create an environment variable and reference it in your module item. Now the value remains the same in each environment (i.e., easy promotion), but the runtime value will be what it should be, depending on the environment you are in.
  • Constants - Do you need to build a file path that includes date parts, like years, months, etc.? Maestro will replace these tokens with the proper value using the current execution time.
  • Other Module Properties - Do you want to reference another module property of the same module? Maestro can use other properties in its token system as well. This allows us to build expressions. This is a very powerful feature.

Let's take a look at our SQL extraction query. It is stored as an environment variable named DefaultSqlServerQuery.

DECLARE @Watermark DATETIME2; SET @Watermark = TRY_CONVERT(DATETIME2, NULLIF('{watermark}',''));
DECLARE @WatermarkColumnName SYSNAME; SET @WatermarkColumnName = NULLIF('{watermarkColumnName}','');

IF @Watermark IS NULL
BEGIN
	SELECT {SelectColumnList} FROM {SchemaName}.{TableName};
END;
ELSE IF (@WatermarkColumnName IS NOT NULL)
BEGIN
    DECLARE @sql NVARCHAR(MAX);
    SET @sql = N'SELECT {SelectColumnList} FROM {SchemaName}.{TableName} WHERE ' + QUOTENAME(@WatermarkColumnName) + ' >= DATEADD(HOUR, {DefaultIncrementalHours}, @Watermark);';
    EXEC sp_executesql @sql, N'@Watermark DATETIME2', @Watermark;
END;
ELSE
BEGIN
	RAISERROR('No valid option found. Please ensure the watermark column name is set properly.',16,1);
END;        

A few things you may notice:

  • This query is much more advanced that a simple "SELECT * FROM SalesLT.Customer". It performs full load and incremental loads. It even throws an error with a descriptive message if conditions aren't right.
  • This environment variable has tokens in it! Yes, this is what makes Maestro very powerful. It will loop and evaluate tokens at many levels.
  • When this is applied to our module item, the module item has properties such as SchemaName, TableName, etc. Those will get evaluated, and the procedure will return a query that uses "SalesLT', "Customer", etc.

This is very enabling to data engineers - far more than simple execution configuration systems like CSV files, such as:

  • This single script allows for both incremental and full loads. Once the full load is completed, the initial load will automatically be selected (provided a watermark column has been set).
  • I've seen data engineers have to come up with their own solutions for full load vs incremental loads. (1) Sometimes they have 2 properties - a query for full load, and a query for incremental load. (2) Or they may create more conditions in their pipelines to determine what the query should be.
  • Maestro "pushed left" and gave us the answer. Now we don't need to manage additional processing in our pipelines and notebooks.

This is very powerful, but can also be tricky. Once you go a few levels deep, it may be difficult to know what will be returned to our pipeline. To help with this, Maestro has a resolve feature on each module item. This shows the values that are returned by the uspGetModuleItem procedure. It looks like this:

Article content
The resolved module item - this is what will be available to our processing tool.

Now that we have entered and validated what Maestro will return, let's build out our pipeline.

Building our Data Extraction Pipeline

The big question many of you are probably asking is - how do I get the data out of Maestro and leverage it in my pipeline? Maestro provides stored procedures that act as our interface into the system. Users will not and should not write custom queries to retrieve data from the system.

To get the module item information from Maestro, we run the below procedure:

EXECUTE [NeudesicMaestroRuntime].[uspGetModuleItem]
    @ModuleItemName = 's2pl.adv.SalesLT.Customer',
    @EnvironmentName = 'Dev';        

This procedure will return the exact same properties as the previous image - the module item resolution feature within Maesto. This makes it easy to know exactly what will be returned to whatever data processing tool I use.

Let's take a look at the Fabric pipeline.

Article content
Fabric pipeline for copying data from a SQL source and copying it to a Fabric Lakehouse

Maestro allows our data processes to follow simple templates. This pipeline does the following:

  • Retrieve the Module Item Data - First, we need to get the runtime values so we can use them in the pipeline. Our pipeline has the ModuleItemName and EnvironmentName parameters, which we will pass to our lookup activity.
  • Perform the Extraction / Copy - Next, we use the module item's properties - query, lakehouse id, filename, etc. - to perform the data copy activity
  • Update the Watermark - Last, we update the watermark value of the module item. This ensures ongoing incremental data extraction.

You may be wondering - how do we reference the module item properties? It looks like this:

@activity('MaestroModuleItem').output.firstRow.TargetFileName        

Very clean, and very simple. This is also simple when we add new module properties - simply add the property in Maestro, and then use it in the pipeline. No need to create new pipeline objects like variables, parameters, etc.

The pipeline itself is also easy to understand. No extra steps for limited find and replace. No branching logic for full load vs incremental loads. No coming up with a custom solution for handling watermark values. Maestro makes it easy to build data processes.

Wrap Up

As a developer, I've found that building data processes with Maestro is fast and easy. Processes that would have taken days can be done in hours. I don't have to spend any time building out my execution configuration system. Maestro's templates also make it incredibly easy. I just copy the template and update the activities as needed. I spend my time focused on building out the data platform - not on the execution configuration system.

We have just scratched the surface of what an advanced execution configuration system can do. In the upcoming weeks we will explore scaling our data platform with Maestro. Bulk updates, environment promotion, ADO integration, automated module item creation. We will also look at how we can leverage a copilot to maintain data in our Maestro system.

Next week we will dig deeper into scaling this pipeline. We will add logging to the pipeline and leverage Maestro's bulk update system to create dozens of module items. We will wrap it up with promoting our module items to the production environment. See you next week!

Great walkthrough, Thomas Fowles. Maestro is simplifying what used to be complex, and enabling teams to focus more on delivering results.

Like
Reply

To view or add a comment, sign in

More articles by Thomas Fowles

Others also viewed

Explore content categories