Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 3: The Silver Layer!)

Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 3: The Silver Layer!)

Welcome back to part three of this series. Let's first recap what happened so far:

Part 1: We used Copilot to create functions and methods in a PySpark Notebook in Microsoft Fabric to get data from files into the Bronze Layer of our setup.

Part 2: We created a Fabric Data Factory Pipeline to get parameters from a .csv file (this could also be a SQL DB or others) and start the Notebook from part 1 with the parameters read injected during runtime.

And we have only just begun.

In this part we are going to examine two different topics:

  1. We want to create shortcuts from the bronze layer into the Silver layer programmatically
  2. The Silver layer will display a Data Vault structure for the data that we load. I want to achieve an insert only core analytical layer that will allow easy maintenance and provide standard methods to keep the data consistent and clean even in case of failures during loads.

Off we go. Let's start with a new notebook to create the shortcuts as a bridge between the layers bringing in the Bronze layer tables to have them available to work on. (sure this is not necessarily needed but we wanted also to experiment and build that muscle for other moments when this will be necessary like creating a workspace for a team and providing them a bunch of shortcuts to tables to give them a head start.

But back to the shortcuts and how to create them. There is a growing collection of REST APIs that you can use to programmatically manipulate Fabric infrastructure like

  • Automating workspace management
  • Managing data pipelines
  • Automating security and access control
  • Resource management, monitoring and optimization
  • Automated deployment

You'll find the API reference here: Microsoft Fabric REST API references - Microsoft Fabric REST APIs | Microsoft Learn.

Another (newer) method to interact and manage Fabric programmatically is the Command Line Interface (CLI): Fabric command line interface - Microsoft Fabric REST APIs | Microsoft Learn that will also help admins and developers in performing tasks in the Fabric realm.

First we'll setup again like in the Bronze loader notebook some parameters in an explicit cell to be able to catch params from the calling job like my Fabric Data Factory Pipeline from part 2.

Article content

The second step in this notebook is a function that receives the return value from a REST call and gets the ID for an object name that we're looking for. Let's ask Copilot again for this:

Article content
Getting the item id for the Fabric item by name.

Nice work again, buddy. With this function we can now just pass the name of an item that we are searching for and get back the ID that we need for the REST API call to create the shortcut.

This is the thought sequence of what we need to go through:

  1. Get the WorkspaceID for the workspace name that has been provided via the parameter p_workspaceName
  2. Get the Lakehouses that are present in the given workspace represented by the target_workspaceId
  3. Get the itemID of the Lakehouse represented in the target Lakehousename (in Fabric speak this is the target that the shortcut is pointing at, where the data is residing that the shortcut displays)
  4. Create the URL for the post
  5. Create the JSON body for the post
  6. Create the shortcut

With the function that we have created above this would look something like:

Article content
Receiving all the necessary item ids to prepare the REST API call.

And with those IDs we create the POST URL like described (OneLake Shortcuts - Create Shortcut - REST API (Core) | Microsoft Learn):

Article content
The POST URL that we will run against the Fabric workspace.

Finally run the POST:

Article content
Running the POST against the Fabric REST API.

And there we are in the Silver Lakehouse:

Article content
New shortcut in the Silver Lakehouse.

Again, I don't provide the notebook for download. Have some fun and try Copilot yourself. You'll be surprised how fast you will have the solution.

But let's move to the second step of our plan and create the Data Vault structure from the source data and start a new notebook.

Remember Data Vault? Dan Linstedt introduced the Data Vault methodology in the 1990s, with formal publications and adoption gaining traction in the early 2000s. It was designed to address scalability, auditability, and agility in enterprise data warehousing.

Main Building Blocks:

  • Hubs: Contain unique business keys (e.g., CustomerID) with minimal attributes.
  • Links: Represent relationships between hubs (e.g., Customer-Product).
  • Satellites: Store descriptive and historical data related to hubs or links, including timestamps and source tracking.

Why am I targeting this methodology? In one of my bigger Data Warehouse projects we had the challenge that the night window for loading was filling up and it was obvious that it won't be enough quickly. The other challenge that we ran into was the data quality of the source. The old warehouse was pretty messed up and cleaning up things after another failed job was a lengthy and ugly task with lots of manual SQL to be run by the maintenance team.

With Data Vault we succeeded to build an insert only core warehouse structure that even was able to benefit from the great performance that the SQL Server delivered. We went from sequenced pipelines to parallel loading of the hubs, the satellites and the links. With the history of any depth manifesting in the Satellites and the canonical structures of the tables we were also able to create easy to understand and use SQL scripts as tools for the maintenance teams in case of failure or inconsistency. And yes of course the different table structures lead to more complex SQL / views from the presentation layer to the marts but the warehouse server was able to digest those quite well.

Let's go and examine how we can build a generic Spark notebook that would take the needed parameters and form a data vault structure in our Silver Lakehouse form the sources that we load in the Bronze layer.

The parameters that are needed so far:

Article content
Parameter for the Data Vault structures

We get the table to work on, the primary key, all the attributes that we want to see in the hub and the satellite and a source identifier to be able to audit the content later.

And then again my friend and co-coder comes to rescue again:

Article content
Prompt used to create the Data Vault Hub table function

And the result:

Article content
Function to create a Data Vault Hub table from the parameters

Not sure how you feel about this and for many people this is of course already daily routine. I still am quite excited about these helpers.

The finish line is right in front of us now. Another instruction to the Copilot to create a function for the Data Vault satellite table that will complete the hub from above:

Article content
Prompt to create the Data Vault Satellite table function

And the resulting function:

Article content
Function to create the Data Vault Satellite table from the parameters

Well, Hub and Satellite functions are available now. The Data Vault though offers uses another type of tables. Let's create a function for a link table to complete the loader and to be able to reflect the facts of an analytical model.

This time I'll do a learning check and try and see what I have learned from the other functions that I have created so far and for this time going without the Copilot showing some sportsmanship.

To be able to call this one separately I'll create another notebook for this. With a separate notebook we can provide separate parameters and call it also in parallel to the notebook from above.

With all the typical cells like before we'll have the following parameters:

Article content
Parameters to create the Link structures of the Data Vault

First we'll create the link table function. My parameters are:

  • A data frame with the data that should finally be loaded to the link table and the according satellite.
  • The source table name from the silver layer. This is one of the shortcuts that we have created above. 
  • The business key that defines the row in the link table. This might be one or more columns.
  • A source id that points back to where the data is coming from.
  • An array of table names and columns that will be used to create the hash keys to link to other tables. The columns should  be present in the actual list of attributes so that we can go straight forward with the creation of all the needed information. These can be again one or more columns.

The function will:

  • Select all the attributes needed into a new data frame.
  • Create a hash key from the column(s) that are provided in the business_key parameter and add it to the data frame.
  • Create a column with the load timestamp and add it to the data frame.
  • Create a column for the source_id and add it to the data frame.
  • Iterate through the list of key columns and for each provided table create a hash key from the columns provided and add it to the data frame.
  • Append the new data frame to the table that will be generated from the source table parameter and the postfix "_link".
  • Finally return the new data frame.

Article content
Function to create Data Vault Link tables from parameters

And now finally create the Link-Satellite:

Article content
Function to create Data Vault Link Satellite tables from parameters

We need to create the business key like we did for the link table, add the attributes to the data frame and also again add the audit information.

The work that we then finally need to do is as follows:

Article content
Using the functions and do the work

How to add these notebooks to the Data Factory pipeline works exactly like in Part 2. I trust you to be able to repeat those steps and add a new parameter file or table and run these two notebooks.

Now with the work that we have done above we are able to create shortcuts from our tables in the Bronze layer and create  simple Data Vault structures from these tables as our central analytics model. We can of course adjust the basic process to reflect more sophisticated requirements as needed.

In the next article it will be the time to create the presentation layer or Gold layer. We will examine how to control the PySpark job and even inject transformation function strings from our meta data repository and run them to create the presentation layer model. In our case as we are using Fabric in this example we will create a Star-Join schema to optimize the semantic model for the usage with PowerBI.

We will examine two alternative ways of providing the transformations and the models:

  1. Using SQL strings from the meta data repository and running them against the Silver Layer.
  2. Using PySpark code.

Until then dear reader maybe try this yourself. Like already mentioned I don't provide the material not because I want to provoke but to encourage you to use the Fabric Copilot in the Spark experience and to see the magic you will be doing with this feature.

Thx for reading!

Here are the other parts of the series:

Part 1: The Bronze Layer, the fundament

Part 2: The control flow, orchestrating the load

To view or add a comment, sign in

More articles by Patrik Borosch

Others also viewed

Explore content categories