Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 3: The Silver Layer!)
Welcome back to part three of this series. Let's first recap what happened so far:
Part 1: We used Copilot to create functions and methods in a PySpark Notebook in Microsoft Fabric to get data from files into the Bronze Layer of our setup.
Part 2: We created a Fabric Data Factory Pipeline to get parameters from a .csv file (this could also be a SQL DB or others) and start the Notebook from part 1 with the parameters read injected during runtime.
And we have only just begun.
In this part we are going to examine two different topics:
Off we go. Let's start with a new notebook to create the shortcuts as a bridge between the layers bringing in the Bronze layer tables to have them available to work on. (sure this is not necessarily needed but we wanted also to experiment and build that muscle for other moments when this will be necessary like creating a workspace for a team and providing them a bunch of shortcuts to tables to give them a head start.
But back to the shortcuts and how to create them. There is a growing collection of REST APIs that you can use to programmatically manipulate Fabric infrastructure like
You'll find the API reference here: Microsoft Fabric REST API references - Microsoft Fabric REST APIs | Microsoft Learn.
Another (newer) method to interact and manage Fabric programmatically is the Command Line Interface (CLI): Fabric command line interface - Microsoft Fabric REST APIs | Microsoft Learn that will also help admins and developers in performing tasks in the Fabric realm.
First we'll setup again like in the Bronze loader notebook some parameters in an explicit cell to be able to catch params from the calling job like my Fabric Data Factory Pipeline from part 2.
The second step in this notebook is a function that receives the return value from a REST call and gets the ID for an object name that we're looking for. Let's ask Copilot again for this:
Nice work again, buddy. With this function we can now just pass the name of an item that we are searching for and get back the ID that we need for the REST API call to create the shortcut.
This is the thought sequence of what we need to go through:
With the function that we have created above this would look something like:
And with those IDs we create the POST URL like described (OneLake Shortcuts - Create Shortcut - REST API (Core) | Microsoft Learn):
Finally run the POST:
And there we are in the Silver Lakehouse:
Again, I don't provide the notebook for download. Have some fun and try Copilot yourself. You'll be surprised how fast you will have the solution.
But let's move to the second step of our plan and create the Data Vault structure from the source data and start a new notebook.
Remember Data Vault? Dan Linstedt introduced the Data Vault methodology in the 1990s, with formal publications and adoption gaining traction in the early 2000s. It was designed to address scalability, auditability, and agility in enterprise data warehousing.
Main Building Blocks:
Why am I targeting this methodology? In one of my bigger Data Warehouse projects we had the challenge that the night window for loading was filling up and it was obvious that it won't be enough quickly. The other challenge that we ran into was the data quality of the source. The old warehouse was pretty messed up and cleaning up things after another failed job was a lengthy and ugly task with lots of manual SQL to be run by the maintenance team.
With Data Vault we succeeded to build an insert only core warehouse structure that even was able to benefit from the great performance that the SQL Server delivered. We went from sequenced pipelines to parallel loading of the hubs, the satellites and the links. With the history of any depth manifesting in the Satellites and the canonical structures of the tables we were also able to create easy to understand and use SQL scripts as tools for the maintenance teams in case of failure or inconsistency. And yes of course the different table structures lead to more complex SQL / views from the presentation layer to the marts but the warehouse server was able to digest those quite well.
Let's go and examine how we can build a generic Spark notebook that would take the needed parameters and form a data vault structure in our Silver Lakehouse form the sources that we load in the Bronze layer.
The parameters that are needed so far:
Recommended by LinkedIn
We get the table to work on, the primary key, all the attributes that we want to see in the hub and the satellite and a source identifier to be able to audit the content later.
And then again my friend and co-coder comes to rescue again:
And the result:
Not sure how you feel about this and for many people this is of course already daily routine. I still am quite excited about these helpers.
The finish line is right in front of us now. Another instruction to the Copilot to create a function for the Data Vault satellite table that will complete the hub from above:
And the resulting function:
Well, Hub and Satellite functions are available now. The Data Vault though offers uses another type of tables. Let's create a function for a link table to complete the loader and to be able to reflect the facts of an analytical model.
This time I'll do a learning check and try and see what I have learned from the other functions that I have created so far and for this time going without the Copilot showing some sportsmanship.
To be able to call this one separately I'll create another notebook for this. With a separate notebook we can provide separate parameters and call it also in parallel to the notebook from above.
With all the typical cells like before we'll have the following parameters:
First we'll create the link table function. My parameters are:
The function will:
And now finally create the Link-Satellite:
We need to create the business key like we did for the link table, add the attributes to the data frame and also again add the audit information.
The work that we then finally need to do is as follows:
How to add these notebooks to the Data Factory pipeline works exactly like in Part 2. I trust you to be able to repeat those steps and add a new parameter file or table and run these two notebooks.
Now with the work that we have done above we are able to create shortcuts from our tables in the Bronze layer and create simple Data Vault structures from these tables as our central analytics model. We can of course adjust the basic process to reflect more sophisticated requirements as needed.
In the next article it will be the time to create the presentation layer or Gold layer. We will examine how to control the PySpark job and even inject transformation function strings from our meta data repository and run them to create the presentation layer model. In our case as we are using Fabric in this example we will create a Star-Join schema to optimize the semantic model for the usage with PowerBI.
We will examine two alternative ways of providing the transformations and the models:
Until then dear reader maybe try this yourself. Like already mentioned I don't provide the material not because I want to provoke but to encourage you to use the Fabric Copilot in the Spark experience and to see the magic you will be doing with this feature.
Thx for reading!
Here are the other parts of the series:
Part 1: The Bronze Layer, the fundament
Part 2: The control flow, orchestrating the load
Insightful!