Building Scalable Data Engineering Solution Using Microsoft Fabric

Building Scalable Data Engineering Solution Using Microsoft Fabric

I wanted to share our journey of using Microsoft Fabric for our Data Engineering solution—starting from simple shortcuts to data sources to developing a full-fledged DE framework that others can easily adopt, deploy and use with minimal onboarding effort.

Through this process, we addressed various data engineering challenges, and I want to share our key learnings at each stage.

Source Data Access

One of the biggest expenses in any data engineering solution is bringing source data into local systems. The volume and complexity of the data impact costs, maintenance efforts, and ingestion strategies.

By leveraging shortcuts, we eliminated significant overhead and streamlined data access without worrying about ingestion complexities. As of today, Fabric supports shortcuts for:

  • One Lake
  • Amazon S3
  • Azure Data Lake Gen2
  • Azure Blob Storage
  • Google Cloud
  • Dataverse

However, our project deals with data from 40+ different sources, including Azure Data Lake, Dynamics CRM, SQL Server, and Cosmos DB. While Fabric doesn't provide direct shortcuts for some of these systems, we devised workarounds to access data without writing ingestion pipelines:

  • Azure Data Lake: Fabric provides a direct shortcut option, allowing seamless access.
  • Dynamics CRM: Since data is not directly accessible through shortcuts, we used Fabric Link to publish CRM data into Fabric OneLake via a simple Power Apps setup—a 10-minute process that makes CRM data immediately available in the Fabric Lakehouse.
  • SQL Server: A common OLTP data store, we used Mirrored Azure SQL Database to replicate data into Fabric. Once data is in Fabric, it is accessible using the standard path:
  • /<database_name>.mountedrelationaldatabase/Tables/<Schema>/<Table_name>
  • Cosmos DB: Similar to SQL Server, we utilized Mirrored Azure Cosmos DB, enabling effortless replication into Fabric.
  • Other Systems (Oracle, Snowflake, etc.): Similar mirroring strategies can be applied to access these sources efficiently.

Compute & Transformations

One of the toughest challenges in ETL processing is orchestrating table dependencies. Traditional DE solutions involve complex pipelines, metadata management, or code-based dependency tracking—leading to high maintenance overhead when dependencies evolve.

Instead, Fabric introduces a smarter approach using:

  • mssparkutils.runMultiple() – allowing dynamic, parallel execution of notebooks without rigid pipeline configurations. Learn more
  • Automated DAG generation – by scanning notebook code and dynamically constructing dependency graphs at runtime. Since DAGs are built dynamically, there's no need for extensive metadata maintenance.

Article content

We could also build DAG objects during code commits using DevOps/Git build pipelines there by eliminating cost to create DAGs at each run —ensuring efficiency and reducing redundancy.

DAG also enables us to assign a batch of notebooks to specific pool. We could take advantage of this feature in by identifying and grouping notebooks into batches thereby reducing Azure spend.

 Data Validations:

                Data validations are key components of data processing where we want to make sure the data is accurate and with quality.

Traditional approach involves creating specific set of validations by table and execute them after processing corresponding table data.

Better option would be make this metadata driven and create generic validation rules and configure metadata for each table to use these validation rules.

One limitation with both of  these approaches is, we can have limited set of validation rules. If a table demands new type of validation, we need to write complex code/logic to implement.

 AI-Powered Data Validations

Traditional validation methods rely on predefined rule sets, limiting flexibility.

A more scalable approach is AI-driven validation using prompt-based validation rules. Instead of maintaining static lists, we allow Fabric AI functions to interpret prompts and dynamically generate validation scenarios. Prompts can act as metadata.

This technique makes validation far more adaptable than manually coding rule sets. Check out this article for more details: Validating Data with Natural Language in Microsoft Fabric

 

Security:

                With the combination of lake house permissions and One lake data access, we can secure data at entity level. By creating roles by entity and creating Security groups at entity level, we could achieve granular entity level security.

 Data Masking and Data Redaction:

                Not only securing entities, it is also crucial to mask and redact sensitive data where applicable. Masking and redaction techniques will make sure sensitive data is not readable at the same time retaining table structure.

Fabric provides options to redact data using Fabric AI functions and Presidio libraries. Using these we could redact Personally identifiable data from entities. Learn more


Semantic Model Support & Validation

                Semantic model validation is crucial step to make sure the end users are seeing valid metrics. This is generally challenging to perform validations on semantic models. But Great Expectations (GX) framework with Fabric makes it easy and streamlined to perform semantic model validations are runtime.

Using great expectations, we can validate semantic models at various stages including dataset, metrics, tables.

With GX, we could separate context (rules) creation from execution. Context can be created during development, execution at runtime.. Check out this tutorial

 Logging & Monitoring

To ensure system health and pipeline performance, we integrated Azure Log Analytics, combining system logs and application execution insights in one place.

For real-time monitoring, we explored:

  1. Azure Grafana – A powerful dashboarding tool with direct KQL-based queries from Log Analytics. Apart from dashboarding  capabilities, Grafana provides teams and Emails integration for notifications. It also provides ICM integration (ICM via ADX) where ICM tickets can be directly created as and when errors happen during pipeline execution.
  2. Fabric Data Activator – Enables automated alerts when failures occur, using real-time monitoring and triggers within Fabric. Learn more


We will discuss each of these items in detail and also how we could bundle all these features as deployable solution, in upcoming posts. Stay tuned!!


Conclusion

Our journey with Microsoft Fabric transformed our data engineering solution—from simple shortcuts to a scalable, automated DE framework.

By adopting Fabric's shortcuts, dynamic transformations, AI-driven validations, and real-time monitoring, we built a cost-efficient, adaptable data pipeline while reducing development overhead.

If you are working with Microsoft Fabric, I’d love to hear your insights and experiences! Let’s connect and discuss innovative ways to optimize data engineering workflows.



Amazing article Kiran!! Enjoy partnering with you and our team as we drive AI transformation.

Like
Reply

To view or add a comment, sign in

More articles by Kiran Butti

Others also viewed

Explore content categories