Dataflow in Azure Data Factory V2-Best practices

Amit Kumar

Published May 25, 2020

Most of you have used data flow inside your data factory to some transformation while building your workflow (pipeline).

I want to share some of the best practices while working with source inside your dataflow.

You guys have seen few check boxes while configuring source inside your dataflow.

1.Allow schema drift

2.Infer drifted column data type

3.Validate schema

In this article I will be discussing about these 3 options.

Suppose, we have a sample dataset like below which has EmployeeInfo.

I will be going to create a workflow which will load this data into some destination.Once data is loaded data will look like this in destination

Suppose, in future, business analyst people changed the schema of the file. Now they are only sending the files with id,name and dept column (see below screenshot)

Now, If you see your destination side, Data will look like this. have you noticed anything wrong?Please check salary and dept column carefully. It seems like data from dept column shifted to salary column which not correct.

In order to work with this kind of situation and to ensure your workflow can work in even the schema i modified at source level, you need to select "Allow schema drift" option.

Once this check-box is selected and once you load your data one more time, you will see output like this-

Please look at salary column, as we don't have data in salary column, so the data for salary column is populated as null.

In this way by allowing schema drift option you can ensure your data quality.

I hope we are good with first option "Allows schema drift", now let's discuss the other two options as well as "Infer drifted column" types and "Validate schema"

Suppose, you have a new column in the source file, as your source file is csv so the column datatype will be string. In order to handle the datatypes of drifted column, Please click on the check-box for "Infer drifted column" which will allow auto- detection of drifter column types

In below example, I have added a new column say "managerid" in the source file.

Now, If you see the same information loaded at data side, you will observe the datatype for this column is "string" (If "Infer drifted column" is not checked). If it is checked you will see the actual datatype for "managerid" column i.e. "integer"

Now, coming to last option "Validate schema". As name suggest, once this option is checked, it will enable the validation at source and sink side, If the incoming data is not matched with the column and type defined in projection tab, then it will fail the dataflow.

Thanks for reading my article :)

Amit Kumar

Laura Roderickz 4y

A very simple explanation that works. Had to go through a few articles to finally understand what each of these options did. Thank you!

1 Reaction

Sheik Jainudeen 5y

Thanks

Ankit Sharma 5y

Siddharth B.

Naveen kumar 5y

Another good article.Keep up the good work

Shruti Bhawsar 5y

Very informative

1 Reaction

See more comments

To view or add a comment, sign in

Dataflow in Azure Data Factory V2-Best practices

Amit Kumar

More articles by Amit Kumar

Others also viewed

Introduction to D365F&SCM data in Azure synapse link for Dataverse

Snowflake’s Latest Data Type Enhancements

Column Store index Performance: load times and data uniqueness

Loading Hive-partitioned files into Microsoft Fabric Data Warehouse

Microsoft Data Platform News 2025 - Week 34

Fabric Wish List - Part 02

Load API Data into Fabric Lakehouse using Pipeline: A Step By Step Guide

Savvy 360o: Data, Data and Data! Invest in it!

Exporting data from Log Analytics / Sentinel to Azure Data Explorer for longer data retention with Kusto queries

Populating Lookup fields in Dataverse with Azure Data Factory (ADF) using Data Flows

Explore content categories

More articles by Amit Kumar

Configuration for Data sources in Databricks (python)

Model view for Azure Analysis service is available in Power BI desktop

Azure Storage cost management- Best practices

Databricks: Connecting to Azure SQL Database and loading the data into Azure datalake gen1

Power BI: Refresh dataset using Microsoft flow

Power BI September 2019 update

Power BI: Power BI best practices

Power BI desktop August 2019 update

Power BI Desktop July 2019 features

Power BI – How to view Usage metrics in Power BI workspace