Dataflow in Azure Data Factory V2-Best practices


Most of you have used data flow inside your data factory to some transformation while building your workflow (pipeline).

I want to share some of the best practices while working with source inside your dataflow.

You guys have seen few check boxes while configuring source inside your dataflow.

1.Allow schema drift

2.Infer drifted column data type

3.Validate schema

No alt text provided for this image

In this article I will be discussing about these 3 options.

Suppose, we have a sample dataset like below which has EmployeeInfo.

No alt text provided for this image

I will be going to create a workflow which will load this data into some destination.Once data is loaded data will look like this in destination

No alt text provided for this image

Suppose, in future, business analyst people changed the schema of the file. Now they are only sending the files with id,name and dept column (see below screenshot)

No alt text provided for this image

Now, If you see your destination side, Data will look like this. have you noticed anything wrong?Please check salary and dept column carefully. It seems like data from dept column shifted to salary column which not correct.

No alt text provided for this image

In order to work with this kind of situation and to ensure your workflow can work in even the schema i modified at source level, you need to select "Allow schema drift" option.

Once this check-box is selected and once you load your data one more time, you will see output like this-

No alt text provided for this image

Please look at salary column, as we don't have data in salary column, so the data for salary column is populated as null.

In this way by allowing schema drift option you can ensure your data quality.

I hope we are good with first option "Allows schema drift", now let's discuss the other two options as well as "Infer drifted column" types and "Validate schema"

Suppose, you have a new column in the source file, as your source file is csv so the column datatype will be string. In order to handle the datatypes of drifted column, Please click on the check-box for "Infer drifted column" which will allow auto- detection of drifter column types

In below example, I have added a new column say "managerid" in the source file.

No alt text provided for this image

Now, If you see the same information loaded at data side, you will observe the datatype for this column is "string" (If "Infer drifted column" is not checked). If it is checked you will see the actual datatype for "managerid" column i.e. "integer"

Now, coming to last option "Validate schema". As name suggest, once this option is checked, it will enable the validation at source and sink side, If the incoming data is not matched with the column and type defined in projection tab, then it will fail the dataflow.

No alt text provided for this image

Thanks for reading my article :)

Amit Kumar

A very simple explanation that works. Had to go through a few articles to finally understand what each of these options did. Thank you!

Another good article.Keep up the good work

Like
Reply

To view or add a comment, sign in

More articles by Amit Kumar

Others also viewed

Explore content categories