Dataflow in Azure Data Factory V2-Best practices
Most of you have used data flow inside your data factory to some transformation while building your workflow (pipeline).
I want to share some of the best practices while working with source inside your dataflow.
You guys have seen few check boxes while configuring source inside your dataflow.
1.Allow schema drift
2.Infer drifted column data type
3.Validate schema
In this article I will be discussing about these 3 options.
Suppose, we have a sample dataset like below which has EmployeeInfo.
I will be going to create a workflow which will load this data into some destination.Once data is loaded data will look like this in destination
Suppose, in future, business analyst people changed the schema of the file. Now they are only sending the files with id,name and dept column (see below screenshot)
Now, If you see your destination side, Data will look like this. have you noticed anything wrong?Please check salary and dept column carefully. It seems like data from dept column shifted to salary column which not correct.
In order to work with this kind of situation and to ensure your workflow can work in even the schema i modified at source level, you need to select "Allow schema drift" option.
Once this check-box is selected and once you load your data one more time, you will see output like this-
Please look at salary column, as we don't have data in salary column, so the data for salary column is populated as null.
In this way by allowing schema drift option you can ensure your data quality.
I hope we are good with first option "Allows schema drift", now let's discuss the other two options as well as "Infer drifted column" types and "Validate schema"
Suppose, you have a new column in the source file, as your source file is csv so the column datatype will be string. In order to handle the datatypes of drifted column, Please click on the check-box for "Infer drifted column" which will allow auto- detection of drifter column types
In below example, I have added a new column say "managerid" in the source file.
Now, If you see the same information loaded at data side, you will observe the datatype for this column is "string" (If "Infer drifted column" is not checked). If it is checked you will see the actual datatype for "managerid" column i.e. "integer"
Now, coming to last option "Validate schema". As name suggest, once this option is checked, it will enable the validation at source and sink side, If the incoming data is not matched with the column and type defined in projection tab, then it will fail the dataflow.
Thanks for reading my article :)
Amit Kumar
A very simple explanation that works. Had to go through a few articles to finally understand what each of these options did. Thank you!
Thanks
Siddharth B.
Another good article.Keep up the good work
Very informative