Data Integration Tips

Data Integration Tips

As data integration continues to evolve, mature and grow, it is quite an exciting time for us data professionals to see best practices and process being adopted by organizations that focus on value proposition in transforming and re purposing data for their analytical and operational data needs. While there are multiple techniques used in diverse applications and business contexts to handle the common issues of data integration, we must look at ways in which the process of workflow management and orchestration can be made easier to produce results quicker. On this context, I wanted to share some of my thoughts, tips and tricks on the common data integration approaches that has helped me and the teams that I work in to build better and efficient data analytics systems. Below is a quick compilation of my learnings

  •  Spending a reasonable amount of time on the data profiling before building the source to target lineage could help team avoid any rework or redesign which could be costlier in the later phases of the project. Usually a data or business analyst would be responsible for this deliverable. A quick rundown of the profiled data will help the development team with good insights on how source data is and help them in understanding on how they could build their process efficiently for the data transformations that needs to be applied when integrating them. Their education in terms of what to expect from the source system helps build correct test cases and lets them explain the exception scenarios within the data sets.


  • Often, I find that there is an overwhelming data integration layers to parse through before data reaches its destination in the warehouse or the marts. Process orchestration takes a precedence over the data reaching its destination. This is because the process orchestration likes to isolate data in the respective layer for troubleshooting. However, at times this creates an overhead when such systems function in production as each processing layer takes it time. Unless it is very necessary, I feel we should restrict this to a minimum number of layers. All projects might not need all the layers, so it beneficial to customize them as per the needs of the project rather than standardize them.


  • Avoiding the design of wider tables helps the integration process to be more efficient. It is important to spent good amount of time finalizing on this aspect so that the tables are not over engineered. We may have to occasionally resort to design of wider tables but unless it is necessary we could be limit ourselves to lean, mean and mightier tables with enough columns to satisfy the business functionality.


  • Getting a subset of data to the business in the test environments is a quick way of mitigating risks involving the data quality issues. An example could be to quickly deliver the data to a specific to a business area like auto policies followed by the other areas such as property and commercial policies. Sometimes business would not want to look at the data in isolation in which case this will be a difficult sell. However most of the times seeing the partial set of the data fructifying for analytics will help build the confidence for the development and the requirements team. In case business is too busy to be involved in validation of a partial set of data, then taking the help of the data analysts to validate this data can be a good idea before the entire set of data is given to the testing and business teams for validation.


  • Often naming convention for the date columns leads to confusion on what column needs to be used while querying the data. While the logical and physical model would state the functionality of the columns, if the development team does not ensure the date columns are populated per the data design decisions due to the time constraints then a disconnect between the data in the tables and their dictionary meaning can be noticed. To avoid this, the development teams should try to understand the purposes of the columns with the data architects and then based on the functionality and the needs, either populate them or get rid of them if they will not be used. Leaving such columns to the future state will also lead to confusion until the history gets restated for them. It is better to provision such future stated columns at the point when those enhancements get prioritized for development.


  • Often not communicating the data decisions taken by the business stewards and the data governance to the development team leads to confusion. It is important to communicate them as early as possible so that the development teams can analyze the impact of those decisions impacting the code. When there are lags in communicating the decisions then it is results in development team getting a very little time to react to those decisions impacting the data quality and project deliverable s.


I hope some of these learnings will help you in setting higher goals and aspirations for data integration work and its outcome.


Nice one Ramki! In depth points but very concisely written. Good one.

Like
Reply

Nice post. This effectively calls out key considerations for developing an integration layer and how to avoid some of the common pitfalls. Well done.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories