Problems with scalable data systems need creative approaches.
Maybe chatGpt will help to write the code, not the solutions that we need to do with human intelligence. (😊 soon the solutions will also be implemented by chatbots)
The following draft is my own, not generated by chatGpt.
Data is becoming a product
Data is becoming a product, and data is the main vein of powering business intelligence and reporting applications, which are still very important in powering all sorts of other use cases, from machine learning and artificial intelligence to being plumbed back into your go-to traditional market tools using ETL and reverse ETL tools to powering product experiences.
To deal with cloud data-related issues, we are building data lakehouses, which are homogeneous (centralized) systems that store all data in one location or application.
So we can see more use cases. More dashboards, more data with various stakeholders, and data lakehouses are not subsets of applications; they are the organization's primary data sources.
There is one drawback: as we acquire more and more data, there is an increasing amount of maintenance required. In traditional data warehouse systems, we could have hundreds of tables and tens of dashboards. With these limited data and components still, we are still facing drawbacks and encountered ad hoc challenges, such as pipeline data quality issues, etc..
Data Lake Houses Preparation We are in 2024, and all companies are keeping individual data lake houses with their own standards due to a lack of outside support and sufficient SME design. A lack of a compass design because its a new era of cloud models so new data model design standards are not yet accessible. There is a risk associated when dealing with thousands of tables and hundreds of dashboards. There is a risk of increased breaking, which leads to a loss of trust among stakeholders.
I'm not attempting to solve all of the problems with a magic punch, instead, I'm attempting to deal with data pipeline break issues.
Before we go into the solution, we'll look at a typical production failure
Problem statement :
A credit risk report dashboard has a failure as usual, stakeholders will not do the basic verification and send mail notes to the developers' team with priority keywords. product owner became cranky and sent the same email to developers to find the root cause, but the smart developer's team is very busy and trying to push back or toss the issue to other teams such as DevOps, infra, and upstream to buy out some time for coffee. In this journey, as a development team, we need to reply to stakeholders with below key questions and answers.
What is going on right now with the issue?
What is the impact?
How do I fix this issue?
After reading the problem statement, to answer the questions, we need to do a little investigation on logs/data. Normally, the developer will respond to this question and prioritise the proposed solution. We are in the data lakehouse era, therefore dealing with all of the stockholders at the same time, with 100 dashboard failures, requires a significant amount of development resources to resolve the issue. The reasons for this lie in operating costs.
Fantasy view of a solution
Create a Model Status Dashboard it has two filters those are Model name and business date.
The key charts below are shown if we select the dashboard.
Data linage graph chart: (if we use DBT, we can get the chart's JSON file from their Power Bi; if not, we need to produce code-level data in order to obtain the model's linage).
Data linage graph chart: A data linage graph chart can be used to better understand the details of impacted objects as well as information about objects upstream and downstream.
HeatMap : last 100 days' model data execution history is given down by date below in the metrics
heatMap helps to understand the trend of failures for the last 100 days.
Gantt Chart: selected model as of the date how many tasks are completed, how many are ongoing, and whether a specific task fails this information.
Typically, how long it takes to complete a task, etc.
Gantt chart: The Gantt chart enables understanding of task-specific data and details of failures. It helps in figuring out the duration of tasks and the impact of execution.
Table Chart: Row count information for the model-dependent target and source tables.
Table Chart:The information in the table chart, which shows how many loaded tables, pending tables, dependent tables, etc., helps to validate the data integration checks.
Text select chart: Upstream and downstream contact details.
Text select chart:Dependent teams can track the upstream and downstream with the use of a text chart that provides contact information.
Entities required to prepare the chart.
This method is independent of any external tools; as a fantasy perspective, I've used power bi. We may use Python and an alt library to generate this model execution test details, utilising basic chars, heat maps, gantt charts, and linage that are available on Python libs.
Thank you!
Good effort..
Useful content
Fantastic overview and assessment of data warehouse complications. Keep them coming!!!