Data Democratization
Data democratization. A phrase used a lot nowadays to indicate a direction that companies should pursue in order to become better. For me, democratization means that you wrestle power from the hands of a concentrated few and distribute that power to a wider audience. Sometimes this data we talk about has to be converted, like how you'd convert currency or convert liquids into assets, or the other way around, and then use it or capitalize on it.
Airbnb is, for me at least, an exemplary company in how they pioneer data engineering combined with data visualization and the tools they provide their employees to wade through terabytes of information in the blink of an eye. It is actually a strategic objective at Airbnb to continuously improve on that; no wonder I'm following airbnb's engineering blog; it's one of the best ways to learn how to become more effective with data.
One of the coolest data tools I've seen come out of this innovative process is Data Portal. You can find Airbnb's post on medium about this here: Democratizing data. And there's also a video from the same talk. Basically, data portal is all about developing transparency on what data is available, how (much) it is used, who uses it, get in touch with communities and people using or producing it. It's sort of a facebook for data instead of people. Unfortunately it doesn't look like it's going to be open-sourced; the main reason is that it's heavily tied in with their internal infrastructure. This makes sense, because it's an "umbrella tool" for their organization, which means you're not solving a local issue, so you're bound to have to deal with more integration work.
When you combine Data Portal with Apache Airflow (capabilities for moving and processing data) and Superset (capabilities for visualizing and transforming data), you can see how the role of some of the people they employ differs from the typical 'data engineer' or 'BI developer'. This is what I really think data engineering is truly about: putting a platform and capabilities together where people can be self-serving. Don't put a cashier here or a guy flipping omelets: prepare the buffet in such a way that the people visiting can help themselves in an orderly fashion; that way you can process more customers in a restaurant with less people.
A very interesting philosophical exercise is to read Maxime Beauchemin's post called "The rise of the data engineer". Go ahead and read it :). The title is positive and the note in this particular post is also rather positive when it comes to this new found profession called "Data Engineer". The important points to take away is how the data warehouse has become more of a public institution instead of a shielded, locked treasure chest where you are only allowed to peek into when you have the permissions to do so.
The role of data engineer from this perspective is different from "BI developer". A BI developer uses data, constructs reports and looks at data from a functional perspective without necessary looking too much at how that is accomplished with underlying technology; it's about the achievement and functional implications. Many BI tools can be pretty complex to use, which is why you want a group of experts dealing with those. A data engineer looks at data from a more abstract perspective and tries to offer tools and processes that allow the people needing the data to become self-serving, even if that sometimes leads to (temporarily) messier work being executed. The idea is that with smarter effort and better abstractions all over the place, eventually you get exponentially more things done than just that single report.
It's a bit surprising then to find a follow-up article on that, which is called "The downfall of the data engineer". What is highlighted there is how efforts to provide more tooling also leads to exponential pressures from all sides, where you see tool misuse, false expectations and misalignment in how people build up what is intended to become a consistent data warehouse that exponentially increases in value the more dimensions and measures are added to it. Instead, although data engineers provide tools or put data somewhere, we see that teams form "micro-groups" that primarily seek to solve local problems and forget to make broader contributions or re-model the data in such a way so that it becomes integration-ready with other sources. Great; so by wrestling the power away from a central location, we now contribute to shared misery? I thought we were a democratic instead of communist? :)
This is why the "Data Portal" is such an important initiative in data-driven organizations. The ability to see how data is used and to derive the contexts from that allows people to talk and chat locally about this data, find ways to integrate that together instead of (usually) going to a single authority on that data, who they then expect is going to tell them how to resolve their problems.
The other key element is "data proficiency", knowing how to work with data is something that employees having access to data will have to learn better for the whole organization to grow. This means more effective use of SQL combined with better efforts made in data modeling. Also be well-versed in data security requirements and obligations (like GDPR), security management around data like encrypted data files over the internet, expectations and regulations in 3rd party data interactions, agreements and SLA on surfacing data with those parties, API/interface requirements in new service agreements with third parties, sharing data wrangling knowledge within their team (or abstracting the work by providing better tooling) and knowing when to request skills of data engineering vs. general SQL and programming skills.
The picture at the top of this page is from a very recent opensource software project I started. It's built on the descriptions of the Data Portal of Airbnb (basically: neo4j, elasticsearch, python+flask). The way it works is that it's going to look at log files from "data surfaces" (SSAS, Tableau, etc.) and extracts user id's and which measures or charts they look at. It will also look at the dictionaries in databases to understand which views take data from which location. The ETL tool we plan on using should record how data flows from table to table. When you put all of this together in a graph database, you get a lot of relationships between people and data. From there, you can answer a huge amount of questions like:
- Is anyone looking at data in this table?
- This table contains personal data and GDPR is coming up. How many people are potentially affected if we remove these fields from this table?
- What other marketing data related resources do we have at the company and is marketing aware of that?
- I am new in this finance group and I don't know the reports that these people look at. If I join their group though, I can explore this for myself.
- There is this useful report that others are using it, but I want to adapt this for my needs. I just don't know the data sources that feed into this report. If only I had this tool that could tell me....
I truly think that data governance and management tools are going to be ultimately important for organizations to steer their growth better and remain effective in how they put the available effort of their employees at work. Should your "data engineers" really be implementing that dataset or report that someone wanted (the role of a BI developer), or should they be more transformative in the way how your organization deals with data and would you invest their time into building better tooling instead?