Metadata: The queen of data.
About a decade ago the buzz about big data started to become pervasive, many players jumped into the gold rush and were hyper motivated by the fear of missing out. It wasn’t unusual then in conferences to see presentations with wild claims and inflated expectations, from super-duper clear data lakes, that one sip from can cure any corporate ills, to turn key solutions so good that will give the corporate leadership a massage (with a horse).
But like any hype cycle, big data went from these over-the-top claims to today’s realistic and highly productive data environments, eventually the survival of the fittest left few major players, such as Snowflake, Databricks, the large cloud providers (AWS, Azure…etc. ). However, despite reaching maturation, the big data space is more than ever ripe for innovation, one of the key potential areas is metadata. But what is really metadata? Metadata is typically defined as data that describes other data. To illustrate let’s say you run a restaurant with a database containing the data about the ingredients, dishes, reservations…etc. All information regarding the database, such as size, transaction and backup logs, partitions…etc. are considered metadata.
The utility of such metadata is wide ranging, to illustrate, let’s say our restaurant software is prone to errors and its database suffers intermittently from data corruption. To fix the problem, the company that sold the software proceeds to analyze the metadata in the logs then train and build an AI model (using the metadata) for the purpose of predicting and diagnosing future software issues.
At the heart of metadata management lays a tool called the data catalog, it is a software that helps manage and leverage the power of the metadata. Can the metadata be managed without a data catalog? Sure… But can one build a house with a hammer and shovel? Sure too, but why would anybody do that in the age of bulldozers.
There are many data catalog vendors out there (Alation, Glue, OvalEdge…etc.) to name few, with some better than others, however, reviewing them is not the goal here. Our goal is to shed some light on the importance of metadata in general and the centrality of the data catalog, with reflections on the best practices for adoption in organizations. Most catalogs include 3 key features.
The first allow the construction of a lineage for all enterprise data. The importance and usefulness of lineage range from compliance requirements to data quality measurements and many other benefits in between. If lineage was a kitchen tool, then it would be the one that helps the chef figure out all the sources of the ingredients and ensure a quality delicious dish. Instead, with bad ingredients no matter how good the chef is (i.e. data scientist), the end result would be an inedible dish or worse.
The second key feature is the one that allow the measurement of data quality via empirical measurement such as ratio of empty or erroneous fields in a table, also via crowdsourcing which help socializing the information garnered by other team members regarding a dataset.
The third is the ability to manage an enterprise-wide glossary of terms and taxonomy. Although this feature sounds tame, it is in my humble opinion a crucial feature, since if managed properly it can allow an enterprise-wide semantic alignment.
Recommended by LinkedIn
More often than not disasters happen because of misunderstanding and/or bad communication. To illustrate, let’s say we have few business leaders in an aerospace company sitting in a war room trying to manage a rocket launch, but some speak French, others English or Italian and so on, some use the metric system, others use the Westeros yard system, then our rocket chances of landing on Mars will be the same as landing in Northern Ireland.
To be fair the catalog alone will not be enough to achieve the elusive semantic alignment, to do so it requires a lot of stake holder engagement and additional tools, which maybe bespoke. These tools may include automated services and ontology-oriented analysis. Ontologies are a huge topic that we may cover in another article. Unfortunately, ontologies are not a very popular topic in the corporate world, this is for many reasons too wide to cover here. Nonetheless, suffice it to say that when utilized judicially in conjunction with a data catalog, they may provide a huge competitive edge to data-oriented companies (which is everybody pretty much).
Still, for the catalog to become the center and fulcrum of an organization’s data management and governance, there are additional needed steps, because the catalog is just a tool. Like in the case of making a bonfire, one needs a combustible, oxygen and a spark. Similarly, to “ignite” the metadata potential, an organization needs the triad: People, Process and Technology.
The technology is represented by the metadata, the catalog, and the custom services ecosystem, all of these are just one leg of the stool.
Without a commitment and understanding from people (the second leg of the stool) at all levels of the organization, it would be challenging to adopt the technology and use it to its full potential. Here, the role of program management is very important, because it is responsible for the evangelizing phase of the adoption, the program management should also help foster a top-down mandate and a bottom-up good will.
The third leg of the triad, namely process, is the glue. This is because through automation and baked in steps in every project, the “respect” of metadata becomes less of a burden thrown out the window whenever the scheduling hurts (which is often), and more of a seamless step that becomes muscle memory for teams and data leaders.
In today’s world where uncertainty and information overload dominate, decision making becomes a matter of survival. A data/fact driven decision process is often superior to instinct, however this is heavily mitigated by the quality of the data. To ensure such quality, the queen of data is vital and any leader that ignores her majesty would do so at his/her own peril.
Great read, Marwan!