A data-catalog: it's all about context
As a business or analytical users, do you want to know there are 100's of customer tables in the organization? Or do you want to know which one to use?
There is a lot of market confusion in whenever a data-catalog solves this problem or not. Even more broadly, there is confusion as to what precisely a data-catalog is.
To me, an enterprise-wide repository representing all your technical metadata (systems, tables, columns, schemas, ...) is a data dictionary, not a data catalog yet. Only when you can add context to the technical metadata, you create a true catalog and value to the organization.
Don't get me wrong, a data dictionary is essential to an organization. However, when technical people need to get this information, they typically just perform a 'select' of the system tables or look directly at the ETL tools.
Ultimately, organizations that want to become data-driven do not only have to enable their technical people but the whole organization (including business users). And to do so, they need context.
Context
Context, that is the hard part. If you look at a customer table in System A, are those retail customer? are they suppliers? or maybe partners? You would only know if you have the context. And remember that context can go multiple levels deep. A partner for the marketing side of the organization might be completely different than the finance side of the organization.
The answer to building context is often a business glossary and exactly that business glossary should really be the core of your catalog, a single point where you can connect all other information assets too. If I search for 'customer', I want to know what reports we run, what business process we have, what source systems to use, and what information is available in those source systems.
Building context is what makes a catalog useful and powerful, but it is also what takes time and effort. Part of it can be automated and suggested via machine learning and other methods, but ultimately it's often the subject matter expert of a system or a line of business that knows the answer.
What to prioritize? The same method as in my last blog-post can be used. Prioritize the high visibility and high impact data, the data of which you want to make sure is used correctly and should be easily findable.
Bottomline, to enable a data-driven organization, you have to make sure everyone in the organization can find and get access to data in context they work in, not just the technical people. And to achieve this, you need a context-driven catalog.
Darren Dadley Suzanne Reid Karen willetts
Good article. The risk is that you end up with a object model, a business glossary, a information assets register, a spatial register and end up with a catalogue of catalogues. Moving the focus from the data to the custodian will help simplify the approach.