Dude, Where's my Data?
Lessons From PostgresVision 2018
Some months back now, I gave a presentation to my co-workers covering what I took away from PostgresVision 2018. This series will cover changes I have made and actions taken following the conference.
Today, I will be sharing what I took away from Rob Thomas, General Manager of IBM Analytics talk: KEYNOTE | AI Needs IA, at Postgres Vision 2018
“There is no AI without IA.” - Rob Thomas
Mr. Thomas started his talk out with talking about an Organizational Data Maturity curve. The presentation was something of a sales pitch (which I normally probably wouldn’t write about), however, there are a number of points I think many don’t consider.
Often companies want to gain additional value from their data, however, they were not able to design for the scope of how they would want to leverage their data today decades ago. Artificial Intelligence (AI) may have existed as a field of research since 1956 but it wasn’t really applicable until advances in Machine Learning were made. Most people didn’t have these technologies as a part of their lives as consumers until 2010, so the application of these technologies are still new to most companies.
Moreover, many have not made investments into improving their Informational Architecture (IA). IA can be defined as the structural design of shared information systems. Getting the most from the data you do have requires some level of organization and intentional design. IBM has been talking about the concept of a Data Dictionary for your entire organization. This centralized repository of metadata about data systems within your organization. With current regulations regarding data, this should be a necessity for all organizations wishing to be compliant. If you are struggling with selling this to a product owner or leadership, additional benefits often include leveraging the data dictionary for automating tasks/administration, helping teams focus refactoring efforts, training algorithms to gain valuable insight, establishing good governance, ensuring data security measures are sufficient, and providing a starting point for any data science or heavy data analytics efforts. Understanding the entire data ecosystem is a necessity for multiple people throughout the organization.
A data dictionary should include:
- Information about all data systems within your organization.
- What domain the data belongs to (is this customer data, organizational data, data specific to a particular application / function)?
- How does this data relate to other data in your organization?
- Where did this data come from?
- What data type / format is the data in?
- What consumes / uses the data and are there any contracts between the owner of the data and the consumers?
The good news if you don’t have a data dictionary is you can often bootstrap efforts by pulling a good deal of this information directly from whatever database system(s) your company is using. Additionally, there are tools (such as the ones IBM is offering) to help. I think the sweet spot for such tools is when you have a very complex environment with multiple data systems and are running several different data management systems (relational, NoSQL, and Data Lake), which many companies seem to be doing these days.
Most Companies do not Understand the Data Required for AI
Rob made a good point in his presentation, data science efforts require knowledge of all of your data (regardless of where it lives) to properly train your models and ensure they are effective. The first step of the process of gaining value from your data, identification, is often one of the largest pain points! Finding the data you need and ensuring you have the correct data shouldn’t be a scavenger hunt, yet, I have both personally witnessed and heard from other professionals where identifying data literally took months or teams of people (detracting from other organizational priorities) to complete. This is something of an productivity anti-pattern which can be addressed with better organization.
These days, every department wants a data mart or some way of doing analytics. Leaders all seem to understand the value data can bring to decision making. Collection and storage of data can usually be accomplished at a reasonable price. The value comes from unifying and correlating this data to empower informed decisions that can have a positive impact on the business.
Optimal solutions tend to involve three elements, people, processes and technology. The solution to addressing the lack of IA could be broken down as follows:
People
- Educate people throughout the organization on where to find information on the data that is stored.
- Educate on data security, protocol and governance.
- Empower people with the data and tools they need to do their jobs and make decisions.
Process
- Update development processes to ensure data dictionary stays current. This could be as simple as a pre-release step or as complex as updating your data pipeline, creating a publication / subscriber solution to automate, and / or creating an audit log that needs to be reviewed by an employee.
- Formalize Information Architecture and implement within your organization. Know what problems you are addressing and why.
- Continuous process improvement cycle. Let’s face it, you don’t get what you expect. You get what you inspect. Schedule periodic reviews to ensure measures are meeting organizational needs and are working as intended.
Technology
- This doesn’t have to be an expensive solution. With a little research, you can architect a robust metadata store to address your needs.
- This can be extended by adding a method where each element can be tagged, and maybe some thought regarding who in the organization can view what data.
- Work smart, a good deal of the metadata you need is probably already stored by your data management system. Normalize, add what is missing to address your needs, create views, implement proper access controls, and put the system to work for you.
Actions taken:
- Created a data dictionary using information from several RDBMS systems. In less than two hours I had details on hundreds of databases from many servers running a variety of data management systems.
Actions planned:
- Empower people throughout the organization to extend the data dictionary to maximize value to their positions through tagging.
- Add an internal website where people at all levels of the organization can leverage the data dictionary.
- Work with Business Intelligence, Software Teams, Product Owners, and management to further extend and improve the resource to address other needs.
What actions are you taking?
What pains do you have finding and using your organization's data?
How do you plan on addressing your personal or organizations needs regarding Information Architecture?
How can I improve future articles to bring more value to you?
Thanks for reading!
Please feel free to share this article. I always welcome criticism and differing opinions. I will be posting more lessons learned from this conference and throughout my career going forward. I enjoy learning and would like to hear from you.
I'll give you chocolate bar for your password.