The Accelerated Importance of Data Strategy
Lately I have noticed many creating content like comments, articles, and images around the concept that AI requires quality data. And thankfully so considering many companies are struggling to get started with basic AI use cases due to questionable data or even worse, basic inaccessibility to data. Questionable data may not even mean data is necessarily in a bad state, but that two or more people or teams can not agree which data source(s) to use for concepts such as patients or clinics in healthcare, clients/matters/attorneys in legal, or factories/machines/vendors in manufacturing.
Even with a fairly sound organization wide data strategy, companies can still run into such issues. With no data strategy they are definitely encountering them. So you have this incredible technology, generative AI, but cannot get off the ground with it and all the promise to help differentiate your organization from your competitors. Unfortunately, this is nothing new. Organizations still struggle to create the most basic reporting, sinking days to weeks into one report that should take minutes to hours. They have bloated billing departments because it takes 20 people to do the work of 3-5 as processes other organizations have automated (those with a solid organization data strategy) years ago are still done mostly by hand. Application upgrades and migrations take 18 months instead of six as three to six months is spent analyzing everything people are doing with the data from said applications. What does 12 months of consulting from the experts on your application cost?
So, what do you do about it? Hire a Chief Data Officer? Ask your Data Architect to figure it out? Either way you need to get started NOW. What about return on investment you say? Does anybody question what the return on investment is when purchasing hardware to run your applications? This is decades old thinking around data management that needs to change before another five years passes by. If you hire a Chief Data Officer it is certainly reasonable to ask them to come up with the algorithm to report ROI, but that should not stop your data management efforts from starting tomorrow. So, where do you begin?
Data Management Patterns
Every company is already getting data from disparate databases, files, and external sources to use for at least the most basic business reporting and to push data to other software that needs it. For the latter think pushing a list of patients to billing and scheduling systems from what you’ve determined as a main source for that data. The diagram below shows this pattern but adds one improvement, adding some layer of governance to the mix. In this case I use Microsoft Purview as the tool of choice in the Microsoft ecosystem for this.
At least by adding a Data Governance tool to the mix, you’ve given the organization the ability to classify and define data coming from your application databases and maybe allow people to answer “what is the primary source for patient data”. That is if somebody is taking the time to add that metadata to Purview for each application/database/table. The upsides and downsides of this pattern include:
Upsides
· Using technologies any on premises practitioner knows well
· No additional hard costs to the organization, unless you use something like an ETL tool that comes with additional cost either directly or via how you configured it (for instance SQL Server SSIS running on its own server therefore generating a SQL license cost)
Downsides
· Managing hardware and software
· Running heavier workloads against important application databases
o Power BI picking up its data directly from the database either through import mode refreshes or live queries via directquery mode
o Every process to push a “Patients” set of data out to another application puts load on the source. If you are copying this data to 15 other applications/databases that means running the same process 15 times per day minimum.
· If the “Patients” data set should consist of data coming from multiple sources, it gets complicated how to handle that. The logic behind this becomes buried in a technical process. Hard to represent this business logic in your Purview environment even though you have it.
· Without documentation little information on lineage of these processes beyond the brains of the technicians who put the data movement processes together.
· Focusing on AI, if you wanted to build a process allowing people to ask questions about your billing data sitting in a database, that would mean putting more load directly on your application database.
An improved on-premises pattern that many have used for decades:
With this pattern you have added a centralized data repository to your architecture. Call it a data warehouse or any label you want. The data warehouse server could be a relational database such as SQL Server, Oracle, or Postgres. It may be another “data warehousing product” you bought that runs on a Windows or Linux server(s). A major improvement is that, if done correctly, you put much less load on your critical application servers. Data gets copied once and even incrementally to your Data Warehouse. Power BI gets its data from the Data Warehouse Server. You pull operational data from this server. The concept of “Patient Data” can now be a newly formulated table consisting of data from many sources/tables (think Gold layer if you have heard of the medallion architecture) or it can be a view built on top of those tables. Much easier to now represent this in Purview.
Upsides and Downsides to this pattern:
Upsides
· Using technologies any on premises practitioner knows well
· Little additional cost potentially. Maybe as little as one additional server with the cost of the database or software product of choice.
Downsides
· Managing hardware and software
· Without documentation little information on lineage beyond the brains of the technicians who put the data movement processes together.
Recommended by LinkedIn
· When another high usage team like data scientists wants to use the data you have centralized this often leads to building multiple copies of the data on additional servers often called Data Marts. This means more ETL processes, more things that can fail, basically more to manage. At least they’re not hitting your app servers directly, but more assets to manage and more expense.
· Managing the data loading processes for all use cases calls for expertise in managing those processes. Since more people will now share this data, it is more critical for data to land in the Data Warehouse. All this said, that means you now need the talents that fall under the headers of Data Engineer and Data Architect. This has led to siloed thinking in data management by putting different people into each role which can lead to communication issues between the various teams.
· Since you have this one data warehouse server(s), it leads to not wanting too many hands in the cookie jar. In pattern one an application manager may have built their own data export process to another application. This will probably now become more centralized with fewer people doing that type of work without some thinking on how to democratize this activity.
I personally do not consider the last two points to be downsides, but rather things that come with a centralized data strategy. With good management these items do not lead to actual issues. Without good strategy it can lead to roadblocks to innovation. Mainly due to something like people waiting for a data engineer to build an ETL process. You can head this off with some simple rules like “our team will be constructed where we can add newly needed data to the Data Warehouse within three days”. These types of rules and strategy require leadership.
I do point out these potential issues though as they have led to a concept called Data Mesh as a solution. Do not get hung up what your definition of Data Mesh may mean. I refer to the concept that instead of building a Data Warehouse/Central Data Repository, you instead implement a software that connects to any database you want and allows you to do things that make it look like you have a data warehouse. You end up with an interface where it appears you have a consolidated “Patient Data” table. What you really have is a structure behind the scenes that is running live queries against your critical application servers as with pattern one. Only worse, you are running more complicated queries and putting even more load on your applications than before. I have heard enough feedback from great data practitioners forced into this pattern that I can safely say this is not my own bias when I state this may be great in certain scenarios, but it not a data management strategy. Those scenarios could include not having the budget for a data management team or using this pattern for a handful of uses cases where the data size is not very large. Just know that this pattern can lead to additional work for your DBAs troubleshooting unexpected performance issues on application databases when the software generates poor quality database queries or new types of queries where the indexing strategy does not meet the need.
The final pattern here will appear to be much like the second, but changes many of the details for the better. This pattern puts a cloud based solution like Microsoft Fabric in place of the on-premises Data Warehouse server.
I have purposely only tweaked this diagram in a way that makes it appear Fabric is simply a place to centrally store your data and little else changes about the strategy. The latter part of this is true in that you still need to invest in great data practitioners to make your strategy work. If you do not do this it can lead to the same pitfalls as doing everything on premises including people waiting too long on requests and stalling innovation. But, Microsoft did not just build a bigger faster database with Fabric.
Instead they built many storage choices in Fabric. Most of those choices come with significantly less management. Fabric has numerous choices for importing/exporting data. This means less management once you understand how those choices work. Power BI is a part of Fabric so it more seamlessly integrates with the overall platform. And those various teams that want to use “your” data? No need to duplicate the data so once again, less management and building. Hard to represent all of this on a diagram, but simply put, less management means quicker time to market with ideas and more confidence in the data products being built. The latter happens as your data practitioners spend more time on data products than building and managing ETL ingestion processes. The technical details behind concepts like less management and no data duplication can be found in other places including other articles I have written.
On the surface this sounds like product talk rather than strategy. Natural when strategy is built into the product. Microsoft, as stated, decided it is no longer good enough to build a bigger faster database. Instead, data management issues decades old need to be solved by the platform to bring true value. Including time to market, confidence in the data, an easy to understand cost model, and the ability for non-technicians to participate. On the last point, everything in Fabric is built around the Workspace concept that came with Power BI. This means non data experts can have their own working areas and should they build something great, it can easily be incorporated into the overall organizational strategy.
And Purview when pointed to Fabric reads information about all of these assets from one place including your storage, reports, sematic models, data ingestion processes, and formulations of your data from raw to ready for prime time (think Bronze to Gold layer from the medallion architecture). Allowing people to see a wholistic view of your data strategy and assets:
Free features of Purview allow you to scan your Fabric environment and then do data cataloging activities like classifying an object such as a table in a lakehouse and giving the asset a longer paragraph style description. More useful than ever in the age of generative AI where language context can be searched rather than keywords only.
If wondering why a cloud strategy for data over an on premises one, it is pretty simple. The big investments around analytics and data science are being made in cloud based platforms. If the cloud offering is just a database offering little more than storage, maybe that is not a huge benefit over on premises. If a platform offers less time for setup, less time to ingest data, a strategy where duplication of data is unnecessary, an easier path to report building, and many built in AI related components that is significant value from both ends of the data management workflow. Less work to get results and faster better results to the consumer.
That is not to say that Microsoft is not making heavy investments in its standalone database offerings as well. Both SQL Server and Postgre on Azure (and on premises in the case of SQL Server) have an abundance of features including vector indexes and ability to make calls to large language models that make them compelling environments for data management and AI solutioning should budget dictate you use them over an all encompassing product like Fabric. You can go pretty far with the cloud data management strategy above by using a toolset such as SSIS or Data Factory to move data from on prem to cloud and process that data, SQL Server/Postgre to store your data, and Purview for data cataloging. The major difference with this strategy compared to one built around a product like Fabric is that all of the components are not self-contained under one umbrella leading to more time managing the overall environment.
Data Team of the Future
In a world where all parts of the data management process become easier, it will be time to assess your team structure. Is there a need to silo people into roles like Analyst, Data Engineer, Data Architect, Business Intelligence Architect, and Database Administrator any longer? Or is it time for people to become more general in their skills? And how about the need to build AI driven processes? Considering most are very data heavy, does it not make sense that people with data skills could and maybe should be involved in building such processes? And should they be faced with writing a little C# or Python code for a process, AI is leading us to a place where three years of experience doing this is no longer needed. For more complicated processes yes, you have seasoned developers to pitch in, but many AI use cases do not require senior level coding skills.
I argue that a siloed team today could be transformed into a team of “do it alls” or at least “multi rolers” (yeah, making up some new words to go with our new view of the world).
Some on the team may seamlessly play all roles. These become your senior people. Some may play many and some just one or two, for a while. The expectation would be for everybody to be learn it alls and grow into taking on more roles. An alternative to this for companies without relative armies of people in each role today could be to keep part of the team in fewer “data” roles and turn a few team members into your “just” AI builders. Either of these models requires some changes to thinking and the first, which I personally think is superior having operated this way at the most fun stops in my career, requires very hands on management. Not micro managing, but knowing what each employee’s capabilities are, tracking progress, knowing what people are working on, and working on individual growth plans. Knowing what people are working on to a detailed level may feel like micro managing, but the thought process behind it is something that will be needed for AI success. The ability to track new types of data to have the ability to measure ROI. How will you answer upper management when they ask you to prove the time you are spending on AI is worth it? Maybe a Vice President who was all in on AI moves on and their replacement is a doubter. Whatever the situation, always good to have data to help tell your story. For your team, keeping track of this type of data will help you effectively manage a high performing team doing complicated things. As a manager your role would be to help them become even higher performing and to determine which requests coming from the business are highest priority.
A side note for anybody having a slight heart palpitation thinking I may be diminishing the DBA role. Not whatsoever. Keeping the relational databases and NoSQL databases that back your applications humming will be more important than ever to ensure a successful data and AI strategy. But like all roles, the time needed to troubleshoot performance issues or manage hardware will change as AI gets applied to these activities. Just like the traditional data management roles, being a DBA will become easier as well. Why should they be left out of the multi role fun? And if you really want to make it easier for your DBAs? Consider hosting your databases and servers in Azure which can decrease your management time on its own. But that is a conversation for another day.
Managing data is more critical than ever to fulfill the potential for AI. The organizations I see who have had solid strategies in place for years or decades are off and running and getting feedback from their employees and customers that AI is having an impact. They may not even know it is AI in some cases. They just see better working conditions and better customer service. Given that I work in the healthcare space I would love to see every hospital in the world provide better patient care and even bring it to those who cannot even receive care through the use of AI. I would love to see every insurance company process claims and authorizations more efficiently so there are no holdups to care. And would love every researcher out there to have an efficient pipeline of data to innovate in both treatments and devices and for their corporate entities to operate as efficiently as possible.
The time is now to invest in a modern data strategy and toolset and to rethink the workforce that will utilize this strategy for the good of humanity. That is not hyperbole. That is the promise of AI once you unblock its implementation in your organization. Alongside security considerations and a strategy for responsible usage and testing of AI, data strategy is one of the major places that begins.