Dark Data
Frequently in my talks I use this reflection: “When I started out in the industry, most of the data I needed for my work didn’t exist, had never been captured. Today, when interpreters begin pulling information together for their projects, most of the data they are looking for exists, but it is very difficult to find it.” Thus begins the tale of “dark data.”
“Dark data” isn’t just a problem for subsurface characterization but for analysts and data scientist tackling just about any oilfield problem. “Dark data” is a consequence of our traditional ways of managing data, in functional or asset-specific silos and of the wide adoption of shadow IT systems (spreadsheets). Sometimes dark data happens unintentionally, as the growing influx of data overwhelms the capabilities of existing data management teams, where data ends up in file cabinets, warehouses (physical not the digital kinds) and data closets (often shared drives). Given new data lake technologies that are being deployed, dark data (that without catalog references or master data) sinks to the bottom when it is streamed into the data lake. I read one estimate that suggests less than 5% of the data captured offshore ever makes it back to the home office, although a lot of it can be used offshore for ”first use” operational decisions.
“Dark Data” is data that cannot be found using your search engine, or talking to your most experienced DBA or data steward, or pestering the last analyst that worked that area. It is data that has been collected but was never properly cataloged, squirreled away in a local, personal data mart and is unavailable to anyone trying to use it beyond that point. Most of the time this behavior is not malicious, but an oversight when one hand doesn’t know that the other hand might want that data in a later stage of interpretation. So, it becomes a low priority to catalog data that you don’t need immediately. It becomes one of those “I will do it when I get time” priorities and everyone knows what happened to low priorities in today’s busy world. They never get done.
The economic consequences of dark data are obvious. Companies and individuals want to make the best data-driven decisions possible. They want to have scorecards representing the true state of the business. They want to build accurate data-driven models and simulations for predictions of future events. All of these aspirations will be dashed if the potential contributions of all relevant data are not included because the modeler or the data analyst is unable to locate that data.
So how can we shine a light on dark data in our companies? Ideally, data is considered a corporate asset and as soon as it enters the company it goes through a well-defined data ingestion process, which includes verification, mastering, data quality checks, application of standard definitions and loading into the official systems of record. That requires a fully staffed data management group with the appropriate technology and domain expertise. Does your company have such a process? If so, a lot of your dark data is turned into useful information from the start and data access later isn’t such a pain.
Given my experience, I don’t think that happens all that often. Data enters the corporation from many channels, probably too many to count and more than are understood by the data management group. The velocity and volume are just too great for a small group to manage, so to keep the data flowing, it bypasses any governance process, and enters the shadow IT world of the engineers and analysts. Quick analysis, updates and model versioning happens to gain quick value of the new data before the next batch hits the desktop. But rarely does most of the data reach the “official” target systems to be accessed by later analysis.
Data gets darker, the longer it takes to reach official target systems. Given the volume, velocity and variety of today’s digital oilfield, if a good data ingestion process doesn’t exist, it just takes too much effort to go back and correct incoming data. So, the data goes dark and goes undiscovered and never gets used.
Experienced hands will usually recognize bad data or missing data during that “first use” moment and they will take time to go find missing data or to fix incorrect data for their analysis. But given the “Big Crew Change” that has just about passed through the industry, it is getting harder to find the “old hand” that has this gift. We can apply business and technical rules to data to help weed out the bad stuff but increasingly less experienced employees (not less smart ones) won’t be able to known when something is out of place, or if something is missing.
In our effort to become more data driven, I believe that we need to pay more attention to the inefficiency in our data management processes that creates so much dark data. The value is not cutting back on the costs of data storage, it the regaining the potential of making better decisions with all the relevant data available. We are spending a lot of money and time collecting this data, so we need to approach the data management lifecycle with an eye of turning more dark data into valuable information.
Reminded me of this article. https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics?utm_campaign=crowdfire&utm_content=crowdfire&utm_medium=social&utm_source=linkedin#OyGYU-5Zg1-li#1512712475438
well said Pete
Great article indeed! Even if the incoming data are "officially" recognized as a corporate asset and are supposed to go through a well-defined ingestion process, there will be ways where some of them will follow a different route. The percentage of data concerned will largely depends on the efficiency of the ingestion process and its reactivity to answer business needs.
There is an old saying in the oil business, "one can find a lot of oil in the files".
Great stuff. I would also say there's another category of 'Dark Data' in most companies. It's the stuff they have, that they shouldn't. Stuff from expired data licenses, old farm-outs, asset sales and stuff that's walked in the door with new employees, from previous employers. And some of this data does get catalogued and even makes the 'official' archive! I'm constantly amazed by the industry's attitude to this - there seems to be a code of complicit silence around the whole thing.