Dark Data or Data in Darkness
Dark data can be defined as all the data that exists within an organization but is currently useless or unusable, either because it is redundant, forgotten, ignored, hidden, simply unknown to the organization at a given point in time, or which may be difficult to find, manage, or exploit for valuable insights.
Where Do These Data Come From, and Why?
These data are either collected by the organization for a certain purpose (but not clearly defined or implemented), or produced by people or IT systems, during normal operations, but not used or useful anymore, or even produced by IT systems without the organization’s knowledge. Other reasons can be:
Examples of “Dark Data”
Often we can overlook, or even forget, data we have collected for a project, a report, a presentation, etc. However, all this data is still in our operational systems. Some cases where dark data may come from or be stored:
Why Should We Care About Them? – Individual Use
At a personal level, I suppose it is not only me that feels annoyed when searching for an old document created some time ago, and trying to figure out which the last version is. To be sure, I will check sent emails, where I found even more versions than anticipated. I assume I am one of many to see a message that notes that there is not enough space and that some files should be cleaned up.
That is the best-case scenario, the worse being when the system crashes or runs very slowly due to a lack of disk space.
Some of us are aware of the need for periodical cleanup, but we still are surprised when we see the number of temporary or large files, we had not accessed for years.
Of course, this counts those pictures we took on several occasions, events, trips, or those with our children, which are never too many, and are never protected enough. Therefore, just in case, we are saving them in several locations, we have them in our phones, our computers, in cloud(s), plus sent (and stored) in emails or any messaging application.
The most curious of us are looking at systems or application logs/history, and we are amazed about the volume and the timeframes we have information about. Information we are never using.
Those aware and up to date with technological risks are doing periodical back-ups of their computers, phones, or accounts. This is a good thing, however, not as good if we are keeping and storing all back-ups from the past 10 years.
The multitude of data we are storing can be continued to be described, which are not only useless but also harming our effectiveness or efficiency. These are the so-called “dark data”.
At An Organizational Level
Unfortunately, all the previous, but in a higher order of magnitude, in terms of data source, volume, and potential use, are applicable at an organizational level just as much.
Moreover, we have additional reasons to care about them. In fact, many other reasons, which could be irrelevant for an individual, but are critical for business.
Some of which are:
The reasons look obvious, so one could ask why these were not already addressed if these could cause such hassle to an organization. If so, why? Because these are not known by the deciding factors, or at least not the magnitude of the volume or impact of these data. This is the reason we are calling them “dark data”.
What Can Or Should We Do With Them?
All people in an organization, all the systems, and all the processes are collecting or generating masses of data and some of them will inevitably become “dark data”. We cannot ignore it, and we should even aim to eliminate all of the “dark data”, as the costs could be higher than living with them. However, we should try to minimize them, keeping them at a level which does not impact business effectiveness and efficiency.
An approach to treat “dark data”, or any data which looks like “dark data”, would contain several logical steps, such as the following:
“Dark Data” Identification
As the name suggests, these are not always obvious, at least if we are not looking for them. So where could we start? Based on the examples proposed here, or the classification proposed later in this material, we should try to look for them, source by source, process by process, system by system. Of course, using automated tools would make our life easier, and there are a lot of them on the market, including many free ones. Your IT team or admin could also help. In fact, without IT help, this initiative is almost doomed, as aside from their access rights, they have the knowledge of where to look and how to do it.
The beginning could be more difficult, but as you start to discover them, the easier it will be, to apply similar search patterns and logic for other processes or IT systems.
Quantitative Evaluation
As we mentioned earlier, we should not aim to get rid of all “dark data”, as the cost could easily be higher than the negative impact. So we should aim to remove the most critical ones, which could have a real impact.
So, the first criterion would be the volume of data. The more data that we do not need or use, the worse it is. However, the overall volume is not the only criterion. For example, 10,000 small (10Kb) files could have a larger negative impact than a 1GB file, even if the overall volume is ten times lower.
Moreover, the quantitative aspect is not the only criterion. Sometimes one single small file, in the wrong place, containing sensitive information, could be the subject of the clean-up.
Classification – The “Classical” Way
The first attempt to classify these useless or unused data was the ROT data.
The term “ROT data” is derived from the acronym “ROT”, which stands for “redundant, obsolete, or trivial” data. This term has been used for many years in the information management industry to refer to data that is no longer useful or relevant but is still being stored and managed by an organization.
All of these concepts have been included in the scope of this article as they overlap with the concept of ‘dark data’.
Recommended by LinkedIn
Classification means to assign them to different categories and subcategories, based on different criteria. The criteria could vary from the perception of data, to the source of data, and the impact of data, based on the specific job requirements of the person doing the classification.
An example could be the following:
However, this is not an easy task as the classification depends on the purpose, source, organization, knowledge of people and functions involved, in processing and analysis, as usually data crosses many functions from generation and collection to processing, storage, analysis, etc.
A More Effective Way of Classifying Dark Data
Taking the same sample of “dark data” and asking different people from different functions is likely to give completely different results. Most likely, IT, Marketing, Legal, Finance, Operational, Compliance, etc., would have different classifications. Even within the same department, like IT, a system or application admin would have a different classification than a cybersecurity specialist, a project manager, a business analyst, or IT management.
The first attempt in classifying “dark data” would be to separate those which are clearly useless and those just unused, but could be potentially useful. The question is “useful for who”? Not always, or rarely, the person responsible for collection is also the main beneficiary of the data.
Thus, a better alternative to pure classification is using keywords to describe certain characteristics used in classification. This way, using as many keywords as we feel relevant, from all relevant stakeholders, we will not need to fit into a certain category, and it even gives some hints about the perceived issues with these dark data and the possible solutions to address the issue.
Some possible keywords could be (not exhaustive and in alphabetical order):
However, we will see that perception is different, and what is useless to one could be useful for other stakeholders, and what is too complex for one could be trivial for another. But all this information together would give important hints for the next steps.
Impact Estimation
Once we classified the data, or even before, the order in this case not being of paramount importance, would be to estimate the impact. Estimating the impact would consist of two different pieces of information: where the impact is higher (like compliance, legal, operational, finance) and the magnitude of the impact.
For each impact area, thresholds should be defined to decide what is worth to be addressed and what is not. However, this is an iterative work, and it would be easier to accomplish.
Identification of (Probable) Cause
In order to decide what to do with them, we should understand why these “dark data” are there. Are they a normal artefact of a business process, which we should delete after a while, are they due to a misconfiguration of an IT system or a broken process (especially a crossfunctional one), or a result of the changes in the IT systems or the business process? This is important to know, in order to prevent their generation after we clean up the mess.
Decide On “Dark Data” Treatment
After we have a quantitative estimation of these data, impact estimation, probable cause, and classification, we can decide what to do with them. Generally speaking, we have to choose from the following alternatives:
Of course, the decision is not always obvious, but the keywords used in classification could give a hint about the needed actions.
For example:
Abandoned, awaiting, disregarded, ignored – ask the intended usage owners to decide: delete/organize for use
Broken, fragmentary, incomplete, inconsistent, insignificant, counterproductive, impractical irrelevant – delete
Negligible – delete or let them be
Disorganized, irreconcilable, incongruent – label, analyze, then decide
Hidden, masked meaningless, omitted, too complex – analyze possible usage, then decide
Inaccessible – get access and analyze or delete/archive
Implement measures to avoid or minimize dark data volume and impact
After we have treated the data, it would be more efficient to implement the needed changes into the business processes or IT systems configuration in order to minimize these “dark data”, or at least to identify them in due time, for proper treatment.
Implementation of data management and governance framework, in case one is not implemented, would definitely make this process easier, or some categories would not even be present, being discovered and treated, in due time.
In case you have one, after this effort, it would be useful to update your business glossary, data dictionary and catalog.
Continuous Monitoring Of “Dark Data” Categories
It seems we have done everything that needed to be done. Not quite. There still remains one activity to be performed: to monitor the “dark data” existent in our business environment. IT systems are continuously changing, and business processes change as well, not to mention people. As a result, what yesterday was under control, tomorrow could become a problem. Monitoring those we know about, or overall categories, like application logs, implementing similar controls for new systems or initiatives, as soon as possible, could save us a lot of time and money.
PECB ISO Certifications PECB University #DarkData #data
Sergiy Tyshchenko Thank you for following our posts
Data Protection/ Management / Governance & AI dude | Engineer by background, Entrepreneur/Intrapreneur by choice | Proud dad of 2 daughters | Human Being by design and by default
2yThanks for sharing my article, but citation of the external source... and author would be required. BTW, mentioning my name twice in the beginning is not attribution and PECB Magazine has quite strict rules for quotation, including for me, as author