What is 'DeNISTing'?
DeNISTing is a process that identifies and removes unusable files in eDiscovery data sets. It compares your files with a central list, and removes ones deemed unnecessary. It can be handy at speeding up review, but it isn’t the only option for effective culling; in fact, it stems from a type of pricing that could actually be harmful to your eDiscovery. Read on to find out why, and whether there are alternatives to keep you in better stead!
As eDiscovery matures, service providers are constantly developing new features for eDiscovery.
A significant chunk of the expenses of small law firms come from eDiscovery. So features that help speed up the process are commonly added by service providers. Things like automated Optical Character Recognition (making scanned documents machine-readable and searchable), ‘advanced’ searches (that let you look for very specific text and ‘metadata’ - i.e. information about files) and intelligent ‘tagging’ (to label and quickly retrieve related information.)
And to speed up the process, attorneys reduce the volume of data they have to review but culling files that aren’t relevant to the case.
Many of the files stored by computers are required to run applications and software, and aren’t reviewable; when viewed, most of them just display their code that would make no sense to the average user. Hardware like scanners, keyboards, and the like, come with ‘driver’ files to enable the computer to correctly recognize and interact with them, for example. And your operating system (i.e. Windows or Mac) needs many millions of ‘system’ files to run it. The vast majority of which are not needed for your case, but which bloat up your data set, increasing costs and clutter, until they are culled.
The idea behind DeNISTing is to quickly and efficiently get rid of this excess baggage.
The process involves comparing files in your data set against a central repository of identified and cataloged files that are deemed ‘unnecessary.’ Any matches are then deleted. The ‘NIST’ in deNISTing stands for the National Institute of Standards and Technology, a federal agency under the U.S. Department of Commerce that promotes industrial competitiveness through innovation. As part of this effort, they maintain a list of files that ‘do not contain evidentiary value,’ and service providers configure the eDiscovery software they offer to compare your files against the NIST list to determine which ones are unnecessary.
But what actually happens during deNISTing? It functions based on the principle of ‘hashing,’ which generates a unique ‘fingerprint’ for each of your files.
Hashing is the process of creating a unique digital ‘fingerprint’ for a file known as a hash value which consists of a long string of numbers and letters (e.g. 4875dcb6dd62d110320c5a7a961685fe) Hash values (or ‘hashes’) are extremely accurate, so eDiscovery software can just compare the hashes of your files against those of files in the NIST list without needing to ‘read’ the information of each file. Which is a big advantage, since the NIST list has over 28 million files to compare against! And hashes are so precise that even a single punctuation mark will change the hash value, so it’s a guarantee that only files in the list will be caught and deleted.
Sound too good to be true? Unfortunately, DeNISTing has a number of limitations that prevent it from being a real priority when it comes to eDiscovery features.
It can be helpful during the very early stages of data collection but it is quickly overtaken by many more concerns that it can’t address:
- eDiscovery data sets often have ‘system’ files already omitted. For example, when you need to source an email inbox from a client, they will likely send you a .PST or .MBOX file containing the emails, not a copy of their entire computer hard disk. And it’s common for computers to have multiple ‘partitions’ on their hard drives with one dedicated to system files, keeping other partitions for documents and files that users actively interact with, allowing those to be easily separated and shared.
- It’s not just unreviewable files that have to be culled. Frequently there are thousands, or even millions of emails, PDFs, and Word documents that need to be culled as well.
- The NIST list is nowhere near comprehensive. Most software these days receives frequent updates, and new file types are constantly created, so a lot of newer files aren’t covered.
- The hashing comparison is inherently limited. Even though 28 million hashes sounds like a lot, it only covers exact matches - and in this case, the accuracy limits the usefulness since even a single-character difference in the file completely changes its hash value. So configuration files for operating systems and applications, for example, that would have differences based on your personal settings, would be completely missed.
There’s an alternative to deNISTing that’s more effective in the long run, and is more readily available - look for an eDiscovery service that finds ‘unreviewable’ file formats.
No one repository can cover every system file and stay constantly updated, given the frequency at which software is constantly being created and updated. A more efficient approach is frequently adopted by eDiscovery providers - removing files that are not supported - file formats like EXEs and DLLs that are needed to run software but often have no value for eDiscovery review. Not only does this cast a wider net, catching a larger variety of unusable files; it also avoids the time it takes to compare the hashes of all your files against the NIST list. If, however, you do choose to look for a service that has deNISTing, be sure to check that the provider actually makes use of the NIST list, and not a private hashing system (that some services use, but which isn’t as effective.)
Sometimes the urge to quickly cull files can actually work against eDiscovery though, and it’s worth taking a moment to look at the bigger picture.
Don’t get me wrong, in some situations deNISTing can be helpful in early culling. However, its creation and use stem from a wider trend of trying to quickly eliminate as many documents as possible, which can lead to potentially useful information being unintentionally discarded as well. This is often caused by a particular eDiscovery pricing model – one in which you are charged per GB of data that you upload. Culling files before they are even reviewed means that you won’t have to pay for the uploads or data processing, but it forces a compromise on the quality of your review. What if we were to change the payment model and uncouple pricing from your data uploaded though? The priority would change from trying to get rid of as much data as fast as possible, letting you focus more on review than culling. One such approach lets you upload your data for free, and charges only for the volume of data stored, prorated. Which means that you can upload data as and when needed, review it and delete it, only having to pay at the rate of the volume that’s actively stored by the provider.
Interested in checking out this kind of pricing model? It’s what we use at GoldFynch, and it’s made eDiscovery less daunting for hundreds of firms that otherwise struggled with upload costs. You can find out more about it here!