Dig "Big" Data
Data Classification
Data is the lifeblood of any business and hence always surrounded by potential Threats. Organisations are always on high alert to lose or abuse their data. So, whenever organizations prepare to collect and secure data, they need to understand how to classify the data at first place.
One can easily differentiate the data as Structured and Unstructured. Each of the mentioned type presents different challenges — especially when it comes to data security. As name suggests “Structured Data” usually is stored in relational databases and displayed in columns and rows. This allows algorithms and machine learning tools to access and analyse it easily. This ease gave freedom to monitor and secure data stored in billions of rows and columns. There are many tools available to support the collection and analysis of structured data to support business decisions. Inventory management systems, Customer relation management are examples of Structured Data.
Whereas by definition, “Unstructured Data” means we have free-form movement of data in files as it is raw and unorganized. It is usually binary data that is proprietary and has no identifiable internal structure. Best way to secure “Unstructured” data is to convert it into structured data however, this would be costly and time consuming. Also, not all types of unstructured data can easily be converted into a structured model. There are two ways to generate unstructured Data. One is human generated and other one is machine generated.
Typical human-generated unstructured data includes:
- Text files: Word processing, spreadsheets, presentations, PDFs, logs.
- Social Media: LinkedIn, Facebook, Twitter data.
- Website: YouTube, Instagram, photo sharing sites.
- Mobile data: Text messages, locations.
- Communications: Chat, IM.
- Media: Audio/Video files and Digital Photos.
Typical machine-generated unstructured data includes:
- Satellite imagery: Weather data, land forms, military movements.
- Scientific data: Space exploration, atmospheric data.
- Digital surveillance: Surveillance photos and video.
- Sensor data: Traffic, weather and various IOT sensors
The files listed above can be stored and managed without the format of the file being understood by the system. This allows them to be stored in an unstructured fashion because the contents of the files are unorganized. As per analysis, 80% of business-relevant information originates in unstructured form, primarily text.
Apart from these two major types, there is one more category of data which lies on the thin border in-between Structured and Unstructured Data. It is called as “Semi Structured Data”. Classic example of Semi Structured Data is email. Though email can be classified as Unstructured data, it has some internal structure, thanks to its metadata, however, its message field is unstructured and traditional analytics tools cannot parse it making it Semi Structured Data. NoSQL and JSON are other examples of Semi Structured Data.
Data Monitoring
Data is monitored mainly through policies. These policies are specific to data types mentioned above. Preferred way is to perform indexing on the data. In case of structured data, it is done on cell by cell and column by column, whereas for unstructured data, it is mainly done for exact contents or partial contents. Once the indexing is done then it doesn’t matter if user save the data in file, add it in email body or attach it as a file. The hashed index contents are used to scan and compare the data at rest as well as data in motion. Apart from unclean data source another challenge for indexing data is dispersed source of data where different types of data are stored on different locations.
For non-indexing approach, policies are based on keyword, data identifiers, regular expressions and user as well as file property rules. Breadth (Wide, Medium or Narrow Luhn Check) of the data can be configured in policy. In case of email monitoring, user identity rules which includes filters like sender/receiver match, domain, user names, email address, intranet, internet, range of IP addresses, beginning/ending characters etc. are used. Data proximity can also be used to improve the monitoring. Most likely, Data Monitoring software uses Two Tier methodology so that these policies can be further applied along with protocol being used and endpoints through which data is exposed.
Natural Language Processor (NLP)
It is easy to monitor data as per policies and scan according to keywords, data identifiers and regular expression. Scanning and monitoring works well with Whitespace language like English where word boundaries are clearly defined. But Non-Whitespace languages like Chinese, Japanese and Korean (CJK) incorporate word boundaries that are not apparent and therefore it could lead false positive incident generation. Third party Natural Language Processor tools plays very important role to improve linguistic accuracy while protecting data.