Data Storing

Hosung Lee

Published Oct 22, 2018

Background of Data Storage (from SMP to MPP)

Since the birth of computer, storing data reliably and cost-efficiently has been a major issue. Until late 1990s, the only viable solution to this issue was high-power pricey storages. But as Big Data era opens up, existing solutions began to show critical shortcomings due to data explosion and complex data demand.

The biggest problem was cost. As more and more applications implemented, related digital data was increasing exponentially. Naturally, cost for additional or new storage systems took up too much portion of IT budget. Once promoted thin provisioning based on storage virtualization was only a stop-gap measure.

The second issue was performance, especially high-throughput required for massive data processing. Solutions like increasing network bandwidth between storage and server layer or high disk I/O in storage layer could not solve the fundamental structural problem, bottleneck at centralized storage.

An architecture where multiple application servers are connected to a single storage is called SMP (Symmetric Multi-Processing). SMP architecture depends majorly on ‘scale-up’ when capacity or performance needs to be upgraded. But ‘scale-up’ involves high cost and does not solve the bottleneck problem all the same.

Things began to change in late 1990s when new dot-com companies began to arise. Under heavy pressure to store and process massive data with limited budget, they needed an alternative to overcome fundamental drawbacks of SMP. Their answer was to distribute centralized data to multiple servers in a cluster and to let each server process locally stored small data in parallel. This architecture was widely more performant than SMP in massive data storing and processing both in cost and performance. Also it can easily adopt ‘scale-out’ approach using commercial hardwares for data increase and performance enhancement providing flexibility and cost efficiency. This new distributed parallel architecture is called MPP (Massive Parallel Processing).

MPP Prime Time

MPP has become de-facto standard of data systems in Big Data era. Not only cost-efficiency in data storing, but also query performance to high-volume data are far superior to SMP. Why? Let’s say there is a task to count the total number of ‘data’ in a 10 book set encyclopedia. If one Harvard graduate majored in applied mathematics and 10 fifth graders compete each other who finishes the task first, who do you think is likely to win? The former must go through the entire 10 books and do simple counting where his extensive knowledge in math and statistics are useless. On the other hand, each member of the latter group needs to check only one book and do the simple math. Naturally, the latter wins despite the former’s high-level education. In the same sense, MPP is much superior to SMP in massive data processing.

MPP has another benefit. A number of MPP systems use commercial hardwares enabling easy scale-out to flexibly increase capacity and performance. Due to this easy availability, many MPP systems have multiple copies of operating data as a backup. HDFS carries two copies of original data by default as backup, and Vertica provides ‘k-safety’ function which enables data in one node gets copied in another adjacent node. Of course, backup is not an exclusive option for MPP. But having a backup system for SMP entails heavy investment in either scaling-up or additional backup system, and the price goes exponentially up as data size grows.

SSOT and MVOT

MPP is not the whole story. Having data storages equipped with cost-efficiency and high performance is only a starting point. The real challenge is how to design them in order to minimize data isolation and duplication, to maintain data consistency, and to meet downstream users’ requirements. These are not only related to data storage but all other data engineering activities, but I would like to mention them here.

Basic strategy of data storage design is to find the balance between two contradicting views, SSOT (Single Source of Truth) and MVOT (Multiple Versions of Truth). SSOT is to maintain consistency. It prevents incorrect data from being circulated by keeping the original data. MVOT is the opposite approach where data is stored in multiple locations so as to meet data requirement from each user group. In a traditional data system, data warehouse is SSOT, and data mart is MVOT.

The above image is an example where SSOT and MVOT are in good balance because one data warehouse becomes SSOT and data marts are MVOT sourcing their data from data warehouse. However, in reality, you can easily find cases where the boundary between SSOT and MVOT collapses such as data marts with multi-sources other than data warehouse, and data warehouse taking data marts as source data. When data pipelines are complexly intertwined, it is almost impossible to figure out which version is the ground truth. So all data must be easily traceable from SSOT to MVOT and vice-versa. In order to do so, keeping a history of data transformation is important, and this is called ‘data lineage’.

This is where EDW (Enterprise Data Warehouse) and data lake come in. Both are ultimate SSOTs, to put it simply. As the boundary between SSOT and MVOT gets blurred and many existing data warehouses fail to play SSOT role, the alternative was to build a new layer, the first destination of all data. The remaining data storages are considered to be MVOTs generated from one single SSOT layer. This is the basic concept of many big data platform including commercial cloud service provider like Amazon and Google.

What I want to emphasize here is that creating a balance between SSOT and MVOT is a tall order which requires high-level expertise in data system design. SSOT and MVOT are nothing new. Finding the balance between them has been a major task from the dawn of data system. Yet, many seem to believe that they can thrive only by having new technologies. For instance, there are quite a number of organizations that mindlessly adopted Hadoop hoping to use it for everything from SSOT and MVOT to advanced analytics. I believe it is not impossible if you have enough internal capability and experience to design, implement, and operate the system properly. But the reality is that the lion's share of them do not. And more importantly, the number of talented and seasoned Hadoop engineers is very limited. Ergo, the money you saved by adopting open source tech can be easily overwhelmed by talent acquisition and retention cost if the system is poorly designed and implemented. Priority must be given to assigning the right people for the right job, not technology. Then let them choose the right tech and product.

Compatibility

As a central piece of bigger platform, all data storages must be compatible with other parts of the platform such as upstream data ingestion and preparation tools, downstream analysis and visualization tools, and other fellow data storages. If not, they become isolated data silos, which organizations must avoid to have because data becomes strategic asset only when they are well-circulated and well-enriched.

Compatibility now carries greater weight since data system of today is a combination of many more technologies and tools than before. Thus, many data storage technologies and tools in Big Data era guarantees wide range of compatibility. One caveat is that when you select a data storage, do not take vendor’s word as it is. You must test and verify required compatibilities through PoC.

To view or add a comment, sign in

Data Storing

Hosung Lee

Background of Data Storage (from SMP to MPP)

MPP Prime Time

SSOT and MVOT

Compatibility

More articles by Hosung Lee

Explore content categories

Background of Data Storage (from SMP to MPP)

MPP Prime Time

SSOT and MVOT

Compatibility

More articles by Hosung Lee

Machine Learning and AI

Data Analysis

Data Science Overview

Data Pipeline

Data Integration & Preparation

Data Engineering Overview

Becoming Data-Driven

Do You Really Need AI?

What is Data Science?

데이터 사이언스란 무엇인가

Explore content categories