ZeroCopy Data Virtualization

Chiradip Mandal

Published Oct 25, 2019

In order to derive any kind of enterprise-wide intelligence - data “needed” to be consolidated at one place. ETLs, funneling data into data-warehouses, have been faithfully doing this job, until data got much more complicated. Let’s see what kind of complications we are talking about, and what assumptions, that worked in the past in regards to ETL-led data-warehouse, are not working anymore. With the proliferation of, rightful often times, Machine Learning and IoT, brings new demand to consolidate data in new ways. Increasing complexities in data also brings increasing complexities in how we protect the data.

Let’s see some of the prior techniques and assumptions those do not remain useful anymore.

Batch Transformation - problems with both, transformation and the fact that’s its batch
Managing Security policies at a central location where the consolidated data is loaded
One huge monolithic consolidation of all data you can get

Problems with the technique number one - “Batch Transformation” are the following:

Transformation changes the original data in order to do unification with others. Which leads to losing some original meaning from the data itself. Names always carry some meaning - losing the name means losing some meaning associated with that too. We will address later in this article how data virtualization deal with this issue.
Batch transformation - batch transformation is lengthy process, and transformed data can only be used after the transformation job is complete. Some of the decision making processes, specially in IoT, banking, telco need current state of the data to make right decision - historical piles of data do not help much in those areas.
Loss of Single Source of Truth - transformation is not migration, it just transforms a datasource into another and combine that to others. This techniques creates multiple copies for facts conveying truth. The problem is - data changes, not all data are immutable time series data. When it changed the copies get out of sync, and problem starts - “Who is telling me the truth?”.

Problems with the technique number two - “Managing Security policies at a central location …”. This really is a huge mess up.

Duplicated security policies - when data is collected into and held at a single central location, that central dataset needs to be protected with access control policies - who can access what. This created duplication of authorization policies with the authorization policies enforced at the source datasets. Beside duplicated authorization policies, other data protection mechanisms needs to be duplicated too. Often times the some data protection mechanics are left out to be duplicated from the source - that creates insufficient data protection compare to the data protection mechanics employed at the original sources. Thats bad. But even more complicated situation happens almost always - that lies in the duplication itself. Once duplicated - even when faithfully duplicated, the problem then is to maintain that at multiple places. Which is impossibly hard, if not outright impossible. Imagine the scenario where the administrators of original data sources changing their access control policies frequently - as part of ongoing effort to increasingly strengthen the security mechanisms and to defend from some recently occurred threats. The centrally collected data deserves to be protected with exactly same effect. That does not practically work - it gets out of sync too quickly to be useful; beside the effort, even without hundred percent success, is cost prohibitive. ZeroCopy Data Virtualization solves this issue once and for all - discussion about that in a while.
Yet another hole, and a real big one, punched into your walled garden - where there is data there is a security liability, there is an attack surface. Data collected at the central location is essentially a hard copy of data, albeit modified in many ways, but it’s a real hard copy. That needs rigorous protection on its own rights. Same data appears at more places - the protection has to be explicitly present at all those places. ZeroCopy Data Virtualization again solves this in simple way; if you don’t copy data at a central location or anywhere at all, you can be free from the responsibility of protecting data and free from the liability of any data breach. ZeroCopy Data Virtualization does not endorse copying anything anywhere permanently - that frees the whole system involved from “implied” data protection liability.

Problems with the technique number three - “One huge monolithic consolidation of all data you can get”

One single consolidation of all enterprise data is a a cookie cutter. Despite rapidly increasing computational power current state-of-the-art analysis engine or pattern recognition models still can not handle massive and noisy corporate data warehouses, that have become reality of today’s business. Different kinds of analysis, and different kinds of machine learning, needs different types of data condensations; it’s not only about de-noising data but preparing specialized models for specific applications and fulfilling specific expectations. Data-marts tried to achieve this in the past, partially by dividing data into different business domains. But expectations are not always exactly mapped with highly petitioned business domains - they are more of expectation driven and span across the enterprise data with the need for different condensations. With traditional approach its virtually impossible to create unlimited copies of datasets to serve the varied expectations of various applications running on top of them due to the fact that copying is expensive - storage and transformation both, copies are hard data once changed, even with error, its difficult to correct. As machine learning expertise is increasing, as availability of extremely useful ML algorithms is increasing, and as hardware (GPU, FPGA etc) capability is increasing to make machine learning computationally real - an enterprise needs virtually unlimited input datasets - all different and all geared towards specific consuming application and yet all encompass the entire spectrum of enterprise data with varied condensation. ZeroCopy Data Virtualization lets you create zero-cost (of storage - since there no real copies) unlimited input datasets comprising whole spectrum of enterprise data, with varied condensation, making it very efficient in terms of speed of computation and accuracy.
Real data, copied at one single location, consolidated from different data-ownership-domains with varied degrees of security sensitivity highly increases the attack surface. As discussed as part of the technique number two “Managing Security policies at a central location where the consolidated data is loaded” - security policies at the underlying datasources and the central location very quickly go out of sync. This problem can be eliminated once and for all by eliminating the need for copying data at the central location altogether.

Let’s now talk about the mechanisms involved in ZeroCopy Data Virtualization that solves those above mentioned problems.

Let’s first see some of the characteristics of ZeroCopy Data Virtualization or behavioral objectives.

Data consolidation along with sanitization can be done without copying data over - a resulting consolidated dataset can be defined with required degrees sanitization and condensation vitally. For our case, it is a set of standard SQL commands or script defining a sophisticated view.
A view is virtual and hence inexpensive - therefore you can create as many variations of them as you need.
Fast distributed query execution - it says it - modern systems involved in data analysis and machine learning needs to be extremely responsive and that demand very sophisticated scale up and scale out architecture.
For certain application data locality is important and needs to be provided
Strictly follow ZeroTrust throughout the Data Virtualization system boundary and ownership domain - this minimizes, if not eliminates, the liabilities involved with data protection, data privacy and any other data exposure may be associated with it.

The anatomical decomposition of the system aspiring to achieve the above objectives is like following:

De-couple Query Execution Engine from Data Store
Make Data Store Distributed - can reside anywhere with reasonable latency
Make Distributed Data Store non-proprietary and heterogeneous - data can be store anywhere and in any format as long as the query engine has required codec to understand the data

If we come this far - all the objectives are met and all the problems, we’ve discussed above, are solves - seriously, that’s right. But as we know devil is in the details - at the high-level the above there points cover everything but let’s dig deeper and see some challenges associated with implementing those.

Query pushdown

Problem number one - massive pileup of in memory data for the query executing engine to handle. It involves temporarily bringing enormous amount of data to one single low-latency memory space - it gets more complicated when multiple queries are getting executed by many users. Tough problem to solve - but we solved it by adopting the mechanism called query-pushdown. What does that mean? It means, instead of executing queries within its own engine, the query execution engine can push the relevant part of the query down to the underlying system for execution - in other words, delegates the query to the underlying system when the underlying system is capable to execute the query in one way or other. The query pushdown involves complex relational algebra and mathematical translation of one query to the other. Immediate next challenge comes to distribute the push down dispatch and push up collection of results parts of which coming from different systems. We use Apache Spark as the engine for this distribution and distributed execution.

Data Locality and User Space Cache

Another problem is achieving data locality needed for some heavy-lifting applications like machine learning - this is done by implying efficient “user space cache”. What is “user space cache”? It’s a cache that only lives and dies within a specific users controlled environment - it means it is not permanent - its TTL is defined by user, and its not shared with any other user - it’s not persistent in any serializable format that can be stolen.

With the above direct objectives met, we also get extraordinarily useful side-effects:

ZeroTrust - making ZeroTrust as easy as eating pie in the data world
Single Source of Truth - if data is not transformed and copied-over, single source of truth is faithfully preserved, once and for all.
Reflection of current state of affairs in underlying data sources - copied over data goes out-of-sync as quickly as there is a change in the underlying datasource. ZeroCopy eliminates this problem.

Let us rehash and define ZeroCopy Data Virtualization in the wee hours of this article.

ZeroCopy Data Virtualization: A mechanism, and an apparatus, to make disparate and independently owned datasources (and the systems that exhibit the behavior of a datasource) to participate in constructing a single virtual datasource with

unified access,
using a unified dialect (e.g. SQL),
without the need for permanent transformation,

while preserving

single source of truth,
their individual access control and other data security policies,

while having the traits of

scalable performance (performance can be increased with added resource and system tuning - different from performance at scale),
performance at scale (performance with increased number of operations),
acceptable throughput for the intended consuming applications (IoT, ML, Analytics loads),

while providing users the ability to use

Data-Locality for the applications (e.g. ML) needing random access to huge volume of data with low-latency,
incremental access of massive results without losing integrity

Areas of application:

Machine learning involving multi-dimensional knowledge tucked away in different systems
IoT - where local processing is a necessary but holistic view is also needed and data changes frequently
SD-WAN - it demands monitoring and management data at the single place
Analytics - without expensive transformation and loss of single source of truth

About the author: Chiradip is a technologist, currently working as the CTO at Gemini Data Inc. In the past, he successfully ideated and delivered innovative products in the areas of distributed computing, IoT, Security, networking, telecom, and data management. Beside his primary profession - he is also into radio communication, holds an amateur radio extra class license, and an avid sailor.

To view or add a comment, sign in

ZeroCopy Data Virtualization

Chiradip Mandal

Query pushdown

Data Locality and User Space Cache

More articles by Chiradip Mandal

Others also viewed

Cost-Efficient Data Consolidation: The Cornerstone of Trusted AI

Zero-Trust Data Architecture: Moving Beyond Perimeter Security in the AI Era

Data Obfuscation Framework in Microsoft Fabric Lakehouse (Encryption & Masking)

☁️ Optimizing Data Storage and Management for Cloud-Based XDR Systems

For The Data, By The Data & To The Data

Enterprise Data Lakes: yes, they are safe to swim in!

Security in the FAIDP

OLTP Systems - The Next Generation

Scaling out Smoothly: How Improved Hashing Rescues Sharding

Integrating Databricks Unity Catalog with Delta Sharing for an enhanced data management and security

Explore content categories

Query pushdown

Data Locality and User Space Cache

More articles by Chiradip Mandal

Redpanda vs Apache Kafka: A Technical Analysis of Architectural Differences and Performance Trade-offs

Programmable Enterprise (PrENT)

Crypto Currency vs Man in the Middle Economy

Corporate cloud-platform strategy

Future of Internet

Polyglot is THE way

Technology, Technique & Skill

Others also viewed

Cost-Efficient Data Consolidation: The Cornerstone of Trusted AI

Zero-Trust Data Architecture: Moving Beyond Perimeter Security in the AI Era

Data Obfuscation Framework in Microsoft Fabric Lakehouse (Encryption & Masking)

☁️ Optimizing Data Storage and Management for Cloud-Based XDR Systems

For The Data, By The Data & To The Data

Enterprise Data Lakes: yes, they are safe to swim in!

Security in the FAIDP

OLTP Systems - The Next Generation

Scaling out Smoothly: How Improved Hashing Rescues Sharding

Integrating Databricks Unity Catalog with Delta Sharing for an enhanced data management and security

Explore content categories