Why Virtualize?
Looking at data virtualization like a computer scientist
Virtualization in Computer Science has led to the ubiquity of technology today. Right from your Instagram selfie to your bank statement. Cloud computing and Virtualization have made the average Jane believe in CRM Ops in the cloud, reinforced her belief in device backups and made that important document accessible through every intelligent device anywhere on the planet.
Virtualization is rooted in abstraction at its core. Abstraction or information hiding in Computer Science is best explained by the analogy of a car's check engine light.
CarMD's most common reasons for a glowing check engine light include:
1. (tie) Replace ignition coil(s) and spark plug(s) ($391.42)
1. (tie) Replace oxygen sensor(s) ($244.04)
3. Replace catalytic converter(s) with a new OEM catalytic converter(s) ($1,371.17)
4. Inspect for loose gas cap and tighten or replace as necessary ($25.86)
5. Replace ignition coil(s) ($217.91)
6. Replace evaporative emissions purge control valve ($149.52)
7. Replace mass airflow sensor ($340.58)
8. Replace evaporative emissions purge solenoid ($153.70)
9. Replace fuel injector(s) ($449.73)
10. Replace thermostat ($244.61)
For someone like me, who doesn't know or care about these details, the check engine light just means going to the mechanic.
Virtualization has enabled us to harness resources from distributed machines, devices and components, making them appear like one powerful and robust resource. Since virtualization abstracts complexity, to the end user, the internal details remain a welcome mystery. All that really matters is making that glorious idea work in order to achieve the end result. To the users enabling the virtual set up, the absence of a single point of failure, failover measures and a logical layer to look at this system for error resolution is the real game changer.
Popularly virtualized compute includes [3]:
- Computer Hardware (Amazon Web Services)
- Computer Operating Systems (Containerization)
- Computer Networks
You get the picture, virtualized entities don't replace their underlying components, they leverage them. They leverage them in a manner that is automated, failsafe, agile and scalable.
Let's talk about my favourite kind: Data Virtualization:
Simply put - Data virtualization leads to one virtual view for an analytics application. Like my take on the check engine light, most analysts almost always care exclusively about a singular view of their underlying data. Its format, physical location, underlying encoding etc is of little importance to them. These details are often an overhead that need to be dealt with in order to get that data model or that 360 view of the organization.
What does Data Virtualization NOT mean?
Data virtualization is not an added layer of complexity.
It does not replace your existing persistence systems.
It is not another layer of persistence.
Why do I need this?
Let's look at this through the Big O Notation.
In algorithmic analysis, O(n) symbolizes the worst case scenario while judging your algorithm's performance. Performance is gauged by the time your instructions take to execute and the memory they consume in the process. Simplifying this concept to measure complexity, I am going to evaluate O(scenario) below. (To keep this short I have considered just one scenario. DV has several uses and several scenarios can be evaluated here)
Scenario - Distributed data
In any digital organization, scattered data sources are a common sight. This is also a reasonable scenario and does not allude to sloppiness. Assuming the variety of data sources (IoT/ Sensor data/ disparate databases, flat files, data lakes and warehouses), it takes time and effort to create dedicated views with reasonable permissions. Given the time frame needed for such an exercise, often the prepared data lacks the most recent updates. Ad-hoc requests take time to append to existing views. Self-service access here is not an option for the average analyst because of the lack of necessary permissions.
IT, in addition to all its responsibilities, now needs to provide error-free data to all stakeholders with customized requests. This is often a bottleneck when operating at scale because at its bare minimum, it involves data prep, data analysis, a search for the required data sources, adherence to global data definitions and enterprise security.
The recurring theme here is nested iterations within a subset of the data space to make sure that the end result meets a basic set of standards. I could compare this to:
O(N^M)
O(N^M) represents an algorithm whose performance is directly proportional to the mth power of the size of the input data set. This is common with algorithms that involve nested iterations over the data set. Deeper nested iterations result in O(N3), O(N4) etc.[5]
bool ContainsDuplicates(IList<string> elements)
{
for (var outer = 0; outer < elements.Count; outer++)
{
for (var inner = 0; inner < elements.Count; inner++)
{
// Don't compare with selfif (outer == inner) continue;
if (elements[outer] == elements[inner]) return true;
}
}
return false;
}
In such a scenario, enabling virtualized data is the most logical approach to sustainably scale in a data economy. It reduces the iterations involved in all steps along the way, right from data prep to publishing virtual views for individual stakeholders.
Is this stuff new?
No! and that's great because it is reliable and relentlessly tested as a product and a concept here at TIBCO. To achieve that logical governance, your trust needs to be in the right place. It sounds groundbreaking because of the data economy today and the emphasis organizations are putting into making data-driven decisions. This aged idea (much like statistical models) combined with your many data sources, data lakes and long term data projects is just a tool that will enable you to go to market with that game-changing idea stronger and faster.
Resources:
1.Virtualization in Amazon Machine Images- https://cloudacademy.com/blog/aws-ami-hvm-vs-pv-paravirtual-amazon/
2. Abstraction in Computer Science:-https://www.d.umn.edu/~tcolburn/papers/Abstraction.pdf
3. Virtualization timeline - https://en.wikipedia.org/wiki/Timeline_of_virtualization_development
4. Virtualization Types, See section "other types" - https://en.wikipedia.org/wiki/Virtualization
5. Explaining O(N^M) - https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
Well articulated ...
Well articulated Minerva. Great job! In the end you mentioned about some TIBCO product in virtualization space. What's that product called? Also I see industry has pivoted from Virtualization to containerization and now serverless computing. Any thoughts on how TIBCO product line is making strategies to cover that market share?