Next Generation Data Architecture for Advanced Analytics
Here is a simplified sketch of the next generation data analytics architecture.
1. Data Generators:
On the left side, there are different data generators. These are the web applications, mobile apps, point-of-sale systems in retail stores, smart IoT devices, tablets. The data generators generate operational data on a day-to-day basis.
2. Transactional Systems (aka OLTP):
All the operational data gets stored in transactional (OLTP) data stores. These data stores are either relational (examples: MySQL, Oracle, Postgres, etc.), document-driven (examples: MongoDB, CouchDB, etc.), or key-value stores (examples: DynamoDB, Redis, etc.).
The transactional data stores are designed to support write-heavy use cases as many applications write to these data stores at any point in time. IO is the most expensive operation when it comes to database read/write performance. In order to support high write throughput (QPS), the schema in transactional systems is normalized to eliminate redundancy and avoid multiple disk writes per request.
3. Analytical Systems (aka OLAP):
All the operational data from transactional systems is eventually ETLed into the analytical systems. The purpose behind this is to centralize all the operational data into one central place for the ability to cross reference and derive meaningful insights that are geared towards improving operational efficiencies and business KPIs.
The paradigm in analytical systems is very different from that in transactional systems. The use cases are read-heavy. The schema is denormalized or flattened; A storage tradeoff is made by introducing redundancy in lieu of faster read performance.
Analytical systems are powered by three powerful technologies, as described in the following section.
You will find these three technologies in most, if not all, next-generation data analytics architectures; each powering different sets of use cases. Having said that, these technologies are converging and the decision to choose one for your specific use case should be made based on the Analytics KPIs described to the far right.
Data Warehouses:
Data Warehouses have been around for many decades.
Recommended by LinkedIn
Data Lakes (Object Stores):
With the advent of newer data formats, which traditional data warehouses could not support natively, data lake technology came to life. Object stores data lakes have compute and storage completely decoupled. The data is stored in object stores such as Amazon S3 where cost of storage is very cheap (compared to block storage attached to the VM). Big data compute engines like hive, spark, presto are used for serving advanced analytics.
Analytical Data Stores:
With the advent of the internal user-facing applications that serve insights and metrics to the end users, sub-second latency became quintessentially important and that’s where analytical data stores came to life. The data in these systems is stored in columnar format; richer indexes are added to meet the millisecond query latency SLAs.
In addition, external user-facing analytics use cases that demanded putting metrics directly in front of the external users arose. The scale and throughput support needed for these external user-facing applications was unprecedentedly high as compared to internal user-facing analytics use cases.
And lastly, with the advent of real time streaming sources (Kafka, Kinesis, Pub/Sub etc..), data freshness became another important KPI to chase after for businesses.
Introducing Apache Pinot:
Apache Pinot is an analytical data store that is designed to support a wide spectrum of analytical use cases; marquee use cases powered in production today by Apache Pinot are LinkedIn's "Who Viewed My Profile" and UberEats "Restaurant Manager."
These use cases require support for:
Apache Pinot provides that ultra-low latency for analytical queries under high read throughput and near real-time data freshness. Additionally, Apache Pinot natively supports upserts which helps use cases such as UberEats Restaurants Managers to keep track of changing order status in real-time.
Stay tuned to learn more about the exciting new updates about Apache Pinot and related ecosystem! :-)
Great content Sandeep Dabade !!
Awesome content Sandeep Dabade!!