Next Generation Data Architecture for Advanced Analytics
Next generation data architecture for advanced analytics

Next Generation Data Architecture for Advanced Analytics

Here is a simplified sketch of the next generation data analytics architecture.

1. Data Generators:

On the left side, there are different data generators. These are the web applications, mobile apps, point-of-sale systems in retail stores, smart IoT devices, tablets. The data generators generate operational data on a day-to-day basis.

2. Transactional Systems (aka OLTP):

All the operational data gets stored in transactional (OLTP) data stores. These data stores are either relational (examples: MySQL, Oracle, Postgres, etc.), document-driven (examples: MongoDB, CouchDB, etc.), or key-value stores (examples: DynamoDB, Redis, etc.).

The transactional data stores are designed to support write-heavy use cases as many applications write to these data stores at any point in time. IO is the most expensive operation when it comes to database read/write performance. In order to support high write throughput (QPS), the schema in transactional systems is normalized to eliminate redundancy and avoid multiple disk writes per request. 

3. Analytical Systems (aka OLAP):

All the operational data from transactional systems is eventually ETLed into the analytical systems. The purpose behind this is to centralize all the operational data into one central place for the ability to cross reference and derive meaningful insights that are geared towards improving operational efficiencies and business KPIs.

The paradigm in analytical systems is very different from that in transactional systems. The use cases are read-heavy. The schema is denormalized or flattened; A storage tradeoff is made by introducing redundancy in lieu of faster read performance.

Analytical systems are powered by three powerful technologies, as described in the following section.

You will find these three technologies in most, if not all, next-generation data analytics architectures; each powering different sets of use cases. Having said that, these technologies are converging and the decision to choose one for your specific use case should be made based on the Analytics KPIs described to the far right.

Data Warehouses:

Data Warehouses have been around for many decades.

  1. Data Format: Data Warehouses are relational in nature; Data is structured.
  2. Schema: In Data Warehouses, schema is fixed and predefined; They support schema-on-write.
  3. Use Cases: Data Warehouses are designed for internal analytics use cases such as dashboarding and reporting. In Data Warehouses, compute and storage are tied together which helps in achieving faster query SLAs. The query SLA range Data Warehouses can achieve is in seconds to minutes.
  4. Analytics KPIs: 

No alt text provided for this image
Analytics KPIs for data warehouses

  • Query latency range is in seconds to minutes
  • Query throughout is fairly low; (under 1). We are talking about a few internal users running queries spread throughout the day. 
  • Data Freshness is in hours to days; The data is ETLed from transactional systems using periodic jobs - typically nightly once; or at best hourly throughout the day during low activity windows. 

Data Lakes (Object Stores):

With the advent of newer data formats, which traditional data warehouses could not support natively, data lake technology came to life. Object stores data lakes have compute and storage completely decoupled. The data is stored in object stores such as Amazon S3 where cost of storage is very cheap (compared to block storage attached to the VM). Big data compute engines like hive, spark, presto are used for serving advanced analytics.

  1. Data Format: Designed for storing data in its purest form; structured, semi-structured or unstructured.
  2. Schema: Flexible schema, schema-on-read.
  3. Use Cases: Internal analytics; data science, machine learning and advanced analytics use cases.
  4. Analytics KPIs: 

No alt text provided for this image
Analytics KPIs for data lakes

  • For analytical queries, data lakes provide query latencies in minutes to hours. Compared to the Data Warehouses, Data Lakes make performance tradeoffs in lieu of cost and flexibility in terms of support for different data formats.
  • Query throughput is low; (typically under 10).
  • Data Freshness is in hours to days; The data is ETLed from transactional systems using periodic jobs - typically nightly once; or at best hourly throughout the day during low activity windows.

Analytical Data Stores:

With the advent of the internal user-facing applications that serve insights and metrics to the end users, sub-second latency became quintessentially important and that’s where analytical data stores came to life. The data in these systems is stored in columnar format; richer indexes are added to meet the millisecond query latency SLAs. 

In addition, external user-facing analytics use cases that demanded putting metrics directly in front of the external users arose. The scale and throughput support needed for these external user-facing applications was unprecedentedly high as compared to internal user-facing analytics use cases.

And lastly, with the advent of real time streaming sources (Kafka, Kinesis, Pub/Sub etc..), data freshness became another important KPI to chase after for businesses.

  1. Data Format: Columnar storage; typically Analytical data stores support structured data; Apache Pinot is one of the data store technologies that natively supports storing and indexing semi-structured data.
  2. Schema: In Data Store technologies schema is Fixed; schema-on-write; Apache Pinot, with its native support for semi-structured datasets (JSON) supports flexible schema.
  3. Use Cases: Designed for user-facing analytics use cases that demand ultra-low latency (sub-second) and data freshness of near real-time.
  4. Analytics KPIs:

No alt text provided for this image
Analytics KPIs for external user facing analytics data stores

  • Query latency range is in milliseconds
  • Query throughput is low (internal user facing) to high (external user facing);
  • Data Freshness is in near real-timeThe data needs to be made available for querying as soon as possible after arriving on the real-time stream.

Introducing Apache Pinot: 

Apache Pinot is an analytical data store that is designed to support a wide spectrum of analytical use cases; marquee use cases powered in production today by Apache Pinot are LinkedIn's "Who Viewed My Profile" and UberEats "Restaurant Manager."

These use cases require support for:

  • Latency: Ultra-low query latency (in milliseconds) as there is an end user waiting for a response on the other side,
  • Throughput: an extremely high read throughput (hundreds of thousands of Read QPS) and
  • Data Freshness: data freshness of near real-time (in a few seconds).

Apache Pinot provides that ultra-low latency for analytical queries under high read throughput and near real-time data freshness. Additionally, Apache Pinot natively supports upserts which helps use cases such as UberEats Restaurants Managers to keep track of changing order status in real-time.

Stay tuned to learn more about the exciting new updates about Apache Pinot and related ecosystem! :-)

To view or add a comment, sign in

More articles by Sandeep Dabade

Others also viewed

Explore content categories