Big Data Systems : Tech Stack and Data Processing architecture styles

Big Data Systems : Tech Stack and Data Processing architecture styles

This article describes about the technology stack / layers involved in the big data system's architecture with the real world tech components used in the each layer which provides an insight on the end to end functionalities involved in the big data system. 

I have also provided some info about the Lambda and Kappa Architecture, which are some of the data processing architecture patterns followed in the industry for deciding the architecture for the big data solution, based on the use cases. And also described about a Unified Architecture pattern (combination of Lambda and Kappa style of architectures) with its process flow for an IoT based real time analytics scenario.  

Contents:

1.   Big Data

2.   Big Data Stack – Layers

3.   Lambda and Kappa architecture

4.   Use Case: Real time IoT Analytics

Big Data:

Gartner defines big data as “a high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation”.

In simple words, it can be described as data whose size or type could not be processed by the traditional database and data warehouse systems. This makes a need a for special architecture and new stack of technologies to gain insights from the data in the agile environment.

Big Data Architecture:

Well defined big data architecture with appropriate big data components with respect to storage, governance, analytics engine, etc. enables the solution to be reliable and future proof along with its core purpose to manage varied and huge volume of data generating in high velocity to extract vital business information from otherwise ambiguous data

Solutions like big data with a complex functionality needs to be divided as layers to define a right component for a right data strategy. These architecture layer helps to divide the end to end broad functionalities of big data management into logical parts which helps to identify and remove the tight coupling between the components

The various layer of big data architecture includes,

1.   Data Sources layer

a.    Data Ingestion Layer

b.    Data Collection layer

2.   Data Storage layer

3.   Data Processing layer

a.    Data Transformation

b.    Data Query layer

4.   Data Consumption layer

a.    Data Exploration using Visualization

b.    Application Programming Interfaces

5.   Data Governance layer

a.    Data Security

b.    Data Monitoring

No alt text provided for this image

Data Sources Layer:

Data sources layer comprises of varied types of web, machine, human generated data either from a streaming device or from a storage device.

The types of data broadly classified into

Structured data (data structured in rows and columns),

Semi-Structured data (XML, JSON, etc..) and

Unstructured data (Images, Videos, etc.)

  • Data Ingestion Layer:

Data Ingestion refers to the process of insertion of varied types of data from varied sources into the big data analytics platform to get insights and take business decision from it, in real time or near real time or in batch mode, depending upon the requirements.

To handle this huge volume of data coming in high velocity, there needs a system with efficient data structures such as queue, running with a distributed processing capability.

The widely used tools include NiFi, Kafka, etc.

  • Data Collection Layer:

Here comes the layer to execute the transportation of data from the ingestion layer to the rest of the data pipeline. This layer deals with process to decouple the source components, so that analytic capabilities may begin. The widely used tools include Flume, Sqoop, Beats, etc.

Data Storage Layer

This layer deals with the storage of varied and huge volume of data in the appropriate data store. As the volume of data generated and stored increases, need for distributed systems to store and process the data are raised, which lead to the need of new file system to handle this storage. Frequently used tools in the market include Hadoop (HDFS), HDInsight, etc.

Data Processing Layer:

It is the primary layer of architecture stack, since the focus is to specialize in the data pipeline processing system. The data collected or stored from the previous layer is processed in this layer. Here we do some tricks with the data to route them to a different destination and classify the data flow, and it’s the first point where the analytics occur.

  • Data Transformation Layer:

This layer deals with the process of changing the format, structure, aggregate, correlate the data from the queue or storage, as required for the further analytics or for the use of data scientist to train the ML models. This is a important part of data integration which involves ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform). The various analytical engines like Spark, Hadoop, etc. are used in this layer.

  • Data Query Layer:

The components involved in this layer provides an interface to find the data as per the transformation requirements, from the huge volume of data spread in the distributed systems. Mostly the big data systems that use HDFS, needs a complex map reduce logic to retrieve the data. Components like Hive enables to use a SQL like query to retrieve the data from HDFS, by converting SQL to map reduce jobs internally. Other used components in this layer includes Spark SQL, etc.

Data Consumption layer:

  • Data Visualization

This layer helps to analyze and find the value within in the data stored in the big data environment. Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is high in volume.

Data visualization tools such as PowerBI, Tableau, QlikView, etc. provide an accessible way to connect data sources and understand trends, outliers, and patterns in data.

Based on the use case, the visualization dashboard can be updated in real time or batch mode. Realtime dashboards are mostly used for the timeseries based tasks such as monitoring.

Exploratory data analysis is another field of analyzing the data that can tell us beyond the formal modelling or hypothesis testing task. Statistical models and statistical graphical representations are used to achieve exploratory data analysis.

  • Application Programming Interface:

Any data platform should provide an option to access the data from the platform’s data store through API, that in turn invokes the data query layer components to receive the appropriate data for the applications that are making use of the data platform.

This avoids the movement of huge volume of processed data between the different systems, by enabling the web application users to select and filter the required data from the big data platform through API’s.

Data Governance layer:

Data Governance is the process, and procedure organizations use to manage, utilize, and protect their data. The two important factors include data security and data monitoring

  • Data Security

Security is the crucial part of any sort of data and is an essential aspect of Big Data Architecture. It is the primary task of any work. Security should be implemented at all layers of the data platform starting from Ingestion, through Storage, Analytics, Discovery, all the way to Consumption. Various tools use to implement this include Apache Ranger, Apache Atlas, Apache Knox, etc. The different security aspects to be considered are,

1.   User / System Authentication

2.   Access Control for data

3.   Encryption and data masking for sensitive data at rest and in motion

  •  Data Monitoring

This layer involves the monitoring, auditing, testing, managing, and controlling the data, which is an important part of the governance mechanisms.

Auditing data access with compliance monitoring solution through access logs analysis is a critical part of the data governance solution. The monitoring solution can be supported with anomaly detection ML models to classify the varied components access and traffic log data.

Data lineage is another process to be considered in this layer. These are the techniques to identify the data’s lifecycle through various phases.

To decide which layer to use when according to the use case, the lambda, kappa and unified architecture provides a generic architecture to design the system.

Types of Big Data Architecture:

  • Lambda architecture
Lambda architecture is a data-processing architecture designed to handle high volumes of data by using both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive views of batch data, while simultaneously using real-time stream processing to provide views of online data.

Data from the batch layer is more accurate and enables to perform high accuracy computation across large data sets with high latency constraint, whereas the data from the stream layer has no latency constraint at the cost of less accuracy in the data, because of the data that is in streaming layer is obtained from a small window of time.

Eventually, the batch and stream paths converge at the analytics client application. If the client needs to display timely, yet potentially less accurate data in real time, it will acquire its result from the stream path. Otherwise, it will select results from the batch path to display less timely but more accurate data.

Use case: Log Analytics, Realtime ad recommendation based on social media data

No alt text provided for this image

This type of architecture closes the gap of high latency with batch processing with the capability to rerun the analytics on whole dataset based on the necessity.

The underlying disadvantage is that is its complexity. Processing logic appears in two different places using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths

Key Takeaway: This architecture is more reliable in updating the data lake and efficient for ML models to predict upcoming events in a robust manner, as it reaps the benefits of batch layer and speed layer to ensure less errors and speed

Best fit : Near to Realtime Applications

  • Kappa Architecture
Kappa architecture is a streaming-first architecture pattern. It has the same basic goals as the lambda architecture, but with an important distinction: All data flows through a single path, using a stream processing system.

Kappa architecture is not a substitute for Lambda architecture. It is an alternative approach for data management, where the batch layer is not needed for meeting the quality of service and process complex transformations where data quality can be applied in streaming layer.

This type of architecture allows for the complete re-computation or processing of the data that lies in a huge time interval or the huge volume of data with the help of raw data reservoir that stores the streaming data for further analytics.

Also, Kappa architecture allows to implement messaging systems in a way to store raw historical streaming data in for longer duration, which could be a better support for reprocessing, rather than taking it from raw data store / data lake.

No alt text provided for this image

Key Takeaway: This architecture is suitable for the case, where we need less expensive hardware and require it to deal effectively on the basis of unique events occurring on the runtime then select the Kappa architecture for your real-time data processing needs

Reference architecture for real-time IoT analytics system:

This reference architecture is based on the combination of Lambda and Kappa architecture which in term described as “Unified Architecture

The proposed reference architecture can be used for IoT scenarios, that requires real time data analytics which combines data processing, analytics and monitoring.

No alt text provided for this image

1.   The edge gateway collects data from all the equipment’s connected in a field and push it into the topic of a message queue for further analytics in a datacenter or cloud

2.   The message queue receives the various types of data, perform light transformations that is required for processing engine and pushes it into the data store and stream processing engine.

3.   The data store provides a whole history data to train a ML model which can be developed through Spark MLlib, etc.

4.   The real time data is sent to stream processing engine for cleaning, transformations, type conversion, etc. to make the data ready for the analytics.

5.   Apply the ML model in the processed data to find a prediction or pattern for any specific anomalies, etc.

6.   Now store the result data from ML model into the data store for further analytics or reporting systems

7.   Configure the data consuming services to pull the data for reporting or enable a push mechanism with Pub Sub to enable the store/processing engine to push the data to real time dashboard.

8.   Use the result data to retrain the ML models

Open for Suggestions and feedback ...

Thank you 

Congratulations Muthu...It's a good knowledge brief about end to end big data solutions in a crisp and clean explanation

Really a great piece of work Muthu.. This single blog is kind of a mini handbook we can refer to refresh all the basics of data systems and surely get a very good picture of the data systems from handling storing streaming and processing. To be honest, even after working on apache flink for quite some time, I wasn't aware its using some kappa architecture behind the scenes... Great job, keep it up👍

To view or add a comment, sign in

More articles by Muthusamy J

  • NoSQL - Document database modeling

    This article describes the data modeling techniques with specific to document database. Overview: Data Model Need for…

Explore content categories