Cloud Native Data Streaming Platform
Overview and Core Components
This platform is designed to handle high-volume data ingestion, real-time processing, storage, analytics, and low-latency delivery to customers. The core components are:
1. Data Ingestion Layer: Ingests data from multiple sources like S3, Kafka, and Kinesis.
2. Processing Layer: Applies transformations, filtering, and schema validation.
3. Storage and Analytics Layer: Stores raw and processed data, enabling advanced analytics.
4. Customer Delivery Layer: Streams processed data to customers with minimal latency.
High-Level Architecture
Detail Level Architecture
Data Ingestion Layer
S3 Ingestion
1. An SNS Notification is triggered for each new file.
2. An AWS Lambda Function:
• Parses and validates the schema.
• Publishes valid events to the messaging queue (Kafka/Kinesis).
• Writes invalid events to an Error Data Bucket.
• Each data source has its own dedicated SNS Topic and Lambda Function.
Kafka/Kinesis Data Sources
Data Validation
Processing Layer
Streaming Processing
• Data Transformation: Format data as required.
• Deduplication: Eliminate duplicate events using event IDs.
• Filtering: Apply custom filtering logic.
• Aggregation: Aggregate data as needed.
Data Writer Job
• KStream
• Kinesis Data Firehose
• Spark Streaming or Flink
Latency Control
Storage and Analytics Layer
Raw Data Storage
Error Data Storage
Recommended by LinkedIn
Processed Data Storage
Advanced Analytics Framework
Customer Delivery Layer
Data Delivery Mechanism
Access Control
Scalability and Fault Tolerance
Kafka/Kinesis
Spark/Flink with EMR
Industrialisation (Automation and Monitoring)
Deployment Automation
Continuous Deployment of Streaming Applications
Monitoring and Alerting
a) Infrastructure Monitoring
• Use AWS CloudWatch for real-time monitoring of infrastructure metrics, logs, and
latency.
b) Application Monitoring
• Publish streaming job metrics to Prometheus.
• Create dashboards in Grafana for:
• Job resource utilisation
• Business metrics
c) Critical Metrics and Alerts
• Metrics:
• Number of events consumed
• Kafka lag
• Micro-batch execution time
• Number of events sent to the error bucket
• Alerts:
• No data alerts
• Zero events count alerts
• Kafka lag alerts
• Error bucket events alerts
• Streaming job failure alerts
Cost Monitoring
Insightful
Very informative