Introduction to Kinesis Stream and Lambda Consumer
Data Streams
Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data from various sources.This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity.
Before dealing with streaming data, it is worth comparing and contrasting stream processing and batch processing. Batch processing can be used to compute arbitrary queries over different sets of data. It usually computes results that are derived from all the data it encompasses, and enables deep analysis of big data sets. MapReduce-based systems, like Amazon EMR, are examples of platforms that support batch jobs. In contrast, stream processing requires ingesting a sequence of data, and incrementally updating metrics, reports, and summary statistics in response to each arriving data record. It is better suited for real-time monitoring and response functions.
Streaming data processing requires two layers: a storage layer and a processing layer. The storage layer needs to support record ordering and strong consistency to enable fast, inexpensive, and replay-able reads and writes of large streams of data. The processing layer is responsible for consuming data from the storage layer, running computations on that data, and then notifying the storage layer to delete data that is no longer needed. We also have to plan for scalability, data durability, and fault tolerance in both the storage and processing layers.
Amazon Web Services (AWS) provides a number options to work with streaming data. We can take advantage of the managed streaming data services offered by AWS Kinesis, kinesis streaming data platform comprises of Kinesis Data Stream along with Kinesis Data Firehose, Kinesis Video Streams, and Kinesis Data Analytics.AWS also allows us to deploy and manage our own streaming data solution in the cloud on Amazon EC2.
Kinesis Data Stream
Kinesis Data Streams is part of the AWS kinesis, it intakes and processes stream of data records in real time.It allows to create Kinesis data streams applications that consume data for processing. These applications use Kinesis Client Libraries and run on EC2 instances.The processing performed by them is light weight because the response time for the data intake and processing is in real time.
Advantages of Kinesis Data Streams
- Elastic and durable as the data is not lost(stored durably) and the stream can be scaled up as well as scaled down easily.
- Put-to-get delay(the delay between the time a record is put into the stream and the time it can be retrieved) is typically less than 1 second
- Multiple Kinesis Data Streams applications can consume data from a stream, so that multiple actions, like archiving and processing, can take place concurrently and independently
- The Kinesis Client Library enables fault-tolerant consumption of data from streams and provides scaling support for Kinesis Data Streams applications.
Kinesis Terminology
- Data Record - The unit of data stored by Kinesis Data Streams is a data record.A record is a data structure that contains the data to be processed in the form of a data blob.Kinesis Data Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be up to 1 MB.Each data record size is rounded to KBs
- Shards - A shard has a sequence of data records in a stream. When we create a stream, we specify the number of shards for the stream. The total capacity of a stream is the sum of the capacities of its shards.Each data record in a shard has a sequence number that is assigned by Kinesis Data Streams.Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys).
- Data Stream - A data stream is made of one or more shards.
- Producer - A producer puts data records into Amazon Kinesis data streams.To put data into the stream, we must specify the name of the stream, a partition key, and the data blob to be added to the stream
- Consumer - A consumer, known as an Amazon Kinesis Data Streams application, is an application that we build to read and process data records from Kinesis data streams.The consumers can be enhanced fan-out consumer where each consumer gets the 2MB/sec/shard read throughput irrespective of other consumers or the consumer can be shared fan-out consumer, where all the consumers share the 2MB/sec/shard read throughput.
- Retention Period - The retention period is the length of time that data records are accessible after they are added to the stream.Minimum 24hrs and maximum 365 days.
- Partition Key - A partition key is used to group data by shard within a stream. Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. It uses the partition key that is associated with each data record to determine which shard a given data record belongs to.
AWS Lambda
AWS Lambda is a compute service that lets us run code without provisioning or managing servers. Lambda runs our code only when needed and scales automatically, from a few requests per day to thousands per second. We pay only for the compute time that we consume—there is no charge when our code is not running. With Lambda, we can run code for virtually any type of application or backend service, all with zero administration. Lambda runs our code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. All we need to do is supply our code in one of the languages that Lambda supports.
AWS lambda can be triggered in response to events like changes to data in an Amazon Simple Storage Service (Amazon S3) bucket or an Amazon DynamoDB table; to run our code in response to HTTP requests using Amazon API Gateway; or to invoke our code using API calls made using AWS SDKs etc. Thus this helps to build server less applications composed of functions that are triggered by events and is fully managed by AWS.This is in exchange for flexibility, which means we cannot log in to compute instances, or customise the operating system on provided runtimes.
Lambda with Kinesis
We can map a Lambda function to a shared-throughput consumer (standard iterator), or to a dedicated-throughput consumer with enhanced fan-out.For standard iterators, Lambda polls each shard in our Kinesis stream for records using HTTP protocol. The event source mapping shares read throughput with other consumers of the shard.
Lambda reads records from the data stream and invokes our function synchronously(When we invoke a function synchronously, Lambda runs the function and waits for a response. When the function completes, Lambda returns the response from the function's code with additional data, such as the version of the function that was invoked) with an event that contains stream records. Lambda reads records in batches and invokes our function to process records from the batch.
Lambda Terminology
- Event - It is the records(from source) that is sent as an arg when the handler is invoked.
- Context - When Lambda runs our function, it passes a context object to the handler. This object provides methods and properties that provide information about the invocation, function, and execution environment.
- Batch Size - The number of records to send to the function in each batch, up to 10,000. Lambda passes all of the records in the batch to the function in a single call, as long as the total size of the events doesn't exceed the payload limit for synchronous invocation (6 MB).The Lambda picks up batch of records from each shard.
- Batch Window - Specify the maximum amount of time to gather records before invoking the function, in seconds.Lambda reads records from a stream at a fixed cadence (e.g. once per second for Kinesis data streams) and invokes a function with a batch of records. Batch Window can allow wait of upto 300s to build a batch before invoking a function. Now, a function is invoked when one of the following conditions is met: the payload size reaches 6MB, the Batch Window reaches its maximum value, or the Batch Size reaches its maximum value. With Batch Window, we can increase the average number of records passed to the function with each invocation. This is helpful when we want to reduce the number of invocations and optimize cost.
- On-Failure-Destination - An SQS queue or SNS topic for records that can't be processed. When Lambda discards a batch of records because it's too old or has exhausted all retries, it sends details about the batch to the queue or topic.The maximum retires can be configured along with the option of splitting the batch on error before retry.
Sources of Knowledge
- https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html#shard
- https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html
- https://docs.aws.amazon.com/streams/latest/dev/introduction.html