Kinesis and streaming stack
AWS Streaming Stack (Kinesis & Friends)
What it is: Real-time Data Streaming and Analytics. It has three main parts:
Kinesis Data Streams (KDS): Ingests the data (The Pipe).
Amazon Data Firehose (formerly Kinesis Firehose): Loads data directly into S3, Redshift, or OpenSearch (The Delivery Truck).
Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics): Processes and transforms data in real-time (The Brain).
Data Engineer Note: This is your "Kafka" ecosystem alternative.
Equivalents:
GCP: Pub/Sub (Ingestion) and Dataflow (Processing/Loading).
Azure: Azure Event Hubs (Ingestion) and Stream Analytics (Processing).
1. The "Storage & Ingestion" Layer: Kinesis Data Streams (KDS)
Think of this as the AWS equivalent of Apache Kafka. It is the raw piping that ingests data at massive scale.
Core Concept: The stream is divided into Shards. A shard is a unit of throughput (1 MB/sec ingest, 2 MB/sec egress). You scale by adding or removing shards.
Retention: It stores data temporarily (default 24 hours, extendable to 365 days). This allows different consumers to read the same data at their own speed (replayability).
Use Case: When you need low latency (millis), need to replay data, or have multiple different applications consuming the same data stream.
2. The "Delivery" Layer: Amazon Data Firehose
Think of this as a managed consumer that captures, transforms, and loads data.
Core Concept: "Load and Forget." You do not write code for consumers; you just configure the source and the destination.
Key Feature - Buffering: Firehose buffers incoming data by size (e.g., 128 MB) or time (e.g., 60 seconds) before writing it out. This makes it "near real-time" (seconds/minutes latency), not "real-time" (sub-second).
Format Conversion: It can automatically convert JSON data into Parquet or ORC before saving it to S3, which is massive for Data Lake performance.
Destinations: S3, Redshift, OpenSearch, Splunk, Datadog, Snowflake.
What it is: The "Delivery Truck." It is the easiest way to load streaming data into S3, Redshift, or OpenSearch.
Key Feature: Zero Code. You just click "Source: Stream A" and "Destination: S3 Bucket," and it handles the buffering, compression, and delivery automatically.
Equivalents:
GCP: Pub/Sub (but Firehose is unique because it handles the write to storage part automatically).
3. The "Processing" Layer: Managed Service for Apache Flink
This is the "Brain." If you need to manipulate the data while it is in flight (before it hits the database), you use this.
Core Concept: It runs Apache Flink applications. It creates a stateful processing engine that can join streams, aggregate data over time windows (e.g., "count clicks in the last 5 minutes"), or filter bad events.
Input/Output: It usually reads from a Kinesis Data Stream and outputs to another Stream or Firehose.
Languages: You can write applications in Java, Scala, Python, or Flink SQL.
What it is: The "Complex Processor." If you need to do math on the stream in real-time (e.g., "Calculate the 5-minute rolling average of stock prices"), you use this. It uses Apache Flink, which is the industry standard for stateful stream processing.
Equivalents:
GCP: Cloud Dataflow.
Azure: Azure Stream Analytics.
The Typical "Modern Data Stack" Pipeline
Here is how they usually fit together in a Data Engineering architecture:
Ingest (KDS): IoT sensors or Web Servers push raw events into Kinesis Data Streams.
Process (Flink): Managed Flink reads from KDS, calculates a rolling average (e.g., "Average temperature per minute"), and filters out noise.
Sink (Firehose): Flink sends the clean, aggregated data to Amazon Data Firehose.
Store (S3): Firehose buffers the data for 60 seconds, converts it to Parquet, and dumps it into an S3 Data Lake.

Summary Comparison
Feature
Kinesis Data Streams (KDS)
Amazon Data Firehose
Managed Service for Apache Flink
Role
Ingestion & Storage
Delivery & Loading
Transformation & Analysis
Latency
Real-time (~200ms)
Near real-time (60s+)
Real-time (~sub-second)
Code Required?
Yes (Producers/Consumers)
No (Config only)
Yes (SQL/Java/Python)
Replay?
Yes (Data is stored)
No (Data is transient)
N/A (Processing engine)
Analogy
The Pipe
The Bucket Brigade
The Filter
Last updated