Content from the book "Streaming Systems"

This is an educational breakdown of stream processing concepts based on "Streaming Systems: The What, Where, When, and How" by Tyler Akidau, Slava Chernyak, and Reuven Lax. All core ideas and frameworks are credited to the original authors. I merely tried to explain (with some minor additions) the material in a more concise way to help myself (and possibly others) better understand and remember these fundamental concepts. I highly recommend reading the original book for the complete treatment. While most of the content from this book covers Apache Beam and Google Cloud's Dataflow model, the concepts apply to other streaming systems as well.

Alternatively, you can read two-part blog: 1, 2

Supplementary article

Intro

The following is a foundational diagram in stream processing that illustrates the relationship between event time (when events actually occur) and processing time (when the system processes those events). Break down each component:

The Axes

Event Time (x-axis): This represents when events actually happened in the real world. For example, if a user clicked a button at 2:00 PM, that's the event time, regardless of when your system processes it.

Processing Time (y-axis): This is the wall-clock time as your streaming system observes it while processing events. It's when your system actually gets around to handling the data.

Event Time (when it happened) and Processing Time (when we saw it)

The Ideal (The dashed 45° Line): In a perfect world, as soon as an event happens (12:00), you process it (12:00). This creates a perfect diagonal line. This represents perfect stream processing where there's no delay—events are processed instantly as they occur. In this ideal world, event time equals processing time. This never happens in reality.
The Reality (The solid line): The gap between the Ideal Line and the Reality Line is the Skew. This shows what actually happens in real systems. The curve deviates from the ideal line, revealing several challenges:
- Processing-Time Lag: The vertical distance between the reality curve and the ideal line. This represents how far behind your system is in processing events. If an event occurred at event time X but you're processing it at processing time Y, the lag is (Y - X). This happens due to network delays, system load, batching, or downstream bottlenecks.
- Event-Time Skew: The horizontal distance between the reality curve and the ideal line at any given processing time. This shows the spread or disorder in event timestamps as they arrive. Events don't always arrive in order—an event from 5 minutes ago might arrive after events from 1 minute ago due to network issues, distributed systems, mobile devices coming online, etc.
The Lesson: If your system relies on the 45-degree line (assuming data arrives instantly), your data is wrong.

Why This Matters

This diagram illustrates why streaming systems need sophisticated mechanisms like watermarks, windowing, and late-data handling. You can't simply assume events arrive in order or without delay, so you need strategies to determine when you have "enough" data to produce accurate results.

Ingestion time might be considered a middle ground when it comes to time semantics. Ingestion Time is the timestamp assigned to a data record the moment it enters your data system, before any actual processing logic runs. Does this distinction make sense? In the book's "Event Time vs. Processing Time" graphs, Ingestion Time essentially acts as a "Processing Time" that is locked in place once recorded.

about ingestion time

Ingestion Time is a critical concept in practical engineering (especially in tools like Apache Flink and Kafka).

Here is where it fits:

1. The Three Times

To understand Ingestion Time, you have to see it in the sequence of the data's journey.

Event Time: The user clicks a button. (Timestamp created on the phone).
- Prop: Correctness.
Ingestion Time: The event reaches your message broker (e.g., Kafka/PubSub). (Timestamp created by the Kafka Broker).
- Prop: Stability & Order.
Processing Time: The event is read by your worker node to be calculated. (Timestamp created by the Flink/Beam Worker).
- Prop: Low Latency / Load balancing.

2. What is Ingestion Time specifically?

Ingestion Time is the timestamp assigned to a data record the moment it enters your data system, before any actual processing logic runs.

In Kafka: This is LogAppendTime. It is the timestamp the Kafka broker attaches to the message when it writes it to the disk.
In Flink: Historically, IngestionTime was a specific setting. It assigns timestamps at the Source Operator (the very first step of the pipeline).

3. Why use Ingestion Time?

You use Ingestion Time when you want a "Hybrid" approach. It offers a compromise between the chaos of Event Time and the randomness of Processing Time.

Benefit 1: Determinism (Unlike Processing Time) If you re-run your pipeline tomorrow on the same data:
- Processing Time results will change (because the worker clock is different tomorrow).
- Ingestion Time results will stay the same (because the timestamp was stamped when the data hit the broker and saved to disk).
- Verdict: Ingestion Time allows for reproducible results, whereas Processing Time does not.
Benefit 2: No "Late Data" (Unlike Event Time) Since the timestamp is assigned by the system as data arrives, the data is, by definition, always in order. You don't need complex Watermarks or Late Triggers because time is monotonically increasing.
- Verdict: It is much cheaper and simpler to compute than Event Time.

4. The Downside (The "Fake Reality")

The problem with Ingestion Time is that it is not real.

If a user clicks "Buy" at 12:00, but the network is down and the signal reaches Kafka at 14:00:

Ingestion Time says the purchase happened at 14:00.
Analytics Result: You will report that sales spiked at 2:00 PM, when they actually spiked at 12:00 PM.

When to use which?

Time Domain

Question it answers

Best for...

Event Time

"When did it actually happen?"

Accurate analytics, billing, user behavior.

Ingestion Time

"When did we receive it?"

Auditing, log analysis, stable/reproducible pipelines where exact user timing doesn't matter.

Processing Time

"When did we calculate it?"

System monitoring, calculating latency (Process - Ingest), debugging lag.

PreviousApache Beam - both batch and stream processing unified NextThe Four Questions Framework for Stream Processing

Last updated 29 days ago

hashtagIntro

Intro