Transformations
What: Transformations
The Core Question: What results are we calculating?
In the context of unbounded data processing, "Transformations" refers to the specific operations you apply to your data stream. While there are infinite specific functions (parsing JSON, training models, summing revenue), they all fall into two architectural categories based on how they handle relationships between records.
1. Element-Wise Transformations
Definition: Operations that process a single element at a time, independent of any other elements.
Mechanism: One record enters the function → One (or zero/many) records exit.
State: Stateless. The system does not need to remember previous events to process the current one.
Common Types:
Filter: "Drop this record if
status != 200." (1 in → 0 or 1 out)Map: "Convert this CSV line into a JSON object." (1 in → 1 out)
FlatMap: "Take this sentence and split it into individual words." (1 in → Many out)
Implication for Streaming:
These are easy. Because they don't depend on other data, you don't need to worry about time, order, or completeness. You just process events as they fly by.
Skew: Doesn't matter.
Watermarks: Not needed.
2. Aggregations (Grouping)
Definition: Operations that compute a result from multiple input elements.
Mechanism: Many records enter One result exits.
State: Stateful. The system must buffer data until it has "enough" to calculate the result.
Common Types:
Sum: "Calculate total revenue."
Count: "Count distinct users."
Min/Max: "Find the highest temperature."
Join: "Combine the click stream with the user profile stream."
Implication for Streaming:
These are hard. To sum numbers, you need to answer a critical question: "Which numbers am I summing?"
Since the data is infinite, you cannot sum "everything." You must define boundaries.
This necessitates Windowing. You cannot have an Aggregation on unbounded data without defining a Window (the "Where").
3. Composite Transformations
In real-world pipelines, you rarely do just one. You combine them into a Directed Acyclic Graph (DAG).
Example:
Filter (Element-wise): Keep only "Purchase" events.
Map (Element-wise): Extract the "Price" field.
Sum (Aggregation): Calculate total revenue per hour.
Since most valuable insights come from Aggregations (counting, summing, trending), we immediately run into the problem of "How do we slice this infinite stream into finite groups so we can sum them?"
This leads directly to the next question: Where: Windowing.
Last updated