Data Ingestion and Data Integration
Data Ingestion
(The information below comes from Chapter 7: Ingestion, in Fundamentals of Data Engineering)
What Is Data Ingestion?
Ingestion is the process of moving data from source systems into storage. It is the first step in the data engineering lifecycle after data is generated, focusing on reliable transport rather than transformation.
Key Engineering Considerations for the Ingestion Phase
Bounded Versus Unbounded Data: Bounded data has a discrete start and end (often associated with batch processing), while unbounded data is an infinite flow (often associated with streaming).
Frequency: This refers to how often data is moved—batch (scheduled intervals), micro-batch, or real-time streaming.
Synchronous Versus Asynchronous Ingestion: Synchronous ingestion waits for a confirmation that the data has been written before proceeding, while asynchronous fires the data off without blocking the process.
Serialization and Deserialization: This involves encoding data into a format for transfer (serialization) and decoding it at the destination (deserialization) to ensure the data is usable after transport.
Throughput and Scalability: Systems must be designed to handle spikes in data volume and backlogs, often using buffering or managed services to scale elastically.
Reliability and Durability: This covers ensuring data isn't lost during transit and persists effectively at the destination.
Payload: The actual content of the data being moved, including its shape (dimensions), kind (audio, text, tabular), and volume.
Push Versus Pull Versus Poll Patterns:
Push: The source sends data to the pipeline.
Pull: The system fetches data from the source.
Poll: The system periodically checks for updates and then pulls if new data exists.
Batch Ingestion Considerations
Snapshot or Differential Extraction:
Snapshot: Captures the full state of the source system every time (simpler but storage-heavy).
Differential: Ingests only the changes (deltas) since the last read (more efficient).
File-Based Export and Ingestion: Source systems export data as files (like CSV or Parquet) to be pushed to a target, avoiding direct database connections.
ETL Versus ELT:
ETL (Extract-Transform-Load): Data is transformed before it lands in the data warehouse.
ELT (Extract-Load-Transform): Data is loaded raw into the destination first, and transformations happen within the destination storage system.
Inserts, Updates, and Batch Size: Engineers must optimize write performance by using bulk operations rather than many small inserts, as some databases (like columnar stores) perform poorly with frequent single-row inserts.
Data Migration: Moving large volumes of historical data (often TBs+) involves different challenges than daily ingestion, such as schema compatibility and moving pipeline connections.
Message and Stream Ingestion Considerations
Schema Evolution: Handling changes in data structure (e.g., new fields) without breaking downstream pipelines.
Late-Arriving Data: Handling events that arrive out of order or after a processing window has closed.
Ordering and Multiple Delivery: Ensuring messages are processed in the correct sequence and handling "at-least-once" vs. "exactly-once" delivery guarantees.
Replay: The ability to re-process historical data streams to correct errors or backfill new models.
Time to Live (TTL): How long a message persists in the stream/queue before it is discarded.
Message Size: The limits on how large a single event payload can be.
Error Handling and Dead-Letter Queues: Routing bad or unprocessable messages to a separate queue for analysis so they don't block the main pipeline.
Consumer Pull and Push:
Pull: Consumers fetch data from a topic (e.g., Kafka).
Push: The stream pushes data to a listener (e.g., Pub/Sub).
Location: Deciding where ingestion happens (cloud vs. edge) to balance latency and bandwidth costs.
Ways to Ingest Data
Direct Database Connection: Connecting directly to a source database to query data (can impact source performance).
Change Data Capture (CDC): Reading the database transaction logs to capture changes (inserts, updates, deletes) in real-time.
APIs: Fetching data from external systems via REST or other API standards.
Message Queues and Event-Streaming Platforms: Using systems like Kafka, Kinesis, or RabbitMQ to ingest real-time events.
Managed Data Connectors: Using third-party tools (like Fivetran or Airbyte) that come with pre-built connectors for common SaaS apps.
Moving Data with Object Storage: Using cloud buckets (S3, GCS) as a landing zone for data before processing.
EDI (Electronic Data Interchange): Handling older, standardized formats for business-to-business document exchange.
Databases and File Export: Exporting database contents to flat files for ingestion elsewhere.
Practical Issues with Common File Formats.
Delimiters & Escaping: In formats like CSV, issues arise when the data itself contains the delimiter (e.g., a comma inside a "Name" column). You need robust handling for quotes and escape characters.
Encoding: Mismatches between UTF-8, ASCII, or Windows-1252 can corrupt data during ingestion.
Header Changes: If a source system adds, removes, or renames a column header, it can break the ingestion pipeline if the schema isn't flexible.
Shell: Using command-line scripts for moving data.
Usage: Simple, effective for moving files between local directories or triggering scripts.
Pros/Cons: It's universal and powerful (using pipes
|to chain commands), but can be brittle if error handling isn't robust.
SSH (Secure Shell): Secure protocol for operating network services securely.
Ingestion Role: Data engineers often use SSH to log into remote servers to manually trigger batch jobs, debug pipelines, or set up secure tunnels for data transfer.
SFTP and SCP: Secure methods for transferring files over a network.
SFTP (SSH File Transfer Protocol): A secure version of FTP that runs over SSH. It allows you to traverse the file system and transfer files.
SCP (Secure Copy Protocol): A simpler, faster way to copy files between hosts on a network, also running over SSH.
Webhooks: A "reverse API" where a source pushes data to your endpoint when an event occurs.
How it works: Instead of you asking an API "Do you have new data?" (Polling), the source system sends an HTTP POST request to your system immediately when an event occurs (e.g., a payment is processed).
Ingestion Role: Crucial for near real-time ingestion without the overhead of constant polling.
Web Interface: Manually or programmatically interacting with web portals to retrieve data.
Usage: Sometimes data is only available by logging into a website and clicking "Export CSV."
Engineering view: This is generally discouraged for automation because it is hard to replicate reliably, but sometimes necessary for ad-hoc ingestion.
Web Scraping: Extracting data from websites that do not offer a formal API.
Usage: Used when there is no official API or file export available.
Risk: It is extremely brittle; if the website owner changes the HTML structure (e.g., renames a
<div>class), your ingestion pipeline breaks. It is considered a "last resort."
Transfer Appliances for Data Migration: Physical hardware used to physically ship massive amounts of data to a cloud provider when network transfer is too slow.
Examples: AWS Snowball, Google Transfer Appliance, Azure Data Box.
Concept: You essentially "ship the hard drives" to the cloud provider. It is a high-bandwidth, high-latency method of ingestion.
Data Sharing: Methods for securely sharing data between organizations or business units. A modern capability (popularized by platforms like Snowflake and Databricks) that avoids moving data altogether.
Concept: Instead of copying data from Vendor A's bucket to Company B's bucket (ETL), Vendor A grants Company B secure access to read the data directly where it sits.
Benefit: Zero-copy ingestion; the data is instantly available without pipelines or storage duplication.
Data Integration
What is Data Integration?
Data integration is the process of combining data from multiple, disparate sources into a unified and consistent view. While individual systems (like a sales app or a shipping log) hold fragments of information, data integration brings these fragments together to create a "single source of truth".
Unlike simple movement, integration focuses on making data useful and actionable. It typically involves cleaning, transforming, and harmonizing the data—such as fixing errors, removing duplicates, and resolving conflicts (e.g., matching "J. Smith" in one system to "John Smith" in another).
The ultimate goal is to provide a comprehensive dataset that is ready for analytics, business intelligence (BI), and decision-making.
Comparison: Data Integration vs. Data Ingestion
While both are critical parts of a data pipeline, they serve distinct purposes and operate at different levels of complexity.
Feature
Data Ingestion
Data Integration
Primary Goal
Transport: Getting data from Point A to Point B.
Unification: Making data from Point A and Point B work together.
Data State
Raw: The data is usually moved "as-is" without changing its format or content.
Refined: The data is cleaned, standardized, and transformed into a new, consistent format.
Complexity
Logistical: Focuses on connectivity, throughput, and reliable delivery.
Logical: Focuses on business logic, schema mapping, and resolving data conflicts.
Timing
First Step: Usually occurs at the very beginning of the pipeline (e.g., getting data into a Data Lake).
Subsequent Step: Typically happens after ingestion, once the raw data is available to be processed.
Number of Sources
Can deal with a single source at a time (e.g., dumping one database into storage).
inherently deals with multiple sources to combine them.
In short:
Ingestion is like the delivery truck that brings raw ingredients (vegetables, meat, spices) from different farms to the restaurant's kitchen.
Integration is like the chef who chops, seasons, and cooks those ingredients together to create a finished meal.
Last updated