Serving for Analytics and ML


Serving Data for Machine Learning (ML) and Analytics

While Machine Learning Engineering (MLE) has emerged as a specialized discipline, the data engineer remains the backbone of the ML lifecycle. The boundary between these roles is often porous; in some organizations, data engineers handle everything up to model training, including feature engineering, while in others, they simply provide raw pipelines.

Regardless of the organizational chart, the serving requirements for ML often mirror those of advanced analytics. The goal is to provide reliable, high-quality data to the algorithms and scientists that need it.

Core Mechanisms for Serving Data

Data engineers have five primary mechanisms for serving data to analytics and ML consumers, ranging from simple files to complex virtualized queries.

1. File Exchange

This is the most ubiquitous method, handling everything from structured CSVs for financial analysis to unstructured images for computer vision models.

  • Scalability: While emailing Excel files is common, it creates version control chaos and security risks. Modern file exchange relies on Object Storage (Data Lakes) to serve data at scale.

  • Flexibility: Object storage is particularly vital for ML because it can natively store the unstructured "blobs" (audio, video, text) required for deep learning.

2. Databases (OLAP)

The Data Warehouse or Data Lakehouse is the standard serving layer for structured analytics.

  • Compute/Storage Separation: Modern platforms (like Snowflake or Databricks) allow engineers to isolate workloads. You can spin up one cluster for heavy ETL and a separate, isolated cluster for data scientists to run heavy training jobs without degrading performance for business dashboard users.

  • Pushdown vs. Extract: BI tools interact with this layer differently. Some (like Tableau) extract data to local storage for speed, while others (like Looker) "push down" SQL queries to the warehouse to leverage its raw power.

3. Query Federation

Query federation (or virtualization) allows users to query data across multiple disparate sources (RDBMS, APIs, Object Storage) without first centralizing it into a single warehouse.

  • Use Case: This is ideal for ad hoc exploration or strict compliance scenarios where data cannot be moved.

  • Trade-off: It offers great flexibility but introduces performance variability and potential strain on source systems.

4. Data Sharing

Cloud-native data sharing allows for "zero-copy" data exchange. Instead of moving data via FTP or APIs, organizations grant secure access to live tables. This turns data serving into an access control management task rather than a pipeline engineering task.

5. Streaming Systems

As operational analytics grows, serving data via streams (e.g., Kafka) or real-time OLAP databases is becoming essential. This enables use cases where the value of data decays rapidly, such as fraud detection or live inventory management.

The Semantic and Metrics Layer

A common failure mode in serving is the "Metric Drift," where different departments calculate "Gross Revenue" differently using inconsistent SQL logic.

  • The Solution: A Semantic or Metrics Layer (using tools like dbt or Looker) centralizes business logic.

  • Write Once, Use Everywhere: Definitions are coded once in this layer. Whether the data is requested by a BI dashboard, a Python notebook, or an external app, the metrics layer compiles the request into the correct SQL, ensuring a single source of truth.

Serving Data in Notebooks

For Data Scientists, the Jupyter Notebook is the primary IDE. Serving data here requires specific attention to developer experience and security.

  • The "Local" Trap: Scientists often start by loading data into local memory (pandas). This fails as data grows. Engineers must help migrate these workflows to cloud-based notebooks or distributed compute frameworks (Spark, Dask) to handle production-scale data.

  • Credential Management: A critical security risk in notebooks is hard-coded database passwords. Data engineers must enforce the use of environment variables or CLI-based credential managers to prevent secrets from leaking into version control.

Reverse ETL: Closing the Loop

"Reverse ETL" refers to the process of moving data from the Data Warehouse back into operational systems of record (like CRMs, Advertising platforms, or ERPs).

  • Operationalizing Insights: Instead of trapping a "Lead Score" in a dashboard where a salesperson might miss it, Reverse ETL pushes that score directly into Salesforce. This reduces friction and puts data into the tools users already inhabit.

  • Feedback Loops: A danger of Reverse ETL is the creation of unintended feedback loops. For example, an algorithm that increases ad bids based on performance data fed back into the ad platform could accidentally spiral into excessive spending if not carefully monitored.


Last updated