What you should know about Machine Learning as a DE


Books:


What a Data Engineer should know about Machine Learning

Key ML Concepts for Data Engineers

While a Data Engineer does not need to be an expert in algorithmic theory, possessing a functional understanding of Machine Learning is crucial for collaboration, resource planning, and building scalable data products. The required knowledge generally falls into four categories:

1. Core ML Paradigms & Techniques

Understanding the "what" and "why" of model building helps engineers anticipate data requirements.

  • Learning Types: The distinctions between supervised, unsupervised, and semi-supervised learning.

  • Problem Types: The difference between classification (predicting categories) and regression (predicting values), as well as specific techniques for time-series analysis and forecasting.

  • Model Complexity: Knowing when to use "classical" techniques (like logistic regression or decision trees) versus Deep Learning. Data scientists often default to Deep Learning when it may be overkill; a data engineer should be able to spot when a simpler model might scale better.

  • AutoML vs. Handcrafted: Understanding the trade-offs between using automated ML tools versus custom-built models.

2. Data Preparation & Feature Engineering

Data Engineers often own the pipeline up to the point of training.

  • Data Types: Strategies for wrangling both structured (tabular) and unstructured (images, text) data.

  • Featurization: All ML models require numerical input. Engineers must understand how data is converted into numbers, including encoding categorical data and creating embeddings.

  • Data Cascades: Being aware of how upstream data issues can compound and negatively impact downstream ML models.

3. Infrastructure & Hardware

Decisions about where and how models run are often engineering responsibilities.

  • Compute Resources: Knowing when to use CPUs versus GPUs, and deciding whether training should happen locally, on a cluster, or at the edge.

  • Training Patterns: The difference between Batch Learning (training offline on historical data) and Online Learning (updating the model incrementally as new stream data arrives).

4. Operationalization & Lifecycle

Integrating ML into the broader data ecosystem.

  • Serving Latency: Determining if results need to be returned in real-time (e.g., product recommendations) or batch (e.g., nightly speech transcription).

  • Lifecycle Intersection: How the Data Engineering lifecycle connects with MLOps, including the support of specific technologies like Feature Stores and ML Observability tools.


Data Drift

Data Sampling

Data Labeling

Class imbalance

Data Augmentation

Feature Engineering

Common challenges:

  • Data Quality, Reproducibility, Data Drift, Scale, Multiple objectives

Data Representations design patterns:

  • Simple ones: numerical and categorical

  • Hashed feature

  • Embeddings

  • Feature-cross

  • Multimodal


Last updated