What you should know about Machine Learning as a DE
Books:
Designing Machine Learning systems by Chip Huyen
Apache Spark for Machine Learning
What a Data Engineer should know about Machine Learning
Key ML Concepts for Data Engineers
While a Data Engineer does not need to be an expert in algorithmic theory, possessing a functional understanding of Machine Learning is crucial for collaboration, resource planning, and building scalable data products. The required knowledge generally falls into four categories:
1. Core ML Paradigms & Techniques
Understanding the "what" and "why" of model building helps engineers anticipate data requirements.
Learning Types: The distinctions between supervised, unsupervised, and semi-supervised learning.
Problem Types: The difference between classification (predicting categories) and regression (predicting values), as well as specific techniques for time-series analysis and forecasting.
Model Complexity: Knowing when to use "classical" techniques (like logistic regression or decision trees) versus Deep Learning. Data scientists often default to Deep Learning when it may be overkill; a data engineer should be able to spot when a simpler model might scale better.
AutoML vs. Handcrafted: Understanding the trade-offs between using automated ML tools versus custom-built models.
2. Data Preparation & Feature Engineering
Data Engineers often own the pipeline up to the point of training.
Data Types: Strategies for wrangling both structured (tabular) and unstructured (images, text) data.
Featurization: All ML models require numerical input. Engineers must understand how data is converted into numbers, including encoding categorical data and creating embeddings.
Data Cascades: Being aware of how upstream data issues can compound and negatively impact downstream ML models.
3. Infrastructure & Hardware
Decisions about where and how models run are often engineering responsibilities.
Compute Resources: Knowing when to use CPUs versus GPUs, and deciding whether training should happen locally, on a cluster, or at the edge.
Training Patterns: The difference between Batch Learning (training offline on historical data) and Online Learning (updating the model incrementally as new stream data arrives).
4. Operationalization & Lifecycle
Integrating ML into the broader data ecosystem.
Serving Latency: Determining if results need to be returned in real-time (e.g., product recommendations) or batch (e.g., nightly speech transcription).
Lifecycle Intersection: How the Data Engineering lifecycle connects with MLOps, including the support of specific technologies like Feature Stores and ML Observability tools.
Data Drift
Data Sampling
Data Labeling
Class imbalance
Data Augmentation
Feature Engineering
Common challenges:
Data Quality, Reproducibility, Data Drift, Scale, Multiple objectives
Data Representations design patterns:
Simple ones: numerical and categorical
Hashed feature
Embeddings
Feature-cross
Multimodal
Last updated