Data Architecture
Various data architecture concepts
Overview of data pipeline architectures with tools involved: https://www.montecarlodata.com/blog-data-pipeline-architecture-explained/
Data Lake
What is a Data Lake? A data lake is a storage system that holds vast amounts of raw data in its natural, unprocessed format—like a natural lake that contains water and various life forms without altering their nature. Unlike data warehouses that store structured, processed data (like organized warehouse goods), data lakes accommodate all data types: structured, semi-structured, and unstructured. The data requires cleaning, joining, and aggregation through compute processing to become useful for analysis.
Key Benefits of Using Data Lakes
Quick Data Access and Flexibility
Enables immediate data storage without upfront structuring (schema-on-read approach)
Allows faster access for power users and data scientists
Facilitates rapid data investigation before committing to full data warehouse integration
Cost and Performance Advantages
Significantly cheaper than data warehouse compute resources
Offers multiple compute options that can run transformations in parallel
Provides access to advanced code libraries for complex transformations impossible with SQL alone
Eliminates data warehouse maintenance windows, enabling 24/7 availability
Storage Benefits
Provides unlimited, inexpensive cloud storage across multiple cost tiers
Supports "just in case" data collection without performance concerns
Rarely requires data deletion except for regulatory compliance
Serves as an online archive for older data (keeping recent data in the warehouse, older data in the lake)
Centralization and Integration
Acts as the "single version of truth" when all data flows through it first
Supports streaming data from IoT devices
Integrates diverse file types: CSV, JSON, Word documents, media files
Enables SQL querying of lake data through most data warehouse technologies
Data Protection and Recovery
Maintains long-term history of raw data for ETL reruns without impacting source systems
Provides backup capabilities for data warehouse restoration
Protects against data loss from accidental deletion or corruption
Data lakes complement data warehouses by handling the storage and initial processing of diverse data types while offering superior cost efficiency, flexibility, and scalability for modern data analytics needs.
Take Time to Design Your Data Lake Right
Don't rush into building a data lake—many companies skip proper planning and end up rebuilding later. Before you start, carefully map out your current and future data sources, considering their size, type, and velocity. Research data lake design patterns thoroughly to choose what fits your specific needs.
Unlike traditional databases that require upfront structure, data lakes accept raw data in any format—this flexibility is their superpower. However, without proper organization, your data lake can quickly become a chaotic "data swamp." The key is implementing smart governance practices from day one.
Start by organizing your data lake into logical layers or zones, each representing progressively higher data quality levels. This layered approach keeps your data both accessible and manageable as it matures.
Understanding Data Lake Layers: From Raw to Ready
Data lakes organize information through hierarchical layers, each improving data quality and usability. Picture it as a refinery process—starting with crude materials and ending with polished products.
Raw Layer (Bronze/Landing Zone) Your permanent (immutable) archive of untouched data in its original format—think of it as the crude oil reservoir. This immutable foundation preserves everything exactly as received, serving as your historical record. May also be called staging zone.
Conformed Layer (Standardized) Here, mixed file formats (CSV, JSON, etc.) get unified into a single format, typically Parquet. It's like converting different currencies into one standard—same value, consistent format for easier processing.
Cleansed Layer (Silver/Transformed) The filtration stage where data gets cleaned, integrated, and enriched. Inconsistencies disappear, schemas align, and data types standardize. Your information emerges uniform and reliable.
Presentation Layer (Gold/Consumption) Business logic transforms clean data into user-ready formats. Think aggregated reports, dashboards, and analytics-friendly structures complete with metadata. This is where data becomes actionable intelligence.
Sandbox Layer (Exploration) An optional playground, usually for data scientists. It's a modifiable copy of raw data where experimentation happens without affecting production systems—your safe space for discovery and development.
Based on the above layer descriptions, let’s look at an example:
Food Delivery Company Data Lake Example
Consider a food delivery service like DoorDash that collects data from mobile apps, restaurant partners, delivery drivers, and customer reviews. The four layers would organize this data as follows:
Raw Layer
Order data from the mobile app including customer ID, restaurant, items ordered, quantities, prices, order time, and delivery address
Driver location data with GPS coordinates, timestamps, driver ID, and delivery status updates
Restaurant data including menu items, prices, availability, preparation times, and restaurant ratings
Customer reviews and ratings for restaurants, food quality, and delivery experience
Conformed Layer
All order data, GPS tracking, restaurant information, and reviews are converted to Parquet format for consistent processing across the platform
Cleaned Layer
Order data with corrected addresses, standardized restaurant and menu item names, fixed pricing errors, and unified time zones
Driver data with accurate GPS coordinates, removed duplicate location pings, and standardized delivery status codes
Restaurant data with consistent menu categorization, standardized cuisine types, and aligned restaurant IDs across all systems
Review data cleaned for spam removal, standardized rating scales, and matched to correct orders and restaurants
Presentation Layer
Delivery performance dashboards showing average delivery times, driver efficiency, and on-time delivery rates by area
Restaurant analytics displaying popular menu items, peak ordering hours, and restaurant performance rankings
Customer insights showing ordering patterns, favorite cuisines, customer lifetime value, and satisfaction trends
Essential Practice: Strategic Folder Organization
Smart folder structuring within each data lake layer serves multiple purposes and can be organized in various ways depending on your needs:
Data Segregation Group data logically by origin, department, or type for easy discovery. Example: /sales-data/, /marketing-data/, /customer-service-data/
Access Control Structure folders by user permissions to enforce security boundaries. Example: /public/, /internal/, /confidential/, /restricted/
Performance Optimization Arrange data to accelerate common queries and processing tasks. Example: /daily-transactions/, /monthly-summaries/, /yearly-archives/
Data Lifecycle Management Separate folders based on data age and retention policies. Example: /active/, /archived/, /scheduled-for-deletion/
Metadata Management Keep data descriptions and schemas separate from actual data files. Example: /datasets/, /schemas/, /documentation/
Compliance Requirements Isolate regulated data types to meet industry-specific standards. Example: /pii-data/, /financial-records/, /medical-data/
Backup and Recovery Organize by backup frequency and recovery priorities. Example: /critical-daily-backup/, /standard-weekly-backup/, /archive-monthly/
Data Versioning Maintain different versions of the same dataset separately. Example: /customer-data-v1/, /customer-data-v2/, /customer-data-latest/
Data Partitioning Split data by key attributes for faster query performance. Example: /year=2024/month=01/, /year=2024/month=02/
Processing Pipeline Management Group data requiring similar transformation workflows. Example: /batch-processing/, /real-time-streams/, /manual-uploads/
Quality Control checkpoints between layers
Set up quality control checkpoints to verify data integrity as it flows between layers. Using financial data as an example, you might develop reconciliation queries that aggregate key metrics (transaction volume and dollar amounts) and cross-reference these totals between your source system and each processed layer. Discrepancies indicate transformation pipeline failures that require immediate investigation, correction, and reprocessing to restore data accuracy.
The Goal: Ensure no data gets lost, corrupted, or incorrectly transformed as it moves through your data lake layers.
The Method: Create "control totals" or checksums that should remain consistent across all layers, then compare them.
Practical Example:
Let's say you're processing daily sales data through your layers:
Raw Layer:
1,000 transaction records
Total sales: $50,000
Conformed Layer (after converting to Parquet):
Should still have: 1,000 records and $50,000 total
Cleaned Layer (after removing duplicates, fixing errors):
Might have: 995 records (5 duplicates removed)
Should still total: $50,000 (unless the duplicates were legitimate sales)
Presentation Layer (aggregated for reporting):
Daily summary: 995 transactions, $50,000 total sales
The Integrity Check Query:
What It Catches:
Pipeline failures: Records accidentally dropped during transformation
Logic errors: Incorrect joins that duplicate or lose data
Calculation mistakes: Wrong aggregations or currency conversions
Data corruption: Values changed unexpectedly during processing
Advantages of using Object storage for Data Lake
Data Warehouse
Data Lakehouse
What exactly is data lakehouse?
Last updated