Data Architecture

Various data architecture concepts

Overview of data pipeline architectures with tools involved: https://www.montecarlodata.com/blog-data-pipeline-architecture-explained/arrow-up-right


Data Lake

What is a Data Lake? A data lake is a storage system that holds vast amounts of raw data in its natural, unprocessed format—like a natural lake that contains water and various life forms without altering their nature. Unlike data warehouses that store structured, processed data (like organized warehouse goods), data lakes accommodate all data types: structured, semi-structured, and unstructured. The data requires cleaning, joining, and aggregation through compute processing to become useful for analysis.

Key Benefits of Using Data Lakes

Quick Data Access and Flexibility

  • Enables immediate data storage without upfront structuring (schema-on-read approach)

  • Allows faster access for power users and data scientists

  • Facilitates rapid data investigation before committing to full data warehouse integration

Cost and Performance Advantages

  • Significantly cheaper than data warehouse compute resources

  • Offers multiple compute options that can run transformations in parallel

  • Provides access to advanced code libraries for complex transformations impossible with SQL alone

  • Eliminates data warehouse maintenance windows, enabling 24/7 availability

Storage Benefits

  • Provides unlimited, inexpensive cloud storage across multiple cost tiers

  • Supports "just in case" data collection without performance concerns

  • Rarely requires data deletion except for regulatory compliance

  • Serves as an online archive for older data (keeping recent data in the warehouse, older data in the lake)

Centralization and Integration

  • Acts as the "single version of truth" when all data flows through it first

  • Supports streaming data from IoT devices

  • Integrates diverse file types: CSV, JSON, Word documents, media files

  • Enables SQL querying of lake data through most data warehouse technologies

Data Protection and Recovery

  • Maintains long-term history of raw data for ETL reruns without impacting source systems

  • Provides backup capabilities for data warehouse restoration

  • Protects against data loss from accidental deletion or corruption

Data lakes complement data warehouses by handling the storage and initial processing of diverse data types while offering superior cost efficiency, flexibility, and scalability for modern data analytics needs.

Take Time to Design Your Data Lake Right

Don't rush into building a data lake—many companies skip proper planning and end up rebuilding later. Before you start, carefully map out your current and future data sources, considering their size, type, and velocity. Research data lake design patterns thoroughly to choose what fits your specific needs.

Unlike traditional databases that require upfront structure, data lakes accept raw data in any format—this flexibility is their superpower. However, without proper organization, your data lake can quickly become a chaotic "data swamp." The key is implementing smart governance practices from day one.

Start by organizing your data lake into logical layers or zones, each representing progressively higher data quality levels. This layered approach keeps your data both accessible and manageable as it matures.


Understanding Data Lake Layers: From Raw to Ready

Data lakes organize information through hierarchical layers, each improving data quality and usability. Picture it as a refinery process—starting with crude materials and ending with polished products.

Raw Layer (Bronze/Landing Zone) Your permanent (immutable) archive of untouched data in its original format—think of it as the crude oil reservoir. This immutable foundation preserves everything exactly as received, serving as your historical record. May also be called staging zone.

Conformed Layer (Standardized) Here, mixed file formats (CSV, JSON, etc.) get unified into a single format, typically Parquet. It's like converting different currencies into one standard—same value, consistent format for easier processing.

Cleansed Layer (Silver/Transformed) The filtration stage where data gets cleaned, integrated, and enriched. Inconsistencies disappear, schemas align, and data types standardize. Your information emerges uniform and reliable.

Presentation Layer (Gold/Consumption) Business logic transforms clean data into user-ready formats. Think aggregated reports, dashboards, and analytics-friendly structures complete with metadata. This is where data becomes actionable intelligence.

Sandbox Layer (Exploration) An optional playground, usually for data scientists. It's a modifiable copy of raw data where experimentation happens without affecting production systems—your safe space for discovery and development.

Based on the above layer descriptions, let’s look at an example:


Food Delivery Company Data Lake Example

Consider a food delivery service like DoorDash that collects data from mobile apps, restaurant partners, delivery drivers, and customer reviews. The four layers would organize this data as follows:

Raw Layer

  • Order data from the mobile app including customer ID, restaurant, items ordered, quantities, prices, order time, and delivery address

  • Driver location data with GPS coordinates, timestamps, driver ID, and delivery status updates

  • Restaurant data including menu items, prices, availability, preparation times, and restaurant ratings

  • Customer reviews and ratings for restaurants, food quality, and delivery experience

Conformed Layer

  • All order data, GPS tracking, restaurant information, and reviews are converted to Parquet format for consistent processing across the platform

Cleaned Layer

  • Order data with corrected addresses, standardized restaurant and menu item names, fixed pricing errors, and unified time zones

  • Driver data with accurate GPS coordinates, removed duplicate location pings, and standardized delivery status codes

  • Restaurant data with consistent menu categorization, standardized cuisine types, and aligned restaurant IDs across all systems

  • Review data cleaned for spam removal, standardized rating scales, and matched to correct orders and restaurants

Presentation Layer

  • Delivery performance dashboards showing average delivery times, driver efficiency, and on-time delivery rates by area

  • Restaurant analytics displaying popular menu items, peak ordering hours, and restaurant performance rankings

  • Customer insights showing ordering patterns, favorite cuisines, customer lifetime value, and satisfaction trends


Essential Practice: Strategic Folder Organization

Smart folder structuring within each data lake layer serves multiple purposes and can be organized in various ways depending on your needs:

Data Segregation Group data logically by origin, department, or type for easy discovery. Example: /sales-data/, /marketing-data/, /customer-service-data/

Access Control Structure folders by user permissions to enforce security boundaries. Example: /public/, /internal/, /confidential/, /restricted/

Performance Optimization Arrange data to accelerate common queries and processing tasks. Example: /daily-transactions/, /monthly-summaries/, /yearly-archives/

Data Lifecycle Management Separate folders based on data age and retention policies. Example: /active/, /archived/, /scheduled-for-deletion/

Metadata Management Keep data descriptions and schemas separate from actual data files. Example: /datasets/, /schemas/, /documentation/

Compliance Requirements Isolate regulated data types to meet industry-specific standards. Example: /pii-data/, /financial-records/, /medical-data/

Backup and Recovery Organize by backup frequency and recovery priorities. Example: /critical-daily-backup/, /standard-weekly-backup/, /archive-monthly/

Data Versioning Maintain different versions of the same dataset separately. Example: /customer-data-v1/, /customer-data-v2/, /customer-data-latest/

Data Partitioning Split data by key attributes for faster query performance. Example: /year=2024/month=01/, /year=2024/month=02/

Processing Pipeline Management Group data requiring similar transformation workflows. Example: /batch-processing/, /real-time-streams/, /manual-uploads/


Quality Control checkpoints between layers

Set up quality control checkpoints to verify data integrity as it flows between layers. Using financial data as an example, you might develop reconciliation queries that aggregate key metrics (transaction volume and dollar amounts) and cross-reference these totals between your source system and each processed layer. Discrepancies indicate transformation pipeline failures that require immediate investigation, correction, and reprocessing to restore data accuracy.

The Goal: Ensure no data gets lost, corrupted, or incorrectly transformed as it moves through your data lake layers.

The Method: Create "control totals" or checksums that should remain consistent across all layers, then compare them.

Practical Example:

Let's say you're processing daily sales data through your layers:

Raw Layer:

  • 1,000 transaction records

  • Total sales: $50,000

Conformed Layer (after converting to Parquet):

  • Should still have: 1,000 records and $50,000 total

Cleaned Layer (after removing duplicates, fixing errors):

  • Might have: 995 records (5 duplicates removed)

  • Should still total: $50,000 (unless the duplicates were legitimate sales)

Presentation Layer (aggregated for reporting):

  • Daily summary: 995 transactions, $50,000 total sales

The Integrity Check Query:

What It Catches:

  • Pipeline failures: Records accidentally dropped during transformation

  • Logic errors: Incorrect joins that duplicate or lose data

  • Calculation mistakes: Wrong aggregations or currency conversions

  • Data corruption: Values changed unexpectedly during processing


Advantages of using Object storage for Data Lake


Data Warehouse


Data Lakehouse

What exactly is data lakehouse?



Last updated