Whom you'll work with (stakeholders)


The following information is from the book "Fundamentals of Data Engineering" by Joe Reis and Matt Housley.

The book explicitly structures the field around the "Data Engineering Lifecycle," which consists of five specific stages: Generation, Storage, Ingestion, Transformation, and Serving.


Upstream and downstream stakeholders

The easiest way to visualize your stakeholders is to imagine data as a flowing river.

As a Data Engineer, you are standing in the middle of this river.

Who are upstream stakeholders?

Aka the Producers.

These are the people "up the river" from you. They stand at the source. They create, own, or control the systems that generate the raw data you need.

  • Software Engineers / App Developers: They build the backend systems (websites, apps, microservices) that generate the data.

  • Product Managers: They decide what features get built, which indirectly decides what data gets generated.

  • Third-Party Vendors: Companies like Salesforce, Stripe, or Google Ads that generate data you have to fetch via API.

The Key Dynamic: You are usually dependent on them. If they change their systems (e.g., rename a database column), your pipelines break. You need to convince them to care about data quality.

Who are downstream stakeholders?

Aka the Consumers.

These are the people "down the river" from you. They are waiting for the water (data) you provide so they can drink, farm, or swim. They consume the data you have processed.

  • Data Analysts: They need clean, organized tables to build dashboards and reports.

  • Data Scientists: They need vast amounts of raw or structured data to train machine learning models.

  • Business Users / Executives: They don't look at the database; they look at the results (dashboards, metrics) to make decisions.

  • The "Machine" (Reverse ETL): Sometimes the "consumer" isn't a person, but another piece of software (like sending data back into a CRM for sales teams).

The Key Dynamic: They are dependent on you. If your pipeline fails or the data is wrong, they cannot do their jobs. You are their supplier.


Stakeholders across Data Engineering Lifecycle stages

Data Generation in Source Systems Stage

1. The Two Main Stakeholder Categories

The book distinguishes between those who build the systems and those who control the gates.

  • Systems Stakeholders: The builders. These are Software Engineers and App Developers who create and maintain the applications generating the data.

  • Data Stakeholders: The gatekeepers. These are IT departments, Data Governance teams, or Security teams who control access rights and permissions.

Note: Sometimes these are the same team (in small startups), but in large enterprises, you often have to convince the Systems team to give you the data, and then beg the Data Stakeholder to open the firewall port.

2. The Core Challenge: "At The Mercy" of Upstream

The book highlights a power imbalance. Data Engineers are downstream consumers, meaning you are often "at the mercy" of upstream engineering practices.

  • If they use poor database practices, your ingestion breaks.

  • If they drop a table without telling you, your pipeline fails.

The Solution: The Feedback Loop You must create a bidirectional relationship. Source system owners often do not realize that their "small change" breaks your dashboard. You must make them aware of how their data is being consumed so they feel a sense of ownership over the downstream impact.

3. Formalizing the Relationship

To manage this fragility, the book suggests three levels of formal agreement:

A. The Data Contract

A written agreement between the Source Owner (Producer) and the Ingestion Team (Consumer).

  • What it covers: What data is extracted, the method (full vs. incremental), frequency, and contact points.

  • Best Practice: Store this as code (e.g., in a GitHub repo) so it is version-controlled and visible.

B. Service Level Agreements (SLAs)

The "Promise." A formal statement of expectations regarding availability and quality.

  • Example: "The source database will be available 24/7 with minimal downtime."

C. Service Level Objectives (SLOs)

The "Measurement." The specific metric used to track if the SLA is being met.

  • Example: "99.9% uptime per month."


Additions for Context

To give you a fuller picture, here are a few modern nuances regarding these concepts:

1. The "Data Mesh" Influence The book touches on a concept central to Data Mesh: Domain-Oriented Ownership. In a Data Mesh architecture, the "Systems Stakeholders" (e.g., the team handling the Checkout Microservice) are expected to treat data as a product, not a byproduct. This shifts the dynamic from you "begging" for data to them being "responsible" for serving it to you cleanly.

2. Automated Data Contracts The book mentions formatting contracts so they can be "integrated into the development process." In modern engineering (CI/CD), this means:

  • If a Software Engineer changes a schema in their code, the deployment pipeline runs a check against the Data Contract.

  • If the change violates the contract (e.g., deleting a column needed by Data Engineering), the deployment is blocked automatically. This moves the protection from a verbal agreement to a hard code constraint.

3. The "Social" Engineering While SLAs and Contracts are great, the book hints at a "verbal" approach if things are too formal. In practice, buying coffee for the upstream Lead Engineer is often as effective as a Jira ticket. Building social capital is a hidden requirement for effective Data Engineering.


Storage Stage

1. The Infrastructure "Landlords"

In the Storage stage, you are dealing with the people who hold the keys to the cloud or the server room.

  • Primary Stakeholders: DevOps Engineers, Cloud Architects, and Security/InfoSec teams.

  • The Dynamic: These teams own the "house" (the AWS/Azure/GCP account), and you are the "tenant" trying to store furniture (data) in it. Their priority is security and stability; yours is access and flexibility.

2. The Core Friction: Autonomy vs. Control

The book highlights a critical question: "Who has the authority to deploy?"

  • The Conflict: Data Engineers often need to spin up new storage buckets or resize data warehouse clusters quickly. However, Central IT/DevOps often wants to restrict permissions to prevent security leaks or runaway costs.1

  • The Solution: You must work with these teams to define "streamlined processes." In modern terms, this usually means adopting Infrastructure as Code (IaC). Instead of asking for permission to click a button in the AWS console, you submit a Pull Request (using tools like Terraform) that DevOps reviews and approves.

3. The Maturity Factor

Your role changes drastically depending on the size of the company:

  • Early Maturity (Startups): You are likely the "Storage Admin." You set up the S3 buckets, you configure the Snowflake/Redshift instance, and you manage the security roles. You are the DevOps team for data.

  • Late Maturity (Enterprises): You manage a specific "slice" of the storage. There is likely a dedicated Data Platform Team that handles the underlying infrastructure (backups, encryption, networking). You focus on the content and logic within that storage (tables, schemas, partitions).

4. Responsibilities to Downstream Users

Even if you don't own the physical server, you are responsible for the storage experience for the users (Analysts, Scientists):

  • Performance: Can they query the storage fast? (Partitioning, Indexing).

  • Security: Is the data safe? (RBAC - Role Based Access Control).

  • Availability: Is the system up when they need it?


Additions for Context (Modern Reality)

To fully understand the Storage stage today, consider these two additional "Silent Stakeholders":

1. The "FinOps" Factor (Cost Management)

Storage in the cloud is cheap, but querying that storage is expensive.

  • The Addition: A major stakeholder in the Storage stage is often the Finance Department or a technical manager holding the budget.

  • The Reality: You aren't just managing "ample storage capacity" (space); you are managing "cost-effective storage." You are responsible for lifecycle policies (e.g., moving old data to "Cold Storage" like Amazon S3 Glacier) to stop the cloud bill from exploding.

2. Governance & Compliance (GDPR/CCPA)

The book mentions "securely available," but today this is a massive legal requirement.

  • The Addition: You will interact heavily with Legal/Compliance teams regarding where data is stored (Data Residency laws—e.g., "European user data cannot leave EU servers"). You must ensure your storage layer respects the "Right to be Forgotten" (if a user deletes their account, can you find and delete their data in your storage?).

Summary Table: Storage Stakeholders

Stakeholder

Their Concern

Your Interaction

DevOps / Cloud Architects

Security, Governance, Infrastructure stability.

Negotiating permissions (IAM roles) and deployment processes (IaC).

Security / Compliance

"Is this S3 bucket public?" / "Is PII encrypted?"

Implementing encryption and access controls.

Finance / Leadership

"Why is the Snowflake bill $10k this month?"

Implementing data lifecycle policies and cost controls.


Ingestion Stage

The Ingestion stage is defined as a "Boundary" function. You are the diplomat operating between two very different worlds: the technical producers and the business consumers.

1. Upstream Stakeholders: The "Data Exhaust" Problem

The primary friction here is between Software Engineers (who generate data) and Data Engineers (who collect it).

  • The Disconnect: Software Engineers often view data as "exhaust"—a byproduct of their application that they don't care about once it leaves their system. They are incentivized to ship features, not to maintain clean data for analytics.

  • The Opportunity: You must convert Software Engineers from passive producers into active stakeholders.

  • The Tactic (Involve the Product Manager): Software Engineers rarely have time to help you unless their Product Manager (PM) prioritizes it. You must convince the PM that "Data Quality" is a feature of their product. If the PM sees value, they will allocate engineering time to help you build better ingestion pipelines.

  • The Goal: Move from "fixing their mess" to "collaborating on design." Ideally, Software Engineers act as an extension of your team, building features (like event-driven architectures) that make ingestion native to the app.

2. Downstream Stakeholders: The "Shiny Object" Trap

The book warns against a common Data Engineering sin: building complex, resume-padding technology while ignoring simple business needs.

  • The Trap: Building a massive, real-time streaming architecture (Kafka, Flink) when the business doesn't actually need it yet.

  • The Reality: A Marketing Director manually downloading CSVs from Google Ads is a high-priority "customer."

  • The Strategy:

    • Focus on Revenue Centers: Marketing and Supply Chain control the money. If you automate their boring, manual ingestion tasks, you unlock budget and trust.

    • Low Tech, High Value: A simple batch script that saves the Marketing team 10 hours a week is more valuable to the business than a complex streaming system that no one uses.

3. Executive Stakeholders: Breaking the Silos

Executives want a "data-driven culture," but they often don't know how to build the organizational structure to support it.

  • Your Role: You must guide executives to change the incentives. If Software Engineers are only rewarded for uptime and feature velocity, they will never care about data quality. Executives need to set top-down goals that reward cross-functional data collaboration.


Additions for Context (Modern Nuances)

To flesh out the Ingestion picture, here are two concepts that have gained massive traction since the foundational ideas of this book were established:

1. The "Buy vs. Build" Stakeholder (Vendors)

In modern Ingestion, you are often not writing code to ingest data from Salesforce or Google Ads. You are buying a tool like Fivetran or Airbyte.

  • The New Stakeholder: You are now managing Vendors and Procurement/Finance.

  • The Shift: The conversation shifts from "How do I write this API connector?" to "Is this SaaS connector costing us too much money?" You become an integrator of third-party tools rather than just a writer of Python scripts.

2. Reverse ETL (Closing the Loop)

The book mentions helping the Marketing Manager who is downloading reports. The modern solution to this is Reverse ETL (tools like Hightouch or Census).

  • The Concept: instead of just ingesting data from apps into a Warehouse, you push clean data back into the apps (e.g., pushing "Customer Lifetime Value" scores back into Salesforce so sales reps can see them).

  • The Impact: This creates a much tighter bond with business stakeholders because you are directly enhancing the tools they live in every day.

Summary Table: Ingestion Stakeholders

Stakeholder

Typical Mindset

Your Goal

Software Engineers

"Data is just exhaust from my app."

Make them view data as a "Product."

Product Managers

"I need to ship features, not fix logs."

Convince them that data quality is a feature.

Business Ops (Marketing/Sales)

"I'm drowning in manual spreadsheets."

Automate the boring stuff to win their trust.

Executives

"We need to be data-driven."

Guide them to align incentives across teams.


Queries, Modeling, and Transformation stage

Transformation and Modeling stage is identified as the most "full-contact" phase of the lifecycle. You are essentially the bridge between data creation and data consumption.

The Core Responsibility

At this stage, your technical goal is to design, build, and maintain systems that query and transform data. However, your value is defined by how well you manage relationships to ensure that data is:

  • Functional: The systems work reliably.

  • Trustworthy: The data is accurate and complete.

  • Performant: Queries are fast and cost-effective.


Upstream Stakeholders (The Source)

These are the people "generating" the data or defining what it means. You must understand where the data comes from and how the business defines it.

A. The Business Logic Owners (Product Managers, Domain Experts)

  • Role: They control business definitions (e.g., "What defines a 'churned' user?").

  • Interaction: You must collaborate with them to design data models.

  • The Challenge: Business logic changes frequently. You must be involved early in these changes to update your data models.

  • Expanded Insight: If you don't align here, you risk building technically perfect pipelines that deliver biologically useless numbers.

B. The Source System Owners (Software/Backend Engineers)

  • Role: They control the apps and databases generating the raw data.

  • Interaction: You need to understand their schema and ensure your extraction queries don't crash their production databases.

  • The Critical Risk (Schema Drift): If they change a column name, delete a field, or change a data type without telling you, your pipeline breaks.

  • Expanded Insight: This friction point often leads to the implementation of "Data Contracts"—formal agreements between software engineers and data engineers on data structure and quality, preventing silent failures.


Downstream Stakeholders (The Consumers)

These are the people who rely on your transformations to do their jobs. If Upstream is about "Stability," Downstream is about "Utility."

A. Technical Consumers (Data Scientists, ML Engineers)

  • Needs: High-quality, complete data to integrate into workflows and products.

  • Goal: They need confidence that the data feeding their models is not garbage.

B. Analytical Consumers (Data Analysts)

  • Needs: Performant data models.

  • Goal: They need to write queries that return results quickly without incurring massive cloud costs.

C. "The Business" (Executives, Ops, Non-technical users)

  • Needs: Actionable insights.

  • Goal: They need to trust that the dashboard numbers are accurate so they can make decisions.

Summary Table: The Data Engineer's Position

Stakeholder Group

Who are they?

Primary Concern

Your Key Responsibility

Upstream

Software Engineers, Product Managers

"Don't slow down my app." / "Reflect my business logic."

Monitor schema changes; minimize impact on source systems.

Downstream

Analysts, Data Scientists, Executives

"Give me clean data fast." / "Can I trust this?"

Ensure data quality, query performance, and cost-efficiency.


Serving

The Serving stage is the "Last Mile" of data engineering. This is where your work is finally consumed, and your value is measured by how useful your data product is to others.

1. The Core Stakeholders (The Consumers)

The book lists the usual suspects who rely on your data:

  • Data Analysts: Need clean, queryable tables for reporting and dashboards.

  • Data Scientists: Need raw or feature-engineered data for experimentation.

  • ML Engineers / MLOps: Need reliable, low-latency data feeds for production models.

  • "The Business": Executives and managers who don't care about the tech, only the insights.

2. The Responsibility Boundary: "Supplier" vs. "Interpreter"

A critical distinction is made here: You deliver the ingredients, you don't cook the meal.

  • Your Role: Producing high-quality, reliable data products.

  • Their Role: Interpreting that data to make business decisions.

  • Why this matters: You should not be held responsible if an analyst misinterprets a correct number. However, you are responsible if the number was wrong in the database to begin with.

3. The Feedback Loop (Cyclic Nature)

Data serving is not a one-way street. The book emphasizes that the "outside world" influences the data.

  • Example: A Data Scientist uses your served data to build a churn prediction model. The predictions from that model (e.g., "User X is high risk") need to be re-ingested and served back to the marketing team so they can send an email.

  • The implication: Serving often triggers new Ingestion requirements.

4. Maturity & Organization: "Wearing Many Hats"

  • Early Stage (Startups): The Data Engineer is often also the Data Scientist and the Analyst. The book warns this is "not sustainable." It leads to burnout and technical debt.

  • Growth Stage: You must establish clear divisions of labor. You build the pipeline; they build the model.

  • Data Mesh (Advanced): In a decentralized "Data Mesh" organization, the responsibility for serving shifts. Instead of one central team serving everyone, each "Domain Team" (e.g., the Checkout Team) becomes responsible for serving their own data products to the rest of the company.


Additions for Context (Modern Serving Patterns)

To complete the picture, here are two modern concepts that heavily influence the Serving stage today:

1. The "Semantic Layer" (The Translator)

The book mentions serving data to analysts. In modern stacks, a Semantic Layer (like dbt metrics or Cube) sits between your warehouse and the consumers.

  • What it does: It defines metrics (like "Revenue" or "Active Users") in code once.

  • The Benefit: It ensures the Analyst, the CEO, and the Data Scientist all see the exact same number for "Revenue," preventing the chaos of different people calculating it differently.

2. Reverse ETL as "Operational Serving"

Traditionally, "Serving" meant powering a Dashboard (BI). Today, "Operational Analytics" is huge.

  • The Shift: Serving data directly into operational tools (Salesforce, Zendesk, Slack).

  • Stakeholder Impact: Your stakeholders are no longer just people analyzing history; they are Salespeople and Support Agents acting in real-time based on your data.

Summary Table: Serving Stakeholders

Stakeholder

What they need

Your Responsibility

Data Analyst

Pre-aggregated, clean tables for BI tools (Tableau, Looker).

Performance optimization (clustering, partitioning) so dashboards load fast.

Data Scientist

Granular, raw data or feature stores.

Consistency and completeness; ensuring training data matches production data.

The Business

Trustworthy numbers to make decisions.

Data Quality (SLAs); ensuring the "Semantic" definition is consistent.


Final Wrap-Up: The Stakeholder Journey

Across all the stages, a clear narrative emerges for the Data Engineer:

  1. Upstream (Generation/Ingestion): You are a Diplomat, convincing software engineers to care about data quality.

  2. Midstream (Storage/Transformation): You are an Architect, balancing cost, security, and performance with IT and DevOps.

  3. Downstream (Serving): You are a Product Owner, delivering reliable data products that empower analysts and the business to drive value.


Last updated