Framework for thinking like a data engineer


Source of the informationarrow-up-right


The framework

Stage 1: Identify Business Goals & Stakeholder Needs

Image sourcearrow-up-right

The primary objective of this initial phase is to align the project with the organization's strategic vision before writing any code or selecting tools.

  • Clarify Business Goals: Start by understanding the high-level objectives of the company. Your project must serve these broader goals, as illustrated by the "Business Goals" icon showing growth and value.

  • Identify & Map Stakeholders: Determine who is impacted by the project. Connect their specific individual needs directly back to the high-level business goals.

  • Discovery & Assessment: Engage in direct conversations to understand the current state.

    • What systems are currently in place?

    • Are you building on top of them or replacing them?

    • What are the gaps between what they have and what they need?

  • Focus on Actionable Outcomes: Crucially, as the thought bubble in the image asks: "What do you plan to do with the data?" You must ask stakeholders what specific actions they will take with the data product. This reverse-engineers the request to reveal the actual functional requirements, ensuring you build something useful rather than just something technical.

Stage 2: Define System Requirements

Image sourcearrow-up-right

The main goal of this stage is to translate the needs gathered in Stage 1 into concrete technical specifications.

  • Convert Needs to Functional Requirements: You must define exactly what the system will do. These are the functional requirements that directly address the stakeholder needs identified previously.

  • Establish Non-Functional Requirements: You must also define how the system will perform its tasks. This covers technical specifications such as speed, security, reliability, and scalability.

  • Documentation & Verification: The final step in this stage is to document these requirements clearly. Crucially, you must then circle back to the stakeholders to confirm that if you build the system as described, it will actually solve their problem. This aligns with the "Confirm with Stakeholders" visual in the image.

Stage 3: Choose Tools & Technologies

Image sourcearrow-up-right

In this stage, the focus shifts from defining what the system does to deciding what tools will be used to build it.

  • Match Tools to Requirements: The process begins by identifying specific tools and technologies capable of meeting the non-functional requirements defined in the previous stage.

  • Conduct Cost-Benefit Analysis: Since multiple tools often solve the same problem, you must weigh the trade-offs. As shown in the graph within the image, you aim to maximize benefits while managing costs. Factors to consider include:

    • Licensing fees.

    • Cloud resource estimates.

    • Maintenance and engineering resources required.

  • Prototype & Test: Finally, you should build prototypes to test your chosen components, ensuring they align with the original stakeholder needs before full-scale implementation.

Stage 4: Build, Evaluate, Iterate & Evolve

Image sourcearrow-up-right

The final stage focuses on moving from theory to reality, ensuring the system delivers actual value before full-scale commitment.

  • Prototype & Validate: Before investing heavily in building the complete system, you must deploy a prototype. The goal is to test if the design genuinely meets stakeholder needs and expectations.

  • Iterative Feedback Loop: Stakeholders must evaluate the prototype to confirm it delivers value. You should spend as much time as necessary iterating on this prototype to guarantee success before moving to production.

  • Production Deployment: Once the prototype is validated, the final step is to build and deploy the full production data system.

  • Continuous Evolution: Even after deployment, the work isn't finished. You must monitor the system and continue to evolve it based on changing stakeholder needs over time.


Conclusion: A Cyclical Process

Image sourcearrow-up-right

While the framework is presented as a four-step linear sequence, the reality of Data Engineering is that it is a continuous, cyclical loop.

  • Continuous Monitoring & Improvement: Once a system is live, the work shifts to monitoring performance and iterating to make improvements where possible.

  • Drivers of Evolution: Systems must evolve over time due to two main factors:

    • Changing Needs: Business goals and stakeholder requirements will inevitably shift.

    • New Technologies: Emerging tools may offer better performance, lower costs, or new advantages.

  • The Ongoing Cycle: As shown in the circular diagram, this is an infinite loop rather than a straight line. You are constantly communicating with stakeholders, re-evaluating requirements, and updating systems to align with new business realities.


Last updated