Fast is slow, slow is fast - rethinking Our Data Engineering Process

7 November 2024

Robin van Schaik

When we set out to build a new data science team in April 2023, our goal was to revolutionize how we do forecasting in logistics at bol. What started as an opportunity to leverage our existing tech stack quickly revealed deeper challenges in our data processes and architecture. Rather than rushing into quick fixes, we chose to lay down a robust data engineering foundation guided by the philosophy of "fast is slow, slow is fast." In this article, we take you behind the scenes of our journey, sharing how we restructured our workflows in Apache Airflow, embraced the clarity of monotasking, and aligned our codebase with our evolving data needs. We discuss the obstacles we encountered, the strategic shifts we made, and the lessons learned as we continue to refine our approach. Whether you’re looking to optimize your own data practices or curious about the evolution of our team’s strategy, this story offers valuable insights into building a sustainable data engineering framework.

Rethinking Our Data Engineering Process

When you're starting a new team, you're often faced with a crucial dilemma: Do you stick with your existing way of working to get up and running quickly, promising yourself to do the refactoring later? Or do you take the time to rethink your approach from the ground up?

We encountered this dilemma in April 2023 when we launched a new data science team focused on forecasting within bol’s capacity steering product team. Within the team, we often joked that "there's nothing as permanent as a temporary solution," because rushed implementations often lead to long-term headaches.These quick fixes tend to become permanent as fixing them later requires significant effort, and there are always more immediate issues demanding attention. This time, we were determined to do things properly from the start.

Recognising the potential pitfalls of sticking to our established way of working, we decided to rethink our approach. Initially we saw an opportunity to leverage our existing technology stack. However, it quickly became clear that our processes, architecture, and overall approach needed an overhaul.

To navigate this transition effectively, we recognised the importance of laying a strong groundwork before diving into immediate solutions. Our focus was not just on quick wins but on ensuring that our data engineering practices could sustainably support our data science team's long-term goals and that we could ramp up effectively. This strategic approach allowed us to address underlying issues and create a more resilient and scalable infrastructure. As we shifted our attention from rapid implementation to building a solid foundation, we could better leverage our technology stack and optimize our processes for future success.

We followed the mantra of "Fast is slow, slow is fast.": rushing into solutions without addressing underlying issues can hinder long-term progress. So, we prioritised building a solid foundation for our data engineering practices, benefiting our data science workflows.

Our Journey: Rethinking and Restructuring

In the following sections, I’m going to take you along our journey of rethinking and restructuring our data engineering processes. We’ll explore how we:

Leveraged Apache Airflow to orchestrate and manage our data workflows, simplifying complex processes and ensuring smooth operations.
Learned from past experiences to identify and eliminate inefficiencies and redundancies that were holding us back.
Adopted a layered approach to data engineering, which streamlined our operations and significantly enhanced our ability to iterate quickly.
Embraced monotasking in our workflows, improving clarity, maintainability, and reusability of our processes.
Aligned our code structure with our data structure, creating a more cohesive and efficient system that mirrored the way our data flows.

By the end of this journey, you’ll see how our commitment to doing things the right way from the start has set us up for long-term success. Whether you’re facing similar challenges or looking to refine your own data engineering practices, I hope our experiences and insights will provide valuable lessons and inspiration.

Go with the flow

We rely heavily on Apache Airflow for job orchestration. In Airflow, workflows are represented as Directed Acyclic Graphs (DAGs), with steps progressing in one direction. When explaining Airflow to non-technical stakeholders, we often use the analogy of cooking recipes.

Imagine a DAG as a recipe for baking bread. If we need the bread ready by 9:00 every morning and the process takes 2 hours, we start preparing at 7:00. Each task in the recipe, like gathering ingredients, mixing, and letting the dough proof, depends on the previous one.

Similarly, in our team, we gather data, combine sources, and run modelling pipelines to forecast on a schedule to support Logistics Operations.

Airflow helps us schedule, develop, and monitor increasingly complex batch data pipelines. It’s not just a scheduling tool but a critical component in building machine learning models. We use it to retrain models, run experiments, and backtest. This allows for rapid iteration and continuous improvement in forecasting.

The momentum created by Airflow is essential for adapting to the evolving needs of our logistics services department.

Learning from Experience

Despite initially finding ourselves in a favourable position with our existing tech stack, a closer examination of our past projects revealed significant challenges. Each project had developed its own set of data pipelines, often re-engineering the same data sources independently. This redundancy not only resulted in unnecessary repetition of effort but also led to inconsistencies in naming conventions and data handling practices.

The next two images provide an example of two projects with unnecessary repetition.

These issues complicated the maintenance of our data pipelines and made debugging more difficult due to a lack of clear data lineage. The complexity of tracing issues and understanding the inputs and processes behind our models became a significant obstacle.

The next image shows how extra data processing is hidden within the model’s code. This means you need to use a debugger to carefully examine the code to figure out what inputs the model is actually using.

The insights gained from these challenges underscored the need for a more systematic and disciplined approach. By addressing these underlying issues, we aimed to streamline our processes, enhance clarity, and build a more robust data infrastructure. These lessons were instrumental in guiding the development of our current strategies, setting the stage for a more cohesive and efficient data engineering practice.

Building the thing right

In response to these challenges, we embraced a set of core philosophies designed to create a more consistent, maintainable, and scalable data infrastructure:

Adopting a layered approach to data engineering
Monotasking in our DAGs
Shifting left and right in our workflows
Mirroring our code structure with our data structure

These principles now serve as the foundation for our data practices, and in the following sections, we will delve into how each of these strategies contributes to our overall goals.

A Layered Approach

When setting up our data engineering process, we were inspired by Joel Schwartzman’s article on the importance of layered thinking in data engineering. To summarise, Schwartzman highlights how one can structure the data engineering process for a data science project according to a number of layers:

Raw & Intermediate Layers:
- Raw Data: The initial entry point of the pipeline, containing immutable data from various sources, serving as our single source of truth.
- Intermediate Data: Includes minimal transformations like cleaning field names or combining files, preparing data for more intensive processing.
Primary Layer: Acts as a workspace for preparing and transforming data to fit specific problem domains, such as aggregating shipment details into daily counts.
Feature Store Layer: Dedicated to storing independent and target variables for machine learning models, enabling rapid experimentation across various domains and granularities.
Model Input Layer: Contains repositories for model-specific logic, simplifying machine learning workflows. Running experiments is now as straightforward as pulling additional features from the feature store, allowing us to focus on analysis and rapid iteration.
Model Output Layer: Stores results generated by models, including production runs and backtesting, facilitating structured analysis of model performance.
Reporting Layer: Prepares model outputs for evaluation, dashboarding, and cross-sectional analysis to support decision-making in logistics operations.

Rather than taking a layered approach per project, we implemented a layered approach across our projects by creating an Airflow DAG per layer.

Within each DAG we are preparing data from various domains that can serve one or multiple models. To illustrate the benefit of a layered approach, we will dive into the feature store layer.

An example of tasks within the feature store DAG

Our feature stores are created based on time-granularity and the semantic domain of the data.

These different features are picked & combined by the data scientist in the model pipeline to serve as training & prediction inputs.

We've seen a significant boost in our iteration speed due to the ability to easily discover and reuse features across projects. For example, date features, once engineered, can be applied to every forecasting model we have. Previously, developing and testing new models could take several hours due to the repetitive process of feature engineering. However, with our new feature store, this has transformed entirely.

By centralising and organising features into a reusable repository, we can now mix and match features effortlessly. This shift has slashed our iteration time from hours to mere minutes. Data scientists can rapidly experiment with different feature combinations, test new models, and refine forecasts without the tedious overhead of manual feature engineering. This efficiency not only accelerates our model development cycle but also enhances our ability to adapt quickly to evolving business needs, all while maintaining high-quality data outputs.

This acceleration in iteration speed highlights the importance of efficiency, not just in feature engineering, but across our entire data pipeline. As we streamline the discovery, reuse, and experimentation of features, it's crucial that the preceding workflows are equally optimised.

Embracing Monotasking in our DAGs

We strongly believe that monotasking can further optimize our Airflow DAGs by simplifying and streamlining our workflows. In monotasking - as opposed to multitasking - each step in the DAG focuses on doing only one thing at a time.

We observed that steps like data loading, cleaning, transformation, and metric calculation are often conflated within a single task in Airflow. This type of multitasking allows one to achieve a desired goal with minimal effort

Multitasking

However, from experience with our previous teams, we often had to deal with long & complex SQL queries or code within that task when debugging. These complexities first had to be dissected before the root cause of the issue could be discovered; costing valuable time during an incident.

The same process as monotasks

As such, we advocate for monotasking in the team, by essentially applying the principle of separation of concerns of software engineering and applying it to data engineering.

By ensuring that each task within a DAG has a single, focused purpose, we reduce cognitive load, making our pipelines easier to manage and understand. We do this by separating data loading, source combining, and re-aggregation into distinct tasks, and thus clarifying the role of each step. Furthermore, we store our intermediate steps in our data warehouse.

This approach not only enhances code maintainability by isolating specific logic, but also facilitates the reuse of processed data, allowing subsequent tasks to branch off to create data for other use cases.

Monotasks allow for quick development of new use cases.

While monotasking offers significant benefits in terms of clarity, maintainability, and reusability, it's important to recognize the potential trade-offs. Breaking down processes into highly granular tasks means that data must be transmitted or persisted after each step. This can introduce overhead in terms of storage and processing time, especially if the number of tasks becomes excessive.

Therefore, while monotasking enhances modularity, it’s crucial to strike a balance to avoid unnecessary complexity in the data pipeline.

With the advantages of monotasking in mind, we extend this focus on clarity and reliability to our overall data workflow.

Shifting left and right

Within bol we are in the process of shifting left and right, and monotasking can help with the process.

First, it helps immensely with testing efficiency as monotasking allows us enabling to write targeted unit tests ensuring the reliability of our workflows before we go into production. On the other hand, monotasking allows us to easily incorporate data quality tests, through write-audit-publish workflows, at every step of the process.

This structured workflow enhances the integrity of our pipelines and reduces the time needed to identify issues. By pinpointing the specific step where problems occur and determining the nature of these issues, we streamline the troubleshooting process.

For instance, one model that we run also uses an Excel file uploaded by stakeholders as input. We encountered a case where the data quality checks immediately identified negative values in a critical data field, preventing the model run from executing on faulty data. This proactive error detection saved us several hours that would have been spent troubleshooting and correcting issues caused by bad data. Moreover, it significantly reduced resource consumption, as fewer resources were wasted on training an invalid model.

As such, building on the principles of monotasking allows us to high-quality, trustworthy data outputs to consumers; whether they are end-users viewing dashboards or downstream systems relying on the data for further processing.

This focus on clarity and precision in our data processes naturally extends to our codebase.

Aligning Code and Data Structure

We are committed to the principle that our code structure should mirror our data structure.

This philosophy is implemented through a monorepo setup that aligns closely with our data engineering processes, reflecting the hierarchical arrangement of our data layers. By minimising boilerplate code across different DAGs, each directory in our monorepo corresponds to a specific data layer, further subdivided into directories for individual tables. This layout includes all relevant components—queries, schemas, and Python scripts—ensuring that our code and data remain in close alignment.

This structure facilitates navigation and clarity, making it easy to locate and understand the code related to any task and thus a specific table. As an additional benefit, Airflow’s DAGs provide a visual representation of our workflows. This visualisation allows us to easily identify how changes to one part of the pipeline will impact downstream processes. As a result, our ability to understand and manage dependencies has significantly improved.

By aligning our code structure with our data structure, we have created a robust, transparent, and efficient data engineering infrastructure. This approach not only supports our current goals but also ensures the long-term success of our data science initiatives.

Closing Thoughts: The Road Ahead

In rethinking our data engineering process, we've made significant strides by focusing on the fundamentals—building a solid foundation, embracing monotasking to create clarity and reliability, and aligning our code with our data structure. These changes have not only improved the efficiency and reliability of our workflows but also set us up for scalable growth as our data science team continues to innovate.

However, this is just the beginning. As our logistics services evolve and the complexity of our data pipelines increases, continuous iteration will be crucial. The principles we've adopted are not static; they are dynamic guidelines that we will refine as we encounter new challenges and opportunities. Our commitment to a systematic and disciplined approach ensures that we remain adaptable, responsive, and always ready to meet the needs of our business.

It's important to note that these concepts are not an all-or-nothing proposition. Each principle—whether it's monotasking, a layered approach, or aligning code with data structure—can be applied individually and incrementally, depending on your specific needs and pain points. This flexibility allows you to tailor your approach to your unique situation, implementing changes where they will have the most impact first.

By adopting this mindset, you can gradually introduce improvements without overwhelming your existing workflows. This step-by-step evolution helps ensure that changes are sustainable and aligned with your overall goals.

In the end, the mantra "fast is slow, slow is fast" has proven to be more than just a guiding philosophy—it's a practical approach to data engineering that balances immediate needs with long-term success. By taking the time to build robust systems now, we are positioning ourselves to move faster and more efficiently in the future, driving innovation and delivering value across our organisation.

Robin van Schaik

All articles by me