BOL Techlab Feed

From Lawyer to Engineer

Tue, 27 Jan 2026 11:00:00 GMT

• The “why” behind a radical career change
• Legal vs. engineering—how contracts are like code
• GDPR, but make it fun—the 6 rules every engineer should know
• Practical advice—how to test if your career dream is serious, and when to “just be brave”

Reinventing Yourself

Tue, 6 Jan 2026 11:00:00 GMT

We explore the trap of turning hobbies into side hustles, the surprising power of boredom (backed by Harvard research), and the moment a late-night Slack message became a wake-up call. Vasileios shares his personal journey from growing up in a culture of "doctor or lawyer" to rediscovering joy in broken motorcycles and un-posted moments. This is a conversation about breaking free from the "grindification" of life and learning to listen to yourself again.

From Engineer to Engineering Manager

Tue, 21 Oct 2025 08:59:00 GMT

• The moment you have to stop coding and start coaching

• The surprising amount of administration (so many spreadsheets! 📊)

• How to build trust when your former peers are now your direct reports

• The #1 tip for new managers: walking meetings

• What no one tells you: office politics and the art of influence

Enterprise Architects at Bol

Fri, 3 Oct 2025 13:26:00 GMT

• What Enterprise Architecture actually is at bol.com

• From chaos to clarity: How ADRs (Architecture Decision Records) stop the "Who decided this?!" debates

• Measuring architecture fitness: Using data to score encapsulation and reduce team coupling

• The RAPID model: How we make big, enterprise-level decisions in weeks, not months

• AI as the 6th team member: How we use AI to automate documentation and create podcasts

Organising our internal conference Spaces Summit

Fri, 8 Aug 2025 08:49:00 GMT

The story behind Space Summit: From auditorium talks to a 1,100-person conference.
By colleagues, for colleagues: How a volunteer crew of 10 pulls off a pro-level conference.
80+ talk submissions: From Tesla hacks to personal burnout stories.
The secret sauce: How Space Summit builds community, trust, and even career-defining moments

Meet the Social Media Team (Gerda) from Bol

Thu, 10 Jul 2025 10:00:00 GMT

Meet Gerda, the alter ego from bol
A day in the life of a community manager
Lots of fun stories and anekdotes
How to stay relevant in a fast paces environment

Young Professionals at Bol

Wed, 18 Jun 2025 14:00:00 GMT

The YP Program structure: From onboarding to graduation
Work-life balance in your first tech job
How Bol.com supports relocation and housing
The application process decoded (no leetcode nightmares!)
Transitioning from student to professional mindset
Leadership opportunities within the program

Everything about Design Sprints

Tue, 27 May 2025 07:58:00 GMT

The sprint mindset: Why constraining time unlocks creativity
Day-by-day breakdown: How to scope, sketch, prototype, and test in 4 days
Real Bol.com case studies: From the "Branded Shelves" success to ideas that flopped (and why that’s good)
Pro tips: Facilitator tricks, stakeholder buy-in, and why candy is non-negotiable
Tools & resources: Where to start your own sprint

All about Product Analytics

Tue, 6 May 2025 10:00:00 GMT

🔍 What you’ll learn:
✅ Why "good enough" data beats perfect (but late) analysis
✅ How a fake door test saved months of wasted engineering time
✅ The 4 key skills every product analyst needs
✅ Why failed experiments are secretly wins

Queer Bol

Tue, 15 Apr 2025 10:00:00 GMT

• How Queer Bol started – from a simple question to a beautiful community

• Why workplace inclusion matters – for employees, companies, and society

• The impact of visibility – pins, posters, and safe spaces

• Personal stories – overcoming challenges and fostering allyship

• What’s next? – Pride collaborations, education, and expanding outreach

Chaos Engineering Day

Tue, 25 Mar 2025 11:00:00 GMT

What is Chaos Engineering and why it matters
How Bol.com runs Chaos Days twice a year
Real-world examples of chaos experiments and their outcomes
Tips for starting your own Chaos Engineering initiatives
The importance of preparation and community involvement

How Bol Adopted GraphQL

Tue, 11 Mar 2025 11:00:00 GMT

Why Bol.com embarked on the GraphQL journey
Overcoming challenges in a large-scale organization
Building a GraphQL stewardship program
Tools and strategies for successful adoption
Future plans and community contributions

Discovering Cloud FinOps

Tue, 28 Jan 2025 11:00:49 GMT

Get ready to explore the essentials of Cloud FinOps! This episode dives into why managing cloud expenses is a must and reveals tips for monitoring and scaling effectively. Packed with insights and humor, it’s a perfect blend of knowledge and fun. Tune in and join the conversation!

Design & Engineering

Tue, 14 Jan 2025 11:00:00 GMT

In this episode, we’re joined by our talented friends from the design community, Cansu and Wendy! The dynamic between engineers and designers is endlessly fascinating—and sometimes, a little chaotic. How can these two worlds collaborate more effectively? What are the biggest challenges (and funniest moments) that arise? We dig into the details, share some laughs, and uncover insights to help bridge the gap. Tune in and enjoy the ride!

100 Days as Director of Technology

Tue, 10 Dec 2024 13:12:34 GMT

Join us for an inspiring and in-depth conversation with Ronald van Rijn, Senior Director of Engineering at bol.com (http://bol.com/), as he reflects on his first 100 days in the role. Ronald shares insights from his journey—starting as a passionate programmer with a Commodore 64 to becoming a leader of over 900 engineers. Discover how his experience as a former CTO, volleyball athlete, and Brazilian Jiu-Jitsu enthusiast shapes his leadership style and problem-solving approach. Learn how he tackles challenges like scaling technology teams, fostering impactful innovation, and creating an environment where engineers thrive. Tune in to hear about his plans for bol.com (http://bol.com/)'s future and valuable lessons in navigating discomfort and change.

Canary Testing | Road to Pro

Tue, 26 Nov 2024 11:00:04 GMT

In this episode, we dive into the world of canary releases—a modern deployment strategy that minimizes risk while delivering new features to users. Joined by Sonja Nesic and Diego Lira, we explore how tools like Argo Rollouts empower teams to test in production safely, offering seamless integration with Kubernetes environments. Learn how automatic rollbacks can safeguard your deployments and why progressive delivery is becoming a must-have for agile teams. Whether you're new to canary releases or looking to refine your strategy, this conversation is packed with insights and best practices to elevate your DevOps game.

Accessibility in Tech

Tue, 12 Nov 2024 15:29:00 GMT

Fast is slow, slow is fast - rethinking Our Data Engineering Process

Thu, 7 Nov 2024 12:08:17 GMT

Rethinking Our Data Engineering Process

When you're starting a new team, you're often faced with a crucial dilemma: Do you stick with your existing way of working to get up and running quickly, promising yourself to do the refactoring later? Or do you take the time to rethink your approach from the ground up?

We encountered this dilemma in April 2023 when we launched a new data science team focused on forecasting within bol’s capacity steering product team. Within the team, we often joked that "there's nothing as permanent as a temporary solution," because rushed implementations often lead to long-term headaches.These quick fixes tend to become permanent as fixing them later requires significant effort, and there are always more immediate issues demanding attention. This time, we were determined to do things properly from the start.

Recognising the potential pitfalls of sticking to our established way of working, we decided to rethink our approach. Initially we saw an opportunity to leverage our existing technology stack. However, it quickly became clear that our processes, architecture, and overall approach needed an overhaul.

To navigate this transition effectively, we recognised the importance of laying a strong groundwork before diving into immediate solutions. Our focus was not just on quick wins but on ensuring that our data engineering practices could sustainably support our data science team's long-term goals and that we could ramp up effectively. This strategic approach allowed us to address underlying issues and create a more resilient and scalable infrastructure. As we shifted our attention from rapid implementation to building a solid foundation, we could better leverage our technology stack and optimize our processes for future success.

We followed the mantra of "Fast is slow, slow is fast.": rushing into solutions without addressing underlying issues can hinder long-term progress. So, we prioritised building a solid foundation for our data engineering practices, benefiting our data science workflows.

Our Journey: Rethinking and Restructuring

In the following sections, I’m going to take you along our journey of rethinking and restructuring our data engineering processes. We’ll explore how we:

Leveraged Apache Airflow to orchestrate and manage our data workflows, simplifying complex processes and ensuring smooth operations.
Learned from past experiences to identify and eliminate inefficiencies and redundancies that were holding us back.
Adopted a layered approach to data engineering, which streamlined our operations and significantly enhanced our ability to iterate quickly.
Embraced monotasking in our workflows, improving clarity, maintainability, and reusability of our processes.
Aligned our code structure with our data structure, creating a more cohesive and efficient system that mirrored the way our data flows.

By the end of this journey, you’ll see how our commitment to doing things the right way from the start has set us up for long-term success. Whether you’re facing similar challenges or looking to refine your own data engineering practices, I hope our experiences and insights will provide valuable lessons and inspiration.

Go with the flow

We rely heavily on Apache Airflow for job orchestration. In Airflow, workflows are represented as Directed Acyclic Graphs (DAGs), with steps progressing in one direction. When explaining Airflow to non-technical stakeholders, we often use the analogy of cooking recipes.

Imagine a DAG as a recipe for baking bread. If we need the bread ready by 9:00 every morning and the process takes 2 hours, we start preparing at 7:00. Each task in the recipe, like gathering ingredients, mixing, and letting the dough proof, depends on the previous one.

Similarly, in our team, we gather data, combine sources, and run modelling pipelines to forecast on a schedule to support Logistics Operations.

Airflow helps us schedule, develop, and monitor increasingly complex batch data pipelines. It’s not just a scheduling tool but a critical component in building machine learning models. We use it to retrain models, run experiments, and backtest. This allows for rapid iteration and continuous improvement in forecasting.

The momentum created by Airflow is essential for adapting to the evolving needs of our logistics services department.

Learning from Experience

Despite initially finding ourselves in a favourable position with our existing tech stack, a closer examination of our past projects revealed significant challenges. Each project had developed its own set of data pipelines, often re-engineering the same data sources independently. This redundancy not only resulted in unnecessary repetition of effort but also led to inconsistencies in naming conventions and data handling practices.

The next two images provide an example of two projects with unnecessary repetition.

These issues complicated the maintenance of our data pipelines and made debugging more difficult due to a lack of clear data lineage. The complexity of tracing issues and understanding the inputs and processes behind our models became a significant obstacle.

The next image shows how extra data processing is hidden within the model’s code. This means you need to use a debugger to carefully examine the code to figure out what inputs the model is actually using.

The insights gained from these challenges underscored the need for a more systematic and disciplined approach. By addressing these underlying issues, we aimed to streamline our processes, enhance clarity, and build a more robust data infrastructure. These lessons were instrumental in guiding the development of our current strategies, setting the stage for a more cohesive and efficient data engineering practice.

Building the thing right

In response to these challenges, we embraced a set of core philosophies designed to create a more consistent, maintainable, and scalable data infrastructure:

Adopting a layered approach to data engineering
Monotasking in our DAGs
Shifting left and right in our workflows
Mirroring our code structure with our data structure

These principles now serve as the foundation for our data practices, and in the following sections, we will delve into how each of these strategies contributes to our overall goals.

A Layered Approach

When setting up our data engineering process, we were inspired by Joel Schwartzman’s article on the importance of layered thinking in data engineering. To summarise, Schwartzman highlights how one can structure the data engineering process for a data science project according to a number of layers:

Raw & Intermediate Layers:
- Raw Data: The initial entry point of the pipeline, containing immutable data from various sources, serving as our single source of truth.
- Intermediate Data: Includes minimal transformations like cleaning field names or combining files, preparing data for more intensive processing.
Primary Layer: Acts as a workspace for preparing and transforming data to fit specific problem domains, such as aggregating shipment details into daily counts.
Feature Store Layer: Dedicated to storing independent and target variables for machine learning models, enabling rapid experimentation across various domains and granularities.
Model Input Layer: Contains repositories for model-specific logic, simplifying machine learning workflows. Running experiments is now as straightforward as pulling additional features from the feature store, allowing us to focus on analysis and rapid iteration.
Model Output Layer: Stores results generated by models, including production runs and backtesting, facilitating structured analysis of model performance.
Reporting Layer: Prepares model outputs for evaluation, dashboarding, and cross-sectional analysis to support decision-making in logistics operations.

Rather than taking a layered approach per project, we implemented a layered approach across our projects by creating an Airflow DAG per layer.

Within each DAG we are preparing data from various domains that can serve one or multiple models. To illustrate the benefit of a layered approach, we will dive into the feature store layer.

An example of tasks within the feature store DAG

Our feature stores are created based on time-granularity and the semantic domain of the data.

These different features are picked & combined by the data scientist in the model pipeline to serve as training & prediction inputs.

We've seen a significant boost in our iteration speed due to the ability to easily discover and reuse features across projects. For example, date features, once engineered, can be applied to every forecasting model we have. Previously, developing and testing new models could take several hours due to the repetitive process of feature engineering. However, with our new feature store, this has transformed entirely.

By centralising and organising features into a reusable repository, we can now mix and match features effortlessly. This shift has slashed our iteration time from hours to mere minutes. Data scientists can rapidly experiment with different feature combinations, test new models, and refine forecasts without the tedious overhead of manual feature engineering. This efficiency not only accelerates our model development cycle but also enhances our ability to adapt quickly to evolving business needs, all while maintaining high-quality data outputs.

This acceleration in iteration speed highlights the importance of efficiency, not just in feature engineering, but across our entire data pipeline. As we streamline the discovery, reuse, and experimentation of features, it's crucial that the preceding workflows are equally optimised.

Embracing Monotasking in our DAGs

We strongly believe that monotasking can further optimize our Airflow DAGs by simplifying and streamlining our workflows. In monotasking - as opposed to multitasking - each step in the DAG focuses on doing only one thing at a time.

We observed that steps like data loading, cleaning, transformation, and metric calculation are often conflated within a single task in Airflow. This type of multitasking allows one to achieve a desired goal with minimal effort

Multitasking

However, from experience with our previous teams, we often had to deal with long & complex SQL queries or code within that task when debugging. These complexities first had to be dissected before the root cause of the issue could be discovered; costing valuable time during an incident.

The same process as monotasks

As such, we advocate for monotasking in the team, by essentially applying the principle of separation of concerns of software engineering and applying it to data engineering.

By ensuring that each task within a DAG has a single, focused purpose, we reduce cognitive load, making our pipelines easier to manage and understand. We do this by separating data loading, source combining, and re-aggregation into distinct tasks, and thus clarifying the role of each step. Furthermore, we store our intermediate steps in our data warehouse.

This approach not only enhances code maintainability by isolating specific logic, but also facilitates the reuse of processed data, allowing subsequent tasks to branch off to create data for other use cases.

Monotasks allow for quick development of new use cases.

While monotasking offers significant benefits in terms of clarity, maintainability, and reusability, it's important to recognize the potential trade-offs. Breaking down processes into highly granular tasks means that data must be transmitted or persisted after each step. This can introduce overhead in terms of storage and processing time, especially if the number of tasks becomes excessive.

Therefore, while monotasking enhances modularity, it’s crucial to strike a balance to avoid unnecessary complexity in the data pipeline.

With the advantages of monotasking in mind, we extend this focus on clarity and reliability to our overall data workflow.

Shifting left and right

Within bol we are in the process of shifting left and right, and monotasking can help with the process.

First, it helps immensely with testing efficiency as monotasking allows us enabling to write targeted unit tests ensuring the reliability of our workflows before we go into production. On the other hand, monotasking allows us to easily incorporate data quality tests, through write-audit-publish workflows, at every step of the process.

This structured workflow enhances the integrity of our pipelines and reduces the time needed to identify issues. By pinpointing the specific step where problems occur and determining the nature of these issues, we streamline the troubleshooting process.

For instance, one model that we run also uses an Excel file uploaded by stakeholders as input. We encountered a case where the data quality checks immediately identified negative values in a critical data field, preventing the model run from executing on faulty data. This proactive error detection saved us several hours that would have been spent troubleshooting and correcting issues caused by bad data. Moreover, it significantly reduced resource consumption, as fewer resources were wasted on training an invalid model.

As such, building on the principles of monotasking allows us to high-quality, trustworthy data outputs to consumers; whether they are end-users viewing dashboards or downstream systems relying on the data for further processing.

This focus on clarity and precision in our data processes naturally extends to our codebase.

Aligning Code and Data Structure

We are committed to the principle that our code structure should mirror our data structure.

This philosophy is implemented through a monorepo setup that aligns closely with our data engineering processes, reflecting the hierarchical arrangement of our data layers. By minimising boilerplate code across different DAGs, each directory in our monorepo corresponds to a specific data layer, further subdivided into directories for individual tables. This layout includes all relevant components—queries, schemas, and Python scripts—ensuring that our code and data remain in close alignment.

This structure facilitates navigation and clarity, making it easy to locate and understand the code related to any task and thus a specific table. As an additional benefit, Airflow’s DAGs provide a visual representation of our workflows. This visualisation allows us to easily identify how changes to one part of the pipeline will impact downstream processes. As a result, our ability to understand and manage dependencies has significantly improved.

By aligning our code structure with our data structure, we have created a robust, transparent, and efficient data engineering infrastructure. This approach not only supports our current goals but also ensures the long-term success of our data science initiatives.

Closing Thoughts: The Road Ahead

In rethinking our data engineering process, we've made significant strides by focusing on the fundamentals—building a solid foundation, embracing monotasking to create clarity and reliability, and aligning our code with our data structure. These changes have not only improved the efficiency and reliability of our workflows but also set us up for scalable growth as our data science team continues to innovate.

However, this is just the beginning. As our logistics services evolve and the complexity of our data pipelines increases, continuous iteration will be crucial. The principles we've adopted are not static; they are dynamic guidelines that we will refine as we encounter new challenges and opportunities. Our commitment to a systematic and disciplined approach ensures that we remain adaptable, responsive, and always ready to meet the needs of our business.

It's important to note that these concepts are not an all-or-nothing proposition. Each principle—whether it's monotasking, a layered approach, or aligning code with data structure—can be applied individually and incrementally, depending on your specific needs and pain points. This flexibility allows you to tailor your approach to your unique situation, implementing changes where they will have the most impact first.

By adopting this mindset, you can gradually introduce improvements without overwhelming your existing workflows. This step-by-step evolution helps ensure that changes are sustainable and aligned with your overall goals.

In the end, the mantra "fast is slow, slow is fast" has proven to be more than just a guiding philosophy—it's a practical approach to data engineering that balances immediate needs with long-term success. By taking the time to build robust systems now, we are positioning ourselves to move faster and more efficiently in the future, driving innovation and delivering value across our organisation.

Splitting the data of the monolith – Because who needs to sleep anyway…

Thu, 7 Nov 2024 12:08:03 GMT

In this article, I would like to share our twisted journey about the data migration from our old monolith to the new “micro” databases. I would like to highlight the specific challenges we encountered during the process, present potential solutions for them, and outline our data migration strategy.

Background: summary and the necessity of the project
How to migrate the data into the new applications: describe the options/strategies how we wanted and how we did the migration
Implementation
- Setting up a test project
- Transforming the data: difficulties and solutions
- Restoring the database: how to manage long running sql scripts with an application
- Finalising the migration and preparing for go-live
- DMS job hiccup
Going live
Learnings

If you find yourself knee-deep in technical jargon or it is too long, feel free to skip for the next chapter—we won't judge.

Background

Our goal was during the last two years to replace our old monolithic application with microservices. It's responsibility was to create customer related financial fulfillments, and ran between 2017 and 2024, soit collected extensive information about logistical events, shop orders, customers, and VAT.

Financial fulfilment is a grouping around transactions and connects trigger events, like a delivery with billing.

The data:

Why do we need the data at all?

Having the old data is crucial:including everything from history of the shop orders like logistical events orVAT calculations. Without them, our new applications cannot process correctly the new events of the old orders. Consider the following situation:

You ordered a PS5 and it is shipped– The old application stores the data and sends a fulfilment
The new applications go live
You send back the PS5, so the new apps need the previous data to be able to create a credit.

The size of the data:

Since the old application had been started: it had collected 4 terabytes from which we still would like to handle 3T in two different microservices (in a new format):

shop order, customer data andVAT: ~2T
logistical events: ~1T

Handle history during development:

To manage historical data during development, we created a small service, which reads directly from the old app database and provides information through REST endpoints. This way can see what has already been processed by the old system.

How to migrate the data into the new applications?

We worked on a new system and by early February, we had a functional distributed system running in parallel with the old monolith. At that point, we considered three different plans:

Run the mediator app until the end of the Fiscal Period (2031):
PRO: it is already done
CON: we would have one extra "unnecessary" application to maintain.
Create a scheduled job to push data to the new applications:
PRO: We can program the data migration logic in the applications and avoid the need for any unfamiliar technology.
CON: Increased cloud costs. The exact duration required for this process is uncertain.
Replay ALL logistical events and test the new applications:
PRO: We can thoroughly retest all features in the new applications.
CON(S): Even higher cloud costs. More time-consuming. Data-related issues, including the need to manually fix past data discrepancies.

Conclusion:

Because the tradeoff was too big for all cases I asked for help and opinions from the development community of the company and after some back and forth, we setup a meeting with couple of experts from specific fields.

The new plan with the collaboration:

Current state of the system(s): Setting the scene

Before we could go ahead, we needed a clear picture of where we stood:

Old application runs on datacenter
Old database already migrated to the cloud
Mediator application is running to serve the old data
Working microservices in the cloud

The big plan:

After the discussion (and a few cups of strong coffee), we forged a totally new plan.

Use off-the-shelf solution to migrate/copy database: use Google’s open source Data Migration Service (DMS)
Promote the new database: Once migrated, this new database would be promoted to serve our new applications.
Transform the data with Flyway : Utilising Flyway and a series of SQL scripts, we would transform the data to the schemas of the new applications..
Start the new applications: Finally, with the data in place and transformed, we’d start the new applications and process the piled-up messages

The last point is extremely important and sensitive. When we finish the migration scripts, we must stop the old application, while we are collecting messages in the new applications to process everything at least once either with the old or the new solution.

Difficulties -the roadblocks ahead:

Of course, no plan is without its hurdles. Here’s what we were up against:

Single DMS job limitation: The two database migration jobs must run sequentially
Time-consuming jobs:
- Each job took around 19-23 hours to complete
- Transformation time: the exact duration was unknown
Daily fulfilment obligations: Despite the migration, we had to ensure that all fulfillments were sent out daily - no exceptions.
Uncharted territory: To top it off, nobody in the company had ever tackled something quite like this before, making it a pioneering effort. Also, the team are mainly Java/Kotlin developers using basic SQL scripts.
Go live date promise with other dependent projects in the company

Conclusion:

With our new plan in hand, with the help provided by our colleagues we could start working on the details, building up the script execution, and the scripts themselves. We also created a dedicated slack channel to keep everybody informed.

Implementation:

We needed a controlled environment to test our approach—a sandbox where we could play out our plan, also to develop the migration scripts themselves.

Setting up a test project

To kick things off, I forked one of the target applications and added some adjustments to fit our testing needs:

Disabling the tests: all existing tests except for the context loading of the Spring application. This was about verifying the structure and integration points, also the flyway scripts.
New Google project: ensuring that our test environment was separate from our production resources.
No communication: all inter-service communications - no messaging, no REST calls, and no BigQuery storage.
One instance: to avoid concurrency issues with the database migrations and transformations.
Remove all alerts to skip the heart attacks.
Database setup: Instead of creating a new database on production, we promoted a “migrated” database created by DMS.

Transforming data: Learning from failures

Our journey through data transformation was anything but smooth. Each iteration of our SQL scripts brought new challenges and lessons. Here’s a closer look at how we iterated through the process, learning from each failure to eventually get it right.

Step 1: SQL stored functions

Our initial approach involved using SQL stored functions to handle the data transformation. Each stored function took two parameters - a start index and an end index. The function would process rows between these indices, transforming the data as needed.

We planned to invoke these functions through separate Flyway scripts, which would handle the migration in batches.

PROBLEM:

Managing the invocation of these stored functions via Flyway scripts turned into a chaotic mess.

Step 2: State table

We needed a method that offered more control and visibility than our Flyway scripts, so we created a: State table, which stored the last processed id for the main/leading table of the transformation. This table acted as a checkpoint, allowing us to resume processing from where we left off in case of interruptions or failures.

The transformation scripts were triggered by the application in one transaction, which also included updating the state table state.

PROBLEM:

As we monitored our progress, we noticed a critical issue: our database CPU was being underutilised, operating at only around 4% capacity.

Step 3: Parallel processing

To solve the problem of the underutilised CPU, we created a lists of jobs concepts: where each list contained migration jobs, which must be executed sequentially.

Two separate lists of jobs have nothing to do with each other, so they can be executed concurrently.

By submitting these lists to a simple java ExecutorService, we could run multiple job lists in parallel.

Keep in mind all job calls a stored function in the database and updates a separate row in the migration state table, but it is extremely important to run only one instance of the application to avoid concurrency problems with the same jobs.

This setup increased CPU usage from the previous 4% to around 15%, a huge improvement. Interestingly, this parallel execution didn’t significantly increase the time it took to migrate individual tables. For example, a migration that initially took 6 hours (when it runs solely) now took about 7 hours, when it was executed with another parallel thread - an acceptable trade-off for the overall efficiency gain.

PROBLEM(S):

One table encountered a major issue during migration, taking an unexpectedly long time—over three days—before we ultimately had to stop it without completion.

Step 4: Optimising the long-running script(s)

To make this process faster, we required extra permissions to the database and our database specialists stepped in and helped us with the investigation.

Together we discovered that the root of the problem lay in how the script was filling a temporary table. Specifically, there was a sub select operation in the script that was inadvertently creating an O(N²) problem. Given our batch size of 10,000, this inefficiency was causing the processing time to skyrocket.

Figure 1: Example analyse script

To address the issue, we rewrote the script with the updated approach:

Instead of relying on the subselect, we created a temporary table that performed a join between the two necessary tables upfront. This way, the heavy lifting was done once, reducing the need to repeatedly scan the data.
After creating the temporary table, we then inserted the rows into the real table. This change effectively transformed the operation from O(N²) to O(2N).
Finally, we started to merge similar jobs together to handle them in one execution and creating temporary tables only once.

RESULT:

The results were immediate and impressive. With the new approach, the previously unmanageable table transformation now completed in 10-12 hours a significant improvement, and most importantly, a predictable and stable time frame.

Restoring database constraints and indexes:

Once the data was transformed and migrated successfully, our next task was to restore the database constraints and indexes. However, as with many things in the world of data migration, it wasn’t as straightforward as one might hope.

Time-consuming index creation

Creating certain indexes had taken more than an hour, and because we planned to create them via flyway: the application failed to start, rollback the flyway transaction, so, on the next application start it would try to create the index again from the beginning starting an endless loop.

A Practical solution: concurrent index creation

Using PostgreSQL’s CREATE INDEX CONCURRENTLY allows the database to build the index without locking the table.

However, there’s an important consideration: CREATE INDEX CONCURRENTLY operates outside of the usual transaction mechanisms. This means that if the application failed to start due to a timeout, the creation process will continue in the database. Once the index was finally built, the next time the application attempted to start, the IF NOT EXISTS clause in our script would gracefully avoid any further attempts to create the index, allowing the Flyway migration to proceed smoothly.

Ensuring continuity

This solution, while not the most traditional, was highly effective. By ensuring that the index creation process continued even if the application startup failed, we allowed the migration to complete successfully. Flyway’s version history was updated once the indexes were in place and the application started, ensuring that our database was in a consistent state.

The key takeaway here is that sometimes, a practical and flexible approach is the best way to overcome challenges.

Finalising the migration and preparing for go-live

After the initial phases of data transformation and index management, our last major task was to restore the database to its original state. This involved carefully reverting any temporary changes, cleaning up our Flyway migration scripts, and preparing for a full-scale test run to ensure everything was ready for production.

The three merge requests

The entire process was divided into three key merge requests/releases, each playing a critical role in the migration:

Preparation: (1-2 hours)
- scale down to 1 instance
- dropped unnecessary indexes and constraints.
- created new necessary indexes
- add stored functions/migration scripts(but without executing them)
Execution of data transformation (1-2 days)
- added execution code
- ran table transformation
Restoring the database state (3-6 hours)
- reinstated all original indexes, constraints
- cleaning up any temporary changes, classes

The DMS job hiccup: The case of the missing foreign key

As with any complex migration project, unexpected issues can arise, and in our case, the data migration service (DMS) threw us a curveball. After about 18 hours of running smoothly, our DMS job suddenly failed, multiple times:because of a missing foreign key.

Investigating the issue

The error message indicated that a foreign key was missing, causing the DMS job to fail. However, when we inspected the database, the foreign key was present. The confusion deepened as we realised that the row in question had been created around the time the migration started. It seemed that the DMS job’s sinking process—where it catches up with ongoing changes - had somehow missed this update.

Our database team took the lead in investigating the issue, also reached out for external help, escalating the problem through multiple channels:

Resolution: DMS parallelism option

After about two weeks of investigation and back-and-forth with support, we finally got the solution from the Bol/Google shared slack channel. It turned out that DMS has(or introduced) a parallelism option and by setting this option to "minimal," the job started working correctly, and the foreign key issue was resolved, also we were able to continue the go live plan.

Going live: The final countdown

At this point, the bulk of the heavy lifting was behind us. We had meticulously planned, tested, and overcome numerous challenges in preparation for this moment. The final step was to execute the go-live process, bringing all our efforts to fruition.

Rather than detailing every step once again, it's safe to say that everything we've covered so far led us to this moment. The process is straightforward, although it is critical to keep the dates and sequence in order.

Learnings:

It was quite a journey, so I would like to highlight again a couple of learnings in the end:

Talk through everything multiple times with different people and reach out for help in case you have concerns.
When you try to explain something try to be as clean/specific as possible and ask for feedback, make sure the other understands the situation as you wanted. Illustrations always welcome!
Try to be humble and patient, when you are asking for help. The priorities can be different, you may not know the other parties’ priorities.
Manage your process in a project management tool to track the processes: to not forget any detail, small task which can kill your project.
For us the full process took 2-3 months. A lot of unforeseen issues happened during the migration, which could have failed our project on many points. Always try to take small steps/problems on an unfamiliar journey, which gives you the confidence and the positive feedback when you achieve a small success.

Thank you for reading this post I hope you enjoyed and learnt a bit from it, because it is always good, faster to learn from other’s mistakes.

Tips for Data Engineering

Thu, 31 Oct 2024 13:28:19 GMT

In this episode of TechLab, we're diving deep into the world of data engineering! Join us as we chat with Robin, a data scientist from bol.com, who shares how his team tackled complex data pipeline challenges by using layered approaches and innovative solutions like Apache Airflow. Robin makes it easy to understand by comparing data workflows to baking bread (yes, really!) and highlights how his team built a robust, reliable system to handle critical logistics data for bol.com. Tune in for insights, laughs, and some surprising analogies that bring data science to life!