Bol’s journey in shifting left* and shifting right**: our Vision
We took a serious look at where we are in our journey of shifting left and shifting right and realised that besides a clear vision, we also need to focus on providing building blocks to our teams to realise it.
Where we were:
*) in full isolation, relying on stubs and stub containers **) fully integrated pre-production environment
Where we wanted to be:
*) in full isolation, relying on stubs and stub containers **) fully integrated pre-production environment ***) experiment with new cloud components, network or permission changes
In this post we'll describe how that vision looks and why we believe in it, and in subsequent posts we'll share more about its key elements.
The Vision
In 2021, many of our teams were still relying on a fully integrated pre-production (STG in further text) environment to validate their changes. They expected dependencies to be up and running, and production-like data to be present across the environment. They expected the chain to be available and reliable.
However, technological changes, data, privacy and access constraints imposed by always expanding regulations meant that guaranteeing a stable STG environment with consistent data across applications was no longer a reasonable expectation. It was time to change. Time to evolve.
We realised that the first key component of our future vision is TESTING IN ISOLATION for both functional and non-functional tests.We truly believe that by making a serious push for this shift left, teams will be able to deliver faster and with confidence.
However, this does not come without costs. Prerequisites for successful testing in isolation are:
- Creating stubs is easy
- Stubs are reliable.
This made us realise that we can’t have 170+ teams start writing their stub implementations for often as many as 10 dependencies their application relies on. It also became clear that the responsibility of providing reliable and trustworthy stubs should lie with the producers. We needed a way to have automation take over these manual and error prone steps while making sure the stubs are a trustworthy representation of an application.
Adopting an API-FIRST approach to development where APIs are considered first-class citizens rather than a technology to integrate sub-systems was an important step in realising this. API design-first enables teams to innovate faster by using CODE GENERATION to produce client/server code, mocks, and stubs. The quality of the generated code depends on the quality of the API, which is where API LINTING plays an important role. API linting will help the creation of high-quality APIs that can then be a solid base for code generation. This way the error prone manual work will be automated away allowing our engineers to focus on delivering value for our customers.
These three components represent the steps we’re taking to shift left.
One had to wonder, by shifting left, what were we “leaving behind”? Was there a gap being created? In our case - there definitely was. If all of us move to testing in isolation, no one would be using the functionalities on STG prior to release. Would all those bugs, functional or non-functional, the unknown unknowns that previously have been found on STG now land on production (PRO in further text) right at the feet of our customers? That didn’t sound good. The goal of shifting left was to deliver value to our customers faster, not to deliver value and issues faster. This meant that we needed to rethink the strategies of releasing and monitoring our software. We began our exploration of what shifting right meant in our case.
It was great to realise that a lot was already happening in the company when it comes to shifting right. Bol’s experimentation platform, feature toggles and practicing shadow runs was already existing in the toolbox of our engineers. We were also a long way in utilising Site Reliability Engineering to balance product innovation and reliability.
There were only a couple more gaps to tackle prior to feeling confident about our overall vision.
The first gap identified was around observability. The more observable a system, engineers can more quickly and accurately navigate from an identified problem to its root cause. Logs and metrics were well known to our engineers but still not enough to easily troubleshoot issues in a highly distributed system like ours. It became clear that we needed to invest in DISTRIBUTED TRACING to make our systems truly observable.
Once we felt confident about the observability of our systems on PRO, we were left with another challenge - limiting the blast radius of a potential unforeseen issue. Feature toggles are a great option with many advantages, but they come with the drawbacks of complexity and overhead in the code and should be used tactically. We believe that adding (automated)CANARY RELEASES to our engineer’s toolbox will help with delivering value faster with confidence.
Both feature toggles and automated canary releases were catering for the needs of a single application but in cases of high impact & high complexity innovations, changes would spread across several applications. For this, we need to be able to rollout changes with CUSTOM ACCESS CONTROL where we enable chain tests and acceptance tests to be performed prior to releasing the functionality to consumers.
With these building blocks in place, we believe that bol will strike a balance between shifting left and right and be able to deliver value to our customers fast, without compromising on quality.
Stay tuned
In follow up posts, we’ll share further about these building blocks and explain in detail how we went about realising them.
*
*) Shift-left testing aims to execute quick, automated, repetitive tests to identify bugs and possible risks at critical phases of software development.
**) Shift-right testing, also known as “testing in production,” focuses on collecting real-time data and conducting tests in a live environment. It offers valuable insights into how the software performs in real-world scenarios and enables the detection of potential issues that may arise when the software is deployed and actively used by end-users.