At bol.com we believe Site Reliability Engineering (SRE) is the best way to balance product innovation and reliability. It has been described as what system operations looks like when you let a software engineer design it, and the basics are expertly explained in this google cloud community blog. If some of the terms in this blog are unknown to you, please take five minutes to read that before you come back here.
In essence it’s a particular set of standards, tools and practices that govern how you balance Dev and Ops in a DevOps software engineering team.
A mature DevOps organisation
At bol.com, we’ve officially been doing DevOps since 2015. Since then, we have developed an expert group of platform engineering teams. They build and run the infrastructure layers our 170+ engineering teams need to efficiently develop and run their software systems.
Therefore, when we started up a dedicated SRE team in 2020, we stayed away from infrastructure problems other SRE teams often focus on. The platform teams had this one covered.
We focussed on process instead. How can we make it as easy as possible for our teams to apply SRE to find the optimal balance between innovation and reliability.
In online retail the competition is fierce, and the marketplace is global. All our teams need to innovate to the best of their ability for us to stay ahead as a company.
Our SRE team’s stated mission is to enable products to balance reliability and innovation to maximize customer value through data-driven decisions.
We want to give every team that ability to innovate as fast as possible while safeguarding enough reliability to maximally delight users.
When will we be successful?
So what does life look like in a team that’s set up to reap all the benefits SRE promises?
Every team has three to five critical error budgets they’re always aware of. If they are threatened, they limit risk. Until then, they innovate with confidence. All alerting is based on SLOs and every alert received results in a change, whether that is in resiliency, alerting coverage or something else.
Product management is in the lead for setting the SLO targets. They understand that higher reliability targets are an investment that comes with slower innovation. They use this knowledge to judge these reliability targets against innovation requirements.
When someone comes knocking on the team’s door about a service interruption, the conversation can be about improving the SLIs and SLOs instead of firefighting. This provides a positive feedback cycle that maintains the active balance between reliability and innovation.
All this enables engineers to make changes with confidence and invest in resiliency when necessary, and only when necessary.
The road ahead
That is where we’re headed, but we still have a long road ahead of us.
There are a few products and teams where we see SRE applied to such a level that the rewards are clear, but adoption has been slower than we had originally hoped.
One thing we experience is what Google DORA’s State of DevOps report calls “The J-Curve” and we refer to as “The hockey-stick effect”. This means that it takes sustained effort and a couple of iterations before you start to see the benefits of SRE. With the pressure on teams to deliver a broad array of changes, it can be hard to make enough time to get to the point where SRE works for the teams.
We address this with a two-pronged strategy. On the one hand we collaborate closely with engineers to enable SRE practices through a bottom-up approach, on the other hand we turn everything we learn from these collaborations into self-service tooling that make it easier to skip ahead on the J-curve. However, it remains vital to plan SRE adoption on product roadmaps because there will always be some effort required to get results.
Because as that ancient proverb goes: “The best time to implement SRE is 20 years ago, the next best time is now”.