DevOps is not a goal - releasing often is

Due to the influence of agile working and the increasing dynamics of the market, releasing is becoming a bottleneck. The question is: how can we fundamentally accelerate the release process? Rini van solingen explains which measures work in practice and illustrates this based on interviews at bol.com.

Based on past experiences, many organizations have developed a policy of strict separation between development and operations, or, as they are often called, between Dev (development) and Ops (operations). This separation makes logical sense, as it illustrates who is responsible for innovation (Dev) and who is responsible for the operational systems landscape (Ops).

However, due to the adaption of agile, this separation is often placed under pressure. The iterative character of agile means Dev teams deliver more new versions than Ops teams can release. The consequence is a growing mountain of software that cannot go live. This is negative for both teams.

This leads many people to evangelize DevOps: agile teams in which both development (Dev) and maintenance (Ops) are combined. ‘Eat your own dogfood’ is the mantra here, with the idea that if teams are made responsible for both Dev and Ops they will look at quality and availability in a whole new light. This is only partially true.

The label DevOps is not the centerpiece around which everything revolves

People implicitly assume that Dev teams sometimes tend to produce poor quality software as they themselves do not have to perform the Ops tasks. This is nonsense. The quality of the product is not determined by their professionality but by the practical possibilities they possess to manage and control quality. It is entirely logical that errors occur on a large scale if too many teams produce too many different versions which all go live just once a quarter. This, then, is the root of the problem, not the absence of DevOps-teams.

The only way to improve releasing is to do it more often. Since 2014, bol.com has concentrated on increasing release speed. They learned that the implementation of concrete measures can drastically increase the speed of the release process.

Breaking up releases is the only correct solution

Practice demonstrates that it is much easier to make small releases on a regular basis. This approach makes problems visible at an earlier stage, which means they can be resolved faster. This sounds simpler than it is, however.

In practice, the mutual interdependence between systems and components is so great that it is impossible to release components separately.

This is the fundamental source of the problem. However, breaking up releases is the only direction in which a solution can be found, even if that appears difficult at first glance. This can be done step-by-step and delivers direct measurable advantages. Above all, each improvement that increases release speed ensures extra time and space to implement the following steps.

Challenges

In short, the solution lies in a rigorous acceleration of the release process. Increasing release speed is an investment that costs time and money. It is also often challenging, especially in the beginning. However, there are no valid alternatives.

For this reason, making a fast start is recommended. Focus on removing dependency between separate components of a system, so that they can be released independently of each other. Ideally, this should be done by the agile teams themselves, with significant autonomy and without a strict separation of powers between Dev and Ops. The label DevOps can eventually be placed on this, but that’s not the purpose of the process. The purpose is to increase the release speed!

Six measures to increase release speed

1 Evaluate all services in the release process

The best way to ensure that a bottleneck poses few problems, is by making the release process the most important focus point. All other activities should therefore support the release process and must be managed with military precision. This can be a (temporary) solution, but does not solve source problems and often proves insufficient.

2 Break up monoliths

The most fundamental measure is to break large monolithic systems into components that support each other with the required services and can be released separately. Exactly how this break up can be established best, is an architectural question, in which SOA and microservices play an important role. An additional advantage of breaking up monoliths, is that the total system becomes much more reliable. If an individual component fails, this won’t result in a system-wide outage.

3 Regular and small releases

It is not necessarily a must to release more often, but rather to release ‘smaller’. Focus on small releases that make it easy to spot if something works well or if there is a problem somewhere in the newly released software. ‘Release small, fail fast’ is the creed here. A logical consequence is that releases will be made more often, simply because this is now possible. Releasing individual components separately and implementing automatic testing is an important precondition.

4 Automate all release steps

Agile teams should be provided with the tools needed to execute all activities that an Ops engineer must perform to release. The role of the Ops engineer is therefore completely changed: no more executing release steps but instead develop release tools, hand them over to agile teams and train them. In practice, this means that Ops-teams ‘automate themselves away’ and can concentrate on avoiding problems rather than solving them after they occur. This is more akin to fire prevention than fire-fighting.

5 Make teams autonomous and let them release themselves

To ensure that teams take full ownership of their responsibilities, it is important to establish them as autonomous. A component needs to be owned by that team and they must be end-to-end responsible for it. This means that all Dev tasks AND Ops tasks are handed over to an agile team, using tools that are made available by Ops teams. Through this the Ops teams can be removed from the critical path and are no longer the bottleneck.

6 Make the advantages of a high release speed visible

Each improvement requires an investment in terms of attention, energy, time, money and patience. If everyone involved knows what the advantages of regular releases are, the difficulties and costs are more likely to be accepted. This applies especially to the business departments for whom the agile teams work. Creating enthusiasm among the business teams serves as an important boost for all other measures.

Read more about independent services & release autonomy in this techlab article: Services & Autonomy: the one can’t live without the other.

Practical advantages in figures

The application of all these measures have led to many measurable advantages at bol.com:

  • Modularity of the website has increased fivefold

The entire website had to be broken down into individual components that could be released separately. This was a major assignment but it has been successfully executed. By the end of 2017 the number of services will have grown to more than 250, and each can be released separately. These components also function independently of each other. If, for example, the search function disappears, the rest of the website still works and customers who are already in the payment process can continue their transactions. Also, if a personalized offer is no longer visible, a generic offer will appear instead.

  • Extreme growth in the number of releases

The number of releases has grown from 1 per month in 2014 to 550 releases per month in 2017. This is a direct result of breaking up monoliths into separate components. Through this, components can be released separately. Above all, going live can now be done by the agile teams themselves, and not by Ops teams. All Ops engineers were transferred to a technical platform team or an SRT (Site Reliability Team). These teams produce infrastructure and tools for the agile teams, enabling the agile teams to conduct automatic testing, monitoring, load testing or make releases.

  • Critical incidents reduced by more than 95%

The number of critical incidents (with a significant impact on reputation, customer satisfaction and revenue) has been reduced from an average of 25 per month in 2014 to less than 1 per month in 2017.

  • Scheduled downtime reduced to zero

In 2014, the website was offline for an average of 4 hours per month to implement releases. Since 2016 the website is no longer offline during a release. A single update of a supporting platform cannot yet be updated without downtime. The downtime for this type of release has been reduced to less than 18 minutes per year.

  • Unscheduled downtime reduced to practically zero

Alongside planned downtime the website can also go offline due to technical problems. This has been reduced by over 98% to less than 10 minutes of unscheduled downtime per year.

Research that appears in this article was carried out as a collaboration between Frederieke Ubels, Thomas de Jong and Nick Tinnemeier of bol.com.

AUTHOR - Rini van SolingenThis article is translated from the orginal blog of Rini van Solingen. Rini is a part time teachertutor at the Technical University of Delft and CTO at Prowareness. Among other works he is the author writer of; The Power of Scrum, The Beekeeper and The Responsive Enterprise.

Frederieke Ubels

All articles by me