“It didn’t look like a big issue at that moment”. Everybody in the room listened to what the Engineer on Duty (EoD) had to tell about the first moments what, later on, turned out to be a partly outage of the webshop in peak season.
It’s Monday morning 10.00h and a team of engineers, service delivery managers and managers on duty involved gathered together in the redroom. Within bol.com the room that is known as the place used to respond to emergencies or evaluate them. The last is happening this Monday morning. This team is reconstructing the partly outage of the Friday evening before with the only purpose to understand what happened, to learn from it and improve. We call this the retro or blameless post-mortem on emergencies.
Going back in time to this Friday evening. Customers of bol.com use this moment to buy their Christmas presents and the traffic on the website and apps is huge. It’s 21.35h when a team manager from the business department reached out to the Manager on Duty (MoD) in IT. He informed him that he experienced slowness on the bol.com site, the apps got less responsive and, even worse, customers were making notice on sites like ‘allestoringen.nl’. A broadly used Dutch website to track all kinds of issues with websites. The first reaction of the MoD was the adrenaline rush. The second reaction, starting up his laptop and reaching out to the engineer on duty to kick off the emergency response plan.
Within bol.com we use a collaborative chat tool, and in emergency situations like these we gather together in the room called “Production”. The great thing of this tool is that people who join in later can read back the whole story so far. The EoD mentioned in this room the first alert he received from one of the services an hour before, but also the notification it recovered. With this new information, investigation started again. The fact that we missed a full hour of this partly outage was something to dive into later. First, we had to know what was going on, and what to fix to enable customers to continue their online Christmas shopping again.
In the first moments after escalation a couple of things happen. If it is clear that we face an emergency like this, higher management is informed by the MoD, while the EoD reaches out to his peers in the Whatsapp group. An emergency notification is sent out and updated at a specific pace. And maybe most importantly, the first responders determine what people should be on board to investigate along. This investigation will take place in a separate room in our chat tool, so it’s clear for everybody where to find the information. After these first steps, it’s assured that our Customer Support organization is being informed to update our customers, as well as our Logistics department so they are able to understand the lower number in the order stream and act accordingly.
The first signals of the investigation pointed in the direction of two specific services, so engineers knowing these systems as well as the database engineer on duty were called to join in for this emergency. At the same time, other people that were informed joined in, and were able to verify the customer impact by looking at the graphs of our external monitoring systems. External monitoring tools are in place to give extra insights in systems uptime and performance on top of the internal monitoring systems. Client side response times are being measured, but only after checking the specific 90 and 99 percentile lines twice, it was possible to see the impact.
Back to the redroom this Monday morning. In retrospective it was clear that we had to fix the Alerting and Monitoring part. Being unaware of something happening for a full hour is unacceptable and should not happen in the future. Actions were determined to address this.
The second thing done in the retro was the creation of the timeline. By knowing the timeline it’s possible to make clear what improvements we can make to the involved processes and systems. So we found out that we overlooked a very important pointer of the root cause. All the systems, services and databases showed green dashboards, but the customers were experiencing something very different. How come?
In retro a face-palm moment, that evening overseen as the pointers were in those two services. But after almost 2 hours of investigation we experienced this breakthrough moment. . Directly after the traffic was routed via the healthy one, customers were happy again.
While discussing the timeline, we found out that there were pointers to these unhealthy spots in our network. It was not only the two services complaining, but also another domain that was unreachable for the customers which is used in the shop and apps a lot. Adding those two should have brought the solution earlier. It’s one of the worst nightmares to happen in an emergency, tunnel vision and overseeing something. So again, we learned: add monitoring and alerting from the customer perspective. Remove specific dependencies which were misleading as pointers and act like a customer to include all network components.
After solving the three-hour partial outage, all our customers and sellers were happy again, but the team wasn’t finished. Everyone in IT knows that having a single point of failure can be very dangerous. Hence, we had to get the initial three-component setup back and add alerting to survive the weekend. It took another hour to recover the unhealthy components and add the alerting. We even raised a ticket with our hardware supplier to check the logs of this outage.
And that’s where the timeline ended after 4 hours of firefighting.
As a wrap up the summary of this one-hour retro was posted on our confluence page in the overview of Incident Reports, and the defined actions were logged in our Jira ticketing system.
Luckily all people involved had the whole weekend to themselves in order to relax before coming in again back on Monday