Who should be on-call for incident response

Paging the right engineers for faster time to resolution, lower error rates, and a culture of ownership and collaboration. 

If a software engineer pushes code to production, writes code for a production application, or makes changes to production systems, they should be on-call. If the system breaks, has a high rate of errors, or some other actionable issue impacting customers, the on-call engineer can be paged to fix the issue or start the incident response process. Which engineer, exactly? The one closest to the problem.

Doing so ensures product quality for your customers. You’ll have faster time to resolution, lower error rates, and foster a culture of ownership, accountability and collaboration.

In this post

💡
Substrate: The Right Way to AWS
Substrate is a CLI tool that helps teams build and operate secure, compliant, isolated AWS infrastructure. From developers who have been there.

What is incident response?

Incident response is an organization's process for being alerted to a problem with their application, how the organization responds to that alert, how they fix the problem, and how they communicate about the issue. The response may depend on the specific problem and severity of the issue. For example,  a simple fix may be done by the on-call engineer, or it may involve several other teams and include customer and executive communication.

Your on-call process is part of your larger incident management process. On-call typically involves engineering teams having a rotation, often a 7-day on-call shift, where an engineer is alerted when a production system is having an issue. The alerts are often automated, but an engineer might also be paged by another team for help with an issue that spans multiple services.

Who needs to be on-call?

Leadership may want Ops teams like SRE (Site Reliability Engineering), DevOps, or even a NOC (Network Operations Center) to be on-call. These Ops teams may be tasked with triage or trying to fix the issue themselves before involving the software engineers who wrote the application. After all, these teams are the experts at incident response and infrastructure, so shouldn't they be first responders on-call?

No!

Ops teams should have an on-call rotation for systems they directly manage and be paged when those systems have issues. But they should not be first responders for the rest of engineering for many reasons.

They probably don’t have detailed knowledge of every application's code base, so they can’t fix it without involving the engineers, making time to resolution that much longer. The Ops teams might also get into the habit of doing quick fixes like restarting applications instead of fixing a memory leak or working around a problem instead of addressing the bug in the application.

It’s a recipe for building resentment toward software engineering teams and triggering a lot of pages. It also puts a first-responder team in the position of having to convince other teams they need to change their priorities and fix the issues that are paging the first responders. Teams are now at odds with each other, and the team feeling the pain of the pages is not empowered to fix the problem and make things better for themselves.

If you’ve read the Google SRE Book you might be thinking you can ensure every team has SLIs and SLOs and are required to stop their feature roadmap and fix the issues when they miss their Service Level Objectives. Yes, this is a good idea! But it still insulates the software engineers from the problems in their systems, reducing their ownership, and putting that problem on a team that is not empowered to fix it. 

Part of the reason this kind of setup worked for Google is because the SRE teams had the political capital to make the other teams fix the issue or hand operations back to the software engineering team in cases where the application is behaving particularly badly. This is rare in many organizations, so instead the Ops teams just suffer, and so does product quality.

Be careful when adopting engineering practices from big companies like Google or Amazon because you don’t have the full context about why they came up with that solution, and it may not even apply to your organization.

There is a simpler solution. Put the software engineering team on-call for their own services.

When any engineers with code or systems in production are part of an on-call rotation, they are empowered to fix their own issues. They feel a greater sense of ownership and responsibility over their work and are less likely to blame others for problems.

In more traditional organizations, you often still see Ops teams deploy other engineers' code. Thes Ops teams are also on-call to support production systems that they didn’t necessarily build. They may also be responsible for monitoring, alerting, and observability for these systems. They coordinate releases with the software engineering teams, and often releases are big and infrequent in these kinds of setups. Often in organizations with this setup, the Dev team is mostly responsible for features while the Ops team is responsible for infrastructure, reliability, and deployments.

This way of working often generates animosity between the Dev and Ops teams. Each blaming the other when something goes wrong, and neither being fully empowered to own a system and truly fix it. The Ops team often doesn’t really know how the applications work, and are not very familiar with the code base. But they can be quite good at monitoring, deploying and maintaining the fundamental infrastructure. The Dev team knows how the application works, but they often don’t know much at all about the infrastructure.

The term DevOps was coined to bring these teams together and stop the “us against them” mentality that crops up when neither team is fully empowered to own their parts of the system. It’s this lack of complete and clear ownership and responsibility that puts these teams at odds with each other.

The solution is to make responsibilities clear and have every member of the team take part in operational responsibility. Engineers who write the code also test it, deploy it, monitor it, observe it, alert on it and get paged when it breaks. 

On-call and incident response best practices

Every team with a production service needs to have an on-call rotation that includes everyone who’s responsible for code or operations of that service. That team is also responsible for ensuring their system is monitored and observable, so they are alerted to issues and can see how the system is performing to handle capacity planning and diagnose issues.

Any given team may depend on other teams’ services, and they should have agreements about availability and performance of any service they depend on because your reliability depends on the reliability of any and all of your downstream systems. Any team you depend on will also have an on-call rotation, monitoring, alerting, and observability. This does mean an engineer may be paged when one of the downstream systems is having an issue.

Each team is ultimately responsible for ensuring their service is up for their customers. This often means working with other teams whose services you depend on when there are problems. Or even calling the cloud provider if they are having an outage. The on-call engineers take responsibility for their service even when the cause of the outage is someone else’s system. 

Improved product quality

With all of engineering on-call, you can often fix problems faster, but the true product quality improvements come when engineers start thinking about how to build applications that are easier to support. When engineers personally feel the challenge of supporting their software in production, and are literally woken up at night if it breaks, there is a lot of motivation to design it to be reliable.

You’ll see engineering teams investing in better observability, better processes and tools for deploying software, and better tools for managing their systems. This results in a more reliable product when engineering teams can take ownership of their services.

To achieve this, engineers need to be directly responsible for the end to end reliability of their services and for the improvements and tooling required to support it. This means leaving room on the roadmap for operational improvements. 

Engineering teams also need to have time in their product roadmaps for on-call duties and operational needs. An on-call engineer cannot be expected to contribute to a feature roadmap during their time on-call. When they are not handling on-call duties, they can spend their time helping customer teams or improving tooling to make operations and on-call better.

How Substrate can help with incident response

If you want more secure and reliable infrastructure, Substrate is an essential tool that will make your job easy on multiple levels. It sets you up with an AWS account architecture that cleanly isolates applications and environments so you can correctly scope access. 

In the event of an incident your team can use Substrate to access all of our AWS accounts, while also ensuring any changes are scoped to only the parts of your application that engineers intend to change. It’s important that whatever changes you make fix the problem, and don’t make it worse. Isolation is part of ensuring that a change doesn’t make the problem worse.

You should have lots of AWS accounts explains in detail why and how this architecture helps.

You can create new accounts for each environment easily with Substrate and then have neatly isolated Terraform state as described in Terraform best practices for reliability at any scale.

💡
Substrate provides you with a scalable and proven account and identity architecture so you can focus on building your product. Sounds like what you’re looking for? Get Started with Substrate. The Right way to AWS.