One of the downsides to web hosted software is the fact that it’s always supposed to be available. Unfortunately, outages happen – maybe it’s because of a bug in the code, an infrastructure outage, or because an army of squirrels chewed through the power lines that feed into us-east-1.
For a lot of companies, that means having someone on-call to take care of issues quickly. How they handle on-call varies wildly from company to company.
What hours is someone on-call? Do you have one unlucky person on-call 24×7? Do you have people who work in multiple regions around the world and use a Follow The Sun pattern (and if you do have this setup, is the on call only active during the weekend and just deal with issues during the week as they arise)? Something else?
How long is someone on-call? Are they just on call once every X weekends because you’re able to Follow The Sun? Are they on call for an entire week? How many people are in your rotation?
What triggers an event? Are the triggers different during business hours vs after hours? Are the alerts so broad that an endpoint returning a 400-series error causing a page to go out? Do you trigger on disc space or processor load warnings? Do you trigger on the system going down?
What is the SLA expectation on response time? Do the SLA expectations change based on whether it’s during or outside of business hours?
Do the people on-call do production work during their time on call?
If on-call gets activated after hours, do you give the person time off to compensate them?
How often does on call get activated?
What happens after an on call issues is resolved?
What SHOULD (Probably) Be Happening
If you are able to, you should probably Follow The Sun. This means that you can have on call coverage without someone being woken up in the middle of the night and trying to figure out what the heck is going on at 2am. Unfortunately, this isn’t always possible – especially in smaller companies.
At worst, your on call rotation shouldn’t last more than a week.
What triggers an event should probably vary based on whether your on-call person is currently working during business hours or if it’s an after hours page. Actual production emergencies should always page. After hours, the only things that should page are emergencies and possibly suspicious activity pages.
If you’re running cloud native apps, a lot of your issues should be taken care of via automated processes, so you’re likely going to be down to emergency issues, alerts for suspicious traffic patterns, etc.
If you’re running things on actual physical infrastructure (yes, those still exist – especially inside large orgs with legacy tech or when dealing with sensitive data), things get a little trickier. You’re also going to have to deal with space/processor usage and all of the headaches that come with managing hardware.
I was once on a team that sent an alert email during hours if any of the web apps that they ran threw a 400 series error in the logs. Don’t do this. For the love of your team’s sanity, don’t do this. There are a lot of reasons a 4xx return code might be sent back and most of them aren’t a problem with your system (this is especially true if you’re running APIs that are exposed to consumers other than yourself).
Your SLA on response time with your team should also probably vary based on the severity of the issue and when it happens. Part of this is going to depend on how it impacts your business/org. If it’s after hours and an outage isn’t really going to cost you much in terms of either money or reputation, is it really going to matter if it’s 20 minutes before your person is able to log in and resolve the issue?
During the day (provided that is when most of your business load happens), yes the expected turnaround time should be pretty quick. Don’t expect your on call person to respond within 5 minutes on the weekend unless lives are at stake though (and, yes, I’ve been on a team where they expected you to respond within 5 minutes. Everyone hated being on-call on that team).
If you have a dedicated person each week during business hours handle on call pages, that person should NOT be doing production work. Don’t have them working on upcoming features. Instead, leverage the fact that they’re going to be in the troubleshooting mindset by having that person focus on finding and improving things that already exist in the system.
In the long run, this will result in a more stable environment that (hopefully) will lead to even fewer alerts. That’s a good thing.
If you page someone after hours, make it up to them. If you page them at 3am, you’ve ruined their sleep. If you page them on the weekend, you’ve ruined time they need in which to live their lives. At the very least, give them a comp day in return.
It may not make up for the night’s sleep, but it will help your people recover. It will also serve as a disincentive for you to page people and, instead, work on fixing the actual root cause. I’ve seen too many companies (many of them large orgs) who just shrugged as their systems never improved and their people were always exhausted because it didn’t cost them anything in their view (who counts people leaving in those calculations, after all?).
After hours on call pages should be rare. Ideally it should never happen, but if it happens more than once or twice a quarter (at MOST), you have some serious problems. Either you’re paging people for every “error” in a system or your system is so unstable that you shouldn’t be doing any new feature work until you sort your crap out. Full stop.
Sometimes you have the worst of both worlds. The team I mentioned before would send at least one on-call page almost every weekend (no joke) and roughly 3 in 4 pages were for non-issues or things that should only be a warning alert in the system. Real emergencies also occurred too often, but the root causes were never fixed “because reasons”. People literally got anxiety on the week leading up to their being on-call.
After an on-call issue is dealt with, you need to resolve the root cause of the issue. As a general rule, the person taking care of the issue at the time is only expected to do enough to get the system back into a functional state. The underlying problem is still going to be there.
Do a root causes analysis of the issue during normal business hours. The emergency is over and this is now an operational issue. If there’s an issue that you can pinpoint, fix it. No excuses unless this will cause your business to go under because you’re on the ragged edge funding wise or you’re literally decommissioning the service in question tomorrow.
If you can’t find the root cause of the issue (which happens sometimes), are there things that you can monitor which will alert you to the likelihood that it’s going to happen again? The error may not occur exactly the same way again, but even intermittent issues generally follow a pattern if you can figure out what it is.
Nobody likes being on call, but with some planning and work, it doesn’t have to be terribly stressful to be the one carrying the pager for the week. Doing things well will not only help make your systems more stable, but will decrease the chance that your people walk out the door because they’re too exhausted and burned out to function.