The world depends on always on services more than ever before. An outage can affect millions of people, with real impact: They can’t pay their bills, they can’t book their flights, they can’t video call with their friends.
And whether you’re having a major bug, capacity issues, or you’re down completely, customers who depend on your services expect an immediate response. (The same is true for internal teams.)
Incidents can have a real impact not only in dollar terms — they cost businesses $700 billion per year in North America alone — but also on the reputation of your company, your product, and your team.
With so much at stake, teams have turned to on putting IT and developer teams on call to make sure the organization has the right people available to address a problem during an incident, no matter when one occurs.
A fair on-call schedule, coupled with an on-call compensation plan, can even foster a culture of shared responsibility and help your teams learn more about what it takes to make resilient software and services, making for a better overall product and fewer outages.
On call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.
On call is a critical responsibility inside many IT, developer, support, and operations teams who run services where customers expect 24/7 availability. Team members take turns staffing an on-call rotation, either providing coverage around the clock or only outside of normal business hours. Along with automated monitoring and alerting solutions, the on-call engineer is empowered to respond immediately to any interruptions to service availability.
Sometimes on-call work gets a bad rap. Some veteran IT workers have horror stories about working on teams that were stretched too thin and didn't get the support they needed to properly respond to incidents.
A lot of that anxiety can be alleviated if on call support is done right. With an effective on call plan you can ensure your team can scale to match expanding services, providing consistent coverage for critical IT functions, and prompt incident response.
There are more benefits to a good on-call management plan than just getting through downtime. With each failure, teams get the opportunity to learn new skills, like understanding a critical service a little better, seeing how it responds to failure, and knowing how to design for fewer failures or improve the incident response plan.
And having a good on-call program built on a culture of shared responsibility can also lead to improved camaraderie and less burnout, which in turn can mean higher employee retention.
In organizations that practice DevOps, software teams are taking a lot of the responsibility for the reliability and availability of the services they build, a job that used to be the exclusive domain of operations teams. For many of these teams “you build it, you run it” is the new motto. Being most familiar with the code, developers are often the ones who can best troubleshoot issues in the shortest amount of time.
And, through this process, developers build better software that is actually less likely to fail. With this shift in responsibility, they test their code more rigorously since they may in fact be the one brought in during off hours if the service has issues.
The result is more resilient systems and, with more people available and capable to take on incidents, fewer burned out workers.
Without a good on-call program, organizations will fail to realize all the cultural benefits of DevOps—or meet the demands of a scaling infrastructure. If one team bears the burden of responding to incidents more than another they won’t have the capacity to do their day jobs well. Developers won’t get to implement the feedback that comes from incidents, and incident responders won’t have the capacity to fortify their systems.
If the responsibilities are lopsided, those people slated for the on-call schedule are never really able to detach from work and can easily succumb to burnout.
But a plan that takes into consideration the org’s true coverage requirements, balances the time burden across the developer and IT ops teams, and captures data for continuous improvement can lead to benefits all around. It will not only lead to a better service for customers, it can also help employees improve their skills and their product and actually look forward to putting in on-call hours.
“I can’t wait to spend my evening overseeing this deployment and responding to potential outages!” —said no engineer, ever.
With more developers taking on the role of maintaining the services they build, it’s important to make sure they are prepared for their on call responsibilities, and the best time to assess this is during the hiring process.
Now, it’s no secret that there is intense competition for top engineering talent. And not everyone is motivated by money alone, so throwing more pay at devs for after-hours work may not close the deal (more about on call compensation later on). Software engineers in the interview process will naturally have questions about how often they’ll need to take time out of their personal lives and be on the on-call schedule.
Demonstrating that you have a documented on-call plan that spreads responsibilities out fairly across a competent team of developers and SREs can go a long way in reassuring new recruits that your organization has its on call management under control. With a documented plan you can be completely transparent in the interview process and make sure candidates are ready for the commitment to on-call work.
It isn’t just developers spending more time on call. Increasingly for IT support and IT service teams, around-the-clock support is critical to helping the business function.
These teams face a lot of the same challenges as developers on call: stress, burnout, unclear roles and responsibilities, access to tooling.
IT teams often have the added stress of often being in the same building as their customers, who can slow things down with a flood of interruptions (email, Slack, even in-person) about the incident.
Here are a few tactics to help keep IT incidents manageable:
A good on call compensation plan rewards your employees for their expertise and time spent working after hours. If employees feel well-cared for, they will, in turn, care about the business and contribute to its success.
According to the U.S. Fair Labor Standard Act (FLSA), a federal law that sets minimum wage, overtime, and minimum age requirements for employers and employees, if an employee is on call but free to do as they wish with their time they’re considered “waiting to be engaged,” and therefore aren’t working.
If someone has their free time restricted and can’t do as they wish on their off hours, according to the FSLA this on call time may be considered “hours worked” and be eligible for compensation.
Your local laws may vary, so be sure to consult an expert. From there, aim for an on call compensation plan that’s competitive and fair, and supports a culture of shared responsibility.
Incentivized on call compensation plans reward employees who raise their hands to work on call hours in exchange for extra days off, flexible hours, higher base salaries, or some combination of these things.
The advantage to this approach to on-call compensation is an increased sense of ownership over the services, which can lead to more resilient systems.
And giving ample time off and paying competitively also lets employees know their work is valued and appreciated, preventing burnout and reducing turnover.
Paid on call compensation means employees are directly compensated for the time they spend on call or scheduled to work, even if no issues arise during their shift.
The obvious advantage of this model of on-call compensation is the tangible incentive. Knowing you are getting paid for carrying a pager (or, more likely, a laptop and a cell phone) makes it easier to justify the burden of being on call and available, even if no issues arise.
Another approach to on-call compensation is paying employees only when they work on an incident. Some ways to calculate this are:
The advantage to this model is that employees are paid for the extra work they put in outside of normal business hours. A potential drawback is that there is a financial disincentive to reducing alerts and issues, which could compromise the overall integrity of the systems.
This is a combination of the two previous models. Some companies pay both for being on the on call schedule and an additional amount for alerts received and issues worked. The upside to this on call compensation model is that employees feel well-compensated for the extra time and effort that the organization asks of them. Additionally, if someone gets stuck with a particularly difficult issue that eats into their personal time, they’re financially compensated for their sacrifice. But again, consider if it makes sense in your company culture to create an indirect reward for having bugs in the software.
These are the typical models for on call compensation plans. Some other things to consider, as appropriate are:
This number is critical to determine if you need on call schedule coverage after business hours, or a special on call team during business hours.
The complexity and importance of your organization’s incidents can vary. An on call engineer may spend a couple of minutes on an issue or could spend the entire night firefighting an incident. The amount of time and effort put in during a typical on call shift should be taken into consideration. This needs to be measured for fair compensation.
Enforced by escalation policies, time to acknowledge is critical for fast resolution. Measuring the mean time to acknowledge and resolve over a period of time helps managers decide on additional incentives.
With the right tools, navigating on-call policies can be a smoother process. Managing on-call schedules, monitoring alerts, and maintaining employee satisfaction and health is possible with better incident management solutions. Jira Service Management's alerting capabilities enable teams to centralize and filter alerts across all monitoring, logging, and CI/CD tools to ensure quick response while avoiding alert fatigue.