Platform Teams That Stop Measuring Uptime Start Reducing Toil
Platform teams love a good uptime dashboard. A green 99.99% feels like a badge of honor. But that number often conceals a dirty secret: teams are burning out on toil while chasing nines. The Google SRE book famously defines toil as "work tied to running a production service that tends to be manual, repetitive, automatable, tactical, and devoid of enduring value." Yet many platform teams remain fixated on uptime as their north star, ignoring the creeping cost of manual work. It's time to flip the metric.
Uptime Is the Wrong North Star
Teams chasing 99.999% availability often drown in alert fatigue. Every minor blip triggers a page, and engineers spend hours investigating false positives. The pursuit of perfection incentivizes over-engineering resilience for services that don't need it, while the real cost—developer time wasted—goes unmeasured.
The Google SRE book calls this out directly: "If a human operator needs to touch your system during normal operations, you have a bug." But many organizations treat manual intervention as a feature, not a bug. They celebrate uptime while ignoring that each deploy requires a human to babysit a script.
Uptime hides the cost of toil because it's a binary measure: the site is either up or down. It doesn't capture the hours spent on manual config changes, waiting for builds, or triaging low-severity alerts. A system with 99.99% uptime might still consume 30% of an engineer's week on rote tasks.
Some teams push back, arguing that uptime is what business stakeholders understand. But that's a cop-out. The real metric should be developer time spent on value-adding work versus operational overhead. If a platform team can't articulate that trade-off, they'll keep optimizing the wrong thing.
The Toil Tax Nobody Counts
Manual deploy steps multiply outage risk. A deployment that requires five manual commands and a config file edit is a human error waiting to happen. Yet many teams treat this as normal. Charity Majors, co-founder of Honeycomb, has called toil "the invisible tax on your engineering organization." It's not tracked in sprint metrics, so it grows unchecked.
Consider a single config change that touches multiple environments. An engineer might spend half a day updating YAML files, running tests, and coordinating with peers. Multiply that by the number of services a platform team owns, and suddenly a small change costs two engineers a full day. That's time not spent on automation or product improvements.
The 2024 DevOps Research and Assessment (DORA) report estimates that elite teams spend less than 10% of their time on toil, while low performers can spend over 30%. That gap translates directly to shipping velocity and engineer satisfaction. Yet many organizations don't measure toil at all—they only track uptime and incident count.
Pager fatigue is another hidden cost. When every alert is treated as urgent, engineers stop responding to pages. A study of on-call practices found that teams with high toil rates experience higher burnout and turnover. The toil tax compounds: tired engineers make mistakes, creating more toil.
When Platform Teams Flip the Metric
Shifting from uptime to time-to-restore changes the conversation. Instead of asking "Is the system up?" teams ask "How quickly can we recover from failure?" This reframe reduces the incentive to paper over problems with manual workarounds. Mean time to acknowledge (MTTA) becomes a leading indicator of toil.
The Spotify model of squad autonomy is a good example. By giving teams ownership of their services and the tools to deploy independently, Spotify reduced the toil associated with handoffs and coordination. Each squad could focus on its domain without waiting for a central platform team to approve changes.
Netflix's chaos engineering approach also surfaces toil before it becomes a crisis. By intentionally breaking systems, they uncover brittle manual steps that engineers rely on. A service that requires manual restart after a failure is a toil magnet. Chaos engineering forces teams to automate those recovery paths.
Internal developer portals (IDPs) are another tool. By providing self-service actions for common tasks—provisioning a database, rolling back a deploy, viewing logs—platform teams cut the manual work that creates toil. As noted in a related article, tracking tool adoption weekly helps ensure these portals actually reduce toil rather than add another layer.
Case Study: Etsy Deployinator
Dan McKinley's work on Etsy's Deployinator is a classic example of toil reduction. Before Deployinator, a single deploy could take 45 minutes of manual steps: SSH into boxes, copy files, run scripts, and hope nothing broke. Engineers dreaded deployment windows. McKinley built a one-click deploy system that cut that time to 30 seconds.
The impact was dramatic. Engineers started shipping code 50 times more often. Uptime remained flat—Etsy didn't become less reliable—but incidents fell because deployments were small, frequent, and reversible. The toil reduction unlocked innovation: teams experimented more because the cost of failure was low.
Deployinator wasn't a silver bullet. It required investment in automation and a cultural shift from gatekeeping to trust. But it proved that reducing toil doesn't harm reliability. In fact, it improves it by making recovery paths explicit and repeatable.
Other companies have followed similar paths. Amazon's internal tooling, like the "deploy to production" button, enabled thousands of daily deployments. The key insight: when engineers can deploy safely and quickly, they spend less time on process and more on building.
Additional Case Study: Stripe's Manual Approval Bottleneck
Stripe, the online payments company, faced a different kind of toil: manual approval workflows for infrastructure changes. Before they automated, every database schema change required a senior engineer to review and approve. That bottleneck caused lead times of up to two weeks for simple migrations. Stripe invested in automated linting, canary deployments, and rollback scripts. The result: lead time dropped to hours, and the senior engineers could focus on architecture instead of rubber-stamping approvals. This case shows that toil isn't always about clicking buttons—it can also be about waiting for human gates. By measuring the time spent in review queues, Stripe identified the toil and automated the low-risk decisions.
Trade-Off: When Automation Introduces New Toil
Not every automation effort reduces toil. Sometimes, building and maintaining automation creates its own form of toil. For example, a team that writes a complex orchestration script may spend as much time debugging the script as they would have spent on the manual steps. This is known as the "automation tax." A study by the USENIX association found that teams that over-automate early often face higher incident rates because the automation itself is brittle. The trade-off is clear: invest in automation only for high-frequency, well-understood tasks. For rare or unpredictable tasks, a manual runbook with good documentation may be more efficient. Platform teams should measure not just the toil eliminated, but also the toil introduced by the automation tooling itself. If an internal tool requires constant patching, configuration, and user support, it may be adding toil rather than removing it.
Counter-Argument: Is All Toil Bad?
Some engineers argue that a certain amount of manual work is beneficial because it forces human judgment and prevents catastrophic errors. For example, a manual approval for a database delete operation can be a safety net. The counter-argument is that toil, by definition, is devoid of enduring value—but not all manual work is toil. The key is to distinguish between manual work that requires expertise and manual work that is rote. The former should be preserved; the latter should be automated. A reasonable approach is to classify tasks by frequency and complexity. High-frequency, low-complexity tasks are prime candidates for automation. Low-frequency, high-complexity tasks may benefit from human oversight. Platform teams should avoid dogmatic elimination of all manual steps and instead use data to decide where automation pays off.
Three Signals Your Team Is Mismeasuring
If your sprint velocity reliably drops after each on-call rotation, you're measuring the wrong thing. On-call should be a stabilizing force, not a drain. A pattern of post-on-call slowdown indicates that engineers are spending their recovery time cleaning up manual work that should have been automated.
New hires taking six months to become productive is another red flag. If the learning curve is steep because of tribal knowledge about manual steps, that knowledge is toil. Good platform teams document and automate so that new engineers can contribute within weeks. As one article put it, platform teams cut on-call by measuring internal tool dead code—eliminating unused scripts reduces confusion and toil.
Frequent "small" incidents that get ignored are a third signal. A service that crashes every week but auto-restarts is still generating toil if an engineer has to investigate each time. If the runbook is outdated or missing, that's a sign that toil is being tolerated rather than eliminated. Engineers who dread deployment windows are telling you the process is too manual.
How to Start Measuring Toil
Start by tagging every task as either toil or project work. Use a simple classification: Is this manual? Is it repetitive? Does it require human judgment? If yes to all three, it's toil. Track the time spent on these tasks per week. Many teams are surprised to find that 20-30% of their sprint capacity goes to toil.
Set a service-level objective (SLO) for toil minutes per week. Google SRE recommends a toil budget of no more than 50% of an engineer's time in the worst case, but elite teams target under 10%. Cap on-call hours per person to prevent burnout. If a team can't meet its toil budget, it needs to automate.
Use the DORA metrics as a starting point: deploy frequency, lead time for changes, time to restore service, and change failure rate. These correlate with toil. If lead time is long because of manual approvals, that's toil. If time to restore is high because of complex runbooks, that's toil.
Some teams push back, arguing that not all manual work is toil. They're right: a unique database migration that runs once a year isn't toil. But if that migration requires a human to type commands, it could be automated. The goal isn't to eliminate all manual work, but to make it visible and intentional.
Measuring Toil with Specific Metrics
Beyond DORA, teams can adopt toil-specific metrics. One is "time spent on manual operations per service per week." This can be tracked via time logs or by analyzing ticket tags. Another is "number of manual steps in a deployment pipeline." A simple count can highlight processes that need automation. A third is "percentage of alerts that require human action." If 90% of alerts are noise, the alerting system itself is generating toil. Some teams use a "toil tracker" spreadsheet where engineers log 15-minute increments of manual work. Over a quarter, this data reveals patterns. For example, a team might discover that 40% of toil comes from a single legacy service that requires manual restarts. That insight justifies a focused automation project.
The Payoff: Less Toil, Better Systems
When platform teams reduce toil, uptime often improves as a side effect. Automated recovery paths handle failures faster than humans can. Engineers have more time to build resilience features. The culture shifts from firefighting to building. Teams that cut toil by 20% report higher satisfaction and lower turnover.
But the payoff isn't guaranteed. Reducing toil requires investment in tooling and a willingness to accept short-term slowdowns while automation is built. Some teams find that their "toil" is actually essential manual oversight for compliance reasons. In those cases, the toil should be acknowledged and budgeted, not eliminated.
Platform teams that succeed in reducing toil become force multipliers. They free up developer time for product work, improve incident response without adding headcount, and build trust with their stakeholders. As noted in another piece on this site, recovering budget from idle tools is another way to fund toil reduction.
Still, the path isn't always smooth. Some engineers resist automation because they fear job loss or loss of control. Others argue that manual steps provide necessary friction. Reasonable people disagree on how much toil is acceptable. The key is to measure it transparently and let data guide the trade-off. Uptime will take care of itself.