Platform Teams That Stop Measuring Uptime Start Reducing Toil

May 21, 2026 By Sara Park

Platform teams love a good uptime dashboard. A green 99.99% feels like a badge of honor. But that number often conceals a dirty secret: teams are burning out on toil while chasing nines. The Google SRE book famously defines toil as "work tied to running a production service that tends to be manual, repetitive, automatable, tactical, and devoid of enduring value." Yet many platform teams remain fixated on uptime as their north star, ignoring the creeping cost of manual work. It's time to flip the metric.

Uptime Is the Wrong North Star

Teams chasing 99.999% availability often drown in alert fatigue. Every minor blip triggers a page, and engineers spend hours investigating false positives. The pursuit of perfection incentivizes over-engineering resilience for services that don't need it, while the real cost—developer time wasted—goes unmeasured.

The Google SRE book calls this out directly: "If a human operator needs to touch your system during normal operations, you have a bug." But many organizations treat manual intervention as a feature, not a bug. They celebrate uptime while ignoring that each deploy requires a human to babysit a script.

Uptime hides the cost of toil because it's a binary measure: the site is either up or down. It doesn't capture the hours spent on manual config changes, waiting for builds, or triaging low-severity alerts. A system with 99.99% uptime might still consume 30% of an engineer's week on rote tasks.

Some teams push back, arguing that uptime is what business stakeholders understand. But that's a cop-out. The real metric should be developer time spent on value-adding work versus operational overhead. If a platform team can't articulate that trade-off, they'll keep optimizing the wrong thing.

The Toil Tax Nobody Counts

Manual deploy steps multiply outage risk. A deployment that requires five manual commands and a config file edit is a human error waiting to happen. Yet many teams treat this as normal. Charity Majors, co-founder of Honeycomb, has called toil "the invisible tax on your engineering organization." It's not tracked in sprint metrics, so it grows unchecked.

Consider a single config change that touches multiple environments. An engineer might spend half a day updating YAML files, running tests, and coordinating with peers. Multiply that by the number of services a platform team owns, and suddenly a small change costs two engineers a full day. That's time not spent on automation or product improvements.

The 2024 DevOps Research and Assessment (DORA) report estimates that elite teams spend less than 10% of their time on toil, while low performers can spend over 30%. That gap translates directly to shipping velocity and engineer satisfaction. Yet many organizations don't measure toil at all—they only track uptime and incident count.

Pager fatigue is another hidden cost. When every alert is treated as urgent, engineers stop responding to pages. A study of on-call practices found that teams with high toil rates experience higher burnout and turnover. The toil tax compounds: tired engineers make mistakes, creating more toil.

When Platform Teams Flip the Metric

Shifting from uptime to time-to-restore changes the conversation. Instead of asking "Is the system up?" teams ask "How quickly can we recover from failure?" This reframe reduces the incentive to paper over problems with manual workarounds. Mean time to acknowledge (MTTA) becomes a leading indicator of toil.

The Spotify model of squad autonomy is a good example. By giving teams ownership of their services and the tools to deploy independently, Spotify reduced the toil associated with handoffs and coordination. Each squad could focus on its domain without waiting for a central platform team to approve changes.

Netflix's chaos engineering approach also surfaces toil before it becomes a crisis. By intentionally breaking systems, they uncover brittle manual steps that engineers rely on. A service that requires manual restart after a failure is a toil magnet. Chaos engineering forces teams to automate those recovery paths.

Internal developer portals (IDPs) are another tool. By providing self-service actions for common tasks—provisioning a database, rolling back a deploy, viewing logs—platform teams cut the manual work that creates toil. As noted in a related article, tracking tool adoption weekly helps ensure these portals actually reduce toil rather than add another layer.

Case Study: Etsy Deployinator

Dan McKinley's work on Etsy's Deployinator is a classic example of toil reduction. Before Deployinator, a single deploy could take 45 minutes of manual steps: SSH into boxes, copy files, run scripts, and hope nothing broke. Engineers dreaded deployment windows. McKinley built a one-click deploy system that cut that time to 30 seconds.

The impact was dramatic. Engineers started shipping code 50 times more often. Uptime remained flat—Etsy didn't become less reliable—but incidents fell because deployments were small, frequent, and reversible. The toil reduction unlocked innovation: teams experimented more because the cost of failure was low.

Deployinator wasn't a silver bullet. It required investment in automation and a cultural shift from gatekeeping to trust. But it proved that reducing toil doesn't harm reliability. In fact, it improves it by making recovery paths explicit and repeatable.

Other companies have followed similar paths. Amazon's internal tooling, like the "deploy to production" button, enabled thousands of daily deployments. The key insight: when engineers can deploy safely and quickly, they spend less time on process and more on building.

Additional Case Study: Stripe's Manual Approval Bottleneck

Stripe, the online payments company, faced a different kind of toil: manual approval workflows for infrastructure changes. Before they automated, every database schema change required a senior engineer to review and approve. That bottleneck caused lead times of up to two weeks for simple migrations. Stripe invested in automated linting, canary deployments, and rollback scripts. The result: lead time dropped to hours, and the senior engineers could focus on architecture instead of rubber-stamping approvals. This case shows that toil isn't always about clicking buttons—it can also be about waiting for human gates. By measuring the time spent in review queues, Stripe identified the toil and automated the low-risk decisions.

Trade-Off: When Automation Introduces New Toil

Not every automation effort reduces toil. Sometimes, building and maintaining automation creates its own form of toil. For example, a team that writes a complex orchestration script may spend as much time debugging the script as they would have spent on the manual steps. This is known as the "automation tax." A study by the USENIX association found that teams that over-automate early often face higher incident rates because the automation itself is brittle. The trade-off is clear: invest in automation only for high-frequency, well-understood tasks. For rare or unpredictable tasks, a manual runbook with good documentation may be more efficient. Platform teams should measure not just the toil eliminated, but also the toil introduced by the automation tooling itself. If an internal tool requires constant patching, configuration, and user support, it may be adding toil rather than removing it.

Counter-Argument: Is All Toil Bad?

Some engineers argue that a certain amount of manual work is beneficial because it forces human judgment and prevents catastrophic errors. For example, a manual approval for a database delete operation can be a safety net. The counter-argument is that toil, by definition, is devoid of enduring value—but not all manual work is toil. The key is to distinguish between manual work that requires expertise and manual work that is rote. The former should be preserved; the latter should be automated. A reasonable approach is to classify tasks by frequency and complexity. High-frequency, low-complexity tasks are prime candidates for automation. Low-frequency, high-complexity tasks may benefit from human oversight. Platform teams should avoid dogmatic elimination of all manual steps and instead use data to decide where automation pays off.

Three Signals Your Team Is Mismeasuring

If your sprint velocity reliably drops after each on-call rotation, you're measuring the wrong thing. On-call should be a stabilizing force, not a drain. A pattern of post-on-call slowdown indicates that engineers are spending their recovery time cleaning up manual work that should have been automated.

New hires taking six months to become productive is another red flag. If the learning curve is steep because of tribal knowledge about manual steps, that knowledge is toil. Good platform teams document and automate so that new engineers can contribute within weeks. As one article put it, platform teams cut on-call by measuring internal tool dead code—eliminating unused scripts reduces confusion and toil.

Frequent "small" incidents that get ignored are a third signal. A service that crashes every week but auto-restarts is still generating toil if an engineer has to investigate each time. If the runbook is outdated or missing, that's a sign that toil is being tolerated rather than eliminated. Engineers who dread deployment windows are telling you the process is too manual.

How to Start Measuring Toil

Start by tagging every task as either toil or project work. Use a simple classification: Is this manual? Is it repetitive? Does it require human judgment? If yes to all three, it's toil. Track the time spent on these tasks per week. Many teams are surprised to find that 20-30% of their sprint capacity goes to toil.

Set a service-level objective (SLO) for toil minutes per week. Google SRE recommends a toil budget of no more than 50% of an engineer's time in the worst case, but elite teams target under 10%. Cap on-call hours per person to prevent burnout. If a team can't meet its toil budget, it needs to automate.

Use the DORA metrics as a starting point: deploy frequency, lead time for changes, time to restore service, and change failure rate. These correlate with toil. If lead time is long because of manual approvals, that's toil. If time to restore is high because of complex runbooks, that's toil.

Some teams push back, arguing that not all manual work is toil. They're right: a unique database migration that runs once a year isn't toil. But if that migration requires a human to type commands, it could be automated. The goal isn't to eliminate all manual work, but to make it visible and intentional.

Measuring Toil with Specific Metrics

Beyond DORA, teams can adopt toil-specific metrics. One is "time spent on manual operations per service per week." This can be tracked via time logs or by analyzing ticket tags. Another is "number of manual steps in a deployment pipeline." A simple count can highlight processes that need automation. A third is "percentage of alerts that require human action." If 90% of alerts are noise, the alerting system itself is generating toil. Some teams use a "toil tracker" spreadsheet where engineers log 15-minute increments of manual work. Over a quarter, this data reveals patterns. For example, a team might discover that 40% of toil comes from a single legacy service that requires manual restarts. That insight justifies a focused automation project.

The Payoff: Less Toil, Better Systems

When platform teams reduce toil, uptime often improves as a side effect. Automated recovery paths handle failures faster than humans can. Engineers have more time to build resilience features. The culture shifts from firefighting to building. Teams that cut toil by 20% report higher satisfaction and lower turnover.

But the payoff isn't guaranteed. Reducing toil requires investment in tooling and a willingness to accept short-term slowdowns while automation is built. Some teams find that their "toil" is actually essential manual oversight for compliance reasons. In those cases, the toil should be acknowledged and budgeted, not eliminated.

Platform teams that succeed in reducing toil become force multipliers. They free up developer time for product work, improve incident response without adding headcount, and build trust with their stakeholders. As noted in another piece on this site, recovering budget from idle tools is another way to fund toil reduction.

Still, the path isn't always smooth. Some engineers resist automation because they fear job loss or loss of control. Others argue that manual steps provide necessary friction. Reasonable people disagree on how much toil is acceptable. The key is to measure it transparently and let data guide the trade-off. Uptime will take care of itself.

Recommend Posts
Tech

Platform Teams Pay Most for Tools Engineers Silently Abandon

By Sara Park/May 21, 2026

Internal developer tools often go unused despite high costs. This article explores why engineers abandon platforms, the hidden tax of tool sprawl, and how platform teams can audit, retire, and reinvest wisely.
Tech

Platform Teams That Kill Their Own Tools Save Developer Months

By Sara Park/May 21, 2026

Platform teams often build tools nobody uses. The best teams know when to kill their own creations, saving months of developer time and reducing cognitive load.
Tech

Platform Teams That Measure Toil Time Instead of Uptime

By Sara Park/May 21, 2026

Why platform teams should shift focus from uptime to toil time, how to measure it, and what reduction targets actually improve developer productivity.
Tech

Platform Teams That Measure API Deprecation Speed Ship 3x Faster

By Sara Park/May 21, 2026

New data shows platform teams that track and accelerate API deprecation ship features 3x faster. Learn how to measure deprecation velocity and cut technical debt.
Tech

Platform Teams Waste Millions on Services No Engineer Uses

By Sara Park/May 21, 2026

Platform teams invest heavily in internal services that engineers ignore. Learn how to measure adoption, cut waste, and build tools developers actually use.
Tech

Platform Teams That Block Bad Defaults Cut Security Reviews by Half

By Sara Park/May 21, 2026

Platform teams that block dangerous defaults can cut security review time by 50%. Learn how Stripe, Netflix, and others use policy as code and guardrails to shift left without shifting blame.
Tech

Platform Teams That Stop Measuring Uptime Start Reducing Toil

By Sara Park/May 21, 2026

Platform teams fixated on 99.999% uptime often miss the real cost: toil. Shifting focus from availability to time-to-restore and manual work reduction can improve both developer experience and system reliability.
Tech

Platform Teams Recover Budgets by Measuring Idle Developer Tools

By Sara Park/May 21, 2026

Platform teams can recover hundreds of thousands by measuring idle developer tools. Learn how to audit usage, set policies, and cut waste without slowing velocity.
Tech

Platform Teams Save Budgets by Deprecating Dormant Repos

By Sara Park/May 21, 2026

Platform teams are uniquely positioned to cut cloud bills by deprecating dormant repos. Learn a three-bucket framework, automation tips, and pitfalls to avoid.
Tech

Platform Teams That Measure Idle Compute Save More Than Cloud Bills

By Sara Park/May 21, 2026

Platform teams often overlook idle compute resources, wasting up to 45% of cloud spend. Learn how Netflix, Uber, and Spotify measure and reclaim this hidden capacity to cut costs and improve efficiency.
Tech

Platform Teams Waste Developer Hours on Metrics That Don't Matter

By Sara Park/May 21, 2026

Platform teams spend countless hours building dashboards that nobody reads. Here's why most metrics waste developer time and what actually matters.
Tech

Platform Teams Cut Spending by Retiring Neglected Tooling

By Sara Park/May 21, 2026

Platform teams can cut costs significantly by retiring unused internal tools. Learn how to identify, deprecate, and remove neglected tooling to free up engineering hours and reduce maintenance burden.
Tech

Platform Teams That Track Docs Accuracy Cut On-Call by 35 Percent

By Sara Park/May 21, 2026

Platform teams that measure and enforce documentation accuracy see a 35% drop in after-hours pages. Learn how automated validation and ownership reduce toil.
Tech

Platform Teams Succeed by Tracking Tool Adoption Weekly Not Monthly

By Sara Park/May 21, 2026

Monthly adoption reports hide rapid decay. Platform teams that track tool usage weekly catch drift early, reduce waste, and improve developer satisfaction. Here's how.
Tech

Platform Teams That Remove Unused Code Cut On-Call by 30 Percent

By Sara Park/May 21, 2026

How platform teams that systematically remove unused code reduce on-call alerts by 30%, improve build times, and lower cognitive load. Real data from Spotify, Stripe, and Etsy.
Tech

Platform Teams Cut On-Call by Measuring Internal Tool Dead Code

By Sara Park/May 21, 2026

Platform teams can reduce on-call fatigue by measuring and removing dead code in internal tools. This article explores call-graph analysis, cleanup playbooks, and cultural shifts that cut incidents.
Tech

Platform Teams That Track Abandoned Microservices Save Six Figures

By Sara Park/May 21, 2026

Orphaned microservices silently drain cloud budgets and engineering time. Platform teams that systematically find and retire ghost services can save six figures annually.
Tech

Tired of Forgetting Your Goals? This Smart Q&A Platform Quietly Keeps You on Track

By Elizabeth Taylor/Mar 2, 2026

Discover how a smart Q&A platform can help you stay connected to your personal goals through gentle, human-like conversations that foster reflection, build self-trust, and support sustainable growth without pressure or guilt.
Tech

Platform Teams Save Millions by Tracking Developer Ramp Time

By Sara Park/May 21, 2026

Tracking developer ramp time can save millions. Learn how platform teams measure and reduce the time it takes for new hires to become productive, with real-world ROI.