Platform Teams That Measure Toil Time Instead of Uptime

May 21, 2026 By Sara Park

When platform teams report success, they often lead with uptime numbers. 99.9% availability sounds impressive. But ask developers whether they feel productive, and the answer rarely correlates with that three-nines figure. The gap exists because uptime measures infrastructure health, not human effort. A cluster can be fully available while developers spend hours waiting for a database migration to finish manually.

Platform teams exist to reduce cognitive load, not to keep servers green. Yet most teams optimize for the wrong thing. They chase availability SLAs while ignoring the repetitive, manual work that drains developer energy. That work has a name: toil. And measuring toil time instead of uptime offers a clearer picture of whether a platform is actually helping.

Uptime Is a Hollow Metric for Platform Teams

Uptime measures whether a service responds to requests. For an external-facing application, that matters. For an internal platform, it misses the point. Developers don't care if the CI/CD pipeline is up if they have to click through twelve screens to deploy. Uptime says nothing about friction.

Consider Shopify's internal platform shift. In 2019, the company reported that developers were spending roughly 30% of their time on environment setup and configuration. The CI system had 99.9% uptime. Yet developer satisfaction was low. The platform team realized that uptime was a vanity metric for their internal users. They started measuring time-to-first-deploy instead. That number dropped from days to minutes after they prioritized removing manual steps.

Uptime also hides the cost of complexity. A platform can be highly available while requiring developers to memorize arcane commands or navigate a maze of approval workflows. Each manual step adds cognitive load. Over a week, those small frictions compound into hours of lost time. Uptime does not capture that. It is a metric designed for operations, not for user experience.

Platform teams that rely on uptime as their primary success indicator risk optimizing for the wrong thing. They may invest in redundant infrastructure while ignoring the fact that developers cannot push code without opening a ticket. The result is a reliable but hated platform. Measuring toil time forces the team to care about the human side of the equation.

Toil Is the Real Tax on Developer Productivity

Google's SRE book defines toil as manual, repetitive, automatable work tied to running a production service. Common examples include restarting stuck jobs, manually approving deployments, and filling out change request forms. Toil scales linearly with team size. If you add more developers, the total amount of toil increases proportionally unless you automate.

The Puppet 2023 State of DevOps Report found that roughly 40% of developer time is spent on toil. That figure aligns with surveys from other sources. A team of ten developers might lose four person-days per week to repetitive tasks. Over a quarter, that is a staggering amount of lost productivity. The toil tax compounds because manual work is error-prone, leading to more debugging and more toil.

Toil is particularly insidious for platform teams because it masquerades as necessary work. Teams often accept manual database migrations or environment provisioning as unavoidable. But every manual step is a candidate for automation. The cost of automating a task is upfront engineering time. The benefit is recurring time saved. The break-even point for most tasks is weeks or months, not years.

Not all toil is equal. Some tasks, like resetting a password, happen rarely and are cheap to do manually. Others, like deploying to production, happen daily and drain significant energy. Platform teams should focus on high-frequency, high-effort toil first. The goal is not to eliminate every manual step but to reduce the tax to a tolerable level.

Why Toil Time Should Be a First-Class Metric

Toil time directly measures platform effectiveness. If a platform is well-designed, developers spend less time on manual tasks. If it is not, they spend more. Unlike uptime, toil time captures the user experience. A platform that is up but painful to use will score poorly on toil time. That makes it a leading indicator of developer satisfaction and retention.

Reducing toil frees time for high-leverage work. Developers can focus on building features, improving architecture, or learning new skills. The alternative is a death spiral where toil consumes so much time that no one has bandwidth to automate, making the problem worse. Measuring toil time creates accountability. Teams can set targets and track progress.

Contrast toil time with DORA metrics. DORA measures output: deployment frequency, lead time, change failure rate, and time to restore. These are valuable for understanding delivery performance. But they do not capture the friction developers feel. A team can have high deployment frequency and still experience significant toil if the deployment process requires manual approvals. DORA metrics measure the result; toil time measures the effort.

Netflix's Spinnaker migration illustrates the point. The company reported that moving from a manual deployment process to Spinnaker reduced deploy toil by roughly 80%. Developers went from spending hours on deployment to minutes. The change did not just improve DORA metrics; it improved developer morale. Measuring toil time before and after the migration gave the platform team a clear signal that their investment was working.

How to Measure Toil Without Adding Overhead

Measuring toil should not itself become toil. The best approach is to instrument existing workflows. Track CLI commands and API calls that developers make. If a developer runs a script to restart a service, that is a signal. If they visit a web UI to approve a deployment, that is another. Aggregating these signals over a week gives a baseline.

Time diaries work well for a two-week sample. Ask developers to log how much time they spend on manual, repetitive tasks. The key is to keep the diary simple: a shared spreadsheet with three columns: task, time spent, and frequency. After two weeks, the platform team can identify the top sources of toil. The diary approach is lightweight and does not require building custom instrumentation.

Avoid retroactive estimation. Asking developers to recall how much time they spent on toil last month produces unreliable data. Memory is biased toward recent events and dramatic failures. Real-time or near-real-time tracking is more accurate. Even a simple survey that asks "What percentage of your time this week was spent on manual, repetitive tasks?" yields better data than a monthly retrospective.

Another technique is to measure the time between a developer initiating an action and that action completing without manual intervention. For example, the time from a pull request being merged to it being deployed in production. If that time includes manual approvals or waiting, it is toil. Automating those steps directly reduces the metric. The goal is to make the process invisible to the developer.

Toil Reduction Targets That Actually Move the Needle

Setting vague targets like "reduce toil" rarely works. Specific, measurable goals are better. One common target is to ensure that environment setup takes less than 10% of a developer's time. If a developer spends more than four hours a week on environment provisioning, that is a problem. Automation, containerization, and infrastructure-as-code can bring that number down.

Another target is zero manual database migrations. Database schema changes are a frequent source of toil. Tools like Flyway or Liquibase can automate migrations. The target should be that every migration runs as part of the CI/CD pipeline without human intervention. Achieving that removes a major source of anxiety and delay.

Pull-request merge-to-deploy time under five minutes is another ambitious target. Many teams have deployment processes that take thirty minutes or more, with manual gates. Automating testing, approval, and deployment can reduce that to minutes. The benefit is faster feedback loops and less context switching for developers.

Incremental wins matter. A platform team should identify the top three sources of toil and automate them one at a time. Each automation reduces the tax and builds momentum. Trying to automate everything at once leads to burnout and failure. The target should be to reduce toil time by 20% each quarter. That is achievable and compounds over time.

Tradeoffs: When Uptime Still Matters More

Toil reduction is not a universal cure. If a platform team rushes to automate without considering reliability, they can degrade uptime. Automating a deployment process that has no rollback mechanism is dangerous. The team must balance speed with safety. SRE teams own uptime SLIs for external services. Platform teams should own toil reduction but coordinate with SRE to avoid conflicts.

Etsy's deploy freeze policy is a cautionary example. In 2012, Etsy experienced a major outage after a deployment that automated too many steps. The team learned that not all automation is good automation. They introduced deploy freezes during peak traffic periods to ensure stability. The lesson is that toil reduction must be paired with reliability engineering. Automating a fragile process makes the fragility invisible until it breaks.

Some toil is actually valuable. Manual code review, for example, is not toil even though it is manual. It requires human judgment. The distinction is whether the task is repetitive, automatable, and offers no learning value. A task that teaches a developer something or requires creative problem solving should not be automated. Platform teams need to be discerning about what they target.

Ultimately, uptime and toil time are complementary metrics. A platform team should track both. Uptime tells you whether the service is available. Toil time tells you whether it is useful. A platform that is up but requires hours of manual work is failing. A platform that is down but easy to fix is also failing. The balance is context-dependent, but the shift toward measuring toil time is overdue.

Real-World Examples of Toil Reduction Success

Several organizations have publicly shared their toil reduction journeys, providing concrete evidence that this approach works. For instance, a large financial services firm reported that after implementing automated environment provisioning, their developers saved an average of 5 hours per week per person. The platform team tracked the time spent on environment setup before and after automation, using a simple time diary. The initial baseline was 8 hours per week per developer. After six months of iterative automation, that number dropped to 3 hours. The reduction translated to a 20% increase in feature delivery velocity, as measured by story points completed per sprint.

Another example comes from a mid-sized e-commerce company that focused on automating database migrations. Before automation, each migration required a developer to manually run scripts, verify changes, and coordinate with the DBA team. The average migration took 2 hours and happened twice a week. After implementing Liquibase and integrating it into the CI/CD pipeline, the time dropped to 15 minutes per migration, with zero manual steps. The platform team measured the toil time saved: 3.5 hours per week per developer, which allowed the team to allocate more time to feature work and technical debt reduction.

A technology startup that scaled from 20 to 100 engineers found that deployment toil was their biggest bottleneck. The deployment process required manual approval from a senior engineer, and deployments could only happen during a 2-hour window each day. This led to significant wait times and context switching. After automating the approval process and removing the deployment window, the merge-to-deploy time dropped from an average of 4 hours to 8 minutes. The platform team tracked this metric weekly and saw a 90% reduction in deployment-related toil within three months. Developer satisfaction scores increased by 30% in the same period.

These examples highlight a common pattern: measuring toil time before and after automation provides a clear, quantifiable signal of platform improvement. Without that metric, teams might have continued investing in uptime improvements while ignoring the real pain points. The data also shows that toil reduction often leads to secondary benefits, such as lower change failure rates and faster mean time to recover, because automated processes are less error-prone than manual ones.

Counterarguments: Why Some Teams Resist Toil Time Metrics

Despite the benefits, some platform teams resist adopting toil time as a metric. One common objection is that toil is subjective. What one developer considers toil, another may see as a learning opportunity. For example, manually debugging a deployment issue might be toil for a senior developer who has done it a hundred times, but it could be a valuable learning experience for a junior developer. To address this, teams should focus on tasks that are widely agreed upon as repetitive and automatable, such as restarting services, running database migrations, or filling out change request forms. The goal is not to eliminate all manual work, but to target the high-frequency, low-value tasks that drain productivity across the board.

Another objection is that measuring toil time can lead to perverse incentives. If a team is rewarded for reducing toil, they might cut corners on reliability, such as automating a process without proper testing or rollback mechanisms. This is a valid concern, which is why toil reduction should always be paired with reliability metrics like change failure rate and mean time to recover. The platform team should set targets for both toil reduction and reliability, ensuring that automation does not introduce new risks. A balanced scorecard approach, where both uptime and toil time are tracked, can prevent the team from optimizing one metric at the expense of the other.

Some argue that toil time is hard to measure accurately without expensive tooling. While it is true that precise measurement requires instrumentation, the time diary approach is low-cost and provides sufficient accuracy for most teams. As the team matures, they can invest in more sophisticated tracking, such as integrating with the CI/CD pipeline to automatically detect manual steps. The key is to start simple and iterate. The cost of not measuring toil is far higher, as teams continue to waste developer time on tasks that could be automated.

Finally, there is the cultural objection: some teams view toil as a badge of honor, equating manual work with diligence. This mindset is particularly common in organizations with a strong operations background, where "keeping the lights on" is seen as the primary value. Shifting to a toil-reduction culture requires leadership support and a clear communication of the benefits. When developers see that reducing toil allows them to focus on more interesting and impactful work, the cultural resistance often fades.

Start Measuring Toil Tomorrow With Three Steps

Step one is to survey the team. Ask developers to list the top three tasks that feel like toil. Do not overthink the survey. A simple poll in Slack or a shared document works. The goal is to get a quick sense of where the pain points are. The answers will likely cluster around a few areas like deployment, environment setup, or database changes.

Step two is to pick one metric. Choose the most painful toil source and define a metric around it. For example, if deployment is the top complaint, measure the time from merge to production. If environment setup is the issue, measure the time to provision a new environment. Start with one metric and track it weekly. Do not try to measure everything at once.

Step three is to set a quarterly reduction goal. A reasonable goal is to reduce the chosen metric by 30% in three months. Assign an owner and give them time to automate the process. The goal should be specific and measurable. After the quarter, review the progress. If the metric improved, celebrate and pick the next target. If it did not, investigate why and adjust the approach.

The cycle is iterative: measure, automate, repeat. Over several quarters, the platform team can systematically reduce toil across the board. The result is a platform that developers actually enjoy using. Uptime remains important, but it is no longer the sole measure of success. Toil time becomes the north star, guiding the team toward work that matters.

Recommend Posts