Io0001
← All case studies

Case study

Building a Culture of Reliability: The Journey to 99.99% Uptime

How the department I led shaped practices and ownership to achieve and sustain 99.99% uptime.

The Client

Global e-learning platform serving 400,000 daily users across 120 countries, with over 1 million courses completed annually.

The Challenge

  • Availability at risk: System availability frequently dropped below 98-99%, putting customer SLAs in jeopardy
  • Unsustainable on-call: Only 2 engineers on-call for the entire ecosystem, including services they didn't build—leading to burnout and slow response
  • Deployment fear: All deployments caused downtime and were restricted to Thursday windows; database scripts carefully scheduled to minimize customer impact
  • No visibility: No systematic monitoring, incident coordination, or improvement process
  • Severe incidents constant: ~30 severe incidents per year disrupting learning for hundreds of thousands of users

What We Did

Restructured On-Call for Ownership

  • Created dedicated on-call teams per business service (4-5 engineers each) so teams owned what they built
  • Reduced cognitive load and enabled proper rest rotation

Built Incident Management Discipline

  • Introduced Incident Commanders to coordinate complex incidents
  • Established post-incident review process with mandatory follow-through on action items
  • Began tracking MTTA, MTTR, and other reliability KPIs to drive continuous improvement

Created Visibility Through Monitoring

  • Defined internal Service Level Objectives (SLOs) for each service
  • Set up monitoring and alerting around SLOs to catch issues before customers noticed

Enabled Cultural Shift

  • Deep ownership meant teams fixed root causes, not just symptoms—especially after out-of-hours incidents
  • Continuous improvement cycles became routine, steadily strengthening reliability posture

The Results

  • 99.99% uptime achieved and maintained, eliminating SLA violation risk
  • MTTA improved 12x: from 4 hours mean time to acknowledge to consistently under 20 minutes
  • Severe incidents dropped 90%: from 30/year to just 3/year
  • Deploy on-demand: Moved from scheduled Thursday-only deployments with downtime to zero-downtime deployments at any time
  • Sustainable on-call: Healthy rotations with 4-5 engineers per team, ending the burnout cycle
  • Proactive reliability culture: Teams routinely implementing structural improvements rather than fighting fires

The Transformation

What started as a reliability crisis became a competitive advantage. By combining ownership, observability, and disciplined incident management, the platform now delivers consistent learning experiences to hundreds of thousands of daily users—with engineering teams that sleep at night.