← All case studies
Case study
Building a Culture of Reliability: The Journey to 99.99% Uptime
How the department I led shaped practices and ownership to achieve and sustain 99.99% uptime.
The Client
Global e-learning platform serving 400,000 daily users across 120 countries, with over 1 million courses completed annually.
The Challenge
- Availability at risk: System availability frequently dropped below 98-99%, putting customer SLAs in jeopardy
- Unsustainable on-call: Only 2 engineers on-call for the entire ecosystem, including services they didn't build—leading to burnout and slow response
- Deployment fear: All deployments caused downtime and were restricted to Thursday windows; database scripts carefully scheduled to minimize customer impact
- No visibility: No systematic monitoring, incident coordination, or improvement process
- Severe incidents constant: ~30 severe incidents per year disrupting learning for hundreds of thousands of users
What We Did
Restructured On-Call for Ownership
- Created dedicated on-call teams per business service (4-5 engineers each) so teams owned what they built
- Reduced cognitive load and enabled proper rest rotation
Built Incident Management Discipline
- Introduced Incident Commanders to coordinate complex incidents
- Established post-incident review process with mandatory follow-through on action items
- Began tracking MTTA, MTTR, and other reliability KPIs to drive continuous improvement
Created Visibility Through Monitoring
- Defined internal Service Level Objectives (SLOs) for each service
- Set up monitoring and alerting around SLOs to catch issues before customers noticed
Enabled Cultural Shift
- Deep ownership meant teams fixed root causes, not just symptoms—especially after out-of-hours incidents
- Continuous improvement cycles became routine, steadily strengthening reliability posture
The Results
- 99.99% uptime achieved and maintained, eliminating SLA violation risk
- MTTA improved 12x: from 4 hours mean time to acknowledge to consistently under 20 minutes
- Severe incidents dropped 90%: from 30/year to just 3/year
- Deploy on-demand: Moved from scheduled Thursday-only deployments with downtime to zero-downtime deployments at any time
- Sustainable on-call: Healthy rotations with 4-5 engineers per team, ending the burnout cycle
- Proactive reliability culture: Teams routinely implementing structural improvements rather than fighting fires
The Transformation
What started as a reliability crisis became a competitive advantage. By combining ownership, observability, and disciplined incident management, the platform now delivers consistent learning experiences to hundreds of thousands of daily users—with engineering teams that sleep at night.