Stan Pogrebnyak · Engineering Leadership & Consultancy

How the department I led shaped practices and ownership to achieve and sustain 99.99% uptime.

The Client

Global e-learning platform serving 400,000 daily users across 120 countries, with over 1 million courses completed annually.

The Challenge

Availability at risk: System availability frequently dropped below 98-99%, putting customer SLAs in jeopardy
Unsustainable on-call: Only 2 engineers on-call for the entire ecosystem, including services they didn't build—leading to burnout and slow response
Deployment fear: All deployments caused downtime and were restricted to Thursday windows; database scripts carefully scheduled to minimize customer impact
No visibility: No systematic monitoring, incident coordination, or improvement process
Severe incidents constant: ~30 severe incidents per year disrupting learning for hundreds of thousands of users

What We Did

Restructured On-Call for Ownership

Created dedicated on-call teams per business service (4-5 engineers each) so teams owned what they built
Reduced cognitive load and enabled proper rest rotation

Built Incident Management Discipline

Introduced Incident Commanders to coordinate complex incidents
Established post-incident review process with mandatory follow-through on action items
Began tracking MTTA, MTTR, and other reliability KPIs to drive continuous improvement

Created Visibility Through Monitoring

Defined internal Service Level Objectives (SLOs) for each service
Set up monitoring and alerting around SLOs to catch issues before customers noticed

Enabled Cultural Shift

Deep ownership meant teams fixed root causes, not just symptoms—especially after out-of-hours incidents
Continuous improvement cycles became routine, steadily strengthening reliability posture

The Results

99.99% uptime achieved and maintained, eliminating SLA violation risk
MTTA improved 12x: from 4 hours mean time to acknowledge to consistently under 20 minutes
Severe incidents dropped 90%: from 30/year to just 3/year
Deploy on-demand: Moved from scheduled Thursday-only deployments with downtime to zero-downtime deployments at any time
Sustainable on-call: Healthy rotations with 4-5 engineers per team, ending the burnout cycle
Proactive reliability culture: Teams routinely implementing structural improvements rather than fighting fires

The Transformation

What started as a reliability crisis became a competitive advantage. By combining ownership, observability, and disciplined incident management, the platform now delivers consistent learning experiences to hundreds of thousands of daily users—with engineering teams that sleep at night.

Building a Culture of Reliability: The Journey to 99.99% Uptime

Similar challenge?