Designing Escalation Chains That Actually Work at 3 AM

2026-01-05

Every engineering organization has an escalation policy. The primary on-call gets paged first. If they do not acknowledge within five minutes, the alert goes to the secondary. If the secondary misses it, it goes to the team lead, then the engineering manager, and eventually someone's phone rings at 3 AM.

The problem is that this linear chain assumes each person is reachable, awake, and able to act. In practice, the primary might be in a dead zone, the secondary might have silenced notifications after a noisy false-positive the previous week, and the team lead might be traveling internationally with a different phone number.

ATT's escalation engine addresses these realities with several design choices. First, we support parallel fan-out at every tier: instead of one primary, you can define a primary group where all members are paged simultaneously and any single acknowledgement satisfies the tier. Second, we offer multi-channel delivery — each tier can specify phone call, SMS, email, Slack, and push notification simultaneously, maximizing the chance of reaching a human. Third, we track acknowledgement latency by individual and time-of-day, and surface reports that help teams tune their policies based on actual response patterns rather than assumptions.

Since rolling out these features, our clients have reduced mean-time-to-acknowledge for critical after-hours alerts from 11 minutes to under 3 minutes, a difference that has directly prevented several near-miss production incidents.