SRE On-Call Quick Reference
A quick-reference guide for on-call engineers covering triage, investigation, escalation, communication, and post-incident review.
Triage Framework
Use the SIREN model to structure your initial response to any alert or incident report:
- Scope -- Determine what's affected. Is it one endpoint, one service, one region, or everything? Check your service dashboard and status page for correlated alerts.
- Impact -- Quantify user impact. How many users are affected? Is functionality degraded or completely unavailable? Check error rates and support ticket volume.
- Root cause hypothesis -- Form an initial hypothesis based on available data. Recent deployment? Infrastructure change? Dependency failure? Don't investigate yet -- just hypothesize.
- Escalate (if needed) -- If scope is broad (multiple services) or impact is high (>10% of users), escalate immediately. Don't wait to confirm the root cause before escalating.
- Notify -- Post an initial status update in the incident channel. Even "investigating" is better than silence.
Severity Classification
| Severity | Criteria | Response Time | Action |
|---|---|---|---|
| P1 - Critical | Service down or data loss affecting >50% of users | Immediate | Page on-call, open incident bridge |
| P2 - Major | Significant degradation or partial outage | 15 minutes | Notify on-call, start investigation |
| P3 - Minor | Minor degradation, workaround available | 1 hour | Create ticket, investigate during business hours |
| P4 - Low | Cosmetic or non-user-facing issue | Next business day | Create ticket, prioritize in backlog |
Common Investigation Patterns
Follow this systematic approach for the three most common alert types.
High Latency Investigation
- Open the service dashboard -- check if latency increase is across all endpoints or specific ones
- Check recent deployments --
git log --since="2 hours ago" --onelinein the affected service - Find a slow trace -- filter traces by duration >P99 and examine the waterfall for the slow span
- Check dependency latency -- database query time, external API response time, cache hit rate
- Check resource utilization -- CPU throttling, memory pressure (GC pauses), connection pool exhaustion
Error Spike Investigation
- Identify the error type -- are errors 4xx (client) or 5xx (server)? Group by error message or status code
- Check if errors correlate with a specific endpoint, user segment, or geographic region
- Search logs for the error message --
trace_idin the log lets you pull the full distributed trace - Check dependency health -- connection refused, timeouts, and authentication errors point to downstream failures
- Check recent config changes -- feature flags, environment variables, secret rotations
Memory Leak Investigation
- Confirm the pattern -- heap size should be monotonically increasing across GC cycles, not just sawtooth
- Check when it started -- correlate with deployments.
git diffbetween last stable and current version - Take a heap profile -- Go:
go tool pprof http://service:6060/debug/pprof/heap. Node.js:--inspectflag + Chrome DevTools - Look for unbounded caches, leaked goroutines/connections, or event listeners that are never removed
Escalation Procedures
Escalation is not a failure -- it's a tool. Escalate early and often when the situation warrants it.
When to Escalate
- Impact affects more than one service or team's domain
- You've been investigating for 15 minutes without a clear root cause
- The fix requires access or expertise you don't have
- Customer-facing impact is growing (error rate increasing, not stable)
- The incident involves data integrity or security
How to Escalate
Page the relevant team's on-call with a structured handoff message:
INCIDENT HANDOFF
Service: [affected service name]
Started: [timestamp in UTC]
Impact: [quantified -- X% error rate, Y users affected]
What I've checked:
- [investigation step 1 and result]
- [investigation step 2 and result]
Hypothesis: [your best guess at root cause]
Ask: [specific help needed -- access, expertise, approval]
Escalation Contacts
Maintain a living document (in your wiki or runbook) with escalation contacts for each service. Include: primary on-call, secondary on-call, team lead, and relevant Slack channel. Review and update quarterly.
Communication Templates
Clear, timely communication reduces user anxiety and internal confusion during incidents.
Internal Slack Update (Every 30 Minutes)
INCIDENT UPDATE - [Service Name] - [Severity]
Status: [Investigating | Identified | Monitoring | Resolved]
Impact: [Current user impact in plain language]
Root cause: [Known | Suspected: brief description | Unknown]
Next action: [What we're doing next]
ETA: [Estimated resolution time or "unknown"]
Incident lead: @[name]
External Status Page Update
[Service Name] - [Degraded Performance | Partial Outage | Major Outage]
We are aware of issues affecting [specific functionality].
[X]% of requests are experiencing [errors/increased latency].
Our team is actively investigating.
Next update in 30 minutes or sooner if we have new information.
Customer Notification (for P1/P2)
Subject: [Service Name] Service Disruption - [Date]
We're experiencing an issue affecting [specific functionality].
Impact: [Clear description of what users can't do]
Workaround: [If available, describe it]
Status: Our engineering team is actively working on a fix.
We'll provide an update within [30 minutes/1 hour].
For urgent needs, contact support@[company].com.
Update Cadence
- P1: Update every 15-30 minutes until resolved
- P2: Update every 30-60 minutes until resolved
- P3/P4: Update when status changes (investigating -> identified -> resolved)
Post-Incident Checklist
Complete these steps within 48 hours of incident resolution. The goal is learning, not blame.
Immediately After Resolution
- Post a final status update confirming resolution in all channels (Slack, status page, customer notification)
- Save relevant dashboards, traces, and log queries -- data retention means this evidence disappears over time
- Write a brief timeline while events are fresh: when detected, when investigated, when resolved
Within 24 Hours
- Schedule a blameless post-incident review with all participants (30-60 minutes)
- Prepare the review document with: timeline, root cause, impact metrics, detection method, resolution steps
- Identify contributing factors beyond the immediate root cause (missing monitoring, unclear runbook, slow escalation)
During the Review
- Walk through the timeline together -- fill in gaps from different perspectives
- Ask "what surprised us?" and "what would have helped us detect/resolve this faster?"
- Generate action items with specific owners and deadlines (not "improve monitoring" but "add P99 latency alert on checkout service by March 20")
After the Review
- File action items as tracked tickets in your project management tool
- Update runbooks with any new investigation steps learned during this incident
- Track recurrence -- if the same root cause appears twice, escalate priority of the fix
- Share the post-incident report with the broader engineering organization for collective learning
Ready to implement?
TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.
Related Resources
Learn distributed tracing patterns and best practices for Go
Calculate SLA uptime and error budgets for your services
AI-powered enterprise observability at enterprise prices. See how TraceKit delivers core APM without the complexity.
Next.js blurs the line between server and client -- React Server Components, ISR, and streaming SSR create invisible boundaries where traces break. TraceKit gives you full visibility across the RSC boundary, from server render to client hydration.
Step-by-step guide to migrate from Datadog to TraceKit. Replace dd-trace with TraceKit SDK, map environment variables, and verify traces in minutes.