SRE On-Call Quick Reference

A quick-reference guide for on-call engineers covering triage, investigation, escalation, communication, and post-incident review.

OperationsBest PracticesBeginner 10 min | Updated March 2026

Triage Framework

Use the SIREN model to structure your initial response to any alert or incident report:

Scope -- Determine what's affected. Is it one endpoint, one service, one region, or everything? Check your service dashboard and status page for correlated alerts.
Impact -- Quantify user impact. How many users are affected? Is functionality degraded or completely unavailable? Check error rates and support ticket volume.
Root cause hypothesis -- Form an initial hypothesis based on available data. Recent deployment? Infrastructure change? Dependency failure? Don't investigate yet -- just hypothesize.
Escalate (if needed) -- If scope is broad (multiple services) or impact is high (>10% of users), escalate immediately. Don't wait to confirm the root cause before escalating.
Notify -- Post an initial status update in the incident channel. Even "investigating" is better than silence.

Severity Classification

Severity	Criteria	Response Time	Action
P1 - Critical	Service down or data loss affecting >50% of users	Immediate	Page on-call, open incident bridge
P2 - Major	Significant degradation or partial outage	15 minutes	Notify on-call, start investigation
P3 - Minor	Minor degradation, workaround available	1 hour	Create ticket, investigate during business hours
P4 - Low	Cosmetic or non-user-facing issue	Next business day	Create ticket, prioritize in backlog

Common Investigation Patterns

Follow this systematic approach for the three most common alert types.

High Latency Investigation

Open the service dashboard -- check if latency increase is across all endpoints or specific ones
Check recent deployments -- git log --since="2 hours ago" --oneline in the affected service
Find a slow trace -- filter traces by duration >P99 and examine the waterfall for the slow span
Check dependency latency -- database query time, external API response time, cache hit rate
Check resource utilization -- CPU throttling, memory pressure (GC pauses), connection pool exhaustion

Error Spike Investigation

Identify the error type -- are errors 4xx (client) or 5xx (server)? Group by error message or status code
Check if errors correlate with a specific endpoint, user segment, or geographic region
Search logs for the error message -- trace_id in the log lets you pull the full distributed trace
Check dependency health -- connection refused, timeouts, and authentication errors point to downstream failures
Check recent config changes -- feature flags, environment variables, secret rotations

Memory Leak Investigation

Confirm the pattern -- heap size should be monotonically increasing across GC cycles, not just sawtooth
Check when it started -- correlate with deployments. git diff between last stable and current version
Take a heap profile -- Go: go tool pprof http://service:6060/debug/pprof/heap. Node.js: --inspect flag + Chrome DevTools
Look for unbounded caches, leaked goroutines/connections, or event listeners that are never removed

Escalation Procedures

Escalation is not a failure -- it's a tool. Escalate early and often when the situation warrants it.

When to Escalate

Impact affects more than one service or team's domain
You've been investigating for 15 minutes without a clear root cause
The fix requires access or expertise you don't have
Customer-facing impact is growing (error rate increasing, not stable)
The incident involves data integrity or security

How to Escalate

Page the relevant team's on-call with a structured handoff message:

INCIDENT HANDOFF
Service: [affected service name]
Started: [timestamp in UTC]
Impact: [quantified -- X% error rate, Y users affected]
What I've checked:
  - [investigation step 1 and result]
  - [investigation step 2 and result]
Hypothesis: [your best guess at root cause]
Ask: [specific help needed -- access, expertise, approval]

Escalation Contacts

Maintain a living document (in your wiki or runbook) with escalation contacts for each service. Include: primary on-call, secondary on-call, team lead, and relevant Slack channel. Review and update quarterly.

Communication Templates

Clear, timely communication reduces user anxiety and internal confusion during incidents.

Internal Slack Update (Every 30 Minutes)

INCIDENT UPDATE - [Service Name] - [Severity]
Status: [Investigating | Identified | Monitoring | Resolved]
Impact: [Current user impact in plain language]
Root cause: [Known | Suspected: brief description | Unknown]
Next action: [What we're doing next]
ETA: [Estimated resolution time or "unknown"]
Incident lead: @[name]

External Status Page Update

[Service Name] - [Degraded Performance | Partial Outage | Major Outage]

We are aware of issues affecting [specific functionality].
[X]% of requests are experiencing [errors/increased latency].
Our team is actively investigating.
Next update in 30 minutes or sooner if we have new information.

Customer Notification (for P1/P2)

Subject: [Service Name] Service Disruption - [Date]

We're experiencing an issue affecting [specific functionality].

Impact: [Clear description of what users can't do]
Workaround: [If available, describe it]
Status: Our engineering team is actively working on a fix.

We'll provide an update within [30 minutes/1 hour].
For urgent needs, contact support@[company].com.

Update Cadence

P1: Update every 15-30 minutes until resolved
P2: Update every 30-60 minutes until resolved
P3/P4: Update when status changes (investigating -> identified -> resolved)

Post-Incident Checklist

Complete these steps within 48 hours of incident resolution. The goal is learning, not blame.

Immediately After Resolution

Post a final status update confirming resolution in all channels (Slack, status page, customer notification)
Save relevant dashboards, traces, and log queries -- data retention means this evidence disappears over time
Write a brief timeline while events are fresh: when detected, when investigated, when resolved

Within 24 Hours

Schedule a blameless post-incident review with all participants (30-60 minutes)
Prepare the review document with: timeline, root cause, impact metrics, detection method, resolution steps
Identify contributing factors beyond the immediate root cause (missing monitoring, unclear runbook, slow escalation)

During the Review

Walk through the timeline together -- fill in gaps from different perspectives
Ask "what surprised us?" and "what would have helped us detect/resolve this faster?"
Generate action items with specific owners and deadlines (not "improve monitoring" but "add P99 latency alert on checkout service by March 20")

After the Review

File action items as tracked tickets in your project management tool
Update runbooks with any new investigation steps learned during this incident
Track recurrence -- if the same root cause appears twice, escalate priority of the fix
Share the post-incident report with the broader engineering organization for collective learning

Ready to implement?

TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.

Try the Demo Start Free

Related Resources

Go Distributed Tracing Guide Guide

Learn distributed tracing patterns and best practices for Go

SLA Calculator Tool

Calculate SLA uptime and error budgets for your services

TraceKit vs Dynatrace Comparison

AI-powered enterprise observability at enterprise prices. See how TraceKit delivers core APM without the complexity.

Next.js Observability Framework

Next.js blurs the line between server and client -- React Server Components, ISR, and streaming SSR create invisible boundaries where traces break. TraceKit gives you full visibility across the RSC boundary, from server render to client hydration.

Migrate from Datadog Migration

Step-by-step guide to migrate from Datadog to TraceKit. Replace dd-trace with TraceKit SDK, map environment variables, and verify traces in minutes.