Production Monitoring Checklist

Ensure your production environment is fully observable with this checklist covering infrastructure, applications, tracing, logs, and incident response.

OperationsChecklistIntermediate 12 min | Updated March 2026
Progress

Infrastructure Monitoring

CPU/memory/disk alerts per host

Set alerts at 80% CPU sustained for 5 minutes, 85% memory, and 90% disk. Per-host granularity catches hotspots that aggregate metrics hide.

Container orchestrator health

Monitor pod restart counts (>3 in 10 minutes indicates crash loops), OOM kills, and pending pods. In Kubernetes, track node NotReady events and eviction pressure.

Network latency between availability zones

Baseline cross-AZ latency (typically 0.5-2ms) and alert on 3x deviations. Increased latency between zones often precedes cascading failures in distributed systems.

SSL certificate expiry monitoring

Alert at 30, 14, and 7 days before expiry. Automate renewal with cert-manager or ACME. Expired certificates cause total service outages with no graceful degradation.

DNS resolution monitoring

Monitor DNS lookup time and failure rate from multiple vantage points. DNS failures cause connection errors that look like application bugs in metrics and logs.

Application Health

HTTP error rate by endpoint

Track 4xx and 5xx rates per endpoint, not just aggregate. A healthy aggregate can hide a completely broken endpoint that serves 10% of traffic.

Latency percentiles (P50/P95/P99)

Always track P95 and P99, never just averages. An average of 100ms can hide a P99 of 5 seconds that affects your most important users on every 100th request.

Request throughput with anomaly detection

Baseline normal request rates by time of day and day of week. A sudden 50% drop in throughput often indicates an upstream failure or routing issue, not reduced demand.

Dependency health checks

Implement active health checks for databases, caches, queues, and external APIs. Health check endpoints should verify actual connectivity (run a simple query), not just return 200 OK.

Memory leak detection

Track heap size over time and alert on monotonic growth across garbage collection cycles. A service that grows 10MB/hour will OOM in days -- catch it before it pages you at 3am.

Distributed Tracing

Verify trace propagation across all service boundaries

Send a test request through your full call chain and verify every service appears as a span. Missing spans mean broken context propagation -- usually a missing middleware or uninstrumented client.

Set up trace-based alerting for critical paths

Create alerts on end-to-end latency for critical user journeys (checkout, login, search). Trace-based alerts catch issues that per-service metrics miss because they span multiple services.

Monitor span error rates per service

Track the percentage of spans marked as errors for each service. A service with a 5% span error rate may look healthy in HTTP metrics but is silently failing on specific operations.

Track trace completeness

Monitor for orphan spans (spans with a parent_id that doesn't exist in the trace). Orphan spans indicate broken context propagation at service boundaries, often in async workers or queue consumers.

Sample rate producing sufficient data for debugging

Verify your sampling rate captures enough traces to debug issues. At 1% sampling, a bug affecting 0.1% of requests may produce zero sampled traces. Always sample errors at 100%.

Log Aggregation

Structured logging (JSON) in all services

Every service should emit JSON logs with consistent field names. Structured logs are searchable, parseable, and enable automated analysis. Unstructured logs are nearly useless at scale.

Correlation IDs in every log line

Include trace_id and span_id in every log entry. This lets you jump from a log line to the full distributed trace, connecting application-level events to system-level performance data.

Log retention policy matching compliance requirements

Define retention based on compliance needs: PCI DSS requires 1 year, HIPAA requires 6 years, SOC 2 varies. Archive to cold storage after 30 days for cost efficiency.

Log-based alerts for critical errors

Alert on specific error patterns that indicate system failures: database connection refused, authentication service unavailable, payment gateway timeout. These are often faster than metric-based alerts.

No PII in log messages

Audit log output quarterly for personally identifiable information. Redact or hash email addresses, phone numbers, IP addresses, and any field that could identify a user. PII in logs creates compliance liability.

Incident Response Readiness

Runbook for each alert

Every alert must have a linked runbook with investigation steps, likely root causes, and remediation actions. Engineers should never see an alert and wonder what to do next.

Escalation path documented and tested

Define who to escalate to, when to escalate, and how (page vs Slack vs email). Test the escalation path monthly -- stale escalation policies cause delayed response during real incidents.

Status page integrated with monitoring

Connect your status page (Statuspage, Instatus, etc.) to monitoring so it updates automatically when services degrade. Manual status updates are always late and often forgotten during incidents.

Post-incident review template ready

Have a blameless post-incident review template covering: timeline, root cause, impact, detection time, resolution time, and action items with owners and deadlines.

Chaos engineering schedule

Run monthly failure injection exercises: kill a service, introduce latency, revoke database credentials. Chaos engineering validates that your monitoring detects failures before users report them.

Ready to implement?

TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.