Production Monitoring Checklist
Ensure your production environment is fully observable with this checklist covering infrastructure, applications, tracing, logs, and incident response.
Infrastructure Monitoring
Set alerts at 80% CPU sustained for 5 minutes, 85% memory, and 90% disk. Per-host granularity catches hotspots that aggregate metrics hide.
Monitor pod restart counts (>3 in 10 minutes indicates crash loops), OOM kills, and pending pods. In Kubernetes, track node NotReady events and eviction pressure.
Baseline cross-AZ latency (typically 0.5-2ms) and alert on 3x deviations. Increased latency between zones often precedes cascading failures in distributed systems.
Alert at 30, 14, and 7 days before expiry. Automate renewal with cert-manager or ACME. Expired certificates cause total service outages with no graceful degradation.
Monitor DNS lookup time and failure rate from multiple vantage points. DNS failures cause connection errors that look like application bugs in metrics and logs.
Application Health
Track 4xx and 5xx rates per endpoint, not just aggregate. A healthy aggregate can hide a completely broken endpoint that serves 10% of traffic.
Always track P95 and P99, never just averages. An average of 100ms can hide a P99 of 5 seconds that affects your most important users on every 100th request.
Baseline normal request rates by time of day and day of week. A sudden 50% drop in throughput often indicates an upstream failure or routing issue, not reduced demand.
Implement active health checks for databases, caches, queues, and external APIs. Health check endpoints should verify actual connectivity (run a simple query), not just return 200 OK.
Track heap size over time and alert on monotonic growth across garbage collection cycles. A service that grows 10MB/hour will OOM in days -- catch it before it pages you at 3am.
Distributed Tracing
Send a test request through your full call chain and verify every service appears as a span. Missing spans mean broken context propagation -- usually a missing middleware or uninstrumented client.
Create alerts on end-to-end latency for critical user journeys (checkout, login, search). Trace-based alerts catch issues that per-service metrics miss because they span multiple services.
Track the percentage of spans marked as errors for each service. A service with a 5% span error rate may look healthy in HTTP metrics but is silently failing on specific operations.
Monitor for orphan spans (spans with a parent_id that doesn't exist in the trace). Orphan spans indicate broken context propagation at service boundaries, often in async workers or queue consumers.
Verify your sampling rate captures enough traces to debug issues. At 1% sampling, a bug affecting 0.1% of requests may produce zero sampled traces. Always sample errors at 100%.
Log Aggregation
Every service should emit JSON logs with consistent field names. Structured logs are searchable, parseable, and enable automated analysis. Unstructured logs are nearly useless at scale.
Include trace_id and span_id in every log entry. This lets you jump from a log line to the full distributed trace, connecting application-level events to system-level performance data.
Define retention based on compliance needs: PCI DSS requires 1 year, HIPAA requires 6 years, SOC 2 varies. Archive to cold storage after 30 days for cost efficiency.
Alert on specific error patterns that indicate system failures: database connection refused, authentication service unavailable, payment gateway timeout. These are often faster than metric-based alerts.
Audit log output quarterly for personally identifiable information. Redact or hash email addresses, phone numbers, IP addresses, and any field that could identify a user. PII in logs creates compliance liability.
Incident Response Readiness
Every alert must have a linked runbook with investigation steps, likely root causes, and remediation actions. Engineers should never see an alert and wonder what to do next.
Define who to escalate to, when to escalate, and how (page vs Slack vs email). Test the escalation path monthly -- stale escalation policies cause delayed response during real incidents.
Connect your status page (Statuspage, Instatus, etc.) to monitoring so it updates automatically when services degrade. Manual status updates are always late and often forgotten during incidents.
Have a blameless post-incident review template covering: timeline, root cause, impact, detection time, resolution time, and action items with owners and deadlines.
Run monthly failure injection exercises: kill a service, introduce latency, revoke database credentials. Chaos engineering validates that your monitoring detects failures before users report them.
Ready to implement?
TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.
Related Resources
Learn distributed tracing patterns and best practices for Go
Calculate SLA uptime and error budgets for your services
AI-powered enterprise observability at enterprise prices. See how TraceKit delivers core APM without the complexity.
Next.js blurs the line between server and client -- React Server Components, ISR, and streaming SSR create invisible boundaries where traces break. TraceKit gives you full visibility across the RSC boundary, from server render to client hydration.
Step-by-step guide to migrate from Datadog to TraceKit. Replace dd-trace with TraceKit SDK, map environment variables, and verify traces in minutes.