APM Implementation Checklist
A comprehensive checklist for implementing application performance monitoring from scratch, covering prerequisites through production rollout.
Prerequisites
Establish target latency (e.g., P99 < 500ms) and availability (e.g., 99.9%) for each service before instrumenting. SLOs determine which metrics and alerts matter.
Map every service, database, cache, and external API in your architecture. Include async workers and scheduled jobs -- these are frequently missed during instrumentation.
Auto-instrumentation covers HTTP, gRPC, and database calls with zero code changes. Manual instrumentation adds custom spans for business logic. Most teams need both.
Deploy an OTel Collector as a sidecar or gateway to receive, process, and export telemetry. This decouples your application from the backend vendor.
Record current P50/P95/P99 latency, error rates, and throughput for each service before adding instrumentation. This lets you measure tracing overhead accurately.
SDK Installation
Add the OTel SDK and relevant auto-instrumentation packages. For Go: go.opentelemetry.io/otel. For Node.js: @opentelemetry/sdk-node. For Python: opentelemetry-sdk.
Point the OTLP exporter to your Collector or backend. Set OTEL_EXPORTER_OTLP_ENDPOINT environment variable or configure programmatically. Use gRPC for lower overhead.
Every service must set a unique service.name via OTEL_SERVICE_NAME or in code. This is the primary grouping key in every observability backend -- get it right early.
Send a test request and confirm the trace appears in your backend within 30 seconds. If missing, check collector logs for export errors or dropped spans.
Pin the OTel SDK version in your dependency file and ensure CI builds pass. OTel SDKs follow semver -- pin minor versions to avoid breaking changes.
Instrumentation
Add middleware or interceptors for all inbound requests. This creates root spans with HTTP method, route, status code, and duration automatically.
Instrument database drivers to capture query spans with db.system, db.statement (sanitized), and duration. This identifies slow queries in trace waterfalls.
Wrap HTTP clients to create spans for outbound calls. Include the target URL, response status, and retry count. Context propagation headers (W3C traceparent) are injected automatically.
Create spans around critical business operations (payment processing, order fulfillment, ML inference). Use semantic naming: verb.noun format like process.payment.
Pass trace context explicitly through goroutines, thread pools, and message queues. Without this, async work creates orphan spans that break end-to-end traces.
Alerting and Dashboards
Build a dashboard showing request rate, error rate, and latency percentiles (RED metrics) per service. Include a service map if your backend supports it.
Alert when P95 or P99 latency exceeds your SLO threshold for 5+ minutes. Avoid alerting on P50 -- it hides tail latency issues that affect your most important requests.
Alert when error rate exceeds baseline by 2x or crosses an absolute threshold (e.g., 1%). Use a sliding window of 5-10 minutes to avoid noise from single request failures.
Route critical alerts to PagerDuty, Opsgenie, or your on-call tool. Set severity levels: P1 pages immediately, P2 notifies within 15 minutes, P3 creates a ticket.
Every alert must link to a runbook describing investigation steps, likely causes, and remediation actions. Alerts without runbooks slow down incident response.
Validation and Rollout
Send a request through your full service chain and confirm every hop appears as a span in a single trace. Missing spans indicate broken context propagation.
Run a load test matching production traffic patterns and measure CPU/memory overhead from tracing. Overhead should be under 3% -- if higher, enable sampling.
Run with full tracing in staging to catch issues before production. Verify trace data quality, check for missing spans, and confirm alert thresholds are reasonable.
Roll out tracing gradually: 10% of traffic first, monitor for 24 hours, then 50%, then 100%. Use feature flags or deployment percentage to control the rollout.
High-traffic services (>1000 RPS) should use head-based sampling at 1-10%. Always sample errors and slow requests at 100%. Tail-based sampling is ideal but requires collector support.
Ready to implement?
TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.
Related Resources
Learn distributed tracing patterns and best practices for Go
Calculate SLA uptime and error budgets for your services
AI-powered enterprise observability at enterprise prices. See how TraceKit delivers core APM without the complexity.
Next.js blurs the line between server and client -- React Server Components, ISR, and streaming SSR create invisible boundaries where traces break. TraceKit gives you full visibility across the RSC boundary, from server render to client hydration.
Step-by-step guide to migrate from Datadog to TraceKit. Replace dd-trace with TraceKit SDK, map environment variables, and verify traces in minutes.