APM Implementation Checklist

A comprehensive checklist for implementing application performance monitoring from scratch, covering prerequisites through production rollout.

ImplementationChecklistIntermediate 15 min | Updated March 2026
Progress

Prerequisites

Define service-level objectives (SLOs)

Establish target latency (e.g., P99 < 500ms) and availability (e.g., 99.9%) for each service before instrumenting. SLOs determine which metrics and alerts matter.

Inventory all services and dependencies

Map every service, database, cache, and external API in your architecture. Include async workers and scheduled jobs -- these are frequently missed during instrumentation.

Choose instrumentation approach (auto vs manual)

Auto-instrumentation covers HTTP, gRPC, and database calls with zero code changes. Manual instrumentation adds custom spans for business logic. Most teams need both.

Set up OpenTelemetry Collector or backend

Deploy an OTel Collector as a sidecar or gateway to receive, process, and export telemetry. This decouples your application from the backend vendor.

Establish baseline performance metrics

Record current P50/P95/P99 latency, error rates, and throughput for each service before adding instrumentation. This lets you measure tracing overhead accurately.

SDK Installation

Install OpenTelemetry SDK for primary language

Add the OTel SDK and relevant auto-instrumentation packages. For Go: go.opentelemetry.io/otel. For Node.js: @opentelemetry/sdk-node. For Python: opentelemetry-sdk.

Configure exporter endpoint

Point the OTLP exporter to your Collector or backend. Set OTEL_EXPORTER_OTLP_ENDPOINT environment variable or configure programmatically. Use gRPC for lower overhead.

Set service.name resource attribute

Every service must set a unique service.name via OTEL_SERVICE_NAME or in code. This is the primary grouping key in every observability backend -- get it right early.

Verify first span reaches backend

Send a test request and confirm the trace appears in your backend within 30 seconds. If missing, check collector logs for export errors or dropped spans.

Add SDK dependency to CI/CD pipeline

Pin the OTel SDK version in your dependency file and ensure CI builds pass. OTel SDKs follow semver -- pin minor versions to avoid breaking changes.

Instrumentation

Instrument HTTP/gRPC entry points

Add middleware or interceptors for all inbound requests. This creates root spans with HTTP method, route, status code, and duration automatically.

Add database query tracing

Instrument database drivers to capture query spans with db.system, db.statement (sanitized), and duration. This identifies slow queries in trace waterfalls.

Trace external API calls

Wrap HTTP clients to create spans for outbound calls. Include the target URL, response status, and retry count. Context propagation headers (W3C traceparent) are injected automatically.

Add custom spans for business logic

Create spans around critical business operations (payment processing, order fulfillment, ML inference). Use semantic naming: verb.noun format like process.payment.

Propagate context across async boundaries

Pass trace context explicitly through goroutines, thread pools, and message queues. Without this, async work creates orphan spans that break end-to-end traces.

Alerting and Dashboards

Create service health dashboard

Build a dashboard showing request rate, error rate, and latency percentiles (RED metrics) per service. Include a service map if your backend supports it.

Set latency P95/P99 alerts

Alert when P95 or P99 latency exceeds your SLO threshold for 5+ minutes. Avoid alerting on P50 -- it hides tail latency issues that affect your most important requests.

Set error rate alerts

Alert when error rate exceeds baseline by 2x or crosses an absolute threshold (e.g., 1%). Use a sliding window of 5-10 minutes to avoid noise from single request failures.

Configure on-call notification channel

Route critical alerts to PagerDuty, Opsgenie, or your on-call tool. Set severity levels: P1 pages immediately, P2 notifies within 15 minutes, P3 creates a ticket.

Document runbook links in every alert

Every alert must link to a runbook describing investigation steps, likely causes, and remediation actions. Alerts without runbooks slow down incident response.

Validation and Rollout

Verify end-to-end trace connectivity

Send a request through your full service chain and confirm every hop appears as a span in a single trace. Missing spans indicate broken context propagation.

Load test with tracing enabled

Run a load test matching production traffic patterns and measure CPU/memory overhead from tracing. Overhead should be under 3% -- if higher, enable sampling.

Enable in staging for 1 week

Run with full tracing in staging to catch issues before production. Verify trace data quality, check for missing spans, and confirm alert thresholds are reasonable.

Progressive rollout to production

Roll out tracing gradually: 10% of traffic first, monitor for 24 hours, then 50%, then 100%. Use feature flags or deployment percentage to control the rollout.

Confirm sampling rates appropriate for traffic

High-traffic services (>1000 RPS) should use head-based sampling at 1-10%. Always sample errors and slow requests at 100%. Tail-based sampling is ideal but requires collector support.

Ready to implement?

TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.