Distributed Tracing Best Practices

Six proven practices for getting the most value from distributed tracing, from span naming conventions to connecting traces with business outcomes.

StrategyBest PracticesIntermediate 18 min | Updated March 2026

Name Spans Meaningfully

Span names are the primary grouping key in every observability backend. Poor names make traces unsearchable and aggregations useless.

Good Span Names (Operation-Based)

  • HTTP GET /api/users/{id} -- includes method and parameterized route
  • DB SELECT users -- includes operation type and table name
  • gRPC OrderService/CreateOrder -- includes service and method
  • Redis GET session:{id} -- includes operation and key pattern

Bad Span Names (Avoid These)

  • HTTP request -- too generic, all HTTP spans collapse into one group
  • GET /api/users/12345 -- includes specific ID, creating infinite cardinality (one group per user)
  • handler -- meaningless without context
  • database call -- loses which table and operation type

Why This Matters

Backends aggregate spans by name to show latency distributions, error rates, and throughput. If your API has 50 endpoints but all spans are named "HTTP request", you see one aggregated metric instead of 50. If each span includes a unique user ID, you get millions of groups that overwhelm the backend and provide no analytical value.

Propagate Context Everywhere

Trace context propagation is the mechanism that connects spans across service boundaries into a single trace. When it breaks, you get disconnected fragments instead of end-to-end visibility.

Standard Propagation (W3C Trace Context)

Use W3C Trace Context headers (traceparent/tracestate) for all HTTP communication. This is the industry standard and supported by all major observability vendors and SDKs.

Propagation Across Non-HTTP Boundaries

Auto-instrumentation handles HTTP and gRPC. You need manual propagation for:

  • Message queues -- Inject trace context into message headers (Kafka headers, RabbitMQ properties, SQS message attributes). Extract on the consumer side before processing.
  • Thread pools and goroutines -- Pass the context.Context (Go) or Context (Java) explicitly to spawned work. Example in Go:
// Wrong: goroutine has no trace context
go processOrder(order)

// Right: pass context explicitly
go processOrder(ctx, order)
  • Serverless invocations -- When one Lambda invokes another, inject trace context into the invocation payload or environment. AWS X-Ray does this automatically; OTel requires manual injection into the Lambda event.
  • Scheduled jobs and cron -- Start a new trace for each job execution, but link it to the triggering trace (if any) using span links rather than parent-child relationships.

Set Strategic Attributes

Span attributes (tags) are the fields you filter, group, and search by in your observability backend. Choose them strategically -- too few and you can't debug, too many and you blow up storage costs.

High-Value Attributes to Set

  • user_id or account_id -- Enables filtering traces by specific user for support debugging. Use a hashed or pseudonymized ID if PII is a concern.
  • tenant_id -- Essential for multi-tenant systems. Quickly isolate whether an issue affects one tenant or all tenants.
  • feature_flag -- When using feature flags, record which variant the request received. Correlating errors with a specific flag variant instantly identifies bad rollouts.
  • deployment_version -- Set the git SHA or semantic version. Enables filtering traces by deployment to compare before/after behavior.
  • environment -- Distinguish staging from production traces when using a shared backend.

Cardinality Management

Every unique attribute value creates a time series in your backend. Attributes with bounded cardinality (deployment_version, environment, feature_flag) are safe. Attributes with unbounded cardinality (request_id, full URL path with IDs, raw SQL query) will exhaust your backend's storage and slow down queries.

Rule of thumb: if an attribute can have more than 10,000 unique values per hour, it's too high-cardinality to be an indexed attribute. Store it in span events or logs instead.

PII in Span Attributes

Never store email addresses, phone numbers, IP addresses, or full names in span attributes. These are indexed and searchable, making them a compliance risk. Use hashed identifiers (SHA-256 of email) or internal IDs instead. Configure the OTel Collector's attributes processor to strip PII before export.

Choose Sampling Wisely

Sampling controls how many traces you collect. Too aggressive and you miss rare bugs. Too conservative and you drown in storage costs. The right strategy depends on your traffic volume and debugging needs.

Head-Based Sampling

Decision made at the start of the trace (first service). Simple to implement: keep 10% of traces, drop 90%. The problem: you don't know if a trace will be interesting until it's complete. A 10% sample rate means you miss 90% of errors.

Tail-Based Sampling (Recommended)

Decision made after the trace is complete, based on its characteristics. Configure the OTel Collector's tail_sampling processor:

processors:
  tail_sampling:
    policies:
      - name: errors-always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}
      - name: baseline-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

This keeps 100% of errors, 100% of slow requests, and a 5% random sample of everything else. You capture every interesting trace while controlling costs.

Sampling Best Practices

  • Always sample errors and slow requests at 100% -- these are what you'll actually investigate
  • For services under 100 RPS, consider keeping 100% of traces (storage cost is minimal)
  • For services over 1000 RPS, use tail-based sampling or head-based at 1-10%
  • Make sampling decisions at the trace level (not span level) to avoid incomplete traces
  • Monitor your sampling pipeline -- dropped traces during sampling are lost forever

Monitor Your Monitoring

Your observability pipeline is itself a distributed system that can fail silently. If your tracing pipeline drops data, you lose visibility exactly when you need it most.

Collector Health Metrics

The OTel Collector exposes Prometheus metrics about its own health. Monitor these:

  • otelcol_exporter_queue_size -- If the export queue grows steadily, the backend can't keep up. Scale the Collector or reduce sampling.
  • otelcol_exporter_send_failed_spans -- Export failures mean data loss. Alert if this exceeds 0 for more than 5 minutes.
  • otelcol_receiver_refused_spans -- The Collector is rejecting incoming data (usually due to memory pressure). Increase memory limits or add Collector replicas.
  • otelcol_processor_dropped_spans -- Sampling or processing is dropping more spans than expected. Review processor configuration.

Detect Instrumentation Gaps

Run a weekly check for services that should be producing traces but aren't. Compare your service registry (Kubernetes services, ECS tasks, etc.) against the list of service.name values in your trace backend. Missing services indicate broken instrumentation or failed deployments of the OTel SDK.

Verify Trace Completeness

Sample 50-100 traces per week from critical paths and verify they contain spans from every expected service. Automated checks can flag traces with fewer spans than the expected minimum for a given entry point.

Connect Traces to Business Impact

The ultimate value of distributed tracing is connecting technical performance to business outcomes. Without this connection, tracing is an engineering tool. With it, tracing becomes a business intelligence platform.

Add Business Attributes

  • order_value -- Tag checkout traces with the order amount. When latency spikes, you can calculate revenue impact: "P99 latency increased to 5s, affecting $12,000 in orders during the incident."
  • plan_tier -- Tag traces with the customer's plan tier (free, pro, enterprise). Prioritize performance issues affecting enterprise customers differently than free-tier issues.
  • experiment_id -- Tag traces with A/B test variant IDs. Compare latency and error rates between experiment groups to detect performance regressions from new features.

SLO Reporting from Traces

Use trace data as the source of truth for SLO reporting. Traces capture actual user-experienced latency (end-to-end, not per-service), making them more accurate than synthetic probes or server-side metrics alone. Define SLO burn rate alerts on trace-derived metrics.

Customer Support Integration

When a customer reports an issue, look up their traces by user_id or account_id. The trace shows exactly what happened: which service was slow, what error occurred, and when. This reduces resolution time from hours of log searching to minutes of trace inspection. Include a trace link in support ticket responses for engineering handoffs.

Ready to implement?

TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.