Distributed Tracing Best Practices
Six proven practices for getting the most value from distributed tracing, from span naming conventions to connecting traces with business outcomes.
Name Spans Meaningfully
Span names are the primary grouping key in every observability backend. Poor names make traces unsearchable and aggregations useless.
Good Span Names (Operation-Based)
HTTP GET /api/users/{id}-- includes method and parameterized routeDB SELECT users-- includes operation type and table namegRPC OrderService/CreateOrder-- includes service and methodRedis GET session:{id}-- includes operation and key pattern
Bad Span Names (Avoid These)
HTTP request-- too generic, all HTTP spans collapse into one groupGET /api/users/12345-- includes specific ID, creating infinite cardinality (one group per user)handler-- meaningless without contextdatabase call-- loses which table and operation type
Why This Matters
Backends aggregate spans by name to show latency distributions, error rates, and throughput. If your API has 50 endpoints but all spans are named "HTTP request", you see one aggregated metric instead of 50. If each span includes a unique user ID, you get millions of groups that overwhelm the backend and provide no analytical value.
Propagate Context Everywhere
Trace context propagation is the mechanism that connects spans across service boundaries into a single trace. When it breaks, you get disconnected fragments instead of end-to-end visibility.
Standard Propagation (W3C Trace Context)
Use W3C Trace Context headers (traceparent/tracestate) for all HTTP communication. This is the industry standard and supported by all major observability vendors and SDKs.
Propagation Across Non-HTTP Boundaries
Auto-instrumentation handles HTTP and gRPC. You need manual propagation for:
- Message queues -- Inject trace context into message headers (Kafka headers, RabbitMQ properties, SQS message attributes). Extract on the consumer side before processing.
- Thread pools and goroutines -- Pass the
context.Context(Go) orContext(Java) explicitly to spawned work. Example in Go:
// Wrong: goroutine has no trace context
go processOrder(order)
// Right: pass context explicitly
go processOrder(ctx, order)
- Serverless invocations -- When one Lambda invokes another, inject trace context into the invocation payload or environment. AWS X-Ray does this automatically; OTel requires manual injection into the Lambda event.
- Scheduled jobs and cron -- Start a new trace for each job execution, but link it to the triggering trace (if any) using span links rather than parent-child relationships.
Set Strategic Attributes
Span attributes (tags) are the fields you filter, group, and search by in your observability backend. Choose them strategically -- too few and you can't debug, too many and you blow up storage costs.
High-Value Attributes to Set
- user_id or account_id -- Enables filtering traces by specific user for support debugging. Use a hashed or pseudonymized ID if PII is a concern.
- tenant_id -- Essential for multi-tenant systems. Quickly isolate whether an issue affects one tenant or all tenants.
- feature_flag -- When using feature flags, record which variant the request received. Correlating errors with a specific flag variant instantly identifies bad rollouts.
- deployment_version -- Set the git SHA or semantic version. Enables filtering traces by deployment to compare before/after behavior.
- environment -- Distinguish staging from production traces when using a shared backend.
Cardinality Management
Every unique attribute value creates a time series in your backend. Attributes with bounded cardinality (deployment_version, environment, feature_flag) are safe. Attributes with unbounded cardinality (request_id, full URL path with IDs, raw SQL query) will exhaust your backend's storage and slow down queries.
Rule of thumb: if an attribute can have more than 10,000 unique values per hour, it's too high-cardinality to be an indexed attribute. Store it in span events or logs instead.
PII in Span Attributes
Never store email addresses, phone numbers, IP addresses, or full names in span attributes. These are indexed and searchable, making them a compliance risk. Use hashed identifiers (SHA-256 of email) or internal IDs instead. Configure the OTel Collector's attributes processor to strip PII before export.
Choose Sampling Wisely
Sampling controls how many traces you collect. Too aggressive and you miss rare bugs. Too conservative and you drown in storage costs. The right strategy depends on your traffic volume and debugging needs.
Head-Based Sampling
Decision made at the start of the trace (first service). Simple to implement: keep 10% of traces, drop 90%. The problem: you don't know if a trace will be interesting until it's complete. A 10% sample rate means you miss 90% of errors.
Tail-Based Sampling (Recommended)
Decision made after the trace is complete, based on its characteristics. Configure the OTel Collector's tail_sampling processor:
processors:
tail_sampling:
policies:
- name: errors-always
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 2000}
- name: baseline-sample
type: probabilistic
probabilistic: {sampling_percentage: 5}
This keeps 100% of errors, 100% of slow requests, and a 5% random sample of everything else. You capture every interesting trace while controlling costs.
Sampling Best Practices
- Always sample errors and slow requests at 100% -- these are what you'll actually investigate
- For services under 100 RPS, consider keeping 100% of traces (storage cost is minimal)
- For services over 1000 RPS, use tail-based sampling or head-based at 1-10%
- Make sampling decisions at the trace level (not span level) to avoid incomplete traces
- Monitor your sampling pipeline -- dropped traces during sampling are lost forever
Monitor Your Monitoring
Your observability pipeline is itself a distributed system that can fail silently. If your tracing pipeline drops data, you lose visibility exactly when you need it most.
Collector Health Metrics
The OTel Collector exposes Prometheus metrics about its own health. Monitor these:
- otelcol_exporter_queue_size -- If the export queue grows steadily, the backend can't keep up. Scale the Collector or reduce sampling.
- otelcol_exporter_send_failed_spans -- Export failures mean data loss. Alert if this exceeds 0 for more than 5 minutes.
- otelcol_receiver_refused_spans -- The Collector is rejecting incoming data (usually due to memory pressure). Increase memory limits or add Collector replicas.
- otelcol_processor_dropped_spans -- Sampling or processing is dropping more spans than expected. Review processor configuration.
Detect Instrumentation Gaps
Run a weekly check for services that should be producing traces but aren't. Compare your service registry (Kubernetes services, ECS tasks, etc.) against the list of service.name values in your trace backend. Missing services indicate broken instrumentation or failed deployments of the OTel SDK.
Verify Trace Completeness
Sample 50-100 traces per week from critical paths and verify they contain spans from every expected service. Automated checks can flag traces with fewer spans than the expected minimum for a given entry point.
Connect Traces to Business Impact
The ultimate value of distributed tracing is connecting technical performance to business outcomes. Without this connection, tracing is an engineering tool. With it, tracing becomes a business intelligence platform.
Add Business Attributes
- order_value -- Tag checkout traces with the order amount. When latency spikes, you can calculate revenue impact: "P99 latency increased to 5s, affecting $12,000 in orders during the incident."
- plan_tier -- Tag traces with the customer's plan tier (free, pro, enterprise). Prioritize performance issues affecting enterprise customers differently than free-tier issues.
- experiment_id -- Tag traces with A/B test variant IDs. Compare latency and error rates between experiment groups to detect performance regressions from new features.
SLO Reporting from Traces
Use trace data as the source of truth for SLO reporting. Traces capture actual user-experienced latency (end-to-end, not per-service), making them more accurate than synthetic probes or server-side metrics alone. Define SLO burn rate alerts on trace-derived metrics.
Customer Support Integration
When a customer reports an issue, look up their traces by user_id or account_id. The trace shows exactly what happened: which service was slow, what error occurred, and when. This reduces resolution time from hours of log searching to minutes of trace inspection. Include a trace link in support ticket responses for engineering handoffs.
Ready to implement?
TraceKit helps you implement these practices with live breakpoints, distributed tracing, and production debugging.
Related Resources
Learn distributed tracing patterns and best practices for Go
Calculate SLA uptime and error budgets for your services
AI-powered enterprise observability at enterprise prices. See how TraceKit delivers core APM without the complexity.
Next.js blurs the line between server and client -- React Server Components, ISR, and streaming SSR create invisible boundaries where traces break. TraceKit gives you full visibility across the RSC boundary, from server render to client hydration.
Step-by-step guide to migrate from Datadog to TraceKit. Replace dd-trace with TraceKit SDK, map environment variables, and verify traces in minutes.