Alert Rules

Get notified when your services need attention. Set up intelligent alerts based on error rates, latency, throughput, and health scores.

Quick Start

  1. 1 Set up notification channels (Slack or Telegram)
  2. 2 Create an alert rule with conditions
  3. 3 Get notified when thresholds are breached

Alert Types

Error Rate

Monitor the percentage of failed requests. Ideal for detecting when your service starts experiencing issues.

Example Use Case:

Alert when authentication service error rate exceeds 5% over 5 minutes

Alert Type: Error Rate
Scope: Service → auth-service
Condition: error_rate > 5%
Time Window: 5 minutes
Severity: Critical
Best Practice: Set thresholds based on your baseline. 5-10% is typical for warning, 15%+ for critical.

Latency

Track response times and get alerted when requests are too slow. Choose from average, P50, P95, or P99 metrics.

Example Use Case:

Alert when P95 latency exceeds 1000ms (1 second) for API endpoints

Alert Type: Latency
Metric: P95
Scope: Service → api-gateway
Condition: p95 > 1000ms
Time Window: 5 minutes
Metric Guide:
  • Average: Good for overall trends
  • P95: Recommended for user experience (95% of requests)
  • P99: Catch worst-case scenarios

Throughput

Monitor requests per minute. Perfect for detecting when services stop processing traffic or get overwhelmed.

Service Down Detection:

Alert when service stops sending traces

Condition: req_per_min < 1
Time Window: 10 minutes
Severity: Critical

Traffic Spike Detection:

Alert when traffic exceeds capacity

Condition: req_per_min > 1000
Time Window: 5 minutes
Severity: Warning
Use Case: Throughput alerts are perfect for detecting service outages (low threshold) or DDoS attacks (high threshold).

Health Score

Composite metric combining error rate and latency into a single health score (0-100). Higher is better.

Example Use Case:

Alert when overall service health drops below 70

Alert Type: Health Score
Scope: Global (All Services)
Condition: health_score < 70
Time Window: 15 minutes
Formula: Health Score = (Error Rate Score × 50) + (Latency Score × 50). Score 100 = perfect, 0 = complete failure.

Scope Types

Global

Monitor all services together. Good for overall system health.

Service

Monitor a specific service. Most common use case.

Endpoint

Monitor a specific endpoint like "POST /api/users".

Best Practices

Set Appropriate Time Windows

Short windows (1-5 min) detect issues quickly but may cause false positives. Longer windows (15-30 min) are more stable but slower to alert.

Use Cooldowns to Prevent Spam

Set cooldown periods (15-60 min) to avoid getting flooded with notifications for the same issue. You'll be notified periodically until the issue is resolved.

Layer Your Alerts

Combine multiple alert types: Error rate alerts catch failures, latency alerts catch slowdowns, and throughput alerts catch outages.

Start with Baselines

Monitor your services for a few days to understand normal behavior before setting alert thresholds. Use your P95 latency as a starting point.

SDK Setup Guides

Alerts work with trace data sent by any TraceKit SDK. Set up your SDK to start sending traces, then create alert rules:

Next Steps

Ready to set up your first alert?

Copilot

No conversations yet

TraceKit Copilot

Ask about services, traces, alerts, snapshots, and LLM usage.

Copilot uses AI to analyze your APM data. Responses may not always be accurate.

Try TraceKit in 10 seconds

Explore real traces, dashboards, and live code monitoring. No signup needed.

Sign Up Free
No credit card No signup Real data