🚨 Alert Rules

Get notified when your services need attention. Set up intelligent alerts based on error rates, latency, throughput, and health scores.

Quick Start

  1. 1 Set up notification channels (Slack or Telegram)
  2. 2 Create an alert rule with conditions
  3. 3 Get notified when thresholds are breached

Alert Types

📈 Error Rate

Monitor the percentage of failed requests. Ideal for detecting when your service starts experiencing issues.

Example Use Case:

Alert when authentication service error rate exceeds 5% over 5 minutes

Alert Type: Error Rate
Scope: Service → auth-service
Condition: error_rate > 5%
Time Window: 5 minutes
Severity: 🔴 Critical
💡 Best Practice: Set thresholds based on your baseline. 5-10% is typical for warning, 15%+ for critical.

⏱️ Latency

Track response times and get alerted when requests are too slow. Choose from average, P50, P95, or P99 metrics.

Example Use Case:

Alert when P95 latency exceeds 1000ms (1 second) for API endpoints

Alert Type: Latency
Metric: P95
Scope: Service → api-gateway
Condition: p95 > 1000ms
Time Window: 5 minutes
📊 Metric Guide:
  • Average: Good for overall trends
  • P95: Recommended for user experience (95% of requests)
  • P99: Catch worst-case scenarios

🚀 Throughput

Monitor requests per minute. Perfect for detecting when services stop processing traffic or get overwhelmed.

Service Down Detection:

Alert when service stops sending traces

Condition: req_per_min < 1
Time Window: 10 minutes
Severity: 🔴 Critical

Traffic Spike Detection:

Alert when traffic exceeds capacity

Condition: req_per_min > 1000
Time Window: 5 minutes
Severity: 🟡 Warning
✅ Use Case: Throughput alerts are perfect for detecting service outages (low threshold) or DDoS attacks (high threshold).

💚 Health Score

Composite metric combining error rate and latency into a single health score (0-100). Higher is better.

Example Use Case:

Alert when overall service health drops below 70

Alert Type: Health Score
Scope: Global (All Services)
Condition: health_score < 70
Time Window: 15 minutes
📐 Formula: Health Score = (Error Rate Score × 50) + (Latency Score × 50). Score 100 = perfect, 0 = complete failure.

Scope Types

🌍

Global

Monitor all services together. Good for overall system health.

🎯

Service

Monitor a specific service. Most common use case.

🔗

Endpoint

Monitor a specific endpoint like "POST /api/users".

Best Practices

Set Appropriate Time Windows

Short windows (1-5 min) detect issues quickly but may cause false positives. Longer windows (15-30 min) are more stable but slower to alert.

🔕 Use Cooldowns to Prevent Spam

Set cooldown periods (15-60 min) to avoid getting flooded with notifications for the same issue. You'll be notified periodically until the issue is resolved.

🎨 Layer Your Alerts

Combine multiple alert types: Error rate alerts catch failures, latency alerts catch slowdowns, and throughput alerts catch outages.

📊 Start with Baselines

Monitor your services for a few days to understand normal behavior before setting alert thresholds. Use your P95 latency as a starting point.

Next Steps

Ready to set up your first alert?