Alert Threshold Recommender

Eliminate alert fatigue with data-driven thresholds. Choose your golden signal, input your baseline metrics, and get recommended warning and critical thresholds with ready-to-use config snippets.

Golden Signal

Baseline Metrics

Sensitivity

Warning Threshold

Critical Threshold

Configuration Snippets

P Prometheus Alerting Rule
G Grafana Alert Configuration
D Datadog Monitor Definition
PD PagerDuty Integration

The Four Golden Signals

Latency

The time it takes to serve a request. Track both successful and error response latencies separately, as errors may be served quickly (e.g., 500 responses) and mask slow successful requests.

Error Rate

The rate of requests that fail, either explicitly (5xx responses) or implicitly (200 with wrong content). Even a small increase in error rate can indicate a serious issue affecting a subset of users.

Traffic

The amount of demand placed on your system, measured in requests per second. Sudden drops in traffic can indicate an upstream failure, while spikes may predict capacity issues.

Saturation

How full your most constrained resource is (CPU, memory, disk, network). Most systems degrade in performance before they hit 100% utilization. Alert well below capacity.

Reducing Alert Fatigue

Alert fatigue occurs when teams receive so many notifications that they start ignoring them. Studies show that over 30% of monitoring alerts are never investigated, and teams with high alert volumes have slower incident response times. The root cause is usually static thresholds that do not account for normal variation in system behavior.

Sigma-based thresholds solve this by deriving alert boundaries from your actual baseline data. A 2-sigma warning threshold means the value is outside 95.4% of normal observations -- statistically significant but not rare. A 3-sigma critical threshold triggers only for observations beyond 99.7% of normal -- genuinely anomalous. This approach adapts naturally to your system's behavior rather than relying on guesswork.

Multi-window alerting adds another layer of noise reduction. Instead of alerting on a single 5-minute window, combine a short window (5m) for acute issues with a longer window (1h) for sustained degradation. Google SRE recommends this approach: alert only when both windows are in violation, dramatically reducing false positives while still catching real incidents.

Monitor your services with intelligent alerting -- Start free with TraceKit