Anomaly Detection and Alerting System Designer

Design a practical anomaly detection and alerting system for your business metrics — including threshold logic, statistical methods, alert routing, and false positive reduction strategies.

Prompt Template

You are a data engineering and analytics expert. Help me design an anomaly detection and alerting system for my key business metrics.

Business context:
- Industry: [industry]
- Metrics to monitor: [list 3–6 key metrics, e.g., daily revenue, signup rate, API error rate, page load time]
- Data update frequency: [real-time / hourly / daily]
- Current monitoring: [describe what you have now, e.g., manual dashboards, basic threshold alerts]
- Data stack: [tools you use, e.g., BigQuery, Snowflake, Grafana, Datadog, Looker]
- Team receiving alerts: [who gets them — data team, engineering, ops, leadership]
- Pain point: [e.g., too many false positives, missed real anomalies, alert fatigue]

Please design:

1. **Metric Classification**
   - Categorize each metric by type: count, rate, latency, ratio
   - Define normal patterns: seasonal, day-of-week, trend, stationary
   - Identify which metrics need one-sided vs two-sided detection

2. **Detection Methods** (for each metric)
   - Static threshold (for simple metrics)
   - Z-score / modified Z-score (for normally distributed data)
   - Moving average + standard deviation bands
   - Seasonal decomposition (for metrics with weekly/monthly patterns)
   - Percentage change from previous period
   - Recommend the best method for each metric with reasoning

3. **Alert Severity Framework**
   - P1 (critical) / P2 (warning) / P3 (info) definitions
   - Threshold multipliers for each severity
   - Escalation paths

4. **False Positive Reduction**
   - Minimum duration before alerting (avoid spike noise)
   - Composite alert rules (e.g., alert only if metric A AND metric B are anomalous)
   - Day-of-week and holiday awareness
   - Cool-down periods between repeat alerts

5. **Implementation Plan**
   - SQL or Python pseudocode for 2 detection methods
   - Alert routing setup (Slack, PagerDuty, email)
   - Dashboard design for anomaly visualization

6. **Maintenance & Tuning**
   - How to review and tune thresholds monthly
   - Alert quality scorecard (precision, recall, noise ratio)

Example Output

Anomaly Detection System: E-Commerce SaaS Platform

1. Metric Classification

|--------|------|---------|----------------|

2. Detection Methods

Daily Revenue — Seasonal Decomposition + Residual Z-Score

Why: Revenue has strong day-of-week patterns (weekends lower). A simple threshold would fire every Saturday.

-- BigQuery: 28-day seasonal baseline

WITH baseline AS (

SELECT

EXTRACT(DAYOFWEEK FROM date) as dow,

AVG(revenue) as avg_rev,

STDDEV(revenue) as std_rev

FROM metrics.daily_revenue

WHERE date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

GROUP BY dow

)

SELECT

t.date,

t.revenue,

b.avg_rev,

(t.revenue - b.avg_rev) / NULLIF(b.std_rev, 0) as z_score

FROM metrics.daily_revenue t

JOIN baseline b ON EXTRACT(DAYOFWEEK FROM t.date) = b.dow

WHERE t.date = CURRENT_DATE()

API Error Rate — Static Threshold + Rate of Change

Why: Error rate should be near-zero; any significant increase is meaningful.

- P3 (info): Error rate > 1% for 5+ minutes

- P2 (warning): Error rate > 3% for 5+ minutes

- P1 (critical): Error rate > 5% for 2+ minutes OR >10% instantaneous

4. False Positive Reduction

1. **Minimum duration:** No alert fires unless anomaly persists for ≥2 consecutive data points

2. **Composite rules:** Revenue drop + signup drop = P1. Revenue drop alone = P2 (could be a payment processor issue)

3. **Holiday suppression:** Maintain a holiday calendar; widen thresholds by 2× on known anomalous days

4. **Cool-down:** After a P1 alert, suppress same-metric alerts for 30 min unless severity escalates

6. Monthly Tuning Scorecard

|--------|-------------|-----------------|-----------------|-----------|--------|

| Revenue | 8 | 6 | 2 | 75% | Widen weekend threshold |

| Error rate | 3 | 3 | 0 | 100% | No change |

| Target | — | — | — | >80% | — |

Tips for Best Results

💡Start with simple methods (static thresholds, Z-scores) before jumping to ML-based detection — they're easier to debug and explain.
💡The #1 killer of alerting systems is false positives. Optimize for precision first, then improve recall.
💡Always include day-of-week and holiday awareness — more than half of false positives come from predictable calendar patterns.
💡Review alert quality monthly. If a metric has <70% precision, either widen its threshold or switch detection methods.

Try it with

ChatGPT Claude Gemini

Related Prompts

Data Analysis

Time Series Decomposition and Trend Analysis

Decompose time series data into trend, seasonality, and residual components to uncover patterns and build better forecasts.

ChatGPTClaudeGemini

Data Analysis

Dataset Summary and Insights

Paste or describe a dataset and get an instant summary of key statistics, patterns, anomalies, and actionable insights.

ChatGPTClaudeGemini

Data Analysis

SQL Query Writer for Business Reports

Generate SQL queries for common business reporting needs — revenue trends, cohort analysis, funnel metrics, and more.

ChatGPTClaude