Anomaly Detection and Alerting System Designer
Design a practical anomaly detection and alerting system for your business metrics — including threshold logic, statistical methods, alert routing, and false positive reduction strategies.
Prompt Template
You are a data engineering and analytics expert. Help me design an anomaly detection and alerting system for my key business metrics. Business context: - Industry: [industry] - Metrics to monitor: [list 3–6 key metrics, e.g., daily revenue, signup rate, API error rate, page load time] - Data update frequency: [real-time / hourly / daily] - Current monitoring: [describe what you have now, e.g., manual dashboards, basic threshold alerts] - Data stack: [tools you use, e.g., BigQuery, Snowflake, Grafana, Datadog, Looker] - Team receiving alerts: [who gets them — data team, engineering, ops, leadership] - Pain point: [e.g., too many false positives, missed real anomalies, alert fatigue] Please design: 1. **Metric Classification** - Categorize each metric by type: count, rate, latency, ratio - Define normal patterns: seasonal, day-of-week, trend, stationary - Identify which metrics need one-sided vs two-sided detection 2. **Detection Methods** (for each metric) - Static threshold (for simple metrics) - Z-score / modified Z-score (for normally distributed data) - Moving average + standard deviation bands - Seasonal decomposition (for metrics with weekly/monthly patterns) - Percentage change from previous period - Recommend the best method for each metric with reasoning 3. **Alert Severity Framework** - P1 (critical) / P2 (warning) / P3 (info) definitions - Threshold multipliers for each severity - Escalation paths 4. **False Positive Reduction** - Minimum duration before alerting (avoid spike noise) - Composite alert rules (e.g., alert only if metric A AND metric B are anomalous) - Day-of-week and holiday awareness - Cool-down periods between repeat alerts 5. **Implementation Plan** - SQL or Python pseudocode for 2 detection methods - Alert routing setup (Slack, PagerDuty, email) - Dashboard design for anomaly visualization 6. **Maintenance & Tuning** - How to review and tune thresholds monthly - Alert quality scorecard (precision, recall, noise ratio)
Example Output
Anomaly Detection System: E-Commerce SaaS Platform
1. Metric Classification
| Metric | Type | Pattern | Detection Side |
|--------|------|---------|----------------|
| Daily revenue | Count ($) | Weekly seasonal + trend | Two-sided (drops AND unexpected spikes) |
| Signup rate | Rate (per hour) | Day-of-week pattern | One-sided (drops only) |
| API error rate | Ratio (%) | Stationary (~0.5%) | One-sided (increases only) |
| P95 page load time | Latency (ms) | Stationary with spikes | One-sided (increases only) |
| Cart abandonment rate | Ratio (%) | Weekly seasonal | Two-sided |
2. Detection Methods
Daily Revenue — Seasonal Decomposition + Residual Z-Score
Why: Revenue has strong day-of-week patterns (weekends lower). A simple threshold would fire every Saturday.
-- BigQuery: 28-day seasonal baseline
WITH baseline AS (
SELECT
EXTRACT(DAYOFWEEK FROM date) as dow,
AVG(revenue) as avg_rev,
STDDEV(revenue) as std_rev
FROM metrics.daily_revenue
WHERE date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY dow
)
SELECT
t.date,
t.revenue,
b.avg_rev,
(t.revenue - b.avg_rev) / NULLIF(b.std_rev, 0) as z_score
FROM metrics.daily_revenue t
JOIN baseline b ON EXTRACT(DAYOFWEEK FROM t.date) = b.dow
WHERE t.date = CURRENT_DATE()
API Error Rate — Static Threshold + Rate of Change
Why: Error rate should be near-zero; any significant increase is meaningful.
- P3 (info): Error rate > 1% for 5+ minutes
- P2 (warning): Error rate > 3% for 5+ minutes
- P1 (critical): Error rate > 5% for 2+ minutes OR >10% instantaneous
4. False Positive Reduction
1. **Minimum duration:** No alert fires unless anomaly persists for ≥2 consecutive data points
2. **Composite rules:** Revenue drop + signup drop = P1. Revenue drop alone = P2 (could be a payment processor issue)
3. **Holiday suppression:** Maintain a holiday calendar; widen thresholds by 2× on known anomalous days
4. **Cool-down:** After a P1 alert, suppress same-metric alerts for 30 min unless severity escalates
6. Monthly Tuning Scorecard
| Metric | Alerts Fired | True Positives | False Positives | Precision | Action |
|--------|-------------|-----------------|-----------------|-----------|--------|
| Revenue | 8 | 6 | 2 | 75% | Widen weekend threshold |
| Error rate | 3 | 3 | 0 | 100% | No change |
| Target | — | — | — | >80% | — |
Tips for Best Results
- 💡Start with simple methods (static thresholds, Z-scores) before jumping to ML-based detection — they're easier to debug and explain.
- 💡The #1 killer of alerting systems is false positives. Optimize for precision first, then improve recall.
- 💡Always include day-of-week and holiday awareness — more than half of false positives come from predictable calendar patterns.
- 💡Review alert quality monthly. If a metric has <70% precision, either widen its threshold or switch detection methods.
Related Prompts
Time Series Decomposition and Trend Analysis
Decompose time series data into trend, seasonality, and residual components to uncover patterns and build better forecasts.
Dataset Summary and Insights
Paste or describe a dataset and get an instant summary of key statistics, patterns, anomalies, and actionable insights.
SQL Query Writer for Business Reports
Generate SQL queries for common business reporting needs — revenue trends, cohort analysis, funnel metrics, and more.