OpenTelemetry Instrumentation Plan Builder

Plan OpenTelemetry traces, metrics, logs, and dashboards for a service so teams can debug latency, errors, and dependency failures faster.

Prompt Template

You are a senior observability engineer. Create an OpenTelemetry instrumentation plan for this system.

Application/service: [name and purpose]
Architecture: [monolith / microservices / serverless / queue workers]
Tech stack: [language, framework, database, message queue, cloud provider]
Critical user journeys: [checkout, login, report generation, API request, background job]
Current pain points: [slow requests, unknown failures, queue delays, third-party API issues]
Telemetry backend: [Datadog / Grafana Tempo / Honeycomb / New Relic / self-hosted / unknown]
Compliance/security constraints: [PII redaction, sampling rules, retention limits]
Operational goals: [reduce MTTR, improve SLOs, diagnose latency, cost control]

Provide:
1. Instrumentation goals and success metrics
2. Trace spans to add for each critical journey
3. Metrics to capture with names, labels, and units
4. Log enrichment strategy and correlation IDs
5. Error and latency attributes to standardize
6. Sampling strategy for normal and incident periods
7. Dashboard and alert recommendations
8. Implementation checklist with code-level guidance

Example Output

OpenTelemetry Plan — Checkout API

Goals

- Cut checkout incident diagnosis time from 45 minutes to under 15 minutes

- Identify whether latency comes from payment provider, inventory DB, or queue publishing

Trace Design

| Journey | Span | Attributes |

|---|---|---|

| POST /checkout | checkout.request | user_tier, cart_size, region, request_id |

| Payment call | payment.authorize | provider, status_code, retry_count, timeout_ms |

| Inventory reserve | inventory.reserve | sku_count, db_statement_type, lock_wait_ms |

| Queue publish | order.event.publish | queue_name, message_size_bytes |

Metrics

- checkout_requests_total{status,region}

- checkout_latency_ms histogram{region,user_tier}

- payment_provider_latency_ms histogram{provider}

- order_queue_publish_failures_total{queue_name}

Log Enrichment

Add request_id, trace_id, span_id, customer_tier, and sanitized order_id to every checkout log. Never log card details, email addresses, or full addresses.

Sampling

- 10% baseline head sampling for successful requests

- 100% sampling for errors, requests over 2 seconds, and payment retries

- Temporary 50% sampling during release windows

Dashboard

Panels: p95 checkout latency, error rate by provider, queue publish failures, top slow spans, dependency latency heatmap.

Tips for Best Results

  • 💡List your most important user journeys; generic instrumentation plans become noisy fast.
  • 💡Ask for attribute names and cardinality warnings so metrics do not explode in cost.
  • 💡Include privacy constraints because observability tools can accidentally capture sensitive customer data.