gRPC Error Handling and Retry Design Guide

Design reliable gRPC service error handling with status codes, deadlines, retries, idempotency, interceptors, and observability.

Prompt Template

You are a senior distributed systems engineer. Design a production-ready gRPC error handling and retry strategy for [service/application].

System context:
- Service purpose: [what the service does]
- Language/framework: [Go, Java, Kotlin, Node, Python, .NET, Rust]
- Clients: [mobile app, backend services, edge gateway, partners]
- Critical RPC methods: [list unary, server streaming, client streaming, bidi methods]
- Data sensitivity and business risk: [payments, healthcare, internal ops, low-risk analytics]
- Current failure modes: [timeouts, unavailable upstreams, validation errors, duplicate writes, deadline exceeded]
- Infrastructure: [Kubernetes, service mesh, Envoy, load balancer, cloud provider]
- Existing observability: [logs, traces, metrics, SLOs, alerting]
- Backward compatibility constraints: [existing clients, public API, generated SDKs]
- Idempotency support: [none, request IDs, idempotency keys, natural keys]

Produce:
1. gRPC status code matrix: status, when to use it, client action, logging level, and retryability.
2. Error detail model using google.rpc.Status, BadRequest, RetryInfo, ResourceInfo, or custom details where appropriate.
3. Deadline, timeout, and cancellation policy for clients and servers.
4. Retry and hedging policy by method, including backoff, jitter, max attempts, and non-retryable errors.
5. Idempotency strategy for write methods and duplicate request handling.
6. Server and client interceptor design for error mapping, correlation IDs, auth failures, and structured logs.
7. Observability plan: metrics, traces, exemplars, dashboards, and alert thresholds.
8. Test plan covering validation errors, partial outages, duplicate writes, deadline exceeded, and network flapping.
9. Rollout plan that avoids breaking existing clients.

Include concise pseudocode or config snippets in [preferred language/config format].

Example Output

Status Code Matrix

| Scenario | gRPC Status | Retry? | Client Action |

|---|---|---|---|

| Invalid customer_id format | INVALID_ARGUMENT | No | Fix request before retrying |

| Duplicate idempotency key with same payload | OK + existing resource | No | Treat as success |

| Inventory service unavailable | UNAVAILABLE | Yes | Retry with exponential backoff |

| Request exceeded client deadline | DEADLINE_EXCEEDED | Maybe | Retry only idempotent reads/writes |

Retry Policy

- Unary reads: max 3 attempts, 100ms initial backoff, 2x multiplier, 20% jitter.

- CreatePayment: retry only when idempotency_key is present and server did not return FAILED_PRECONDITION.

- Streaming methods: no transparent retry after headers; reconnect with resume token.

Error Details

Use BadRequest for field validation and RetryInfo for throttling or temporary dependency failures. Add correlation_id to metadata, not the user-facing message.

Tips for Best Results

  • 💡Do not mark every UNKNOWN or INTERNAL error retryable; that is how small outages become stampedes.
  • 💡Separate user-safe messages from operator diagnostics to avoid leaking internals.
  • 💡Design idempotency before enabling automatic retries on write methods.