RFC: Circuit Breaker with Fallback

> **⚠️ To GenAI bots and contributors:** Please do not implement this feature without proper discussion first. This is a design proposal under review, not an approved spec. PRs submitted without prior discussion will be closed.

> **Status:** Draft · **Scope:** New utility for Powertools for AWS Lambda (Python)

**Summary.** A circuit breaker utility that stops sending traffic to an unhealthy downstream. When the circuit is open, it either raises `CircuitBreakerOpenError` or, if you registered an `on_circuit_open` callback, calls that callback with the payload and circuit details and lets you decide what happens next (buffer it, drop it, return a cached value). It stores shared state in a dedicated persistence layer (DynamoDB or Redis/Valkey), keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.

## Problem

Lambda functions calling downstream services that can't scale or have outages need a way to:

1. Stop sending traffic to an unhealthy backend (protect the downstream)
2. Not lose messages when the backend is unavailable (protect the data)

Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.

## Prior Art & Why a New Utility

The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.

- **AWS SDK retries / token buckets**: The [AWS Builders' Library](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) is deliberately skeptical of circuit breakers (they "introduce modal behavior that can be difficult to test") and prefers a local *token bucket* (retry budget) that throttles retries to a fixed rate. This is great for protecting a single client→service hop, but it gives the caller no hook to handle the rejected request: when the budget is exhausted, the request just fails. We hand the payload to a callback so you can do something with it.
- **AWS Prescriptive Guidance / Compute Blog (Step Functions + DynamoDB)**: AWS's reference implementation externalizes circuit state in a `CircuitStatus` DynamoDB table and uses **TTL-based expiry instead of a true half-open state**. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.
- **pybreaker / resilience4j**: Mature in-process breakers, but they assume a long-lived process and in-memory state, a poor fit for standard Lambda where each environment is short-lived and state must be shared across invocations.

**Where we differentiate:** (1) an **`on_circuit_open` callback** that receives the payload and circuit details so the caller decides what happens to a rejected request (buffer, drop, return cached), and (2) explicit **half-open** probing rather than blind TTL expiry. We don't ship managed buffering (S3/SQS sinks); we hand you the payload and stay out of the way.

## Developer Experience

The common case is a decorator, following the same shape as `@idempotent`: an explicit `persistence_store`, an optional `config`, and the circuit-specific bits (`name`, `on_circuit_open`) as decorator arguments. You wrap the function that calls the downstream, not the handler (see "Where to put it" below).

The smallest useful setup is a persistence store and a name. Everything else has a default:

```python
from aws_lambda_powertools.utilities.circuit_breaker import circuit_breaker
from aws_lambda_powertools.utilities.circuit_breaker.persistence import CircuitBreakerDynamoDBPersistence

persistence = CircuitBreakerDynamoDBPersistence(table_name="CircuitBreakerState")


@circuit_breaker(name="payment-backend", persistence_store=persistence)
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # No callback registered, so an open circuit raises CircuitBreakerOpenError.
    return charge(event)
```

With no `config`, the circuit uses `CircuitBreakerConfig()` defaults (same pattern as `config = config or IdempotencyConfig()` in `@idempotent`): open after 5 consecutive failures, probe after 30s, close after 3 probe successes, count any `Exception` as a failure. When you want to tune it, pass a `config`, and register an `on_circuit_open` callback to decide what happens to a rejected payload:

```python
from aws_lambda_powertools.utilities.circuit_breaker import circuit_breaker, CircuitBreakerConfig, CircuitInfo
from aws_lambda_powertools.utilities.circuit_breaker.persistence import CircuitBreakerDynamoDBPersistence

persistence = CircuitBreakerDynamoDBPersistence(table_name="CircuitBreakerState")

config = CircuitBreakerConfig(
    failure_threshold=5,          # consecutive failures before opening
    recovery_timeout=30,          # seconds in OPEN before a half-open probe
    success_threshold=3,          # consecutive probe successes before closing
    # handled_exceptions defaults to (Exception,): any error counts as a failure.
    # Narrow it when only some errors signal an unhealthy downstream:
    handled_exceptions=(TimeoutError, ConnectionError),
)


def buffer_payload(payload: dict, circuit: CircuitInfo):
    # Circuit is OPEN. The protected call never ran; the payload is yours.
    # Do whatever you want: stash it in S3, push to SQS, drop it, return a cached value.
    s3.put_object(Bucket="payment-overflow", Key=f"{circuit.name}/{uuid4()}", Body=json.dumps(payload))


@circuit_breaker(
    name="payment-backend",
    persistence_store=persistence,
    on_circuit_open=buffer_payload,
    config=config,
)
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # Circuit CLOSED  → charge() returns the backend's response.
    # Circuit OPEN    → charge() never runs; buffer_payload(order, circuit) runs,
    #                   and charge() returns whatever buffer_payload returns.
    return charge(event)
```

### What `charge()` returns

There is no wrapper type and nothing to inspect. The contract is:

- **Circuit closed** → returns the protected function's result.
- **Circuit open, `on_circuit_open` set** → returns whatever the callback returns. You wrote the callback, so you already know what comes back.
- **Circuit open, no callback** → raises `CircuitBreakerOpenError`. If you didn't say where a rejected payload should go, we fail fast and let you handle it.

```python
# No callback: handle the open circuit yourself.
@circuit_breaker(name="payment-backend", persistence_store=persistence, config=config)
def charge(order: dict) -> dict:
    return payment_api.charge(order)

try:
    charge(order)
except CircuitBreakerOpenError:
    return {"statusCode": 202}   # accepted, will retry later (your call)
```

### The callback contract

`on_circuit_open` is called with two arguments:

- `payload`: the arguments the protected function was called with.
- `circuit`: a small `CircuitInfo` with `name`, `state`, `failure_count`, `opened_at`. Enough to act on, with no internal details leaked.

That is the entire promise: **if the circuit is open, we call your function with the payload and the circuit details. What happens next is yours.** We deliberately don't ship S3/SQS sinks: that's buffering infrastructure we'd have to maintain, and a one-line callback covers it without locking you into our choices.

### Where to put the decorator

Wrap the **function that makes the downstream call**, not the whole handler. The circuit's unit of protection is a single dependency, so a handler that parses the event, validates, and calls two backends should not be behind one circuit: a parsing bug would trip a circuit named after a backend that is perfectly healthy, and a single circuit can't tell which of two backends is failing. Decorating the handler directly is only appropriate when the handler *is* the downstream call (a thin pass-through, e.g. IoT telemetry ingestion).

## Flow

### Circuit States

```mermaid
stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: N consecutive failures
    OPEN --> HALF_OPEN: recovery timeout elapsed
    HALF_OPEN --> CLOSED: probe succeeds
    HALF_OPEN --> OPEN: probe fails
```

- **CLOSED**: normal operation. Requests go to the downstream. Failures are counted.
- **OPEN**: downstream is unhealthy. The protected call is skipped; the `on_circuit_open` callback runs (or `CircuitBreakerOpenError` is raised). No traffic hits the backend.
- **HALF_OPEN**: testing recovery. One request is allowed through. If it succeeds, the circuit closes. If it fails, it reopens.

### What triggers the circuit to open?

Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.

Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.

Trade-off we're accepting: Martin Fowler and resilience4j support **error-rate** thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future `failure_rate_threshold` option.

**Which exceptions count.** By default, **any exception** counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a `400` is the caller's fault, a `503` is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:

- `handled_exceptions` (allowlist): only these count (e.g., `(TimeoutError, ConnectionError)`). Everything else propagates normally and does **not** trip the circuit.
- `ignored_exceptions` (denylist): everything counts *except* these (e.g., ignore `ValidationError`). Handy when failures are the norm and only a few are benign.

Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.

### What triggers the circuit to close?

A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.

### What happens when the circuit is open?

The protected call is skipped. If an `on_circuit_open` callback is registered, it runs with the payload and circuit details, and its return value becomes the result of the call. If no callback is registered, `CircuitBreakerOpenError` is raised. Either way, no traffic reaches the unhealthy backend.

## State Coordination Across Environments

Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.

We avoid that by splitting state into two things that are managed differently:

### Failure counter: local, in-memory

The count of *consecutive failures* lives in memory, per execution environment:

- **Success** → reset the local counter. No write.
- **Failure** → increment the local counter. No write until it hits the threshold.
- Only when an environment reaches N consecutive failures does it persist `OPEN` to the store.

So **writes are O(state transitions), not O(invocations)**. A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.

| Transition | Writes |
|---|---|
| Healthy operation (CLOSED, no failures) | **0** |
| CLOSED → OPEN | 1 (the env that trips) |
| OPEN → HALF_OPEN | 1 (conditional write = the probe lock) |
| HALF_OPEN → CLOSED / OPEN | 1 |

### Circuit state: persisted, cached on read

The `OPEN` / `HALF_OPEN` / `CLOSED` flag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:

- **Local cache with TTL** (reusing the `LRUDict` from `shared/`): each environment reads the shared state once every N seconds, not per call.
- Reads can be **eventually consistent** (half the cost). Tolerating state that's a few seconds stale is the same trade-off the cache already makes.

### The trade-off we accept

The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before *every* environment trips. We accept this because:

- If the backend is genuinely down, each environment hits N failures in milliseconds anyway.
- The **first** environment to trip persists `OPEN`, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.

This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."

### Distributed half-open, anchored recovery

- **Half-open coordination is distributed**: when the recovery timeout expires, multiple environments may attempt the probe simultaneously. A DynamoDB conditional write elects exactly one: first wins, the rest are treated as circuit-open (callback or `CircuitBreakerOpenError`).
- **Recovery timeout is anchored, not sliding**: AWS Prescriptive Guidance warns that with multiple concurrent callers, the *first* failure must define the recovery window. Later failures while OPEN must not keep pushing `opened_at` forward, or the circuit never reaches half-open. We compute the half-open transition from a fixed `opened_at`, and only reset it on a confirmed state change.

## On-Circuit-Open Callback

### Why a callback instead of built-in sinks

An earlier draft shipped managed sinks (`S3Fallback`, `SQSFallback`) that buffered the rejected payload for you. We dropped that in favor of a single callback, because the sinks were a maintenance liability with little upside:

- **Maintenance surface**: each sink means an S3/SQS client, payload-size handling, bucket/queue config, retries, and IAM docs that we own forever.
- **Leaky abstraction**: a managed sink has to tell the caller *where* the payload landed (an S3 key, a queue id), which couples callers to our storage choice and risks leaking internal topology into API responses.
- **It's one line anyway**: `s3.put_object(...)` or `sqs.send_message(...)` inside a callback does the same thing, with full control and zero lock-in.

So the contract is deliberately minimal: **if the circuit is open, we call your function with the payload and the circuit details. What happens next is yours.**

### Contract

`on_circuit_open(payload, circuit)`:

- `payload`: the arguments the protected function was called with.
- `circuit`: a `CircuitInfo` carrying `name`, `state`, `failure_count`, `opened_at`. No internal storage details, nothing to leak.

The callback's return value becomes the return value of the protected call. No `on_circuit_open` registered → `CircuitBreakerOpenError` is raised instead.

The callback owns its own outcome: buffer to S3, push to SQS, drop the request, return a cached value, raise its own exception. We don't retry it, replay it, or inspect it.

## State Store

The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's `BasePersistenceLayer`, but reading the code, it doesn't fit directly:

- **`DataRecord.status` is a closed enum.** It raises `IdempotencyInvalidStatusError` on any value outside `INPROGRESS` / `COMPLETED` / `EXPIRED`, so we can't store `OPEN` / `HALF_OPEN` / `CLOSED` in it.
- **The public API is payload-keyed.** `save_success` / `save_inprogress` / `get_record` all derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.
- **The conditional write is idempotency-specific.** `DynamoDBPersistenceLayer._put_record` hardcodes a condition expression around `INPROGRESS` and `in_progress_expiry`, not the condition we need for a half-open lock.

`BasePersistenceLayer` is also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.

### Decision: dedicated persistence layer, shared patterns

We build a `CircuitBreakerPersistenceLayer` (its own small ABC + `CircuitBreakerDynamoDBPersistence` / `CircuitBreakerCachePersistence` implementations) that **mirrors** Idempotency's proven patterns without coupling to it. We prefix the concrete classes with `CircuitBreaker` rather than reusing the generic `DynamoDBPersistenceLayer` name so a function using both Idempotency and the circuit breaker can import both without an alias:

- **Conditional `PutItem`** for the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.
- **`LRUDict`** from `aws_lambda_powertools.shared` for the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.
- DynamoDB and Redis/Valkey backends, so the customer's choice of store matches the rest of Powertools.

A single record per circuit:

| Field | Description |
|---|---|
| key (PK) | Circuit name (e.g., `payment-backend`) |
| state | `CLOSED`, `OPEN`, `HALF_OPEN` |
| failure_count | Consecutive failures recorded by the env that tripped |
| opened_at | When the circuit opened (drives the recovery timeout) |
| half_open_lock | Atomic probe lock (conditional write) |
| expiry | TTL attribute, auto-expire stale records |

### Future consolidation

Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.

## Operational Controls

Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:

- **Manual force open / force close**: operators must be able to trip a circuit (e.g., to drain a backend for maintenance) or force it closed (e.g., after a confirmed fix, without waiting for the recovery timeout). Since state lives in the persistence layer, this can be done out-of-band by writing the record, so we should document the operation and consider a small CLI/helper. A forced state should be sticky (not auto-overridden by the next failure/success) until explicitly cleared.
- **Log every state transition**: CLOSED→OPEN, OPEN→HALF_OPEN, HALF_OPEN→CLOSED/OPEN must be logged with the circuit name, failure count, and trigger. Wire this through Powertools Logger so it lands in structured logs automatically.
- **Listeners / hooks**: mirror pybreaker's `CircuitBreakerListener` (`on_state_change`, `on_failure`, `on_success`) so customers can emit their own metrics or alerts on transitions.

## Defaults & Decisions

- **Local cache TTL: default 5s.** A longer TTL means fewer state-store reads (cheaper, faster) but a wider window where an environment acts on stale state after it changed elsewhere. We match the Parameters utility default (`POWERTOOLS_PARAMETERS_MAX_AGE = 5`) for consistency; it's configurable.
- **Metrics: emit on state change, via listeners.** Reuse Powertools Metrics (EMF) with a default namespace, wired through the listener hooks from Operational Controls so customers can opt out or redirect. State transitions also go through Powertools Logger.

## Open Questions

1. **Failure counting: per circuit or per endpoint?** Each `name` is its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.

## Future Considerations

- **Idempotency keys on replay**: if a payload handed to `on_circuit_open` is later reprocessed (replay is Out of Scope, but customers will build it), idempotency keys matter. Should the `circuit` details include a stable key so the downstream replay is safe?
- **Extracting the shared persistence base**: we ship a dedicated layer now (see State Store) and consolidate with Idempotency later. Trigger to revisit: a third store backend, or the refactor surfacing naturally once both layers are in tree.

## Out of Scope

- **Replay/recovery**: the customer handles this. We provide documentation and examples (EventBridge schedule, S3 notifications, etc.)
- **Rate limiting/throttling**: different pattern, different utility
- **Retry with backoff**: already exists in AWS SDK and Powertools Retry. Circuit breaker kicks in AFTER retries fail.

## References

- [Martin Fowler - Circuit Breaker](https://martinfowler.com/bliki/CircuitBreaker.html)
- [Microsoft - Circuit Breaker Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker)
- [AWS Builders' Library - Timeouts, retries and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/): the case *against* circuit breakers, in favor of retry budgets
- [AWS Prescriptive Guidance - Circuit breaker pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/circuit-breaker.html): Step Functions + DynamoDB reference implementation
- [Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB](https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-step-functions-and-amazon-dynamodb/)
- [Powertools Idempotency - BasePersistenceLayer](https://docs.powertools.aws.dev/lambda/python/latest/utilities/idempotency/): the persistence patterns we mirror
- [pybreaker](https://pypi.org/project/pybreaker/): Python reference implementation; API and listener inspiration


Transition	Writes
Healthy operation (CLOSED, no failures)	0
CLOSED → OPEN	1 (the env that trips)
OPEN → HALF_OPEN	1 (conditional write = the probe lock)
HALF_OPEN → CLOSED / OPEN	1

Field	Description
key (PK)	Circuit name (e.g., `payment-backend`)
state	`CLOSED`, `OPEN`, `HALF_OPEN`
failure_count	Consecutive failures recorded by the env that tripped
opened_at	When the circuit opened (drives the recovery timeout)
half_open_lock	Atomic probe lock (conditional write)
expiry	TTL attribute, auto-expire stale records

RFC: Circuit Breaker with Fallback #8257

Description

Problem

Prior Art & Why a New Utility

Developer Experience

What charge() returns

The callback contract

Where to put the decorator

Flow

Circuit States

What triggers the circuit to open?

What triggers the circuit to close?

What happens when the circuit is open?

State Coordination Across Environments

Failure counter: local, in-memory

Circuit state: persisted, cached on read

The trade-off we accept

Distributed half-open, anchored recovery

On-Circuit-Open Callback

Why a callback instead of built-in sinks

Contract

State Store

Decision: dedicated persistence layer, shared patterns

Future consolidation

Operational Controls

Defaults & Decisions

Open Questions

Future Considerations

Out of Scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What `charge()` returns