Skip to content

RFC: Circuit Breaker with Fallback #8257

@leandrodamascena

Description

@leandrodamascena

⚠️ To GenAI bots and contributors: Please do not implement this feature without proper discussion first. This is a design proposal under review, not an approved spec. PRs submitted without prior discussion will be closed.

Status: Draft · Scope: New utility for Powertools for AWS Lambda (Python)

Summary. A circuit breaker utility that stops sending traffic to an unhealthy downstream. When the circuit is open, it either raises CircuitBreakerOpenError or, if you registered an on_circuit_open callback, calls that callback with the payload and circuit details and lets you decide what happens next (buffer it, drop it, return a cached value). It stores shared state in a dedicated persistence layer (DynamoDB or Redis/Valkey), keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.

Problem

Lambda functions calling downstream services that can't scale or have outages need a way to:

  1. Stop sending traffic to an unhealthy backend (protect the downstream)
  2. Not lose messages when the backend is unavailable (protect the data)

Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.

Prior Art & Why a New Utility

The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.

  • AWS SDK retries / token buckets: The AWS Builders' Library is deliberately skeptical of circuit breakers (they "introduce modal behavior that can be difficult to test") and prefers a local token bucket (retry budget) that throttles retries to a fixed rate. This is great for protecting a single client→service hop, but it gives the caller no hook to handle the rejected request: when the budget is exhausted, the request just fails. We hand the payload to a callback so you can do something with it.
  • AWS Prescriptive Guidance / Compute Blog (Step Functions + DynamoDB): AWS's reference implementation externalizes circuit state in a CircuitStatus DynamoDB table and uses TTL-based expiry instead of a true half-open state. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.
  • pybreaker / resilience4j: Mature in-process breakers, but they assume a long-lived process and in-memory state, a poor fit for standard Lambda where each environment is short-lived and state must be shared across invocations.

Where we differentiate: (1) an on_circuit_open callback that receives the payload and circuit details so the caller decides what happens to a rejected request (buffer, drop, return cached), and (2) explicit half-open probing rather than blind TTL expiry. We don't ship managed buffering (S3/SQS sinks); we hand you the payload and stay out of the way.

Developer Experience

The common case is a decorator, following the same shape as @idempotent: an explicit persistence_store, an optional config, and the circuit-specific bits (name, on_circuit_open) as decorator arguments. You wrap the function that calls the downstream, not the handler (see "Where to put it" below).

The smallest useful setup is a persistence store and a name. Everything else has a default:

from aws_lambda_powertools.utilities.circuit_breaker import circuit_breaker
from aws_lambda_powertools.utilities.circuit_breaker.persistence import CircuitBreakerDynamoDBPersistence

persistence = CircuitBreakerDynamoDBPersistence(table_name="CircuitBreakerState")


@circuit_breaker(name="payment-backend", persistence_store=persistence)
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # No callback registered, so an open circuit raises CircuitBreakerOpenError.
    return charge(event)

With no config, the circuit uses CircuitBreakerConfig() defaults (same pattern as config = config or IdempotencyConfig() in @idempotent): open after 5 consecutive failures, probe after 30s, close after 3 probe successes, count any Exception as a failure. When you want to tune it, pass a config, and register an on_circuit_open callback to decide what happens to a rejected payload:

from aws_lambda_powertools.utilities.circuit_breaker import circuit_breaker, CircuitBreakerConfig, CircuitInfo
from aws_lambda_powertools.utilities.circuit_breaker.persistence import CircuitBreakerDynamoDBPersistence

persistence = CircuitBreakerDynamoDBPersistence(table_name="CircuitBreakerState")

config = CircuitBreakerConfig(
    failure_threshold=5,          # consecutive failures before opening
    recovery_timeout=30,          # seconds in OPEN before a half-open probe
    success_threshold=3,          # consecutive probe successes before closing
    # handled_exceptions defaults to (Exception,): any error counts as a failure.
    # Narrow it when only some errors signal an unhealthy downstream:
    handled_exceptions=(TimeoutError, ConnectionError),
)


def buffer_payload(payload: dict, circuit: CircuitInfo):
    # Circuit is OPEN. The protected call never ran; the payload is yours.
    # Do whatever you want: stash it in S3, push to SQS, drop it, return a cached value.
    s3.put_object(Bucket="payment-overflow", Key=f"{circuit.name}/{uuid4()}", Body=json.dumps(payload))


@circuit_breaker(
    name="payment-backend",
    persistence_store=persistence,
    on_circuit_open=buffer_payload,
    config=config,
)
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # Circuit CLOSED  → charge() returns the backend's response.
    # Circuit OPEN    → charge() never runs; buffer_payload(order, circuit) runs,
    #                   and charge() returns whatever buffer_payload returns.
    return charge(event)

What charge() returns

There is no wrapper type and nothing to inspect. The contract is:

  • Circuit closed → returns the protected function's result.
  • Circuit open, on_circuit_open set → returns whatever the callback returns. You wrote the callback, so you already know what comes back.
  • Circuit open, no callback → raises CircuitBreakerOpenError. If you didn't say where a rejected payload should go, we fail fast and let you handle it.
# No callback: handle the open circuit yourself.
@circuit_breaker(name="payment-backend", persistence_store=persistence, config=config)
def charge(order: dict) -> dict:
    return payment_api.charge(order)

try:
    charge(order)
except CircuitBreakerOpenError:
    return {"statusCode": 202}   # accepted, will retry later (your call)

The callback contract

on_circuit_open is called with two arguments:

  • payload: the arguments the protected function was called with.
  • circuit: a small CircuitInfo with name, state, failure_count, opened_at. Enough to act on, with no internal details leaked.

That is the entire promise: if the circuit is open, we call your function with the payload and the circuit details. What happens next is yours. We deliberately don't ship S3/SQS sinks: that's buffering infrastructure we'd have to maintain, and a one-line callback covers it without locking you into our choices.

Where to put the decorator

Wrap the function that makes the downstream call, not the whole handler. The circuit's unit of protection is a single dependency, so a handler that parses the event, validates, and calls two backends should not be behind one circuit: a parsing bug would trip a circuit named after a backend that is perfectly healthy, and a single circuit can't tell which of two backends is failing. Decorating the handler directly is only appropriate when the handler is the downstream call (a thin pass-through, e.g. IoT telemetry ingestion).

Flow

Circuit States

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: N consecutive failures
    OPEN --> HALF_OPEN: recovery timeout elapsed
    HALF_OPEN --> CLOSED: probe succeeds
    HALF_OPEN --> OPEN: probe fails
Loading
  • CLOSED: normal operation. Requests go to the downstream. Failures are counted.
  • OPEN: downstream is unhealthy. The protected call is skipped; the on_circuit_open callback runs (or CircuitBreakerOpenError is raised). No traffic hits the backend.
  • HALF_OPEN: testing recovery. One request is allowed through. If it succeeds, the circuit closes. If it fails, it reopens.

What triggers the circuit to open?

Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.

Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.

Trade-off we're accepting: Martin Fowler and resilience4j support error-rate thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future failure_rate_threshold option.

Which exceptions count. By default, any exception counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a 400 is the caller's fault, a 503 is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:

  • handled_exceptions (allowlist): only these count (e.g., (TimeoutError, ConnectionError)). Everything else propagates normally and does not trip the circuit.
  • ignored_exceptions (denylist): everything counts except these (e.g., ignore ValidationError). Handy when failures are the norm and only a few are benign.

Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.

What triggers the circuit to close?

A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.

What happens when the circuit is open?

The protected call is skipped. If an on_circuit_open callback is registered, it runs with the payload and circuit details, and its return value becomes the result of the call. If no callback is registered, CircuitBreakerOpenError is raised. Either way, no traffic reaches the unhealthy backend.

State Coordination Across Environments

Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.

We avoid that by splitting state into two things that are managed differently:

Failure counter: local, in-memory

The count of consecutive failures lives in memory, per execution environment:

  • Success → reset the local counter. No write.
  • Failure → increment the local counter. No write until it hits the threshold.
  • Only when an environment reaches N consecutive failures does it persist OPEN to the store.

So writes are O(state transitions), not O(invocations). A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.

Transition Writes
Healthy operation (CLOSED, no failures) 0
CLOSED → OPEN 1 (the env that trips)
OPEN → HALF_OPEN 1 (conditional write = the probe lock)
HALF_OPEN → CLOSED / OPEN 1

Circuit state: persisted, cached on read

The OPEN / HALF_OPEN / CLOSED flag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:

  • Local cache with TTL (reusing the LRUDict from shared/): each environment reads the shared state once every N seconds, not per call.
  • Reads can be eventually consistent (half the cost). Tolerating state that's a few seconds stale is the same trade-off the cache already makes.

The trade-off we accept

The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before every environment trips. We accept this because:

  • If the backend is genuinely down, each environment hits N failures in milliseconds anyway.
  • The first environment to trip persists OPEN, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.

This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."

Distributed half-open, anchored recovery

  • Half-open coordination is distributed: when the recovery timeout expires, multiple environments may attempt the probe simultaneously. A DynamoDB conditional write elects exactly one: first wins, the rest are treated as circuit-open (callback or CircuitBreakerOpenError).
  • Recovery timeout is anchored, not sliding: AWS Prescriptive Guidance warns that with multiple concurrent callers, the first failure must define the recovery window. Later failures while OPEN must not keep pushing opened_at forward, or the circuit never reaches half-open. We compute the half-open transition from a fixed opened_at, and only reset it on a confirmed state change.

On-Circuit-Open Callback

Why a callback instead of built-in sinks

An earlier draft shipped managed sinks (S3Fallback, SQSFallback) that buffered the rejected payload for you. We dropped that in favor of a single callback, because the sinks were a maintenance liability with little upside:

  • Maintenance surface: each sink means an S3/SQS client, payload-size handling, bucket/queue config, retries, and IAM docs that we own forever.
  • Leaky abstraction: a managed sink has to tell the caller where the payload landed (an S3 key, a queue id), which couples callers to our storage choice and risks leaking internal topology into API responses.
  • It's one line anyway: s3.put_object(...) or sqs.send_message(...) inside a callback does the same thing, with full control and zero lock-in.

So the contract is deliberately minimal: if the circuit is open, we call your function with the payload and the circuit details. What happens next is yours.

Contract

on_circuit_open(payload, circuit):

  • payload: the arguments the protected function was called with.
  • circuit: a CircuitInfo carrying name, state, failure_count, opened_at. No internal storage details, nothing to leak.

The callback's return value becomes the return value of the protected call. No on_circuit_open registered → CircuitBreakerOpenError is raised instead.

The callback owns its own outcome: buffer to S3, push to SQS, drop the request, return a cached value, raise its own exception. We don't retry it, replay it, or inspect it.

State Store

The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's BasePersistenceLayer, but reading the code, it doesn't fit directly:

  • DataRecord.status is a closed enum. It raises IdempotencyInvalidStatusError on any value outside INPROGRESS / COMPLETED / EXPIRED, so we can't store OPEN / HALF_OPEN / CLOSED in it.
  • The public API is payload-keyed. save_success / save_inprogress / get_record all derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.
  • The conditional write is idempotency-specific. DynamoDBPersistenceLayer._put_record hardcodes a condition expression around INPROGRESS and in_progress_expiry, not the condition we need for a half-open lock.

BasePersistenceLayer is also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.

Decision: dedicated persistence layer, shared patterns

We build a CircuitBreakerPersistenceLayer (its own small ABC + CircuitBreakerDynamoDBPersistence / CircuitBreakerCachePersistence implementations) that mirrors Idempotency's proven patterns without coupling to it. We prefix the concrete classes with CircuitBreaker rather than reusing the generic DynamoDBPersistenceLayer name so a function using both Idempotency and the circuit breaker can import both without an alias:

  • Conditional PutItem for the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.
  • LRUDict from aws_lambda_powertools.shared for the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.
  • DynamoDB and Redis/Valkey backends, so the customer's choice of store matches the rest of Powertools.

A single record per circuit:

Field Description
key (PK) Circuit name (e.g., payment-backend)
state CLOSED, OPEN, HALF_OPEN
failure_count Consecutive failures recorded by the env that tripped
opened_at When the circuit opened (drives the recovery timeout)
half_open_lock Atomic probe lock (conditional write)
expiry TTL attribute, auto-expire stale records

Future consolidation

Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.

Operational Controls

Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:

  • Manual force open / force close: operators must be able to trip a circuit (e.g., to drain a backend for maintenance) or force it closed (e.g., after a confirmed fix, without waiting for the recovery timeout). Since state lives in the persistence layer, this can be done out-of-band by writing the record, so we should document the operation and consider a small CLI/helper. A forced state should be sticky (not auto-overridden by the next failure/success) until explicitly cleared.
  • Log every state transition: CLOSED→OPEN, OPEN→HALF_OPEN, HALF_OPEN→CLOSED/OPEN must be logged with the circuit name, failure count, and trigger. Wire this through Powertools Logger so it lands in structured logs automatically.
  • Listeners / hooks: mirror pybreaker's CircuitBreakerListener (on_state_change, on_failure, on_success) so customers can emit their own metrics or alerts on transitions.

Defaults & Decisions

  • Local cache TTL: default 5s. A longer TTL means fewer state-store reads (cheaper, faster) but a wider window where an environment acts on stale state after it changed elsewhere. We match the Parameters utility default (POWERTOOLS_PARAMETERS_MAX_AGE = 5) for consistency; it's configurable.
  • Metrics: emit on state change, via listeners. Reuse Powertools Metrics (EMF) with a default namespace, wired through the listener hooks from Operational Controls so customers can opt out or redirect. State transitions also go through Powertools Logger.

Open Questions

  1. Failure counting: per circuit or per endpoint? Each name is its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.

Future Considerations

  • Idempotency keys on replay: if a payload handed to on_circuit_open is later reprocessed (replay is Out of Scope, but customers will build it), idempotency keys matter. Should the circuit details include a stable key so the downstream replay is safe?
  • Extracting the shared persistence base: we ship a dedicated layer now (see State Store) and consolidate with Idempotency later. Trigger to revisit: a third store backend, or the refactor surfacing naturally once both layers are in tree.

Out of Scope

  • Replay/recovery: the customer handles this. We provide documentation and examples (EventBridge schedule, S3 notifications, etc.)
  • Rate limiting/throttling: different pattern, different utility
  • Retry with backoff: already exists in AWS SDK and Powertools Retry. Circuit breaker kicks in AFTER retries fail.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions