Expire circuit breaker state#917
Conversation
Add TTLs to circuit breaker state written by the default failure policy so unreachable hosts do not leave per-host state behind forever. Custom failure policies can opt in with stateTtl because Fedify cannot infer their retention window. Migrate valid legacy state from earlier 2.3 releases instead of deleting active circuits, and guard the best-effort migration with a CAS marker so concurrent workers do not race or block delivery behind another sweep. Fixes fedify-dev#916 fedify-dev#917 Assisted-by: Codex:gpt-5.5
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces a TTL for outbound delivery circuit breaker states in the key-value store to prevent remote host states from being retained indefinitely. It also implements a legacy sweep mechanism to automatically clean up stale, no-TTL states from previous versions. The review feedback highlights a critical performance and reliability issue: the legacy sweep is currently executed synchronously on every outbound delivery path, which can block active requests and propagate transient database errors. The reviewer recommends running the sweep asynchronously in the background and exposing a helper to await its completion in unit tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1ac7e8bd6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
packages/fedify/src/federation/middleware.test.ts (1)
83-91: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate of
markLegacySweepDonein circuit-breaker.test.ts.Same key array and payload as
markLegacySweepDoneincircuit-breaker.test.ts. Consider consolidating into a shared test helper to keep the sweep-meta key in sync across both files.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/fedify/src/federation/middleware.test.ts` around lines 83 - 91, `markCircuitBreakerLegacySweepDone` duplicates `markLegacySweepDone` from `circuit-breaker.test.ts` with the same sweep-meta key and payload, so consolidate this into a shared test helper. Move the common helper to a shared test utility and have both `markCircuitBreakerLegacySweepDone` and `markLegacySweepDone` call it, keeping the `_fedify/circuit/__fedify_meta/circuit_breaker_state_ttl_sweep_v1` key definition in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/fedify/src/federation/circuit-breaker.test.ts`:
- Around line 77-90: The migration test currently only verifies the rewritten
value shape and does not confirm that TTL is actually applied during the
legacy-state rewrite. Extend CountingSweepKvStore in circuit-breaker.test.ts to
capture write options by overriding set/cas and recording KvStoreSetOptions,
then update the “migrates legacy states without TTL once” test to assert the
migration write for stale-a.example and stale-b.example includes a TTL. Use the
existing CountingSweepKvStore and migration test block to locate the change.
- Around line 129-136: The `markLegacySweepDone` helper duplicates the same
sweep-marker logic used by `markCircuitBreakerLegacySweepDone` in
`middleware.test.ts`, including the magic key path and payload. Refactor this
test setup into a shared helper/utility that both tests import, and update
`markLegacySweepDone` to delegate to that shared function so the sweep meta key
only lives in one place.
- Around line 852-889: `CircuitBreaker.#sweepLegacyStates()` is currently
causing transient kv.list() failures to reject the main flow, so update the
`CircuitBreaker` sweep path to catch and log sweep errors instead of propagating
them from `beforeSend`, `recordFailure`, `recordSuccess`, and `getState`. Keep
the primary state transition logic running after a failed sweep, and use the
existing `CircuitBreaker`/`#sweepLegacyStates` symbols to locate the shared
sweep call site and error handling.
In `@packages/fedify/src/federation/circuit-breaker.ts`:
- Around line 512-516: The `#deleteIfUnchanged` flow can leave
LEGACY_SWEEP_DELETING_MARKER behind permanently if the delete step fails or the
worker stops between the CAS and delete. Update the `#deleteIfUnchanged` method in
circuit-breaker.ts so the this.#kv.cas call writes the marker with an explicit
TTL/expiration option, ensuring the temporary deleting state self-expires even
when `#kv.delete` is never reached. Keep the change localized to
`#deleteIfUnchanged` and reuse the existing LEGACY_SWEEP_DELETING_MARKER logic.
---
Duplicate comments:
In `@packages/fedify/src/federation/middleware.test.ts`:
- Around line 83-91: `markCircuitBreakerLegacySweepDone` duplicates
`markLegacySweepDone` from `circuit-breaker.test.ts` with the same sweep-meta
key and payload, so consolidate this into a shared test helper. Move the common
helper to a shared test utility and have both
`markCircuitBreakerLegacySweepDone` and `markLegacySweepDone` call it, keeping
the `_fedify/circuit/__fedify_meta/circuit_breaker_state_ttl_sweep_v1` key
definition in one place.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 19e226d1-703b-489e-88cd-b780b0cadbee
📒 Files selected for processing (4)
CHANGES.mdpackages/fedify/src/federation/circuit-breaker.test.tspackages/fedify/src/federation/circuit-breaker.tspackages/fedify/src/federation/middleware.test.ts
Keep the legacy circuit-breaker migration out of the delivery path. Sweep failures are logged and retried later instead of rejecting the operation that happened to start the cleanup. Add test hooks and assertions for the background sweep, including the migration TTL and the temporary deleting marker TTL. Share the test helper that marks the legacy sweep complete so seeded circuit states do not depend on duplicated marker literals. fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where the outbound delivery circuit breaker retains per-host state in the key-value store indefinitely. It introduces a TTL on writes for the default failure policy, adds a new stateTtl option for custom policies, and implements a background legacy sweep to clean up or migrate older, no-TTL state. The review feedback highlights a potential issue where parsing the stored retryUntil timestamp using Temporal.Instant.from is not guarded by a try/catch block. If the stored value is malformed, it could throw a RangeError and disrupt the background sweep process. It is recommended to wrap these parsing calls in try/catch blocks to handle invalid dates gracefully.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f0db201af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
Keep default circuit breaker state past the recovery delay by adding the failure window to the derived TTL. That preserves an open circuit long enough for the first half-open probe and any immediate failure to reuse the existing state. Treat malformed legacy sweep retry markers as invalid marker state instead of letting Temporal parsing throw from the background sweep. fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where the outbound delivery circuit breaker retained per-host state in the key-value store indefinitely if a remote host never recovered. To resolve this, circuit breaker states are now written with a Time-To-Live (TTL) by default, and custom failure policies can opt-in using the new stateTtl option. Additionally, a background migration sweep has been introduced to automatically clean up or migrate legacy, TTL-less state entries written by previous 2.3 releases on CAS-backed stores. Comprehensive unit tests have been added to verify the new TTL behavior, custom policy options, and the legacy sweep migration logic. There are no review comments to address, and I have no further feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c1efef678
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Remember completed legacy sweep markers in each CircuitBreaker instance so normal delivery paths stop polling the marker after the migration is done. The done marker is cached until retryUntil, while final markers are cached permanently. fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request resolves an issue where the outbound delivery circuit breaker retained per-host state indefinitely in the key-value store for hosts that never recovered. It introduces a TTL on writes made with the default failure policy, allows custom failure policies to opt in via a new stateTtl option, and automatically sweeps stale state written by earlier 2.3 releases on CAS-backed stores. The review feedback highlights two key improvements: first, replacing optional chaining with non-null assertions inside the legacy sweep loop to prevent potential infinite loops if the CAS guard is bypassed; second, adding a key length check during the sweep to defensively skip non-host keys under the prefix.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 38c3e7fffb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Keep legacy sweeps focused on host state keys and make their guarded CAS calls fail fast if a future refactor bypasses the CAS-store precondition. Refresh Postgres KV row timestamps on upsert so TTLs are measured from the latest write rather than the original row creation time. fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces state TTLs for the outbound delivery circuit breaker to prevent per-host state from being retained indefinitely, along with a background legacy sweep mechanism to migrate or clear old no-TTL states. Additionally, the PostgreSQL key-value store was updated to refresh the created timestamp on updates to ensure correct TTL calculations. The review feedback highlights a potential deadlock if a worker crashes during a legacy sweep, leaving a stale sweeping marker, and suggests verifying if the sweep has timed out using LEGACY_SWEEP_LOCK_TTL.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c66ba1874
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Let stale legacy sweep markers be replaced after the lock TTL so a crashed or failed worker cannot block future cleanup forever. Extend the default state TTL to cover held messages created near the end of the recovery delay. fedify-dev#917 (comment) fedify-dev#917 (comment) fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a TTL (Time-To-Live) mechanism for outbound delivery circuit breaker states stored in the key-value store, preventing per-host state from being retained indefinitely when a remote host never recovers. For the default numeric failure policy, a default TTL is automatically derived, while custom failure policies can opt in using the new stateTtl option. Additionally, a background legacy sweep mechanism is introduced to automatically migrate or clear stale, no-TTL states written by earlier 2.3 releases on CAS-backed stores. The PostgreSQL KV store implementation (PostgresKvStore) is also updated to refresh the created timestamp on conflict updates, ensuring TTLs are correctly recalculated. There are no review comments to address, so no additional feedback is provided.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9afca89d57
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Update the normalized circuit breaker option test to match the derived state TTL after including the recovery delay on the held-activity side of the calculation. fedify-dev#917 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where outbound delivery circuit breaker states were retained indefinitely in the key-value store. It introduces a stateTtl option to expire circuit breaker states and implements a background legacy sweep mechanism to migrate or clean up stale, no-TTL states from previous versions. Additionally, the PostgreSQL KV store implementation is updated to refresh the created timestamp on updates, ensuring correct TTL calculation. No review comments were provided, so there is no additional feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. 👍 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@nyanrus If you have some spare time, could you review this pull request? Thanks! |
Why this is needed
Fedify 2.3.0 introduced per-host circuit breaker state, but those entries were written without a TTL. That made sense for the immediate delivery logic, where the state is just a small record of recent failures, but it gave old entries no natural way to disappear when a remote host never recovered. Over time, a busy server could keep circuit breaker keys for hosts it no longer needed to track.
This PR treats circuit breaker state as soft state. The default failure policy has a bounded time window, so Fedify can derive a safe retention period from the failure window, recovery delay, and held activity TTL. Custom failure callbacks are different: Fedify cannot inspect how far back they want to look, so those states stay unexpired unless the application opts in with
stateTtl.Closes #916.
How the fix works
New circuit breaker writes now include a TTL when the normalized options have one. The stored value also gets a small internal version marker. Public reads still return the same CircuitBreakerKvState shape, so the marker stays an implementation detail rather than becoming part of the API.
The legacy cleanup path is deliberately conservative. Fedify 2.3.0 and 2.3.1 may already have valid open or half-open circuits in the configured KV store, so the migration does not delete every unversioned value. It parses valid legacy state and rewrites it with the internal marker and TTL. Malformed values under the circuit prefix are removed only when a CAS check confirms that they have not changed since listing.
The sweep is guarded with a marker under the circuit prefix. That avoids two upgraded workers racing through the same prefix, and it leaves room for a second sweep after a short grace window so rolling deployments can catch no-TTL state written by still-running old workers. If another worker is already sweeping, the current delivery path skips the migration instead of waiting behind a full prefix scan. The migration is useful cleanup, not a reason to stall outbound delivery.
Stores without CAS are left out of the legacy sweep. They still get TTLs on new state writes, but they cannot safely migrate or delete legacy entries after a prefix listing without risking stale writes. The non-CAS path therefore avoids extra marker writes on every circuit breaker operation.
Tests
The regression tests in packages/fedify/src/federation/circuit-breaker.test.ts cover TTL writes with and without CAS, custom failure policy retention, legacy state migration, rolling-upgrade retry behavior, transient sweep failures, and the case where another worker already holds the sweep marker. The middleware tests that seed circuit state now mark the migration as complete so they keep testing the intended delivery paths.
I verified the branch with:
deno test --config deno.json --allow-env packages/fedify/src/federation/circuit-breaker.test.ts mise run check git diff --check