Skip to content

fix(audit): grow writeAuditEntry 412 backoff to exponential jitter#277

Merged
kptdobe merged 1 commit into
mainfrom
fix/audit-412-backoff-exponential
May 12, 2026
Merged

fix(audit): grow writeAuditEntry 412 backoff to exponential jitter#277
kptdobe merged 1 commit into
mainfrom
fix/audit-412-backoff-exponential

Conversation

@kptdobe
Copy link
Copy Markdown
Contributor

@kptdobe kptdobe commented May 12, 2026

Summary

  • Replace linear 412-retry jitter with exponential jitter in writeAuditEntry.
  • Per-attempt window grows from 0-50/0-100/0-150/0-200ms (~500ms total) to 0-50/0-100/0-200/0-400ms (~750ms total).
  • Retry count unchanged (4 retries, 5 total attempts).

Why

Daily DA worker log review on 2026-05-12 measured writeAuditEntry failed / PreconditionFailed at 5011/24h (and climbing in-flight). The previous day was 1929/24h. Two consecutive days > 250/24h is the explicit re-tune trigger from the monitoring issue. The merged #274 backoff (~500ms total) is not enough headroom during burst contention.

Test plan

  • New unit test: stubs setTimeout + Math.random to capture per-retry delays, asserts per-attempt exponential upper bounds and total elapsed sleep is in range (500ms, 750ms).
  • npm test -- 390 passing, 0 failing.
  • npm run lint (via lint-staged on commit) -- clean on touched files.
  • Post-deploy: re-run the writeAuditEntry failed Coralogix query for last 1h and last 6h. Expect rate to drop materially.

Out of scope

  • Schema/ordering changes to the audit entry write (contention is structural to per-file If-Match writes).
  • The paired da-collab docroom 412 log -- resolves once da-admin race rate drops.

References

Bursts of PreconditionFailed on the per-file audit.txt If-Match write
hit 5011/24h on 2026-05-12 (1929/24h the day before) - the existing linear
0-50/0-100/0-150/0-200 ms jitter (~500 ms total) does not spread retries
far enough across the contention window.

Switch to exponential jitter 0-50/0-100/0-200/0-400 ms (~750 ms total,
i.e. Math.random() * 50 * 2**attempt). Retry count unchanged (4 retries,
5 total attempts). Early retries stay cheap; late retries get the headroom.

New test stubs setTimeout + Math.random to capture per-retry delays and
asserts the per-attempt upper bounds plus that total elapsed sleep exceeds
the prior linear worst-case (500 ms). All 390 existing tests continue to pass.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@kptdobe kptdobe requested a review from bosschaert May 12, 2026 08:08
@kptdobe kptdobe merged commit 17aac95 into main May 12, 2026
6 checks passed
@kptdobe kptdobe deleted the fix/audit-412-backoff-exponential branch May 12, 2026 14:55
adobe-bot pushed a commit that referenced this pull request May 12, 2026
## [1.8.2](v1.8.1...v1.8.2) (2026-05-12)

### Bug Fixes

* **audit:** grow writeAuditEntry 412 backoff to exponential jitter ([#277](#277)) ([17aac95](17aac95))
@adobe-bot
Copy link
Copy Markdown
Collaborator

🎉 This PR is included in version 1.8.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

@kptdobe
Copy link
Copy Markdown
Contributor Author

kptdobe commented May 13, 2026

Post-deploy verification (release v1.8.2)

Re-ran the canonical Coralogix query for da-admin writeAuditEntry failed and the paired da-collab signal after merge + deploy:

Window da-admin writeAuditEntry failed da-collab [docroom] Failed to update document
last 30m 0 (not queried, expected ~0)
last 1h 0 0
last 2h 2097 n/a
last 3h 2496 n/a
last 6h 4942 16
last 24h 6018 n/a

The 24h / 6h totals are dominated by the pre-merge burst. The cutover is sharp: rate goes to 0 between the 2h and 1h windows, which lines up with the v1.8.2 release deploy after this PR merged.

Before/after:

  • da-admin writeAuditEntry failed: 5011/24h (2026-05-12 pre-merge burst) -> 0/1h post-deploy.
  • da-collab [docroom] Failed to update document (paired): 209/24h baseline -> 0/1h post-deploy.

Both signals materially dropped, which is the COR-26 acceptance criterion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants