fix: loop safety — max iteration guard and malformed config detection by iemejia · Pull Request #254 · microsoft/pg_durable

iemejia · 2026-06-18T15:58:20Z

Split from #221 to ease review.

M7: Loops now track iteration count across continue_as_new generations via FunctionInput.loop_iteration. After 100,000 iterations (~27 hours at the minimum 1-second rate limit), the loop fails with a clear error directing users to df.break() or workflow restructuring.

M8: If a LOOP node's condition config is unparseable JSON, the code now returns an error immediately instead of silently falling through to infinite looping.

Files: src/orchestrations/execute_function_graph.rs, src/types.rs, src/dsl.rs, src/lib.rs
Tests: 3 pg_tests

pinodeca

Reviewed with Opus 4.8

Issues / discussion points

No test exercises the M7 runtime guard itself. All three tests cover serialization and DSL validation. The actual MAX_LOOP_ITERATIONS trip (the core behavior) is untested — triggering it needs 100k iterations, which is impractical in a pg_test. Consider making the limit injectable (e.g. a const overridable in tests, or a small helper that takes the max as a parameter) so a unit test can assert the failure fires at the boundary. As written, a regression in the counter math could pass CI silently.
Behavioral change for intentionally long-running loops. A daemon-style loop (e.g. poll-a-queue-forever with df.break) will now hard-fail after ~27h rather than run indefinitely. That's a deliberate safety tradeoff, but it ends the instance in failed state. Two things worth confirming with the author:
- Is failing (vs. gracefully completing with the last body result) the desired terminal state? The PR description says "fails with a clear error," so this seems intentional — but it's a semantic the USER_GUIDE should document.
- Should the limit be a GUC rather than a hard-coded const, given different deployments may want different ceilings?
Loops nested inside subtrees aren't covered by the guard. execute_function_graph.rs hardcodes loop_iteration: 0 and its custom input schema doesn't carry the counter, so a loop in a parallel/JOIN branch resets every generation. This appears to be pre-existing (subtree continue_as_new semantics already differ), not introduced here — but a one-line comment noting the guard only applies to top-level loops would prevent a false sense of safety.
Minor: error message off-by-one wording. The guard fires when next_iteration >= MAX_LOOP_ITERATIONS, so the body executes 100,000 times and the message says "exceeded maximum iteration count of 100000" — technically it's failing at the limit, not after exceeding it. Cosmetic; ignore if you prefer.

Verdict

Approve-with-nits. Nothing blocks merge functionally; the changes are safe and backward-compatible. I'd ask for (1) a boundary test for the counter and (2) a doc/comment on the new fail-after-27h behavior before merge.

pinodeca

LGTM

M7: Loops now track iteration count across continue_as_new generations via FunctionInput.loop_iteration. After 100,000 iterations (~27 hours at the minimum 1-second rate limit), the loop fails with a clear error directing users to df.break() or workflow restructuring. M8: If a LOOP node's condition config is unparseable JSON, the code now returns an error immediately instead of silently falling through to infinite looping.

iemejia mentioned this pull request Jun 18, 2026

fix: reliability hardening — pool isolation, client recovery, SSRF, loop safety #221

Closed

iemejia force-pushed the fix/loop-safety branch from 59fb32c to 55760f4 Compare June 18, 2026 16:18

pinodeca approved these changes Jun 18, 2026

View reviewed changes

iemejia force-pushed the fix/loop-safety branch from 55760f4 to fa9e580 Compare June 18, 2026 20:06

pinodeca force-pushed the fix/loop-safety branch from fa9e580 to b34aa85 Compare June 18, 2026 22:33

pinodeca approved these changes Jun 18, 2026

View reviewed changes

pinodeca force-pushed the fix/loop-safety branch from b34aa85 to 5fd5284 Compare June 18, 2026 22:49

pinodeca merged commit a26cab9 into microsoft:main Jun 18, 2026
5 checks passed

iemejia deleted the fix/loop-safety branch June 19, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: loop safety — max iteration guard and malformed config detection#254

fix: loop safety — max iteration guard and malformed config detection#254
pinodeca merged 1 commit into
microsoft:mainfrom
iemejia:fix/loop-safety

iemejia commented Jun 18, 2026

Uh oh!

pinodeca left a comment

Uh oh!

pinodeca left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iemejia commented Jun 18, 2026

Uh oh!

pinodeca left a comment

Choose a reason for hiding this comment

Issues / discussion points

Verdict

Uh oh!

pinodeca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants