fix: loop safety — max iteration guard and malformed config detection#254
Conversation
pinodeca
left a comment
There was a problem hiding this comment.
Reviewed with Opus 4.8
Issues / discussion points
-
No test exercises the M7 runtime guard itself. All three tests cover serialization and DSL validation. The actual
MAX_LOOP_ITERATIONStrip (the core behavior) is untested — triggering it needs 100k iterations, which is impractical in apg_test. Consider making the limit injectable (e.g. aconstoverridable in tests, or a small helper that takes the max as a parameter) so a unit test can assert the failure fires at the boundary. As written, a regression in the counter math could pass CI silently. -
Behavioral change for intentionally long-running loops. A daemon-style loop (e.g. poll-a-queue-forever with
df.break) will now hard-fail after ~27h rather than run indefinitely. That's a deliberate safety tradeoff, but it ends the instance infailedstate. Two things worth confirming with the author:- Is failing (vs. gracefully completing with the last body result) the desired terminal state? The PR description says "fails with a clear error," so this seems intentional — but it's a semantic the USER_GUIDE should document.
- Should the limit be a GUC rather than a hard-coded
const, given different deployments may want different ceilings?
-
Loops nested inside subtrees aren't covered by the guard. execute_function_graph.rs hardcodes
loop_iteration: 0and its custom input schema doesn't carry the counter, so a loop in a parallel/JOIN branch resets every generation. This appears to be pre-existing (subtreecontinue_as_newsemantics already differ), not introduced here — but a one-line comment noting the guard only applies to top-level loops would prevent a false sense of safety. -
Minor: error message off-by-one wording. The guard fires when
next_iteration >= MAX_LOOP_ITERATIONS, so the body executes 100,000 times and the message says "exceeded maximum iteration count of 100000" — technically it's failing at the limit, not after exceeding it. Cosmetic; ignore if you prefer.
Verdict
Approve-with-nits. Nothing blocks merge functionally; the changes are safe and backward-compatible. I'd ask for (1) a boundary test for the counter and (2) a doc/comment on the new fail-after-27h behavior before merge.
fa9e580 to
b34aa85
Compare
M7: Loops now track iteration count across continue_as_new generations via FunctionInput.loop_iteration. After 100,000 iterations (~27 hours at the minimum 1-second rate limit), the loop fails with a clear error directing users to df.break() or workflow restructuring. M8: If a LOOP node's condition config is unparseable JSON, the code now returns an error immediately instead of silently falling through to infinite looping.
b34aa85 to
5fd5284
Compare
Split from #221 to ease review.
M7: Loops now track iteration count across
continue_as_newgenerations viaFunctionInput.loop_iteration. After 100,000 iterations (~27 hours at the minimum 1-second rate limit), the loop fails with a clear error directing users todf.break()or workflow restructuring.M8: If a LOOP node's condition config is unparseable JSON, the code now returns an error immediately instead of silently falling through to infinite looping.
Files:
src/orchestrations/execute_function_graph.rs,src/types.rs,src/dsl.rs,src/lib.rsTests: 3 pg_tests