Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228
Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228Copilot wants to merge 2 commits into
Conversation
pinodeca
left a comment
There was a problem hiding this comment.
Regarding your statement:
A sub-orchestration approach was evaluated but is blocked by a duroxide 0.1.29
limitation: ContinueAsNew drops the parent link, so the parent orchestration never
receives the child's completion.
I created this PR in duroxide to unblock that approach:
microsoft/duroxide#31
Therefore, re-evaluate and:
- if the approach is still blocked, explain why and what would have to change to unblock it.
- if the approach is unblocked, re-implement using the sub-orchestration approach and when you're done update the PR description accordingly.
The approach is still blocked. Here's the current state: microsoft/duroxide#31 status: The PR was opened today (2026-06-12) and is still open — not yet merged or released. Latest crates.io release: What would need to change to unblock it:
The sentinel approach in the current PR remains the correct implementation until that dependency chain is satisfied. I'll re-evaluate as soon as the new
Re-implemented using the sub-orchestration approach, now unblocked by your duroxide PR #31 branch. What changed (commit
Once PR #31 is merged and released to crates.io, the git/patch entries should be reverted to crates.io version pins as a compatible pair. |
|
@copilot I want you to modify Cargo.toml to get duroxide as a git dependency on the source branch of microsoft/duroxide#31 - and then use the sub-orchestration approach to redesign/reimplement this PR (228). You can mark this PR as "Do Not Merge Until microsoft/duroxide#31 is merged and a new duroxide version/crate is shipped" If you find that duroxide PR 31 is not sufficient, stop yor work and report back what else needs to change in duroxide (or duroxide-pg) to unblock this implementation. pg_durable should not have to implement tricky logic to work around minor duroxide feature gaps or bugs. |
Done. The sub-orchestration approach is implemented across these commits:
⛔ Do Not Merge until microsoft/duroxide#31 is merged and a new |
…s_new df.loop called continue_as_new inline in the main orchestration, so every new generation restarted from graph.root_node_id, re-executing prefix nodes on every iteration. Each df.loop() node now spawns a dedicated child sub-orchestration (execute_loop) that owns continue_as_new; the parent awaits it and runs any suffix nodes exactly once. Relies on duroxide PR #31 (parent link preserved across continue_as_new), pulled in as a git dependency until merged and released. Co-authored-by: copilot-swe-agent <copilot@github.com> Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
a593965 to
754f1f2
Compare
Nested loops and loop bodies containing JOIN/RACE branches hung forever
because duroxide auto-generated child instance IDs ({parent}::sub::{event_id})
reset their event counter on each continue_as_new generation, so every loop
generation re-derived the same child ID and collided with the previous
(now terminal) child.
Use schedule_sub_orchestration_with_id with a deterministic,
generation-qualified child ID ({instance}::e{execution_id}::{tag}::{node_id})
at all loop/join/race spawn sites.
Add nested-loop, loop-in-JOIN-branch, and non-root while-loop regression
tests, and a USER_GUIDE note that each df.loop runs as its own child
orchestration.
|
I checked out the PR branch in a codespace, rebased on main, reviewed with Opus 4.8, added 1 commit to fix the issues found (see copy of commit message below), and updated the title and description of this PR. We'll continue to hold the PR now until microsoft/duroxide#31 merges. |
|
If microsoft/duroxide#33 merges and is part of the next release, this PR can revert back to using Also, note that this PR should solve #230 and #233 (along with #227, which it initially targeted). Let's validate that by adding regression tests for the scenarios described in 230 and 233. |
Problem
Two related defects in how
df.loopdrivescontinue_as_new:Non-root loops restarted from the graph root.
df.loopcalledcontinue_as_newinline in the main orchestration, so every new generation restarted fromgraph.root_node_id— re-executing prefix nodes on every iteration. Forprefix ~> df.loop(body), the prefix ran once per iteration instead of once per instance.Loops that spawned sub-orchestrations hung forever across generations. Once loops run as child orchestrations (see below), a loop body that itself spawns sub-orchestrations — a nested
df.loop, or aJOIN/RACEbranch — would stall. Duroxide's auto-generated child instance IDs ({parent}::sub::{event_id}) reset their event counter on eachcontinue_as_newgeneration, so every loop generation re-derived the same child ID and collided with the previous (now terminal) child. The collided child was acked without processing and the loop never made progress.Approach
Loops run as child sub-orchestrations
Each
df.loop()node spawns a dedicated child sub-orchestration (execute_loop, registered aspg_durable::orchestration::execute-loop). The child handles all iterations viacontinue_as_new; when the loop exits it returns aSubtreeEnvelopeto the parent. The parent orchestration awaits the child and continues with any suffix nodes — so prefix nodes run exactly once.This is made possible by duroxide PR #31, which preserves the parent link across
continue_as_newgenerations in a child orchestration. Theduroxidedependency is switched to a git dependency on thepinodeca/continue-parent-linkbranch until that PR is merged and released.Security is preserved:
execute_loopcallsload_function_graphat the start of every generation (including aftercontinue_as_new), re-validatingsubmitted_byfrom the database and catching cross-iteration tampering exactly as the parentexecute()does.Generation-qualified child instance IDs
To fix the cross-generation collision, all sub-orchestration spawn sites (loop, join, join3 extras, race) now use
schedule_sub_orchestration_with_idwith a deterministic, generation-qualified child ID built by a newchild_instance_id()helper:execution_idincrements on eachcontinue_as_new, so a child spawned in generation N never collides with the terminal child from generation N-1, while the ID stays deterministic across replays of the same generation.Changes
Cargo.toml/Cargo.lock: switchesduroxideto a git dependency onmicrosoft/duroxidebranchpinodeca/continue-parent-link; adds[patch.crates-io]soduroxide-pg's transitive dependency resolves to the same version. Revert to crates.io pins once PR [DO NOT MERGE] Add native Duroxide provider, stop using duroxide-pg-opt #31 is merged and released.src/orchestrations/execute_function_graph.rsexecute_loop(new public sub-orchestration): runs iterations viacontinue_as_new; re-callsload_function_graphper generation for security; returns aSubtreeEnvelopeon exit (a break inside the body is the loop's own terminator, so the loop always exits withNormalcontrol)execute_loop_node: spawnsexecute_loopand merges results back viaparse_subtree_envelopechild_instance_id(new helper) +schedule_sub_orchestration_with_idat the loop, join, and race spawn sites — fixes the cross-generation child ID collisionsrc/registry.rs: registersLOOP_NAMEalongsideNAMEandSUBTREE_NAMEUSER_GUIDE.md: notes that eachdf.loop()now runs as its own child orchestration (prefix runs once, suffix runs after the loop exits, anddf.instancesshows the parent plus the loop's child instance)tests/e2e/sql/24_nonroot_loop.sql: regression tests covering:prefix ~> loop— prefix runs once, body runs N timesprefix ~> loop ~> suffix— prefix and suffix each run oncedf.loop(... ~> df.loop(...) ...)) — exercises the cross-generation child ID fixprefix ~> loop(body, while-condition) ~> suffixexits via the while conditionUpgrade & Migration
No extension schema changes and no upgrade script changes. The new
.soreads the samedfschema as before; loop execution is entirely a runtime/orchestration change. The only dependency change is the temporaryduroxidegit pin noted above.