Skip to content

Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-loop-restart-issue
Open

Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision#228
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-loop-restart-issue

Conversation

Copilot AI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Two related defects in how df.loop drives continue_as_new:

  1. Non-root loops restarted from the graph root. df.loop called continue_as_new inline in the main orchestration, so every new generation restarted from graph.root_node_id — re-executing prefix nodes on every iteration. For prefix ~> df.loop(body), the prefix ran once per iteration instead of once per instance.

  2. Loops that spawned sub-orchestrations hung forever across generations. Once loops run as child orchestrations (see below), a loop body that itself spawns sub-orchestrations — a nested df.loop, or a JOIN/RACE branch — would stall. Duroxide's auto-generated child instance IDs ({parent}::sub::{event_id}) reset their event counter on each continue_as_new generation, so every loop generation re-derived the same child ID and collided with the previous (now terminal) child. The collided child was acked without processing and the loop never made progress.

Approach

Loops run as child sub-orchestrations

Each df.loop() node spawns a dedicated child sub-orchestration (execute_loop, registered as pg_durable::orchestration::execute-loop). The child handles all iterations via continue_as_new; when the loop exits it returns a SubtreeEnvelope to the parent. The parent orchestration awaits the child and continues with any suffix nodes — so prefix nodes run exactly once.

This is made possible by duroxide PR #31, which preserves the parent link across continue_as_new generations in a child orchestration. The duroxide dependency is switched to a git dependency on the pinodeca/continue-parent-link branch until that PR is merged and released.

// execute_loop_node — spawns child sub-orchestration with an explicit, generation-qualified id
let child_id = child_instance_id(ctx, "loop", node_id);
let raw = ctx.schedule_sub_orchestration_with_id(LOOP_NAME, child_id, loop_input).await?;
parse_subtree_envelope(&raw, "LOOP", results)

Security is preserved: execute_loop calls load_function_graph at the start of every generation (including after continue_as_new), re-validating submitted_by from the database and catching cross-iteration tampering exactly as the parent execute() does.

Generation-qualified child instance IDs

To fix the cross-generation collision, all sub-orchestration spawn sites (loop, join, join3 extras, race) now use schedule_sub_orchestration_with_id with a deterministic, generation-qualified child ID built by a new child_instance_id() helper:

{instance_id}::e{execution_id}::{tag}::{node_id}

execution_id increments on each continue_as_new, so a child spawned in generation N never collides with the terminal child from generation N-1, while the ID stays deterministic across replays of the same generation.

Changes

  • Cargo.toml / Cargo.lock: switches duroxide to a git dependency on microsoft/duroxide branch pinodeca/continue-parent-link; adds [patch.crates-io] so duroxide-pg's transitive dependency resolves to the same version. Revert to crates.io pins once PR [DO NOT MERGE] Add native Duroxide provider, stop using duroxide-pg-opt #31 is merged and released.

  • src/orchestrations/execute_function_graph.rs

    • execute_loop (new public sub-orchestration): runs iterations via continue_as_new; re-calls load_function_graph per generation for security; returns a SubtreeEnvelope on exit (a break inside the body is the loop's own terminator, so the loop always exits with Normal control)
    • execute_loop_node: spawns execute_loop and merges results back via parse_subtree_envelope
    • child_instance_id (new helper) + schedule_sub_orchestration_with_id at the loop, join, and race spawn sites — fixes the cross-generation child ID collision
  • src/registry.rs: registers LOOP_NAME alongside NAME and SUBTREE_NAME

  • USER_GUIDE.md: notes that each df.loop() now runs as its own child orchestration (prefix runs once, suffix runs after the loop exits, and df.instances shows the parent plus the loop's child instance)

  • tests/e2e/sql/24_nonroot_loop.sql: regression tests covering:

    1. prefix ~> loop — prefix runs once, body runs N times
    2. prefix ~> loop ~> suffix — prefix and suffix each run once
    3. Named result from a prefix node accessible inside the loop body across generations
    4. Nested loop (df.loop(... ~> df.loop(...) ...)) — exercises the cross-generation child ID fix
    5. Loop inside a JOIN branch — exercises the same fix for parallel branches
    6. Non-root while-loopprefix ~> loop(body, while-condition) ~> suffix exits via the while condition

Upgrade & Migration

No extension schema changes and no upgrade script changes. The new .so reads the same df schema as before; loop execution is entirely a runtime/orchestration change. The only dependency change is the temporary duroxide git pin noted above.

Copilot AI changed the title [WIP] Fix df.loop continue_as_new restart behavior Fix df.loop continue_as_new restarting from graph root when loop is not root Jun 11, 2026
Copilot AI requested a review from pinodeca June 11, 2026 23:29
@pinodeca pinodeca marked this pull request as ready for review June 11, 2026 23:40

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your statement:

A sub-orchestration approach was evaluated but is blocked by a duroxide 0.1.29
limitation: ContinueAsNew drops the parent link, so the parent orchestration never
receives the child's completion.

I created this PR in duroxide to unblock that approach:
microsoft/duroxide#31

Therefore, re-evaluate and:

  • if the approach is still blocked, explain why and what would have to change to unblock it.
  • if the approach is unblocked, re-implement using the sub-orchestration approach and when you're done update the PR description accordingly.

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Regarding your statement:

A sub-orchestration approach was evaluated but is blocked by a duroxide 0.1.29...

The approach is still blocked. Here's the current state:

microsoft/duroxide#31 status: The PR was opened today (2026-06-12) and is still open — not yet merged or released.

Latest crates.io release: duroxide 0.1.29 (published 2026-05-09), which predates PR 31 entirely. There is no released version that includes the parent-link fix.

What would need to change to unblock it:

  1. duroxide PR 31 gets merged
  2. A new duroxide version is published to crates.io with the parent-link fix
  3. duroxide-pg is updated to depend on the new duroxide version and a new duroxide-pg version is published
  4. pg_durable's Cargo.toml is updated to the new duroxide + duroxide-pg pair
  5. The sentinel approach in this PR is replaced with a proper sub-orchestration: execute_loop_node spawns a child orchestration for each iteration and continue_as_news inside the child; the parent awaits the child and loops without ever restarting from the graph root

The sentinel approach in the current PR remains the correct implementation until that dependency chain is satisfied. I'll re-evaluate as soon as the new duroxide release lands on crates.io.

Regarding your statement:

A sub-orchestration approach was evaluated but is blocked by a duroxide 0.1.29...

Re-implemented using the sub-orchestration approach, now unblocked by your duroxide PR #31 branch.

What changed (commit Re-implement df.loop via sub-orchestration using duroxide PR #31 git branch):

  • Cargo.toml: duroxide switched from a crates.io pin to a git dependency on microsoft/duroxide branch pinodeca/continue-parent-link; [patch.crates-io] added so duroxide-pg's transitive dependency resolves to the same version. duroxide-pg = "=0.1.34" is unchanged — PR [DO NOT MERGE] Add native Duroxide provider, stop using duroxide-pg-opt #31 is a runtime-only change with no public API delta, confirmed by a clean build and test run.
  • execute_loop (new public sub-orchestration, LOOP_NAME): runs iterations via continue_as_new within the child; re-calls load_function_graph at the start of every generation so cross-iteration security tampering is still caught (test 17 passes).
  • execute_loop_node (parent): just spawns the child via schedule_sub_orchestration and merges results back via parse_subtree_envelope — no sentinel signals, no continue_as_new in the parent.
  • All sentinel machinery (LOOP_CONTINUE_SENTINEL, helpers, propagation in execute_then_node, LoopContinuation in types.rs) removed.

Once PR #31 is merged and released to crates.io, the git/patch entries should be reverted to crates.io version pins as a compatible pair.

@pinodeca

Copy link
Copy Markdown
Contributor

@copilot I want you to modify Cargo.toml to get duroxide as a git dependency on the source branch of microsoft/duroxide#31 - and then use the sub-orchestration approach to redesign/reimplement this PR (228).

You can mark this PR as "Do Not Merge Until microsoft/duroxide#31 is merged and a new duroxide version/crate is shipped"

If you find that duroxide PR 31 is not sufficient, stop yor work and report back what else needs to change in duroxide (or duroxide-pg) to unblock this implementation.

pg_durable should not have to implement tricky logic to work around minor duroxide feature gaps or bugs.

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@copilot I want you to modify Cargo.toml to get duroxide as a git dependency on the source branch of microsoft/duroxide#31 - and then use the sub-orchestration approach to redesign/reimplement this PR...

Done. The sub-orchestration approach is implemented across these commits:

  • Cargo.toml: duroxide is a git dependency on microsoft/duroxide branch pinodeca/continue-parent-link; [patch.crates-io] overrides the transitive dependency from duroxide-pg. Revert both to crates.io pins once PR [DO NOT MERGE] Add native Duroxide provider, stop using duroxide-pg-opt #31 is merged and released.
  • execute_loop (new LOOP_NAME sub-orchestration): all iterations run via continue_as_new inside the child; load_function_graph is re-called at every generation for security.
  • execute_loop_node: just spawns the child via schedule_sub_orchestration and merges results via parse_subtree_envelope — no sentinel signals, no continue_as_new in the parent.
  • All sentinel machinery (LoopContinuation, loop_continuation, LOOP_CONTINUE_SENTINEL) removed.
  • LOOP_NAME registered in registry.rs.
  • Regression tests in tests/e2e/sql/24_nonroot_loop.sql covering three non-root loop shapes.

⛔ Do Not Merge until microsoft/duroxide#31 is merged and a new duroxide + duroxide-pg pair is published to crates.io — then revert the git/patch entries back to crates.io version pins.

…s_new

df.loop called continue_as_new inline in the main orchestration, so every
new generation restarted from graph.root_node_id, re-executing prefix nodes
on every iteration. Each df.loop() node now spawns a dedicated child
sub-orchestration (execute_loop) that owns continue_as_new; the parent awaits
it and runs any suffix nodes exactly once.

Relies on duroxide PR #31 (parent link preserved across continue_as_new),
pulled in as a git dependency until merged and released.

Co-authored-by: copilot-swe-agent <copilot@github.com>
Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
@pinodeca pinodeca force-pushed the copilot/fix-loop-restart-issue branch from a593965 to 754f1f2 Compare June 12, 2026 21:06
Nested loops and loop bodies containing JOIN/RACE branches hung forever
because duroxide auto-generated child instance IDs ({parent}::sub::{event_id})
reset their event counter on each continue_as_new generation, so every loop
generation re-derived the same child ID and collided with the previous
(now terminal) child.

Use schedule_sub_orchestration_with_id with a deterministic,
generation-qualified child ID ({instance}::e{execution_id}::{tag}::{node_id})
at all loop/join/race spawn sites.

Add nested-loop, loop-in-JOIN-branch, and non-root while-loop regression
tests, and a USER_GUIDE note that each df.loop runs as its own child
orchestration.
@pinodeca pinodeca changed the title Fix df.loop continue_as_new restarting from graph root when loop is not root Fix df.loop: run loops as child sub-orchestrations and fix cross-generation child ID collision Jun 12, 2026
@pinodeca

pinodeca commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I checked out the PR branch in a codespace, rebased on main, reviewed with Opus 4.8, added 1 commit to fix the issues found (see copy of commit message below), and updated the title and description of this PR.

We'll continue to hold the PR now until microsoft/duroxide#31 merges.

Fix sub-orchestration child ID collision across loop generations

Nested loops and loop bodies containing JOIN/RACE branches hung forever
because duroxide auto-generated child instance IDs ({parent}::sub::{event_id})
reset their event counter on each continue_as_new generation, so every loop
generation re-derived the same child ID and collided with the previous
(now terminal) child.

Use schedule_sub_orchestration_with_id with a deterministic,
generation-qualified child ID ({instance}::e{execution_id}::{tag}::{node_id})
at all loop/join/race spawn sites.

Add nested-loop, loop-in-JOIN-branch, and non-root while-loop regression
tests, and a USER_GUIDE note that each df.loop runs as its own child
orchestration.

@pinodeca

Copy link
Copy Markdown
Contributor

If microsoft/duroxide#33 merges and is part of the next release, this PR can revert back to using schedule_sub_orchestration rather than schedule_sub_orchestration_with_id thus avoiding managing an orchestration ID scheme.

Also, note that this PR should solve #230 and #233 (along with #227, which it initially targeted). Let's validate that by adding regression tests for the scenarios described in 230 and 233.

@pinodeca pinodeca mentioned this pull request Jun 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants