You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both are nesting-of-combinators bugs at the runtime boundary (continue_as_new scope, sub-orchestration ID determinism) — see execute_loop_node and schedule_sub_orchestration usage. They share a profile: a correctly-shaped DSL program produces wrong runtime behaviour only when a specific combinator-inside-combinator pattern is exercised.
The DSL surface is small (~8 control-flow combinators: seq, if, if_rows, loop, break, join, join3, race, plus leaf nodes sql, sleep, wait_for_schedule, http), but the combinatorial surface of legal nestings is large and grows fast with depth. Hand-written E2E tests will not keep up.
This issue tracks a phased plan to move from purely hand-written E2E coverage to generative + invariant-driven testing of the DSL.
Non-goals
Replacing the existing tests/e2e/sql/*.sql suite. It stays as the source of truth for user-visible scenarios; new infra is additive.
Formal verification (TLA+/Coq/Lean). Listed as future direction only.
Fuzzing the SQL parser — pg_durable has no DSL parser; programs are JSON Durofut trees built by df.* functions and persisted in df.nodes. Grammar-based fuzzing (ANTLR, AFL grammar mode, Grammarinator) does not apply.
Proposed phases
Ranked by ROI (highest first). Phases 1–3 are the actionable proposal; 4–6 are follow-ons.
The runtime already writes a structural trace: every node transitions through df.nodes.status via the update_node_status activity (execute_function_graph.rs#L237-L264). Turn that into a first-class post-run invariant checker any test (or production debugger) can call:
Initial invariants to encode (each one named, individually skippable, returns the offending node id on failure):
every_reachable_node_completed — every node reachable from the root via the if-taken-branch tree ends in completed, failed, cancelled, or skipped.
single_execution_outside_loop — any node not transitively under a loop body completes exactly once. (Would have caught bug Adding Microsoft SECURITY.MD #1.)
loop_body_iteration_count_matches — children of a loop body complete exactly iteration_count(loop) times. (Would have caught bug This repo is missing important files #2: the join/race body would show 1 completion while the loop reports 2 iterations.)
race_loser_terminal — for every race, exactly one branch completes; the other(s) are cancelled or never started.
join_all_branches_completed — for every successful join, every branch reached a terminal completed state.
node_status_monotonic — no node transitions backward from a terminal status to a non-terminal one within a single iteration window.
result_name_uniqueness — result_name resolutions are unique per scope and never read before written.
Deliverables: df.assert_structural_invariants(text) SQL function, one regression test per invariant, opt-in invocation at the end of every existing E2E test (one line per file).
There are ~8 combinators. At depth ≤ 3 the universe of distinct shapes is in the low thousands — exhaustively enumerable. Build a small Rust generator that:
Enumerates every nesting shape up to a configurable depth (default 3).
Renders each shape to a SQL DSL expression with marker leaves that record (node_path, iteration, wall_clock) into a per-test trace table.
Runs each generated program through the live extension via the existing E2E harness.
Validates the trace against the Phase 1 oracles plus shape-specific expectations the generator knows by construction (e.g. "this race has 2 branches; trace must show exactly 1 branch completed").
This is exhaustive over the surface that historically breaks. The two recent bugs would have been caught at depth 2.
Deliverables: tests/e2e/generated/ directory, generator binary, CI job that runs the full matrix in <5 min, golden-trace snapshots checked into the repo for regression-style diffing.
Phase 3 — proptest random tree strategy (unbounded coverage + shrinking)
Once Phase 1+2 exist, add a recursive proptest::Strategy<FunctionGraph> with weighted depth and combinator-frequency knobs. Run thousands of random trees per CI run; assert with the Phase 1 oracles.
The key value-add over Phase 2 is shrinking: when a 17-node failing tree is found, proptest reduces it to the minimal failing shape automatically. This makes triaging future nesting bugs dramatically faster.
Deliverables: pg_durable-dsl-proptest test binary, persistent failure corpus checked into the repo, CI job with a fixed seed for reproducibility and a nightly job with a fresh seed for exploration.
Equivalences specific to pg_durable semantics. Each relation is a generated equivalence-class test: build two programs the runtime should consider equivalent, run both, assert observable equivalence (side-effect set / returned value).
Candidates:
seq(a, seq(b, c)) ≡ seq(seq(a, b), c) (sequence associativity)
if('SELECT true', a, b) ≡ a; if('SELECT false', a, b) ≡ b
loop(body, 'SELECT false') single-iteration ≡ body (do-while)
join(a, b) and join(b, a) produce the same multiset of side effects (commutativity)
race(a, sleep(very_large)) ≡ a when a terminates deterministically
loop(seq(a, break(...))) ≡ seq(a, ...) (loop with immediate break)
Deliverables: a small registry of (program_a, program_b, equivalence_predicate) triples, executed under Phase 2/3 infrastructure.
Write a synchronous, single-threaded tree-walking interpreter over FunctionGraph in plain Rust. SQL leaves are mocked to record (node_path, iteration) tuples; control combinators implement the intended semantics directly. Compare causally-ordered traces from this oracle to traces produced by the duroxide-backed runtime. Use it as the ground-truth check inside Phases 2 and 3.
This is the strongest single oracle but the biggest investment; it should follow Phase 6 (write down the semantics before implementing them twice).
Phase 6 — One-page operational-semantics document
Before Phase 5, document the intended semantics of each combinator under nesting (1–2 pages, prose + small-step rules). This is the artifact the recent bugs actually expose a gap in: there is currently no written contract saying "a loop body restarts only its own subgraph" or "sub-orchestration identity must be unique per iteration."
This doc becomes:
The acceptance contract for Phase 5's reference interpreter.
The spec to refer to when triaging future "is this a bug or by design?" questions.
The future input to formal models if Phases 1–5 prove insufficient.
Cross-cutting recommendations
Treat every reported bug as a generator seed, not just a regression test. Bug This repo is missing important files #2 is not "join inside loop" — it is the parametric family "any parallel/sub-orchestration construct inside any iterating construct." File regressions that way.
Distinguish DSL-semantic bugs from runtime-boundary bugs in tracking. The two recent ones are runtime-boundary: they would not have been caught by reading the DSL spec alone, only by exercising the duroxide integration. Phase 1's trace oracles target exactly this class.
Roadmap: thorough automated testing for the durable-function DSL
Background / motivation
Two recent bugs surfaced that hand-written E2E tests in tests/e2e/sql/ failed to catch despite being structurally simple:
Both are nesting-of-combinators bugs at the runtime boundary (continue_as_new scope, sub-orchestration ID determinism) — see
execute_loop_nodeandschedule_sub_orchestrationusage. They share a profile: a correctly-shaped DSL program produces wrong runtime behaviour only when a specific combinator-inside-combinator pattern is exercised.The DSL surface is small (~8 control-flow combinators:
seq,if,if_rows,loop,break,join,join3,race, plus leaf nodessql,sleep,wait_for_schedule,http), but the combinatorial surface of legal nestings is large and grows fast with depth. Hand-written E2E tests will not keep up.This issue tracks a phased plan to move from purely hand-written E2E coverage to generative + invariant-driven testing of the DSL.
Non-goals
tests/e2e/sql/*.sqlsuite. It stays as the source of truth for user-visible scenarios; new infra is additive.Durofuttrees built bydf.*functions and persisted indf.nodes. Grammar-based fuzzing (ANTLR, AFL grammar mode, Grammarinator) does not apply.Proposed phases
Ranked by ROI (highest first). Phases 1–3 are the actionable proposal; 4–6 are follow-ons.
Phase 1 — Trace-oracle library (highest ROI, lowest cost)
The runtime already writes a structural trace: every node transitions through
df.nodes.statusvia theupdate_node_statusactivity (execute_function_graph.rs#L237-L264). Turn that into a first-class post-run invariant checker any test (or production debugger) can call:Initial invariants to encode (each one named, individually skippable, returns the offending node id on failure):
every_reachable_node_completed— every node reachable from the root via theif-taken-branch tree ends incompleted,failed,cancelled, orskipped.single_execution_outside_loop— any node not transitively under aloopbody completes exactly once. (Would have caught bug Adding Microsoft SECURITY.MD #1.)loop_body_iteration_count_matches— children of aloopbody complete exactlyiteration_count(loop)times. (Would have caught bug This repo is missing important files #2: the join/race body would show 1 completion while the loop reports 2 iterations.)race_loser_terminal— for everyrace, exactly one branch completes; the other(s) arecancelledor never started.join_all_branches_completed— for every successfuljoin, every branch reached a terminalcompletedstate.node_status_monotonic— no node transitions backward from a terminal status to a non-terminal one within a single iteration window.result_name_uniqueness—result_nameresolutions are unique per scope and never read before written.Deliverables:
df.assert_structural_invariants(text)SQL function, one regression test per invariant, opt-in invocation at the end of every existing E2E test (one line per file).Phase 2 — Combinator-nesting matrix generator (deterministic, exhaustive at low depth)
There are ~8 combinators. At depth ≤ 3 the universe of distinct shapes is in the low thousands — exhaustively enumerable. Build a small Rust generator that:
(node_path, iteration, wall_clock)into a per-test trace table.This is exhaustive over the surface that historically breaks. The two recent bugs would have been caught at depth 2.
Deliverables:
tests/e2e/generated/directory, generator binary, CI job that runs the full matrix in <5 min, golden-trace snapshots checked into the repo for regression-style diffing.Phase 3 —
proptestrandom tree strategy (unbounded coverage + shrinking)Once Phase 1+2 exist, add a recursive
proptest::Strategy<FunctionGraph>with weighted depth and combinator-frequency knobs. Run thousands of random trees per CI run; assert with the Phase 1 oracles.The key value-add over Phase 2 is shrinking: when a 17-node failing tree is found, proptest reduces it to the minimal failing shape automatically. This makes triaging future nesting bugs dramatically faster.
Deliverables:
pg_durable-dsl-proptesttest binary, persistent failure corpus checked into the repo, CI job with a fixed seed for reproducibility and a nightly job with a fresh seed for exploration.Phase 4 — Metamorphic relations (cheap, complementary)
Equivalences specific to pg_durable semantics. Each relation is a generated equivalence-class test: build two programs the runtime should consider equivalent, run both, assert observable equivalence (side-effect set / returned value).
Candidates:
seq(a, seq(b, c)) ≡ seq(seq(a, b), c)(sequence associativity)if('SELECT true', a, b) ≡ a;if('SELECT false', a, b) ≡ bloop(body, 'SELECT false')single-iteration ≡body(do-while)join(a, b)andjoin(b, a)produce the same multiset of side effects (commutativity)race(a, sleep(very_large))≡awhenaterminates deterministicallyloop(seq(a, break(...)))≡seq(a, ...)(loop with immediate break)Deliverables: a small registry of
(program_a, program_b, equivalence_predicate)triples, executed under Phase 2/3 infrastructure.Phase 5 — Reference interpreter + differential testing (high value, real cost)
Write a synchronous, single-threaded tree-walking interpreter over
FunctionGraphin plain Rust. SQL leaves are mocked to record(node_path, iteration)tuples; control combinators implement the intended semantics directly. Compare causally-ordered traces from this oracle to traces produced by the duroxide-backed runtime. Use it as the ground-truth check inside Phases 2 and 3.This is the strongest single oracle but the biggest investment; it should follow Phase 6 (write down the semantics before implementing them twice).
Phase 6 — One-page operational-semantics document
Before Phase 5, document the intended semantics of each combinator under nesting (1–2 pages, prose + small-step rules). This is the artifact the recent bugs actually expose a gap in: there is currently no written contract saying "a
loopbody restarts only its own subgraph" or "sub-orchestration identity must be unique per iteration."This doc becomes:
Cross-cutting recommendations
cargo-llvm-covso new generated shapes demonstrably hit new branches in src/orchestrations/execute_function_graph.rs.Explicitly considered and deferred
cargo fuzz/ libFuzzer / AFL++ on the binarySuggested execution order
Each phase is independently shippable and provides incremental value; we do not need to commit to all six up front.
Acceptance criteria for closing this umbrella
Either:
Related files