Skip to content

Roadmap: thorough automated testing for the durable-function DSL #232

@pinodeca

Description

@pinodeca

Roadmap: thorough automated testing for the durable-function DSL

Background / motivation

Two recent bugs surfaced that hand-written E2E tests in tests/e2e/sql/ failed to catch despite being structurally simple:

  1. df.loop continue_as_new restarts from root when loop is not root #227 - the loop behaves as if it were the root regardless of where it sits.
  2. JOIN/RACE inside a loop stalls on iteration ≥2 (continue_as_new resets sub-orchestration id counter → child id collision) #230 - the sub-orchestration IDs generated inside the loop body collide with completed sub-orchestrations from the first iteration, so control never returns to the loop.

Both are nesting-of-combinators bugs at the runtime boundary (continue_as_new scope, sub-orchestration ID determinism) — see execute_loop_node and schedule_sub_orchestration usage. They share a profile: a correctly-shaped DSL program produces wrong runtime behaviour only when a specific combinator-inside-combinator pattern is exercised.

The DSL surface is small (~8 control-flow combinators: seq, if, if_rows, loop, break, join, join3, race, plus leaf nodes sql, sleep, wait_for_schedule, http), but the combinatorial surface of legal nestings is large and grows fast with depth. Hand-written E2E tests will not keep up.

This issue tracks a phased plan to move from purely hand-written E2E coverage to generative + invariant-driven testing of the DSL.

Non-goals

  • Replacing the existing tests/e2e/sql/*.sql suite. It stays as the source of truth for user-visible scenarios; new infra is additive.
  • Formal verification (TLA+/Coq/Lean). Listed as future direction only.
  • Fuzzing the SQL parser — pg_durable has no DSL parser; programs are JSON Durofut trees built by df.* functions and persisted in df.nodes. Grammar-based fuzzing (ANTLR, AFL grammar mode, Grammarinator) does not apply.

Proposed phases

Ranked by ROI (highest first). Phases 1–3 are the actionable proposal; 4–6 are follow-ons.

Phase 1 — Trace-oracle library (highest ROI, lowest cost)

The runtime already writes a structural trace: every node transitions through df.nodes.status via the update_node_status activity (execute_function_graph.rs#L237-L264). Turn that into a first-class post-run invariant checker any test (or production debugger) can call:

SELECT * FROM df.assert_structural_invariants(instance_id);

Initial invariants to encode (each one named, individually skippable, returns the offending node id on failure):

  • every_reachable_node_completed — every node reachable from the root via the if-taken-branch tree ends in completed, failed, cancelled, or skipped.
  • single_execution_outside_loop — any node not transitively under a loop body completes exactly once. (Would have caught bug Adding Microsoft SECURITY.MD #1.)
  • loop_body_iteration_count_matches — children of a loop body complete exactly iteration_count(loop) times. (Would have caught bug This repo is missing important files #2: the join/race body would show 1 completion while the loop reports 2 iterations.)
  • race_loser_terminal — for every race, exactly one branch completes; the other(s) are cancelled or never started.
  • join_all_branches_completed — for every successful join, every branch reached a terminal completed state.
  • node_status_monotonic — no node transitions backward from a terminal status to a non-terminal one within a single iteration window.
  • result_name_uniquenessresult_name resolutions are unique per scope and never read before written.

Deliverables: df.assert_structural_invariants(text) SQL function, one regression test per invariant, opt-in invocation at the end of every existing E2E test (one line per file).

Phase 2 — Combinator-nesting matrix generator (deterministic, exhaustive at low depth)

There are ~8 combinators. At depth ≤ 3 the universe of distinct shapes is in the low thousands — exhaustively enumerable. Build a small Rust generator that:

  1. Enumerates every nesting shape up to a configurable depth (default 3).
  2. Renders each shape to a SQL DSL expression with marker leaves that record (node_path, iteration, wall_clock) into a per-test trace table.
  3. Runs each generated program through the live extension via the existing E2E harness.
  4. Validates the trace against the Phase 1 oracles plus shape-specific expectations the generator knows by construction (e.g. "this race has 2 branches; trace must show exactly 1 branch completed").

This is exhaustive over the surface that historically breaks. The two recent bugs would have been caught at depth 2.

Deliverables: tests/e2e/generated/ directory, generator binary, CI job that runs the full matrix in <5 min, golden-trace snapshots checked into the repo for regression-style diffing.

Phase 3 — proptest random tree strategy (unbounded coverage + shrinking)

Once Phase 1+2 exist, add a recursive proptest::Strategy<FunctionGraph> with weighted depth and combinator-frequency knobs. Run thousands of random trees per CI run; assert with the Phase 1 oracles.

The key value-add over Phase 2 is shrinking: when a 17-node failing tree is found, proptest reduces it to the minimal failing shape automatically. This makes triaging future nesting bugs dramatically faster.

Deliverables: pg_durable-dsl-proptest test binary, persistent failure corpus checked into the repo, CI job with a fixed seed for reproducibility and a nightly job with a fresh seed for exploration.

Phase 4 — Metamorphic relations (cheap, complementary)

Equivalences specific to pg_durable semantics. Each relation is a generated equivalence-class test: build two programs the runtime should consider equivalent, run both, assert observable equivalence (side-effect set / returned value).

Candidates:

  • seq(a, seq(b, c)) ≡ seq(seq(a, b), c) (sequence associativity)
  • if('SELECT true', a, b) ≡ a; if('SELECT false', a, b) ≡ b
  • loop(body, 'SELECT false') single-iteration ≡ body (do-while)
  • join(a, b) and join(b, a) produce the same multiset of side effects (commutativity)
  • race(a, sleep(very_large))a when a terminates deterministically
  • loop(seq(a, break(...)))seq(a, ...) (loop with immediate break)

Deliverables: a small registry of (program_a, program_b, equivalence_predicate) triples, executed under Phase 2/3 infrastructure.

Phase 5 — Reference interpreter + differential testing (high value, real cost)

Write a synchronous, single-threaded tree-walking interpreter over FunctionGraph in plain Rust. SQL leaves are mocked to record (node_path, iteration) tuples; control combinators implement the intended semantics directly. Compare causally-ordered traces from this oracle to traces produced by the duroxide-backed runtime. Use it as the ground-truth check inside Phases 2 and 3.

This is the strongest single oracle but the biggest investment; it should follow Phase 6 (write down the semantics before implementing them twice).

Phase 6 — One-page operational-semantics document

Before Phase 5, document the intended semantics of each combinator under nesting (1–2 pages, prose + small-step rules). This is the artifact the recent bugs actually expose a gap in: there is currently no written contract saying "a loop body restarts only its own subgraph" or "sub-orchestration identity must be unique per iteration."

This doc becomes:

  • The acceptance contract for Phase 5's reference interpreter.
  • The spec to refer to when triaging future "is this a bug or by design?" questions.
  • The future input to formal models if Phases 1–5 prove insufficient.

Cross-cutting recommendations

  • Treat every reported bug as a generator seed, not just a regression test. Bug This repo is missing important files #2 is not "join inside loop" — it is the parametric family "any parallel/sub-orchestration construct inside any iterating construct." File regressions that way.
  • Distinguish DSL-semantic bugs from runtime-boundary bugs in tracking. The two recent ones are runtime-boundary: they would not have been caught by reading the DSL spec alone, only by exercising the duroxide integration. Phase 1's trace oracles target exactly this class.
  • Coverage-guide the generator with cargo-llvm-cov so new generated shapes demonstrably hit new branches in src/orchestrations/execute_function_graph.rs.

Explicitly considered and deferred

Idea Reason
Grammar-based fuzzing (ANTLR / Grammarinator / AFL grammar mode) No DSL parser exists; AST is JSON built programmatically.
cargo fuzz / libFuzzer / AFL++ on the binary Orchestration is async, Postgres-coupled, and stateful across processes; the harness cost dwarfs the bug-finding value at this stage.
Model checking (TLA+ / Alloy) Defer until Phase 6 exists. Worth revisiting once semantics stabilize and a class of concurrency bugs is found that Phase 5 still misses.
Formal proof (Coq / Lean) Out of scope for the foreseeable future.

Suggested execution order

  1. Phase 6 (semantics doc) — small, unblocks everything else.
  2. Phase 1 (trace oracles) — biggest immediate bug-catching power per line of code.
  3. Phase 2 (nesting matrix) — exhaustive at low depth, deterministic, CI-friendly.
  4. Phase 4 (metamorphic) — low-cost addition once Phase 2 infra is in place.
  5. Phase 3 (proptest) — unbounded coverage; runs nightly.
  6. Phase 5 (reference interpreter + differential) — strongest oracle, biggest cost; do after semantics are stable.

Each phase is independently shippable and provides incremental value; we do not need to commit to all six up front.

Acceptance criteria for closing this umbrella

Either:

  • All six phases shipped, or
  • A documented decision per phase to ship / defer / abandon, with the rationale captured in this repo's testing docs.

Related files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions