Roadmap: thorough automated testing for the durable-function DSL

# Roadmap: thorough automated testing for the durable-function DSL

## Background / motivation

Two recent bugs surfaced that hand-written E2E tests in [tests/e2e/sql/](tests/e2e/sql/) failed to catch despite being structurally simple:

1. #227 - the loop behaves as if it were the root regardless of where it sits. 
2. #230 - the sub-orchestration IDs generated inside the loop body collide with completed sub-orchestrations from the first iteration, so control never returns to the loop.

Both are *nesting-of-combinators* bugs at the runtime boundary (continue_as_new scope, sub-orchestration ID determinism) — see [`execute_loop_node`](src/orchestrations/execute_function_graph.rs#L465) and [`schedule_sub_orchestration` usage](src/orchestrations/execute_function_graph.rs#L762). They share a profile: a correctly-shaped DSL program produces wrong runtime behaviour only when a specific combinator-inside-combinator pattern is exercised.

The DSL surface is small (~8 control-flow combinators: `seq`, `if`, `if_rows`, `loop`, `break`, `join`, `join3`, `race`, plus leaf nodes `sql`, `sleep`, `wait_for_schedule`, `http`), but the *combinatorial* surface of legal nestings is large and grows fast with depth. Hand-written E2E tests will not keep up.

This issue tracks a phased plan to move from purely hand-written E2E coverage to **generative + invariant-driven** testing of the DSL.

## Non-goals

- Replacing the existing `tests/e2e/sql/*.sql` suite. It stays as the source of truth for user-visible scenarios; new infra is additive.
- Formal verification (TLA+/Coq/Lean). Listed as future direction only.
- Fuzzing the SQL *parser* — pg_durable has no DSL parser; programs are JSON `Durofut` trees built by `df.*` functions and persisted in `df.nodes`. Grammar-based fuzzing (ANTLR, AFL grammar mode, Grammarinator) does not apply.

## Proposed phases

Ranked by ROI (highest first). Phases 1–3 are the actionable proposal; 4–6 are follow-ons.

### Phase 1 — Trace-oracle library (highest ROI, lowest cost)

The runtime already writes a structural trace: every node transitions through `df.nodes.status` via the `update_node_status` activity ([execute_function_graph.rs#L237-L264](src/orchestrations/execute_function_graph.rs#L237-L264)). Turn that into a first-class **post-run invariant checker** any test (or production debugger) can call:

```sql
SELECT * FROM df.assert_structural_invariants(instance_id);
```

Initial invariants to encode (each one named, individually skippable, returns the offending node id on failure):

- `every_reachable_node_completed` — every node reachable from the root via the `if`-taken-branch tree ends in `completed`, `failed`, `cancelled`, or `skipped`.
- `single_execution_outside_loop` — any node not transitively under a `loop` body completes **exactly once**. *(Would have caught bug #1.)*
- `loop_body_iteration_count_matches` — children of a `loop` body complete exactly `iteration_count(loop)` times. *(Would have caught bug #2: the join/race body would show 1 completion while the loop reports 2 iterations.)*
- `race_loser_terminal` — for every `race`, exactly one branch completes; the other(s) are `cancelled` or never started.
- `join_all_branches_completed` — for every successful `join`, every branch reached a terminal `completed` state.
- `node_status_monotonic` — no node transitions backward from a terminal status to a non-terminal one within a single iteration window.
- `result_name_uniqueness` — `result_name` resolutions are unique per scope and never read before written.

**Deliverables**: `df.assert_structural_invariants(text)` SQL function, one regression test per invariant, opt-in invocation at the end of every existing E2E test (one line per file).

### Phase 2 — Combinator-nesting matrix generator (deterministic, exhaustive at low depth)

There are ~8 combinators. At depth ≤ 3 the universe of distinct shapes is in the low thousands — exhaustively enumerable. Build a small Rust generator that:

1. Enumerates every nesting shape up to a configurable depth (default 3).
2. Renders each shape to a SQL DSL expression with **marker leaves** that record `(node_path, iteration, wall_clock)` into a per-test trace table.
3. Runs each generated program through the live extension via the existing E2E harness.
4. Validates the trace against the Phase 1 oracles **plus** shape-specific expectations the generator knows by construction (e.g. "this race has 2 branches; trace must show exactly 1 branch completed").

This is exhaustive over the surface that historically breaks. The two recent bugs would have been caught at depth 2.

**Deliverables**: `tests/e2e/generated/` directory, generator binary, CI job that runs the full matrix in <5 min, golden-trace snapshots checked into the repo for regression-style diffing.

### Phase 3 — `proptest` random tree strategy (unbounded coverage + shrinking)

Once Phase 1+2 exist, add a recursive `proptest::Strategy<FunctionGraph>` with weighted depth and combinator-frequency knobs. Run thousands of random trees per CI run; assert with the Phase 1 oracles.

The key value-add over Phase 2 is **shrinking**: when a 17-node failing tree is found, proptest reduces it to the minimal failing shape automatically. This makes triaging future nesting bugs dramatically faster.

**Deliverables**: `pg_durable-dsl-proptest` test binary, persistent failure corpus checked into the repo, CI job with a fixed seed for reproducibility and a nightly job with a fresh seed for exploration.

### Phase 4 — Metamorphic relations (cheap, complementary)

Equivalences specific to pg_durable semantics. Each relation is a generated equivalence-class test: build two programs the runtime should consider equivalent, run both, assert observable equivalence (side-effect set / returned value).

Candidates:

- `seq(a, seq(b, c)) ≡ seq(seq(a, b), c)` (sequence associativity)
- `if('SELECT true', a, b) ≡ a`; `if('SELECT false', a, b) ≡ b`
- `loop(body, 'SELECT false')` single-iteration ≡ `body` (do-while)
- `join(a, b)` and `join(b, a)` produce the same multiset of side effects (commutativity)
- `race(a, sleep(very_large))` ≡ `a` when `a` terminates deterministically
- `loop(seq(a, break(...)))` ≡ `seq(a, ...)` (loop with immediate break)

**Deliverables**: a small registry of `(program_a, program_b, equivalence_predicate)` triples, executed under Phase 2/3 infrastructure.

### Phase 5 — Reference interpreter + differential testing (high value, real cost)

Write a synchronous, single-threaded tree-walking interpreter over `FunctionGraph` in plain Rust. SQL leaves are mocked to record `(node_path, iteration)` tuples; control combinators implement the *intended* semantics directly. Compare causally-ordered traces from this oracle to traces produced by the duroxide-backed runtime. Use it as the ground-truth check inside Phases 2 and 3.

This is the strongest single oracle but the biggest investment; it should follow Phase 6 (write down the semantics before implementing them twice).

### Phase 6 — One-page operational-semantics document

Before Phase 5, document the intended semantics of each combinator under nesting (1–2 pages, prose + small-step rules). This is the artifact the recent bugs *actually* expose a gap in: there is currently no written contract saying "a `loop` body restarts only its own subgraph" or "sub-orchestration identity must be unique per iteration."

This doc becomes:
- The acceptance contract for Phase 5's reference interpreter.
- The spec to refer to when triaging future "is this a bug or by design?" questions.
- The future input to formal models if Phases 1–5 prove insufficient.

## Cross-cutting recommendations

- **Treat every reported bug as a generator seed**, not just a regression test. Bug #2 is not "join inside loop" — it is the parametric family "any parallel/sub-orchestration construct inside any iterating construct." File regressions that way.
- **Distinguish DSL-semantic bugs from runtime-boundary bugs** in tracking. The two recent ones are runtime-boundary: they would not have been caught by reading the DSL spec alone, only by exercising the duroxide integration. Phase 1's trace oracles target exactly this class.
- **Coverage-guide the generator** with `cargo-llvm-cov` so new generated shapes demonstrably hit new branches in [src/orchestrations/execute_function_graph.rs](src/orchestrations/execute_function_graph.rs).

## Explicitly considered and deferred

| Idea | Reason |
|---|---|
| Grammar-based fuzzing (ANTLR / Grammarinator / AFL grammar mode) | No DSL parser exists; AST is JSON built programmatically. |
| `cargo fuzz` / libFuzzer / AFL++ on the binary | Orchestration is async, Postgres-coupled, and stateful across processes; the harness cost dwarfs the bug-finding value at this stage. |
| Model checking (TLA+ / Alloy) | Defer until Phase 6 exists. Worth revisiting once semantics stabilize and a class of concurrency bugs is found that Phase 5 still misses. |
| Formal proof (Coq / Lean) | Out of scope for the foreseeable future. |

## Suggested execution order

1. Phase 6 (semantics doc) — small, unblocks everything else.
2. Phase 1 (trace oracles) — biggest immediate bug-catching power per line of code.
3. Phase 2 (nesting matrix) — exhaustive at low depth, deterministic, CI-friendly.
4. Phase 4 (metamorphic) — low-cost addition once Phase 2 infra is in place.
5. Phase 3 (proptest) — unbounded coverage; runs nightly.
6. Phase 5 (reference interpreter + differential) — strongest oracle, biggest cost; do after semantics are stable.

Each phase is independently shippable and provides incremental value; we do **not** need to commit to all six up front.

## Acceptance criteria for closing this umbrella

Either:
- All six phases shipped, or
- A documented decision per phase to ship / defer / abandon, with the rationale captured in this repo's testing docs.

## Related files

- DSL constructors: [src/dsl.rs](src/dsl.rs)
- Orchestration core: [src/orchestrations/execute_function_graph.rs](src/orchestrations/execute_function_graph.rs)
- Graph + node types: [src/types.rs](src/types.rs)
- E2E harness and existing tests: [tests/e2e/sql/](tests/e2e/sql/)
- Testing strategy docs: [docs/E2E_TESTING.md](docs/E2E_TESTING.md), [docs/TESTING.md](docs/TESTING.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: thorough automated testing for the durable-function DSL #232

Roadmap: thorough automated testing for the durable-function DSL

Background / motivation

Non-goals

Proposed phases

Phase 1 — Trace-oracle library (highest ROI, lowest cost)

Phase 2 — Combinator-nesting matrix generator (deterministic, exhaustive at low depth)

Phase 3 — `proptest` random tree strategy (unbounded coverage + shrinking)

Phase 4 — Metamorphic relations (cheap, complementary)

Phase 5 — Reference interpreter + differential testing (high value, real cost)

Phase 6 — One-page operational-semantics document

Cross-cutting recommendations

Explicitly considered and deferred

Suggested execution order

Acceptance criteria for closing this umbrella

Related files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Idea	Reason
Grammar-based fuzzing (ANTLR / Grammarinator / AFL grammar mode)	No DSL parser exists; AST is JSON built programmatically.
`cargo fuzz` / libFuzzer / AFL++ on the binary	Orchestration is async, Postgres-coupled, and stateful across processes; the harness cost dwarfs the bug-finding value at this stage.
Model checking (TLA+ / Alloy)	Defer until Phase 6 exists. Worth revisiting once semantics stabilize and a class of concurrency bugs is found that Phase 5 still misses.
Formal proof (Coq / Lean)	Out of scope for the foreseeable future.

Roadmap: thorough automated testing for the durable-function DSL #232

Description

Roadmap: thorough automated testing for the durable-function DSL

Background / motivation

Non-goals

Proposed phases

Phase 1 — Trace-oracle library (highest ROI, lowest cost)

Phase 2 — Combinator-nesting matrix generator (deterministic, exhaustive at low depth)

Phase 3 — proptest random tree strategy (unbounded coverage + shrinking)

Phase 4 — Metamorphic relations (cheap, complementary)

Phase 5 — Reference interpreter + differential testing (high value, real cost)

Phase 6 — One-page operational-semantics document

Cross-cutting recommendations

Explicitly considered and deferred

Suggested execution order

Acceptance criteria for closing this umbrella

Related files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase 3 — `proptest` random tree strategy (unbounded coverage + shrinking)