Skip to content

Fix for Issue #232: generative + invariant-driven DSL testing (Phases 1-6)#234

Open
crprashant wants to merge 8 commits into
microsoft:mainfrom
crprashant:crprashant/dsl-testing-roadmap
Open

Fix for Issue #232: generative + invariant-driven DSL testing (Phases 1-6)#234
crprashant wants to merge 8 commits into
microsoft:mainfrom
crprashant:crprashant/dsl-testing-roadmap

Conversation

@crprashant

Copy link
Copy Markdown
Contributor

Implements the full six-phase roadmap from #232 — moving DSL coverage from
hand-written E2E tests to generative + invariant-driven testing. Each phase
is independently shippable; all six are included here (Phase 1 as a documented
slice — see Deferred below).

What's in this PR

Phase Deliverable Status
6 docs/dsl-semantics.md — one-page operational-semantics contract for every combinator under nesting
1 df.assert_structural_invariants(instance_id, fail_on_violation) post-run oracle (6 pure-state invariants) + tests/e2e/sql/24_structural_invariants.sql ✅ (slice)
2 tests/e2e/generated/ combinator-nesting matrix generator — depth-2 exhaustive: 154 shapes (128 live / 26 quarantined) — plus CI wiring
4 Metamorphic-relations registry (7 issue-#232 equivalences) in the same generator crate
3 proptest::Strategy random-tree generator with shrinking (model-level, over the reference interpreter) + failure corpus + nightly job
5 Synchronous tree-walking reference interpreter (pomset / causal-order oracle) + live causal-order assertions emitted into every clean generated test

Highlights

  • Phase 1 is a sound post-run snapshot checker (reachable-completed,
    join-all-completed, untaken-if-pending, race-one-completed, join-name-disjoint,
    query-json-wellformed). New src/db.rs node loader; loop-scope relaxations keep
    it false-positive-free. Packaged as extension 0.2.4 with upgrade script
    sql/pg_durable--0.2.3--0.2.4.sql.
  • Phase 2 is a standalone std-only crate (no pgrx/duroxide dep) that enumerates
    nestings → renders live SQL E2E tests with marker leaves → asserts the Phase 1
    oracle + per-path counts. A classifier splits clean shapes (blocking) from the
    known loop-bug zone (quarantined / xfail, non-blocking).
  • Phase 5 adds the causal-order dimension the count-based oracles can't see:
    for every happens-before (≺) edge the reference interpreter predicts, the
    generated SQL asserts the earlier marker's wall_clock precedes the later one;
    concurrent siblings carry no edge (no flakiness). The order oracle is
    fail-closed — a missing trace endpoint is a violation, never silently dropped.

Validation

  • cargo fmt / cargo clippy -D warnings clean; 196 extension unit tests +
    the generator's own unit tests pass.
  • Live ./scripts/test-e2e-local.sh --include-generated: 169 passed / 0 failed
    (128 generated + 7 metamorphic shapes exercise the new causal-order block with
    zero false positives).
  • Each phase was put through a multi-specialist review; triage records live under
    the testing dir.

Deferred (documented, per the issue's acceptance criteria)

The two count-based invariants that would directly catch #227
(single_execution_outside_loop) and #230 (loop_body_iteration_count_matches)
are deferred: df.nodes.status is current-state (in-place UPDATE), not an
append-only trace, so iteration/execution counts need a new df.node_events log
(or duroxide-history integration). The generated matrix (Phases 2–5) already
catches that same class differentially via the df_gen_trace marker table
today; the standalone oracle slice ships exactly the invariants that are
checkable from df.nodes alone.

New defect surfaced (bug-as-seed)

Per the issue's "treat every reported bug as a generator seed" recommendation,
the Phase 2 matrix surfaced a new runtime-boundary defect — a loop nested
inside a join/race branch (the inverse of #230; root cause: continue_as_new
is host-context-unaware) — filed as #233. A blast-radius experiment found it is
not a cluster DoS (parked hangs ~0 CPU, no pool/connection exhaustion); the
harms are permanently-stuck instances + silent over-execution.

Implements all six phases of #232; maintainers to decide on closing the umbrella
given the documented Phase 1 deferral.

…ft#232)

Introduce a sound post-run snapshot oracle that validates structural invariants of a completed durable function instance directly from df.nodes:

- src/invariants.rs: 6 invariants (every_reachable_node_completed, join_all_branches_completed, untaken_if_branch_pending, race_at_least_one_branch_completed, join_branch_result_name_disjoint, query_json_well_formed) with loop-scope relaxations to stay false-positive-free; 26 unit tests.
- src/db.rs: canonical NodeSnapshot loader for df.nodes (shared projection).
- df.assert_structural_invariants(instance_id, fail_on_violation) SQL function.

Packaging/docs/tests:
- Bump to 0.2.4; lib.rs func_sig + grant_usage re-emit; sql/pg_durable--0.2.3--0.2.4.sql upgrade script.
- USER_GUIDE.md, docs/upgrade-testing.md, docs/dsl-semantics.md (operational-semantics contract).
- tests/e2e/sql/24_structural_invariants.sql.

Validated: fmt, clippy, 26 unit tests, e2e 24_, upgrade 35/35.
Standalone std-only generator crate (tests/e2e/generated/generator/) that
deterministically enumerates DSL combinator nestings, renders each to a live
SQL E2E test with marker leaves, and validates it against the Phase 1
structural-invariant oracle plus per-path execution counts.

- depth-2 matrix: 154 shapes (128 live / 26 quarantined); 128 passed / 0 failed
- classifier splits clean shapes (blocking, sql/) from known loop-bug shapes
  (quarantine/, xfail, non-blocking) via an empirically-anchored fail set
- golden manifest.json + --check determinism guard; per-shape .sql is
  gitignored and regenerated on demand
- CI: generated-matrix determinism+unit job; live matrix folded into the E2E
  step; nightly non-blocking quarantine run that flags unexpected passes
- harness: --include-generated[-quarantine], unix socket in the data dir to
  keep socket paths short; Makefile generate-matrix target
- 32 generator unit tests; paw-sot multi-specialist review applied

Surfaced a new product defect (loop inside a JOIN/RACE branch) -> issue microsoft#233.
Phase 4 of the DSL automated-testing roadmap. Adds a metamorphic-relations
registry to the standalone std-only generator crate: pairs of DSL programs the
runtime must treat as equivalent, checked live.

meta.rs introduces a labeled-leaf program model (Leaf/Seq/If/Join/Race/DoWhile/
LoopBreak) where the same logical leaf carries the same label in both programs
of a pair. A tiny deterministic interpreter eval() computes the ground-truth
observable = multiset {label -> completed-count}. The renderer reuses Phase 2's
exact marker/loop/race/break DSL. registry() defines 7 equivalence relations
(seq-assoc, if-true, if-false, join-comm, race-winner, do-while-once,
loop-break-once).

Each generated SQL test runs BOTH programs (tags meta-NNNN-a / meta-NNNN-b in
the shared df_gen_trace) and asserts: both completed; both pass
df.assert_structural_invariants; HEADLINE observable(A) == observable(B) via an
EXCEPT-based multiset symmetric-diff; and a BACKSTOP per-label count plus a
no-unexpected-label check that catches the symmetric-count-bug blind spot the
Phase 1 pure-state invariants miss.

A separate committed golden meta-manifest.json mirrors the Phase 2 manifest.json
determinism contract; the generator's --check guard diffs both. Phase 2's
manifest.json stays byte-identical.

Adds a scoped tests/e2e/generated/.gitattributes pinning the goldens and
generator sources to LF so the Linux-CI --check gate matches a Windows-authored
commit byte-for-byte.

Validation: fmt, build, clippy -D warnings, 42 generator unit tests, and --check
all green. Live metamorphic E2E runs in CI (folded into the existing
generated-matrix and E2E jobs).
Adds a recursive proptest Strategy<Meta> over the labeled-leaf program model,
generating thousands of random combinator trees per run with proptest's
shrinking — the issue's stated value-add over the Phase 2 enumerated matrix.

Properties run in-process against the Phase 4 reference interpreter
(eval/observable) and the SQL renderer (not live PG), because shrinking
requires the property to execute in-process so proptest can drive reduction.
This hardens the same eval the live Phase 4 metamorphic suite depends on, so
it strengthens the live suite transitively.

src/prop.rs (new): 12 shrinking-enabled properties + 4 helper tests
  - eval determinism
  - differential: eval vs an independent functional reference interpreter
    over the whole random tree space (anti-circular; a Phase 5 bridge)
  - metamorphic algebra laws on random subtrees (seq-assoc, join-comm,
    join-assoc, seq==join multiset, if-branch selection, race->winner) —
    generalizing Phase 4's 7 hand cases to thousands
  - loop multiplier scaling (DoWhile k / LoopBreak n)
  - total-count conservation over structural combinators
  - render determinism + structural well-formedness (balanced parens, even
    $mk$/$c$ dollar-quotes, no leaked df.start, every label present,
    race arity, loop has cond)

proptest is a dev-dependency only, so the generation binary and both goldens
(manifest.json, meta-manifest.json) stay dependency-free and byte-identical.
Persistent failure corpus is proptest-native generator/proptest-regressions/
(committed when populated; LF-normalized via .gitattributes).

CI: clippy gate on the generator; PROPTEST_CASES=256 deterministic replay in
the generated-matrix job; nightly fresh-seed deep run (8192 cases,
non-blocking) with corpus surfacing. Makefile proptest target (1024 cases).

meta.rs: render_prog -> pub(crate) for the renderer property; strip_quoted
test helper loop -> while let (clippy). Rendered output unchanged.

Validated: fmt, clippy --all-targets -D warnings, 58 tests, --check
byte-identical goldens, shrinking demonstrated.
…#232)

Phase 5 adds the roadmap's strongest single oracle: a synchronous
tree-walking reference interpreter over `Shape` that simulates each
program into a pomset (events + happens-before edges). Unlike Phases 2
and 4 -- which differential-test only the COUNT dimension -- this adds
the missing causal-ORDER dimension the live wall-clock trace must obey.

- refinterp.rs: the interpreter (Seq/If/Loop/Join/Race/LoopBreak per
  docs/dsl-semantics.md) plus a `counts_match_render` model-level
  differential against render's closed-form counts, and 14 unit tests
  including a full depth-2 matrix x K in {1,2,3} count differential.
- prop.rs: 6 proptest properties over random `Shape` trees -- the count
  differential, determinism, forward-edge acyclicity, dense iterations,
  and structural Seq/Join happens-before laws (asserting the exact edge
  sets, not just their sizes).
- Docs: README Phase 5 section + dsl-semantics.md cross-reference.

Test-only: the generation binary does not link the `#[cfg(test)]`
interpreter, so manifest.json / meta-manifest.json stay byte-identical.
Validated: fmt, clippy -D warnings, 78 tests, generator --check.
Close the Phase 5 loop to the live runtime: emit a causal-order oracle into
each clean generated E2E test and pin the happens-before relation as a golden
in manifest.json.

- emit.rs: per live shape, render a set-based causal-order block — one VALUES
  row per happens-before (≺) edge from the reference interpreter, a double
  df_gen_trace self-join to resolve both endpoints, and a RAISE iff any edge ran
  with earlier.wall_clock >= later.wall_clock. Gated on
  emit_order = reason.is_none() && !ordered_pairs.is_empty(), with a conditional
  `ord_viol TEXT` declare. Concurrent (join/race) siblings carry no edge, so they
  are never compared and the block cannot flake.
- main.rs: wire ordered_pairs into build_records via
  refinterp::interpret(shape, loop_iters).ordered_pairs().
- refinterp.rs: promote off #[cfg(test)] for the binary; only the count-projection
  helpers (path_counts, counts_match_render) stay test-only.
- manifest.json: add an `order` golden (one [earlier, iter, later, iter] entry per
  ≺ edge) for all 88 edge-bearing shapes, live and quarantined (187896 -> 198914 B;
  intentionally breaks the prior byte-identical property).
- README/dsl-semantics: document the staged scope, the order golden, and a "Clock
  assumptions (known limitation)" caveat (clock_timestamp is wall-clock, not
  monotonic; a backward NTP step inside a sub-second test window is the only
  theoretical false positive, and it fails loudly, never silently).

Validated: fmt, clippy -D, generator tests, --check byte-identical (manifest
198914 B, meta-manifest 7730 B); live --include-generated run with the order
blocks green. paw-sot review (9 specialists) triaged against source; all must-fixes
refuted, net change is the doc-only clock caveat.
…-op (microsoft#232)

Self-review follow-ups on the Phase 5 work:

- emit.rs: the generated causal-order block now LEFT JOINs both
  happens-before endpoints to df_gen_trace and treats a missing endpoint
  (wall_clock IS NULL) as a violation, tagged "(missing endpoint)".
  Previously an INNER JOIN silently dropped any edge whose endpoint row
  was absent, so a dropped or duplicated marker iteration -- the very
  MAX(iteration)+1 numbering under test -- could pass unflagged. The
  oracle is now fail-closed and strictly stronger; concurrent siblings
  still carry no edge, so it cannot flake. Lockstep emit test updated.

- invariants.rs: document that df.assert_structural_invariants is STRICT,
  so a NULL argument yields an empty set and never raises.

- USER_GUIDE.md: note the same STRICT / non-NULL-boolean caveat in the
  API reference table.

Validated: generator fmt + clippy + 82 unit tests + --check byte-identical;
extension fmt + build + clippy + 196 unit tests; live e2e --include-generated
169 passed / 0 failed (128 generated + 7 metamorphic shapes exercise the
new block with zero false positives).
…rade (microsoft#232)

The pgspot SQL security gate flagged 26 findings on the re-emitted
df.grant_usage function in the new upgrade script, even though the
byte-identical function passes in the fresh-install SQL.

Root cause: pgspot's is_secure_searchpath only trusts a schema named in
a function's SET search_path when that schema is CREATEd earlier in the
same script. The fresh install runs CREATE SCHEMA df, so its grant_usage
re-emit is analyzed with a secure search_path; the upgrade fragment never
created df, so the SET search_path = pg_catalog, df, pg_temp was deemed
insecure and every unqualified body reference cascaded into findings.

Prepend CREATE SCHEMA IF NOT EXISTS df; to the upgrade script. df always
exists on any install being upgraded, so this is an idempotent runtime
no-op that preserves catalog upgrade-equivalence (test-upgrade.sh Scenario
A is a catalog-snapshot diff, unaffected). It makes pgspot recognize the
search_path as secure, collapsing all 26 findings into a single PS010
that is already on the gate's allowlist. Gate now passes.
@crprashant crprashant changed the title Issue #232: generative + invariant-driven DSL testing (Phases 1-6) Fix for Issue #232: generative + invariant-driven DSL testing (Phases 1-6) Jun 15, 2026
@pinodeca pinodeca linked an issue Jun 16, 2026 that may be closed by this pull request
@pinodeca pinodeca mentioned this pull request Jun 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roadmap: thorough automated testing for the durable-function DSL

1 participant