feat(authoring): per-stage retrieval against the project corpus by danielnaab · Pull Request #112 · flexion/forms-lab

danielnaab · 2026-04-20T21:17:10Z

Please leave unmerged. Opening for review so we have a concrete proposal on the table, not to ship today.

Turns the authoring pipeline's RAG from "corpus-grounded generation" into retrieval per stage. Every stage now issues a cosine-similarity query against a Titan-embedded index of the project's corpus chunks and sends only the top-k most relevant chunks to the LLM, instead of the full 21-chunk corpus.

Why this matters

The earlier RAG ablation (#104 → #105) established that the corpus is load-bearing — removing it drops authoring recall from 10.6 % to 4.7 %. But in the current code, every authoring LLM call receives the full 21-chunk corpus. That's "context-injection with a citation," not retrieval.

The catalog page `pdf-field-extraction/sonnet-with-rag` is honest about this: retrieval is real for extraction (slug query, top-2), and explicitly not real for authoring. This PR closes that gap.

What it does

New primitive: `src/services/rag/corpus-retriever.ts`

`getCorpusRetriever(slug)` — per-slug memoized retriever. First call embeds the corpus chunks with Bedrock Titan; subsequent calls on the same process hit warm memory.
`retrieveOrFullCorpus(slug, query, k)` — high-level wrapper. Returns top-k cosine-similar chunks when Titan is available; falls back to the full corpus when it isn't. Surfaces the source (`retrieval` vs `full-corpus`) so callers can log and the pipeline can honestly say whether this particular run did RAG.

Hash-embedder fallback is deliberately disabled for authoring. Unlike extraction (which queries by fixture slug, for which a deterministic hash works fine as a lookup), authoring queries are natural-language ("Household Composition"). The hash embedder produces essentially random cosine scores for these — worse than just passing every chunk. So we fall back to the full corpus when Titan is unavailable, and the pipeline stays correct.

Rewired authoring routes

Every `loadPolicyCorpus({ slug })` call in `src/entrypoints/app/routes/owner/edit/authoring.tsx` — six sites — now calls `retrieveOrFullCorpus(slug, query, k)`.

Query construction:

Stage	Query	k	Rationale
Criteria analysis	` — eligibility, required information, application process, regulatory requirements`	15	Broad coverage; criteria should reflect most of the policy surface
Structure planning	same	15	Structure needs to name every topical section the policy implies
Section generation	group title (e.g. "Household Composition")	5	Focused — fields in this section only need the chunks closest to its topic
Section evaluation	group title	5	Same

For SNAP's 21-chunk corpus this means: planning stages see 15 of 21 chunks (cosine excludes the 6 least-relevant); section stages see 5.

Visible log output

Each stage logs whether it retrieved or fell back. Sample progress stream for a build:

```
Corpus: snap-wisconsin
criteria: retrieval (15 chunks)
structure: retrieval (15 chunks)
[Applicant Information] retrieval (5 chunks)
[Household Composition] retrieval (5 chunks)
…
```

What this does NOT do

No benchmark re-run. Re-running the SNAP ablation (`bun run cli evaluate authoring all-sonnet` and `... no-rag-sonnet`) against the new retrieval path requires Bedrock spend (~$0.30) and hasn't been done. Expected directional effect:
- With-corpus (retrieval active): possibly similar or slightly lower recall than the 10.6 % we measured with full-corpus-per-call — if retrieval accidentally excludes a ground-truth-relevant chunk, fields go missing. Or slightly higher — if focus improves field quality per section.
- Without-corpus: unchanged (4.7 %) — this path doesn't touch retrieval.
- The ablation's shape — corpus matters — won't change.
No changes to the extraction RAG. That already uses retrieval (different bootstrap, different fallback semantics appropriate for slug-keyed queries). Unifying the two primitives could come later.

Risks worth reviewing

Hyperparameters are guesses. k = 15 for planning and k = 5 for sections are defensible starting points, not optimized. If k is too low on planning, criteria / structure miss topics. If k is too low on sections, fields are regulation-thin.
Silent fallback. If Titan access breaks in production, every call falls back to the full corpus and the pipeline keeps working — safe but easy to miss. The per-stage log helps; no monitoring/paging yet.
Memoization lives in module state. One retriever per process per slug for process lifetime. For demo scale this is fine; for a real deployment the memory footprint grows with the number of corpora (21 × 1024-dim float vectors ≈ 86 KB per corpus, trivial).

Testing

`bun run check` — 1354 tests pass (one new file; fallback semantics unit-tested).
Benchmark re-run against SNAP ground truth.
Manual smoke: build a SNAP form from `/new` and watch the progress stream for `retrieval` lines.

Turns the authoring pipeline's RAG from "corpus-grounded generation" into "retrieval per stage". Every stage now issues a cosine-similarity query against a Titan-embedded index of the project's corpus chunks and sends only the top-k most relevant chunks to the LLM, instead of the full 21-chunk corpus. Changes - New \`src/services/rag/corpus-retriever.ts\`. Per-slug memoized retriever bootstrap — embeds each corpus's chunks with Titan on first use and caches the index for the process lifetime. Exposes \`getCorpusRetriever(slug)\` and \`retrieveOrFullCorpus(slug, query, k)\`. When Titan access is unavailable, retrieval returns null and the wrapper gracefully falls back to the full corpus (hash embeddings produce essentially random cosine scores on natural-language queries — worse than no retrieval at all). - Replace every \`loadPolicyCorpus({ slug })\` call in the authoring routes with a \`retrieveOrFullCorpus(slug, query, k)\` call. Six sites in \`src/entrypoints/app/routes/owner/edit/authoring.tsx\`: criteria analysis (runBuild + endpoint), structure planning (runBuild + endpoint), section generation (runBuild + endpoint), section evaluation. - Query construction: - Criteria / structure — \`planningQuery(slug)\` derives a broad query from the corpus's formName: "<Form Name> — eligibility, required information, application process, regulatory requirements". k = 15 (out of 21 for SNAP) so the planning stages see most of the corpus while still excluding chunks that cosine identifies as distant from the form's topic. - Section generation / evaluation — the group title is the query ("Household Composition", "Earned Income"). k = 5 — a focused window of the chunks most relevant to the fields this section collects. - Progress log surfaces whether each stage used retrieval or the full-corpus fallback, so demos / debugging show real retrieval happening vs a degraded run. - Test: \`test/services/rag/corpus-retriever.test.ts\` pins the fallback semantics (falls back when RAG_EMBEDDER=hash; returns empty for unknown slug). Not doing here - Re-running the SNAP ablation against the new retrieval path. That requires real Bedrock spend ($0.30ish). Deferred to whoever merges; the ablation variant (\`bun run cli evaluate authoring no-rag-sonnet\`) still exposes the with-vs-without-corpus delta. - Touching the extraction RAG variant. It already uses a retriever (different bootstrap, slug-keyed fallback for the hash-embedder case). Unifying the two could come later. Why this is worth reviewing carefully - Changes the signal every authoring LLM call sees. If top-k is too narrow, criteria / structure stages produce less-complete forms. Benchmarks needed before trusting k=15 / k=5. - Adds a failure mode: if Titan access breaks in production, every call silently falls back. That's safe (pipeline keeps working) but easy to miss. The per-stage log helps; monitoring doesn't yet page on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(authoring): per-stage retrieval against the project corpus#112

feat(authoring): per-stage retrieval against the project corpus#112
danielnaab wants to merge 1 commit intomainfrom
feat/authoring-per-stage-retrieval

danielnaab commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielnaab commented Apr 20, 2026

Why this matters

What it does

New primitive: `src/services/rag/corpus-retriever.ts`

Rewired authoring routes

Visible log output

What this does NOT do

Risks worth reviewing

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant