Skip to content

feat(authoring): per-stage retrieval against the project corpus#112

Open
danielnaab wants to merge 1 commit intomainfrom
feat/authoring-per-stage-retrieval
Open

feat(authoring): per-stage retrieval against the project corpus#112
danielnaab wants to merge 1 commit intomainfrom
feat/authoring-per-stage-retrieval

Conversation

@danielnaab
Copy link
Copy Markdown
Member

Please leave unmerged. Opening for review so we have a concrete proposal on the table, not to ship today.

Turns the authoring pipeline's RAG from "corpus-grounded generation" into retrieval per stage. Every stage now issues a cosine-similarity query against a Titan-embedded index of the project's corpus chunks and sends only the top-k most relevant chunks to the LLM, instead of the full 21-chunk corpus.

Why this matters

The earlier RAG ablation (#104#105) established that the corpus is load-bearing — removing it drops authoring recall from 10.6 % to 4.7 %. But in the current code, every authoring LLM call receives the full 21-chunk corpus. That's "context-injection with a citation," not retrieval.

The catalog page `pdf-field-extraction/sonnet-with-rag` is honest about this: retrieval is real for extraction (slug query, top-2), and explicitly not real for authoring. This PR closes that gap.

What it does

New primitive: `src/services/rag/corpus-retriever.ts`

  • `getCorpusRetriever(slug)` — per-slug memoized retriever. First call embeds the corpus chunks with Bedrock Titan; subsequent calls on the same process hit warm memory.
  • `retrieveOrFullCorpus(slug, query, k)` — high-level wrapper. Returns top-k cosine-similar chunks when Titan is available; falls back to the full corpus when it isn't. Surfaces the source (`retrieval` vs `full-corpus`) so callers can log and the pipeline can honestly say whether this particular run did RAG.

Hash-embedder fallback is deliberately disabled for authoring. Unlike extraction (which queries by fixture slug, for which a deterministic hash works fine as a lookup), authoring queries are natural-language ("Household Composition"). The hash embedder produces essentially random cosine scores for these — worse than just passing every chunk. So we fall back to the full corpus when Titan is unavailable, and the pipeline stays correct.

Rewired authoring routes

Every `loadPolicyCorpus({ slug })` call in `src/entrypoints/app/routes/owner/edit/authoring.tsx` — six sites — now calls `retrieveOrFullCorpus(slug, query, k)`.

Query construction:

Stage Query k Rationale
Criteria analysis ` — eligibility, required information, application process, regulatory requirements` 15 Broad coverage; criteria should reflect most of the policy surface
Structure planning same 15 Structure needs to name every topical section the policy implies
Section generation group title (e.g. "Household Composition") 5 Focused — fields in this section only need the chunks closest to its topic
Section evaluation group title 5 Same

For SNAP's 21-chunk corpus this means: planning stages see 15 of 21 chunks (cosine excludes the 6 least-relevant); section stages see 5.

Visible log output

Each stage logs whether it retrieved or fell back. Sample progress stream for a build:

```
Corpus: snap-wisconsin
criteria: retrieval (15 chunks)
structure: retrieval (15 chunks)
[Applicant Information] retrieval (5 chunks)
[Household Composition] retrieval (5 chunks)

```

What this does NOT do

  • No benchmark re-run. Re-running the SNAP ablation (`bun run cli evaluate authoring all-sonnet` and `... no-rag-sonnet`) against the new retrieval path requires Bedrock spend (~$0.30) and hasn't been done. Expected directional effect:
    • With-corpus (retrieval active): possibly similar or slightly lower recall than the 10.6 % we measured with full-corpus-per-call — if retrieval accidentally excludes a ground-truth-relevant chunk, fields go missing. Or slightly higher — if focus improves field quality per section.
    • Without-corpus: unchanged (4.7 %) — this path doesn't touch retrieval.
    • The ablation's shape — corpus matters — won't change.
  • No changes to the extraction RAG. That already uses retrieval (different bootstrap, different fallback semantics appropriate for slug-keyed queries). Unifying the two primitives could come later.

Risks worth reviewing

  • Hyperparameters are guesses. k = 15 for planning and k = 5 for sections are defensible starting points, not optimized. If k is too low on planning, criteria / structure miss topics. If k is too low on sections, fields are regulation-thin.
  • Silent fallback. If Titan access breaks in production, every call falls back to the full corpus and the pipeline keeps working — safe but easy to miss. The per-stage log helps; no monitoring/paging yet.
  • Memoization lives in module state. One retriever per process per slug for process lifetime. For demo scale this is fine; for a real deployment the memory footprint grows with the number of corpora (21 × 1024-dim float vectors ≈ 86 KB per corpus, trivial).

Testing

  • `bun run check` — 1354 tests pass (one new file; fallback semantics unit-tested).
  • Benchmark re-run against SNAP ground truth.
  • Manual smoke: build a SNAP form from `/new` and watch the progress stream for `retrieval` lines.

Turns the authoring pipeline's RAG from "corpus-grounded generation"
into "retrieval per stage". Every stage now issues a cosine-similarity
query against a Titan-embedded index of the project's corpus chunks
and sends only the top-k most relevant chunks to the LLM, instead of
the full 21-chunk corpus.

Changes

- New \`src/services/rag/corpus-retriever.ts\`. Per-slug memoized
  retriever bootstrap — embeds each corpus's chunks with Titan on
  first use and caches the index for the process lifetime. Exposes
  \`getCorpusRetriever(slug)\` and \`retrieveOrFullCorpus(slug, query, k)\`.
  When Titan access is unavailable, retrieval returns null and the
  wrapper gracefully falls back to the full corpus (hash embeddings
  produce essentially random cosine scores on natural-language
  queries — worse than no retrieval at all).

- Replace every \`loadPolicyCorpus({ slug })\` call in the authoring
  routes with a \`retrieveOrFullCorpus(slug, query, k)\` call. Six
  sites in \`src/entrypoints/app/routes/owner/edit/authoring.tsx\`:
  criteria analysis (runBuild + endpoint), structure planning
  (runBuild + endpoint), section generation (runBuild + endpoint),
  section evaluation.

- Query construction:
  - Criteria / structure — \`planningQuery(slug)\` derives a broad
    query from the corpus's formName: "<Form Name> — eligibility,
    required information, application process, regulatory
    requirements". k = 15 (out of 21 for SNAP) so the planning
    stages see most of the corpus while still excluding chunks
    that cosine identifies as distant from the form's topic.
  - Section generation / evaluation — the group title is the query
    ("Household Composition", "Earned Income"). k = 5 — a focused
    window of the chunks most relevant to the fields this section
    collects.

- Progress log surfaces whether each stage used retrieval or the
  full-corpus fallback, so demos / debugging show real retrieval
  happening vs a degraded run.

- Test: \`test/services/rag/corpus-retriever.test.ts\` pins the
  fallback semantics (falls back when RAG_EMBEDDER=hash; returns
  empty for unknown slug).

Not doing here

- Re-running the SNAP ablation against the new retrieval path. That
  requires real Bedrock spend ($0.30ish). Deferred to whoever
  merges; the ablation variant
  (\`bun run cli evaluate authoring no-rag-sonnet\`) still exposes
  the with-vs-without-corpus delta.
- Touching the extraction RAG variant. It already uses a retriever
  (different bootstrap, slug-keyed fallback for the hash-embedder
  case). Unifying the two could come later.

Why this is worth reviewing carefully

- Changes the signal every authoring LLM call sees. If top-k is too
  narrow, criteria / structure stages produce less-complete forms.
  Benchmarks needed before trusting k=15 / k=5.
- Adds a failure mode: if Titan access breaks in production, every
  call silently falls back. That's safe (pipeline keeps working)
  but easy to miss. The per-stage log helps; monitoring doesn't
  yet page on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant