feat(authoring): per-stage retrieval against the project corpus#112
Open
danielnaab wants to merge 1 commit intomainfrom
Open
feat(authoring): per-stage retrieval against the project corpus#112danielnaab wants to merge 1 commit intomainfrom
danielnaab wants to merge 1 commit intomainfrom
Conversation
Turns the authoring pipeline's RAG from "corpus-grounded generation"
into "retrieval per stage". Every stage now issues a cosine-similarity
query against a Titan-embedded index of the project's corpus chunks
and sends only the top-k most relevant chunks to the LLM, instead of
the full 21-chunk corpus.
Changes
- New \`src/services/rag/corpus-retriever.ts\`. Per-slug memoized
retriever bootstrap — embeds each corpus's chunks with Titan on
first use and caches the index for the process lifetime. Exposes
\`getCorpusRetriever(slug)\` and \`retrieveOrFullCorpus(slug, query, k)\`.
When Titan access is unavailable, retrieval returns null and the
wrapper gracefully falls back to the full corpus (hash embeddings
produce essentially random cosine scores on natural-language
queries — worse than no retrieval at all).
- Replace every \`loadPolicyCorpus({ slug })\` call in the authoring
routes with a \`retrieveOrFullCorpus(slug, query, k)\` call. Six
sites in \`src/entrypoints/app/routes/owner/edit/authoring.tsx\`:
criteria analysis (runBuild + endpoint), structure planning
(runBuild + endpoint), section generation (runBuild + endpoint),
section evaluation.
- Query construction:
- Criteria / structure — \`planningQuery(slug)\` derives a broad
query from the corpus's formName: "<Form Name> — eligibility,
required information, application process, regulatory
requirements". k = 15 (out of 21 for SNAP) so the planning
stages see most of the corpus while still excluding chunks
that cosine identifies as distant from the form's topic.
- Section generation / evaluation — the group title is the query
("Household Composition", "Earned Income"). k = 5 — a focused
window of the chunks most relevant to the fields this section
collects.
- Progress log surfaces whether each stage used retrieval or the
full-corpus fallback, so demos / debugging show real retrieval
happening vs a degraded run.
- Test: \`test/services/rag/corpus-retriever.test.ts\` pins the
fallback semantics (falls back when RAG_EMBEDDER=hash; returns
empty for unknown slug).
Not doing here
- Re-running the SNAP ablation against the new retrieval path. That
requires real Bedrock spend ($0.30ish). Deferred to whoever
merges; the ablation variant
(\`bun run cli evaluate authoring no-rag-sonnet\`) still exposes
the with-vs-without-corpus delta.
- Touching the extraction RAG variant. It already uses a retriever
(different bootstrap, slug-keyed fallback for the hash-embedder
case). Unifying the two could come later.
Why this is worth reviewing carefully
- Changes the signal every authoring LLM call sees. If top-k is too
narrow, criteria / structure stages produce less-complete forms.
Benchmarks needed before trusting k=15 / k=5.
- Adds a failure mode: if Titan access breaks in production, every
call silently falls back. That's safe (pipeline keeps working)
but easy to miss. The per-stage log helps; monitoring doesn't
yet page on it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Turns the authoring pipeline's RAG from "corpus-grounded generation" into retrieval per stage. Every stage now issues a cosine-similarity query against a Titan-embedded index of the project's corpus chunks and sends only the top-k most relevant chunks to the LLM, instead of the full 21-chunk corpus.
Why this matters
The earlier RAG ablation (#104 → #105) established that the corpus is load-bearing — removing it drops authoring recall from 10.6 % to 4.7 %. But in the current code, every authoring LLM call receives the full 21-chunk corpus. That's "context-injection with a citation," not retrieval.
The catalog page `pdf-field-extraction/sonnet-with-rag` is honest about this: retrieval is real for extraction (slug query, top-2), and explicitly not real for authoring. This PR closes that gap.
What it does
New primitive: `src/services/rag/corpus-retriever.ts`
Hash-embedder fallback is deliberately disabled for authoring. Unlike extraction (which queries by fixture slug, for which a deterministic hash works fine as a lookup), authoring queries are natural-language ("Household Composition"). The hash embedder produces essentially random cosine scores for these — worse than just passing every chunk. So we fall back to the full corpus when Titan is unavailable, and the pipeline stays correct.
Rewired authoring routes
Every `loadPolicyCorpus({ slug })` call in `src/entrypoints/app/routes/owner/edit/authoring.tsx` — six sites — now calls `retrieveOrFullCorpus(slug, query, k)`.
Query construction:
For SNAP's 21-chunk corpus this means: planning stages see 15 of 21 chunks (cosine excludes the 6 least-relevant); section stages see 5.
Visible log output
Each stage logs whether it retrieved or fell back. Sample progress stream for a build:
```
Corpus: snap-wisconsin
criteria: retrieval (15 chunks)
structure: retrieval (15 chunks)
[Applicant Information] retrieval (5 chunks)
[Household Composition] retrieval (5 chunks)
…
```
What this does NOT do
Risks worth reviewing
Testing