diff --git a/AGENTS.md b/AGENTS.md index be93c0bc2..9caaa7e25 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -14,11 +14,12 @@ Co-Authored-By: (agent model name) ## File-Scoped Commands -| Task | Command | -| --------------------- | ------------------------------------------------------------------- | -| Unit test file | `pnpm --filter @sentry/junior exec vitest run path/to/file.test.ts` | -| Integration test file | `pnpm --filter @sentry/junior exec vitest run path/to/file.test.ts` | -| Eval file | `pnpm --filter @sentry/junior-evals evals path/to/eval.test.ts` | +| Task | Command | +| --------------------- | ------------------------------------------------------------------------------ | +| Unit test file | `pnpm --filter @sentry/junior exec vitest run path/to/file.test.ts` | +| Integration test file | `pnpm --filter @sentry/junior exec vitest run path/to/file.test.ts` | +| Eval file | `pnpm --filter @sentry/junior-evals evals path/to/eval.eval.ts` | +| Eval case filter | `pnpm --filter @sentry/junior-evals evals path/to/eval.eval.ts -t "case name"` | ## Key Conventions @@ -27,8 +28,11 @@ Co-Authored-By: (agent model name) - Use `/skill-writer` skill when creating or updating skills. - Prefer integration tests for most product/runtime changes that need real wiring. - Use evals as the integration-style layer for agent/prompt/natural-language behavior. See `packages/junior-evals/README.md`. +- In evals, use the normalized `vitest-evals` session and `toolCalls(result.session)` as the primary assertion surfaces; do not invent local transcript/tool-call schemas. - For any non-Slack-specific product/runtime/prompt/tool/plugin behavior change, use the local agent as the first manual behavior check. Follow `packages/docs/src/content/docs/contribute/local-agent-validation.md`; from this monorepo, run it with `pnpm cli -- chat ...`, which uses `apps/example` as the canonical local validation app. -- Run evals from Codex as escalated host commands when they need real Vercel Sandbox/network access; use `pnpm evals` for the full suite. +- Run evals through the package scripts only: `pnpm evals` for the full suite or `pnpm --filter @sentry/junior-evals evals ...` for focused runs. Do not use `pnpm exec vitest` directly for evals. +- Pass eval file paths and `-t` filters directly after `evals`; do not insert `--` before eval args. +- Run evals from Codex as escalated host commands when they need real Vercel Sandbox/network access. - If evals fail from missing or expired Gateway/Vercel credentials, run `pnpm dev:env` to refresh secrets before retrying. - Use instrumentation conventions from `specs/instrumentation.md`. - Use OpenTelemetry semantic keys for logs; when no semantic key exists, use `app.*`. diff --git a/packages/docs/src/content/docs/contribute/testing.md b/packages/docs/src/content/docs/contribute/testing.md index d54188cbf..ba52c9467 100644 --- a/packages/docs/src/content/docs/contribute/testing.md +++ b/packages/docs/src/content/docs/contribute/testing.md @@ -35,7 +35,7 @@ pnpm --filter @sentry/junior exec vitest run path/to/file.test.ts Run one eval file: ```bash -pnpm --filter @sentry/junior-evals evals path/to/eval.test.ts +pnpm --filter @sentry/junior-evals evals path/to/eval.eval.ts ``` ## Notes diff --git a/packages/junior-evals/README.md b/packages/junior-evals/README.md index 00c0c6c4e..3d3363280 100644 --- a/packages/junior-evals/README.md +++ b/packages/junior-evals/README.md @@ -6,7 +6,7 @@ Evals are end-to-end Slack conversation evaluations. They are the integration-st - We define conversation cases inline in TypeScript using `describeEval()` and the shared `slackEvals` harness options. - We run the real runtime/harness against those fixtures. -- We score outcomes with a `vitest-evals` judge that reuses the Slack harness prompt seam, backed by Junior's Pi client and the Vercel AI Gateway model `openai/gpt-5.4`. +- We score outcomes against the normalized `vitest-evals` session surface, backed by Junior's Pi client and the Vercel AI Gateway model `openai/gpt-5.4`. ## Layer Boundaries @@ -58,9 +58,17 @@ For each `it()` case inside a `describeEval()` suite: 1. Replay events through the harness via `runEvalScenario()`. 2. Create a fresh runtime instance for the case via the chat composition root; do not mutate the production singleton runtime. 3. Route message events through real ingress + queue-worker behavior, with only the external queue transport replaced by an in-memory harness shim. -4. Return observed artifacts as JSON for LLM judgment, including structured `assistant_posts` with text plus actual attached-file metadata, and Slack-visible metadata. - The helper pretty-prints this JSON so failure output stays readable in local runs and CI. -5. `vitest-evals` scores the output against `criteria` (A–E → 1.0–0.0). +4. Return a standard `vitest-evals` `HarnessRun`; `result.session` is the canonical normalized surface for judge scoring and deterministic assertions. +5. Do not create a second repo-local transcript, event-log, or assertion schema when `vitest-evals` already has `session`, `toolCalls(result.session)`, `artifacts`, or `traces`. +6. `vitest-evals` scores the normalized session against `criteria` (A–E -> 1.0-0.0). + +## Harness Boundaries + +- Use the Slack eval harness for Slack/runtime behavior: mentions, thread/channel delivery, OAuth privacy, lifecycle/resume behavior, reactions, and Slack-visible side effects. +- Use an agent-level harness for prompt, skill routing, tool choice, provider/tool calls, and reply quality when Slack transport is not the behavior under test. +- The Slack eval harness session is an observed Slack output/tool/artifact projection. Do not add a repo-local sequencing layer to make it look like a full ordered conversation transcript. +- When the eval boundary is Junior's Pi agent or needs an ordered full-turn transcript, prefer `@vitest-evals/harness-pi-ai` primitives instead of rebuilding transcript capture locally. The Pi harness already owns normalized `session.messages`, `toolCalls(result.session)`, artifacts, traces, replay, and judge context. +- Do not assert against logs, spans, or status telemetry for product behavior. Use `vitest-evals` session/tool/artifact primitives for behavior contracts; reserve traces/spans for instrumentation tests or diagnostics. Harness override knobs (in `EvalOverrides`): @@ -85,7 +93,10 @@ Tool replay: - `pnpm evals`: Run all eval cases (from workspace root) - `pnpm --filter @sentry/junior-evals evals`: Run from any directory -- `pnpm --filter @sentry/junior-evals evals -- -t "subscribed"`: Filter by test name pattern +- `pnpm --filter @sentry/junior-evals evals evals/sentry/skill-workflows.eval.ts`: Run one eval file +- `pnpm --filter @sentry/junior-evals evals evals/sentry/skill-workflows.eval.ts -t "subscribed"`: Run one eval case by name + +Pass eval file paths and `-t` filters directly after the `evals` script. Do not use `pnpm exec vitest` directly, and do not insert `--` before eval arguments. ## Optional CI Runs @@ -109,11 +120,11 @@ Evals require real Vercel Sandbox access. If sandbox bootstrap fails, the eval f - Use `auto_complete_mcp_oauth` or `auto_complete_oauth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth. - For multi-turn, pass the same `thread` override so events land in one thread. - Keep each case focused on one primary behavior. -- Encode all expectations in `criteria`; do not add deterministic inline assertions. -- New and edited evals must express `criteria` with `rubric({ contract, pass, allow, fail })`. -- `contract` should name the user-visible behavior being proven. +- Put semantic, model-dependent expectations in `criteria`. +- Put deterministic boundary expectations in normal Vitest assertions against `result.session`, `toolCalls(result.session)`, or `result.artifacts`. Prefer `vitest-evals` primitives over local helper-specific output shapes. +- New and edited evals must express `criteria` with `rubric({ pass, fail })`. +- Let the eval test name describe the scenario and expected outcome. - `pass` should list observable pass conditions. -- `allow` should list acceptable optional variations. - `fail` should list forbidden outputs or failure conditions. - Do not write judge criteria as one dense paragraph. - Let the `describeEval()` block own the behavior area. The file path and `describeEval()` context already provide scope. @@ -128,6 +139,7 @@ Do not do these in eval files: - Do not import `@/chat/slack/*` directly. - Do not use MSW Slack helpers (`queueSlackApiResponse`, `getCapturedSlackApiCalls`, `queueSlackApiError`, `queueSlackRateLimit`). - Do not validate raw Slack Web API request payload shapes from evals. +- Do not invent parallel transcript, event-log, or tool-call schemas for assertions. If the existing `vitest-evals` primitives are insufficient, improve the harness boundary first. - Do not validate implementation internals (exact tool names, sandbox IDs, or other non-user-visible details) unless the scenario explicitly evaluates those surfaces. ## File Naming Strategy @@ -155,12 +167,13 @@ Good conversational evals should: - Describe user-visible outcomes first (reply count, reply content, metadata effects visible to Slack users). - Use concrete real-world scenarios (incident updates, planning follow-ups, capability setup requests), not abstract mechanics like "posted two replies." - Use judge criteria written in product language, not implementation language. -- Use rubric sections that are easy for maintainers to scan in a failure: one `contract`, a short `pass` list, and focused `allow` / `fail` lists only when needed. +- Use rubric sections that are easy for maintainers to scan in a failure: a short `pass` list and a focused `fail` list only when it describes a real regression. - Keep rubric bullets at the behavior level. Prefer "uses the stored repo as the target" over requiring exact wording or incidental reply ordering. -- Put incidental variation in `allow`, not `pass`. Omit `fail` bullets unless they describe a real regression or unsafe side effect. +- Omit incidental variation from the rubric unless it affects the behavior contract. +- Omit `fail` bullets unless they describe a real regression or unsafe side effect. - Use fake/nonexistent external targets unless the eval explicitly opts into live provider access. - Cover realistic failure behavior with clear user-visible errors. -- Use tool-call traces when they prove behavior at a real boundary, such as source grounding, mutation safety, provider routing, or auth sequencing. +- Use `toolCalls(result.session)` when tool/provider evidence proves behavior at a real boundary, such as source grounding, mutation safety, provider routing, or auth sequencing. Avoid: @@ -180,7 +193,6 @@ describeEval("Routing", slackEvals, (it) => { await run({ events: [mention("<@U_APP> summarize this")], criteria: rubric({ - contract: "An explicit mention gets one direct reply.", pass: ["The assistant posts exactly one reply to the mention."], }), }); diff --git a/packages/junior-evals/evals/core/coding-file-tools.eval.ts b/packages/junior-evals/evals/core/coding-file-tools.eval.ts index 8590f6855..7dd76eb81 100644 --- a/packages/junior-evals/evals/core/coding-file-tools.eval.ts +++ b/packages/junior-evals/evals/core/coding-file-tools.eval.ts @@ -17,8 +17,6 @@ describeEval("Coding File Tools", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A small source edit in the sandbox fixture updates the requested value and reports the changed file.", pass: [ "The final reply identifies the changed config file and says the default retry count is now 3.", ], @@ -41,8 +39,6 @@ describeEval("Coding File Tools", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A sandbox fixture comparison returns grounded file-path evidence without claiming to modify files.", pass: [ "The reply cites the alert source file and the operations doc using recognizable fixture-relative paths.", "The reply accurately summarizes that source code handles emergency alerts while the operations doc describes escalation or operator behavior.", diff --git a/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts b/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts index f30a9b563..8f58733ad 100644 --- a/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts +++ b/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts @@ -8,8 +8,6 @@ describeEval("Lifecycle and Resilience", slackEvals, (it) => { await run({ events: [threadStart()], criteria: rubric({ - contract: - "The assistant initializes Slack thread metadata without posting a visible reply.", pass: [ "No assistant reply is posted.", "The thread title is set exactly once.", @@ -26,10 +24,8 @@ describeEval("Lifecycle and Resilience", slackEvals, (it) => { overrides: { fail_reply_call: 1 }, events: [mention("What's the status of the deploy?")], criteria: rubric({ - contract: - "When reply generation fails before any answer is posted, the user still gets one clear failure reply.", pass: [ - "assistant_posts contains exactly one reply.", + "The normalized transcript contains exactly one assistant thread reply.", "That reply clearly tells the user the request failed in user-facing language.", ], fail: [ @@ -54,10 +50,8 @@ describeEval("Lifecycle and Resilience", slackEvals, (it) => { }, events: [mention("Quick budget update?")], criteria: rubric({ - contract: - "A provider interruption preserves the partial answer and marks that same reply as interrupted.", pass: [ - "assistant_posts contains exactly one reply because this answer fits in a single Slack post.", + "The normalized transcript contains exactly one assistant thread reply because this answer fits in a single Slack post.", "That reply includes the budget update that it is still on track for Friday.", "That same reply clearly says the response was interrupted before completion.", ], diff --git a/packages/junior-evals/evals/core/media-and-attachments.eval.ts b/packages/junior-evals/evals/core/media-and-attachments.eval.ts index 27cb17180..fd270ef9e 100644 --- a/packages/junior-evals/evals/core/media-and-attachments.eval.ts +++ b/packages/junior-evals/evals/core/media-and-attachments.eval.ts @@ -9,8 +9,6 @@ describeEval("Media and Attachments", slackEvals, (it) => { overrides: { mock_image_generation: true }, events: [mention("show me how you feel")], criteria: rubric({ - contract: - "An image-generation prompt returns an actual image attachment in the thread.", pass: ["The assistant responds by attaching an image in the thread."], fail: [ "Do not respond with text that merely describes an image.", diff --git a/packages/junior-evals/evals/core/oauth-workflows.eval.ts b/packages/junior-evals/evals/core/oauth-workflows.eval.ts index 0c04b6f40..e516c72af 100644 --- a/packages/junior-evals/evals/core/oauth-workflows.eval.ts +++ b/packages/junior-evals/evals/core/oauth-workflows.eval.ts @@ -1,61 +1,53 @@ -import { describeEval } from "vitest-evals"; +import { assistantMessages, describeEval, toolCalls } from "vitest-evals"; +import type { HarnessRun } from "vitest-evals/harness"; import { expect } from "vitest"; import { rubric, slackEvals, threadMessage } from "../helpers"; -type EvalOutput = { - assistant_posts?: Array<{ - text?: string; - channel?: string; - thread_ts?: string; - }>; - observed_tool_invocations?: Array<{ - tool?: string; - skill_name?: string; - bash_command?: string; - }>; -}; +type EvalRun = HarnessRun; -function outputOf(result: { output?: unknown }): EvalOutput { - return (result.output ?? {}) as EvalOutput; +function textContent(value: unknown): string { + return typeof value === "string" ? value : ""; } -function postTexts(output: EvalOutput): string[] { - return output.assistant_posts?.map((post) => post.text ?? "") ?? []; -} - -function expectNoPublicOAuthUrl(output: EvalOutput): void { - expect(postTexts(output).join("\n")).not.toMatch( +function expectNoPublicOAuthUrl(result: EvalRun): void { + const visibleText = assistantMessages(result.session) + .map((message) => textContent(message.content)) + .join("\n"); + expect(visibleText).not.toMatch( /https?:\/\/[^\s|>]*(oauth|authorize|callback)[^\s|>]*/i, ); } -function expectEvalOauthIdentityCheck(output: EvalOutput): void { - expect(output.observed_tool_invocations).toEqual( +function expectEvalOauthIdentityCheck(result: EvalRun): void { + expect(toolCalls(result.session)).toEqual( expect.arrayContaining([ expect.objectContaining({ - tool: "loadSkill", - skill_name: "eval-oauth", + name: "loadSkill", + arguments: expect.objectContaining({ + skill_name: "eval-oauth", + }), }), expect.objectContaining({ - tool: "bash", - bash_command: "curl -fsSL https://example.com/junior-eval-oauth/whoami", + name: "bash", + arguments: expect.objectContaining({ + command: "curl -fsSL https://example.com/junior-eval-oauth/whoami", + }), }), ]), ); } function expectFinalThreadReply( - output: EvalOutput, + result: EvalRun, thread: { channel_id: string; thread_ts: string }, pattern: RegExp, ): void { - const matchingPosts = - output.assistant_posts?.filter( - (post) => - post.channel === thread.channel_id && - post.thread_ts === thread.thread_ts && - pattern.test(post.text ?? ""), - ) ?? []; + const matchingPosts = assistantMessages(result.session).filter( + (message) => + message.metadata?.channel === thread.channel_id && + message.metadata?.thread_ts === thread.thread_ts && + pattern.test(textContent(message.content)), + ); expect(matchingPosts.length).toBeGreaterThan(0); } @@ -88,19 +80,11 @@ describeEval("OAuth Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "After MCP authorization completes, the same thread gets a resumed answer that keeps prior context.", pass: [ "The same Slack thread later gets a resumed answer after authorization completes.", "Because the eval harness auto-completes MCP authorization off-transcript, treat a later same-thread resumed answer as evidence that authorization completed.", "The resumed answer explicitly says the earlier budget deadline was Friday.", ], - allow: [ - "A private auth-link handoff is expected and does not need to appear in assistant_posts.", - "A single URL-free public acknowledgement that authorization is needed, including a note to check the private link, is acceptable before the resumed answer.", - "A concise resumed answer that only restates the budget deadline is acceptable.", - "A brief connection or continuation notice is acceptable before the resumed answer.", - ], fail: [ "Do not post the authorization URL in the public thread.", "Do not ask the user to repeat the deadline.", @@ -109,9 +93,8 @@ describeEval("OAuth Workflows", slackEvals, (it) => { ], }), }); - const output = outputOf(result); - expectNoPublicOAuthUrl(output); - expectFinalThreadReply(output, mcpAuthResumeThread, /\bFriday\b/i); + expectNoPublicOAuthUrl(result); + expectFinalThreadReply(result, mcpAuthResumeThread, /\bFriday\b/i); }); const oauthResumeThread = { @@ -142,18 +125,10 @@ describeEval("OAuth Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "After generic OAuth authorization completes, the same thread gets a resumed answer that keeps prior context.", pass: [ "The same Slack thread gets a resumed answer after authorization completes.", "The resumed answer explicitly says the earlier budget deadline was Friday.", ], - allow: [ - "A private auth-link handoff is expected and does not need to appear in assistant_posts.", - "A single URL-free public acknowledgement that authorization is needed, including a note to check the private link, is acceptable before the resumed answer.", - "A concise resumed answer that only restates the budget deadline is acceptable.", - "A brief connection or continuation notice is acceptable before the resumed answer or in the same message as the resumed answer.", - ], fail: [ "Do not post the authorization URL in the public thread.", "Do not ask the user to repeat the deadline.", @@ -162,10 +137,9 @@ describeEval("OAuth Workflows", slackEvals, (it) => { ], }), }); - const output = outputOf(result); - expectNoPublicOAuthUrl(output); - expectEvalOauthIdentityCheck(output); - expectFinalThreadReply(output, oauthResumeThread, /\bFriday\b/i); + expectNoPublicOAuthUrl(result); + expectEvalOauthIdentityCheck(result); + expectFinalThreadReply(result, oauthResumeThread, /\bFriday\b/i); }); const oauthReconnectThread = { @@ -189,28 +163,20 @@ describeEval("OAuth Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "An explicit reconnect request can drive a fresh authorization cycle to completion in the same thread.", pass: [ "The thread gets a connected or processing notice in the same thread.", "The reconnect flow ends with a short connected confirmation or success follow-up in the same thread.", ], - allow: [ - "A brief 'Processing your request' continuation notice is acceptable if the final follow-up stays focused on the reconnect result.", - "A single initial auth-needed notice is acceptable before the harness auto-completes authorization.", - "The auth-link handoff itself may happen off-thread and does not need to appear in the visible thread transcript.", - ], fail: [ "Do not ask the user to authorize again after the reconnect has already completed.", "Do not post a generic failure message.", ], }), }); - const output = outputOf(result); - expectNoPublicOAuthUrl(output); - expectEvalOauthIdentityCheck(output); + expectNoPublicOAuthUrl(result); + expectEvalOauthIdentityCheck(result); expectFinalThreadReply( - output, + result, oauthReconnectThread, /connected|reconnected/i, ); diff --git a/packages/junior-evals/evals/core/output-contract.eval.ts b/packages/junior-evals/evals/core/output-contract.eval.ts index 6095f1a14..fedce4222 100644 --- a/packages/junior-evals/evals/core/output-contract.eval.ts +++ b/packages/junior-evals/evals/core/output-contract.eval.ts @@ -13,15 +13,10 @@ describeEval("Output Contract", slackEvals, (it) => { ], requireSandboxReady: false, criteria: rubric({ - contract: - "Structured multi-section replies do not use hash-prefixed markdown heading markers.", pass: [ "The assistant posts one reply that covers the authorization request, token exchange, and refresh.", "No section label line starts with `#`, `##`, or `###`.", ], - allow: [ - "Bolded title lines, bolded section labels, and numbered bold labels are acceptable.", - ], fail: [ "Do not use lines beginning with `#`, `##`, or `###` for section labels.", "Do not paste a hash-heading line like `# Authorization Request` at the start of a section.", @@ -41,8 +36,6 @@ describeEval("Output Contract", slackEvals, (it) => { ], requireSandboxReady: false, criteria: rubric({ - contract: - "URLs in Slack replies render as plain URLs, not markdown hyperlinks.", pass: [ "The assistant posts one reply that names the three documentation starting points.", "Each URL appears as a bare URL in the reply text, not wrapped in markdown link syntax.", @@ -66,8 +59,6 @@ describeEval("Output Contract", slackEvals, (it) => { ], requireSandboxReady: false, criteria: rubric({ - contract: - "Comparative Slack replies present structured data with bullets or bolded labels rather than markdown tables.", pass: [ "The assistant posts one reply that compares REST and GraphQL across caching, over-fetching, and tooling maturity.", "The comparison is expressed through bullets or bolded labels with short explanations, not a table.", diff --git a/packages/junior-evals/evals/core/passive-behavior.eval.ts b/packages/junior-evals/evals/core/passive-behavior.eval.ts index f5e6cfd6c..eb105203d 100644 --- a/packages/junior-evals/evals/core/passive-behavior.eval.ts +++ b/packages/junior-evals/evals/core/passive-behavior.eval.ts @@ -29,8 +29,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A later human-to-human question is ignored even when it is phrased like something Junior could answer.", pass: [ "The assistant posts exactly one reply: the initial helpful answer about the deploy.", ], @@ -63,8 +61,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A follow-up that clearly refers to Junior's prior answer gets a reply even without another @mention.", pass: [ "The assistant posts exactly two replies in order.", "The second reply plainly restates that the budget is needed by Friday.", @@ -98,8 +94,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A casual pronoun-based question stays ignored when it reads like human-to-human discussion rather than a turn back to Junior.", pass: [ "The assistant posts exactly one reply: the initial helpful answer about the deploy.", ], @@ -134,8 +128,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "Shared domain vocabulary alone does not make a later human discussion message directed at Junior.", pass: [ "The assistant posts exactly one reply: the initial answer about the billing worker.", ], @@ -164,8 +156,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { threadMessage("Can you check on this?", { thread: canYouThread }), ], criteria: rubric({ - contract: - "A casual 'can you' request is ignored when it is directed at a coworker, not at Junior.", pass: [ "The assistant posts exactly one reply: the initial answer about deployment status.", ], @@ -199,8 +189,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "An explicit request to expand Junior's prior answer gets a second reply.", pass: [ "The assistant posts exactly two replies in order.", "The second reply provides more detail about the deploy changes.", @@ -234,8 +222,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A terse clarification right after Junior's reply is treated as directed back to Junior.", pass: [ "The assistant posts exactly two replies in order.", "The second reply clarifies which services changed.", @@ -269,8 +255,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "Once humans resume the conversation, a later same-topic question stays ignored unless it clearly turns back to Junior.", pass: [ "The assistant posts exactly one reply: the initial deploy summary.", ], @@ -307,8 +291,6 @@ describeEval("Passive Behavior", slackEvals, (it) => { mention("Actually jump back in.", { thread: optOutThread }), ], criteria: rubric({ - contract: - "An explicit stop request pauses thread participation until the assistant is mentioned again.", pass: [ "The assistant posts exactly three visible replies in order.", "The first reply is a normal helpful reply to the initial mention.", diff --git a/packages/junior-evals/evals/core/research-reply-shape.eval.ts b/packages/junior-evals/evals/core/research-reply-shape.eval.ts index aa81a3d7c..5c32f0a70 100644 --- a/packages/junior-evals/evals/core/research-reply-shape.eval.ts +++ b/packages/junior-evals/evals/core/research-reply-shape.eval.ts @@ -13,8 +13,6 @@ describeEval("Research Reply Shape", slackEvals, (it) => { ], requireSandboxReady: false, criteria: rubric({ - contract: - "A multi-source research request returns a concise Slack-style answer without process chatter.", pass: [ "The thread reply is a concise researched answer, not a status update or process note.", "The answer coherently summarizes Slack agent streaming across the provided sources.", @@ -38,15 +36,13 @@ describeEval("Research Reply Shape", slackEvals, (it) => { ], requireSandboxReady: false, criteria: rubric({ - contract: - "A long-form reference deliverable becomes a Slack canvas, with the thread reserved for a short summary and pointer.", pass: [ "The assistant creates a single useful canvas for the requested Slack streaming reference.", "The canvas is a structured artifact that covers the supplied Slack streaming notes.", "The thread reply stays brief and points to the canvas instead of pasting the full document inline.", ], fail: [ - "Do not paste the entire long-form reference artifact directly into assistant_posts.", + "Do not paste the entire long-form reference artifact directly into the assistant thread reply.", "Do not create multiple canvases for this one research request.", "Do not add process chatter such as 'let me check', 'fetching', or similar tool-progress narration.", "Do not use web discovery when the prompt supplies the material to organize.", diff --git a/packages/junior-evals/evals/core/routing-and-continuity.eval.ts b/packages/junior-evals/evals/core/routing-and-continuity.eval.ts index 4297d907e..cfa2d5247 100644 --- a/packages/junior-evals/evals/core/routing-and-continuity.eval.ts +++ b/packages/junior-evals/evals/core/routing-and-continuity.eval.ts @@ -8,8 +8,6 @@ describeEval("Routing and Continuity", slackEvals, (it) => { await run({ events: [threadMessage("<@U_APP> what is 2+2?", { is_mention: true })], criteria: rubric({ - contract: - "An explicit @mention in a thread always gets a direct reply.", pass: [ "The assistant posts exactly one reply.", "The reply answers with 4.", @@ -25,14 +23,9 @@ describeEval("Routing and Continuity", slackEvals, (it) => { await run({ events: [mention("@bot say hello to the channel!")], criteria: rubric({ - contract: - "A user request to post in-channel is delivered as a channel post, not as a thread reply.", pass: [ - "channel_posts contains exactly one hello-style message with no thread_ts.", - "assistant_posts does not contain that hello-style message as a thread reply.", - ], - allow: [ - "A lightweight acknowledgement reaction in reactions is acceptable.", + "The normalized transcript contains exactly one hello-style channel_post assistant message with no thread_ts.", + "The normalized transcript does not contain that hello-style message as a thread reply.", ], }), }); @@ -48,11 +41,9 @@ describeEval("Routing and Continuity", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A request for another named channel does not get silently redirected to the current channel.", pass: [ - "channel_posts is empty.", - "assistant_posts contains exactly one reply.", + "The normalized transcript contains no channel_post assistant message.", + "The normalized transcript contains exactly one assistant thread reply.", "That reply clearly says the assistant can only post to the current channel or cannot post to #discuss-design-engineering from here.", ], fail: [ @@ -96,8 +87,6 @@ describeEval("Routing and Continuity", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "When explicitly asked to do work in a multi-participant thread, the assistant treats itself as the requested actor.", pass: [ "The assistant posts exactly two replies in order.", "The second reply drafts a one-sentence status update about the paused billing rollout and retry queue.", @@ -117,9 +106,9 @@ describeEval("Routing and Continuity", slackEvals, (it) => { await run({ events: [mention("react to this")], criteria: rubric({ - contract: - "A reaction-only request is satisfied with reactions instead of reply clutter.", - pass: ["reactions contains at least one added reaction."], + pass: [ + "The normalized transcript contains at least one reaction_added assistant message.", + ], fail: [ "Do not add a redundant thread reply that echoes the emoji.", "Do not add a short acknowledgement reply such as 'Done'.", @@ -146,8 +135,6 @@ describeEval("Routing and Continuity", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A later question in the same thread can reference earlier context without restating it.", pass: [ "The assistant posts exactly two replies in order.", "The second reply explicitly references the earlier budget context, including budget and/or Friday.", diff --git a/packages/junior-evals/evals/core/scheduler.eval.ts b/packages/junior-evals/evals/core/scheduler.eval.ts index 2750f5ef9..fae9ec7f3 100644 --- a/packages/junior-evals/evals/core/scheduler.eval.ts +++ b/packages/junior-evals/evals/core/scheduler.eval.ts @@ -2,6 +2,46 @@ import { describeEval } from "vitest-evals"; import { mention, rubric, scheduledTaskDue, slackEvals } from "../helpers"; describeEval("Scheduler", slackEvals, (it) => { + it("when asked for a simple one-off reminder, create it without asking for confirmation", async ({ + run, + }) => { + await run({ + events: [mention("@bot remind me in 1 minute to wash my hands")], + criteria: rubric({ + pass: [ + "The reply confirms that a one-off reminder to wash hands was scheduled.", + "The schedule creation omits recurrence.", + "The reply does not ask the user to confirm first.", + ], + fail: [ + "Do not ask the user to confirm the reminder before creating it.", + "Do not ask the user to provide a channel ID.", + "Do not describe the reminder as a recurring schedule.", + ], + }), + }); + }); + + it("when asked for a terse one-off reminder, create it without recurrence", async ({ + run, + }) => { + await run({ + events: [mention("@bot remind me to drink water in 1m")], + criteria: rubric({ + pass: [ + "The reply confirms that a one-off reminder to drink water was scheduled.", + "The schedule creation omits recurrence.", + "The reply does not ask the user to retry with a different one-time format.", + ], + fail: [ + "Do not reject the request as an invalid one-off task format.", + "Do not ask the user to confirm the reminder before creating it.", + "Do not describe the reminder as a recurring schedule.", + ], + }), + }); + }); + it("when asked for a specific one-off reminder, preserve the future work in the schedule", async ({ run, }) => { @@ -12,8 +52,6 @@ describeEval("Scheduler", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A one-off reminder request is scheduled with the future reminder work preserved as the task.", pass: [ "The observed slackScheduleCreateTask tool call has schedule_kind=one_off.", "The observed slackScheduleCreateTask tool call omits recurrence.", @@ -27,6 +65,30 @@ describeEval("Scheduler", slackEvals, (it) => { }); }); + it("when asked to schedule clear recurring work, create it without confirmation", async ({ + run, + }) => { + await run({ + events: [ + mention( + "@bot schedule this every Monday at 9am Pacific: check open GitHub issues about the scheduler and post a short digest here.", + ), + ], + criteria: rubric({ + pass: [ + "The created task describes checking scheduler-related GitHub issues, not creating a schedule.", + "The schedule creation sets recurrence=weekly.", + "The reply confirms the recurring schedule was created for Monday at 9am Pacific.", + ], + fail: [ + "Do not ask the user to confirm before creating the clear recurring task.", + "Do not ask the user to provide a channel ID.", + "Do not only give instructions for how the user can set up an external cron.", + ], + }), + }); + }); + it("when a one-off reminder becomes due, deliver the reminder outcome", async ({ run, }) => { @@ -38,10 +100,8 @@ describeEval("Scheduler", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A due one-off scheduled task is executed now and posts the requested reminder outcome to the destination channel.", pass: [ - "The channel_posts output contains a Slack channel message saying standup moved to 10:30 today.", + "The normalized session includes a Slack channel message saying standup moved to 10:30 today.", "The delivered message is the reminder content itself, not a schedule creation confirmation.", "The delivered message does not ask for clarification or confirmation.", ], @@ -70,10 +130,8 @@ describeEval("Scheduler", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A due recurring scheduled task is executed for the current occurrence and posts the requested reminder outcome to the destination channel.", pass: [ - "The channel_posts output contains a Slack channel message reminding people to submit timesheets by 5pm today.", + "The normalized session includes a Slack channel message reminding people to submit timesheets by 5pm today.", "The delivered message treats this as the current due occurrence.", "The delivered message is not just a confirmation that a recurring task exists.", ], diff --git a/packages/junior-evals/evals/core/skill-infra.eval.ts b/packages/junior-evals/evals/core/skill-infra.eval.ts index 9cb810a05..a7060136e 100644 --- a/packages/junior-evals/evals/core/skill-infra.eval.ts +++ b/packages/junior-evals/evals/core/skill-infra.eval.ts @@ -9,8 +9,6 @@ describeEval("Skill Infrastructure", slackEvals, (it) => { overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [mention("/candidate-brief David Cramer")], criteria: rubric({ - contract: - "A skill command can return a single candidate brief in one reply.", pass: [ "The assistant posts exactly one reply for David Cramer.", "The reply is a candidate brief with role, team, and location-style details.", @@ -41,8 +39,6 @@ describeEval("Skill Infrastructure", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "The same skill can be invoked twice in one thread without losing ordering or context.", pass: [ "Across two turns in one thread, the assistant posts exactly two replies in order: Alice first, then Bob.", "Each reply addresses the requested candidate by name.", @@ -60,8 +56,6 @@ describeEval("Skill Infrastructure", slackEvals, (it) => { overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [mention("/list-working-directory")], criteria: rubric({ - contract: - "A simple infrastructure skill can list the working directory in one reply.", pass: [ "The assistant posts exactly one working-directory listing reply.", "That reply includes a file-list section such as 'Working directory files:'.", @@ -82,8 +76,6 @@ describeEval("Skill Infrastructure", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A verification request uses the available source-backed skill and returns the checked answer instead of offering to check later.", pass: [ "The assistant posts exactly one final answer.", "The answer says closed tracking issues alone do not prove capability support.", @@ -111,16 +103,11 @@ describeEval("Skill Infrastructure", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "An MCP-backed skill can complete a natural lookup by using the provider result instead of surfacing tool validation errors.", pass: [ "The visible thread output includes a final answer based on the demo MCP provider result.", "The visible thread output refers to the handbook or US holidays request.", "The visible thread output does not claim the MCP lookup was blocked by missing arguments.", ], - allow: [ - "The final answer may be a concise paraphrase of the eval handbook result.", - ], fail: [ 'Do not include `expected string, received undefined` or `"query"` argument validation errors.', "Do not ask the user to provide a page URL or repeat the request.", diff --git a/packages/junior-evals/evals/core/skill-invocation-control.eval.ts b/packages/junior-evals/evals/core/skill-invocation-control.eval.ts index a7c9ce756..1a349d1f3 100644 --- a/packages/junior-evals/evals/core/skill-invocation-control.eval.ts +++ b/packages/junior-evals/evals/core/skill-invocation-control.eval.ts @@ -11,15 +11,9 @@ describeEval("Skill Invocation Control", slackEvals, (it) => { overrides: { skill_dirs: skillDirs }, events: [mention("What's the weather like in San Francisco today?")], criteria: rubric({ - contract: - "A command-style weather skill does not answer ordinary weather questions unless the user explicitly asks for that skill.", pass: [ "The assistant does not return the weather-lookup skill's simulated report (72°F, partly cloudy, 8 mph NW).", ], - allow: [ - "The assistant may answer from other available sources, say it cannot check real weather, or offer general knowledge.", - "The assistant may mention that a weather skill exists but explain it was not used for this general request.", - ], fail: [ "Do not return the simulated weather report from the weather-lookup skill.", ], @@ -38,8 +32,6 @@ describeEval("Skill Invocation Control", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A command-style weather skill answers when the user explicitly names that skill.", pass: [ "The assistant posts a reply containing a weather report for San Francisco from the weather-lookup skill.", "The reply includes the simulated data: 72°F or 22°C.", @@ -62,8 +54,6 @@ describeEval("Skill Invocation Control", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "A source-handbook request receives an answer based on the handbook content.", pass: [ "The assistant posts an answer based on the source-handbook content.", ], diff --git a/packages/junior-evals/evals/github/skill-workflows.eval.ts b/packages/junior-evals/evals/github/skill-workflows.eval.ts index d71b06b39..c3a74b3af 100644 --- a/packages/junior-evals/evals/github/skill-workflows.eval.ts +++ b/packages/junior-evals/evals/github/skill-workflows.eval.ts @@ -16,8 +16,6 @@ describeEval("GitHub Skill Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "The assistant explains the GitHub PR auth order without omitting the push step.", pass: [ "The answer explicitly says the branch push happens before `gh pr create` for the PR step.", "The answer says the push step needs GitHub write access for the remote.", @@ -67,14 +65,9 @@ describeEval("GitHub Skill Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "Stored repo context is reused in a later turn without asking the user to restate the repo.", pass: [ "The assistant confirms default repo setup and later says issue commands without an explicit repo would use getsentry/junior.", ], - allow: [ - "A concise answer is acceptable; no live GitHub issue lookup is required for this continuity check.", - ], fail: [ "Do not ask the user to provide the repo again.", "Do not say a live GitHub lookup is required before answering.", @@ -108,8 +101,6 @@ describeEval("GitHub Skill Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "Draft a fake issue against the default repo while keeping the fake foreign issue reference as context.", pass: [ "The assistant confirms default repo setup and drafts the requested issue against getsentry/junior-eval-bot-never-exists.", "The foreign issue reference is treated only as context if it appears in the answer.", @@ -149,8 +140,6 @@ describeEval("GitHub Skill Workflows", slackEvals, (it) => { ), ], criteria: rubric({ - contract: - "Confirm the explicitly referenced issue as target even when a default repo is set.", pass: [ "After confirming default repo setup, the assistant recognizes the explicitly referenced issue as the action target.", "No GitHub issue create/comment/view command is run for this confirmation-only request.", diff --git a/packages/junior-evals/evals/helpers.ts b/packages/junior-evals/evals/helpers.ts index b6c4f32ea..638474e1e 100644 --- a/packages/junior-evals/evals/helpers.ts +++ b/packages/junior-evals/evals/helpers.ts @@ -1,5 +1,6 @@ import { createJudge, + createJudgeHarness, type DescribeEvalOptions, type JudgeContext, } from "vitest-evals"; @@ -7,9 +8,11 @@ import { completeText, resolveGatewayModel } from "@/chat/pi/client"; import { toJsonValue, type Harness, + type HarnessMetadata, type HarnessRun, type JsonValue, type NormalizedMessage, + type NormalizedSession, type ToolCallRecord, } from "vitest-evals/harness"; import { registerLogRecordSink, type EmittedLogRecord } from "@/chat/logging"; @@ -45,31 +48,29 @@ function toJsonRecord( return record; } -function buildEvalOutput(result: EvalResult): Record { +function slackMetadata(result: EvalResult): Record { return { - assistant_posts: toJson(result.posts), - observed_tool_invocations: toJson(result.toolInvocations), - canvases: toJson(result.canvases), - channel_posts: toJson(result.channelPosts), - reactions: toJson(result.reactions), - slack_metadata: { - thread_title_set: result.slackAdapter.titleCalls.length > 0, - suggested_prompts_set: result.slackAdapter.promptCalls.length > 0, - assistant_status_pending: hasAssistantStatusPending(result), - }, + thread_title_set: result.slackAdapter.titleCalls.length > 0, + suggested_prompts_set: result.slackAdapter.promptCalls.length > 0, + assistant_status_pending: hasAssistantStatusPending(result), }; } -function serializeEvalOutput(output: Record): string { - return JSON.stringify(output, null, 2); -} - function toToolCallRecord( invocation: EvalResult["toolInvocations"][number], ): ToolCallRecord { const args: Record = {}; if (invocation.arguments) { - args.arguments = toJson(invocation.arguments); + const genericArgs = toJson(invocation.arguments); + if ( + genericArgs && + typeof genericArgs === "object" && + !Array.isArray(genericArgs) + ) { + Object.assign(args, genericArgs); + } else { + args.value = genericArgs; + } } if (invocation.bash_command) { args.command = invocation.bash_command; @@ -99,21 +100,89 @@ function toLogMetadata(record: EmittedLogRecord): Record { }); } -function toHarnessRun(result: EvalResult): HarnessRun { - const output = buildEvalOutput(result); - const toolCalls = result.toolInvocations.map(toToolCallRecord); - const messages: NormalizedMessage[] = [ - ...result.posts.map( - (post): NormalizedMessage => ({ +function serializeSession(session: NormalizedSession): string { + const metadata = { ...(session.metadata ?? {}) }; + delete metadata.log_records; + return JSON.stringify( + { + messages: session.messages, + metadata, + }, + null, + 2, + ); +} + +function toAssistantPostMessage( + post: EvalResult["posts"][number], +): NormalizedMessage { + return { + role: "assistant", + content: post.text, + metadata: toJsonRecord({ + event_type: "thread_post", + ...(post.channel ? { channel: post.channel } : {}), + ...(post.thread_ts ? { thread_ts: post.thread_ts } : {}), + files: post.files, + }), + }; +} + +function buildPostKey(post: { + channel?: string; + text: string; + thread_ts?: string; +}): string { + return `${post.channel ?? ""}\u0000${post.thread_ts ?? ""}\u0000${post.text}`; +} + +function toSessionMessages( + result: EvalResult, + toolCalls: ToolCallRecord[], +): NormalizedMessage[] { + const threadPostKeys = new Set(result.posts.map(buildPostKey)); + return [ + ...result.posts.map(toAssistantPostMessage), + ...result.channelPosts + .filter((post) => !threadPostKeys.has(buildPostKey(post))) + .map( + (post): NormalizedMessage => ({ + role: "assistant", + content: post.text, + metadata: toJsonRecord({ + event_type: post.thread_ts ? "thread_post" : "channel_post", + channel: post.channel, + ...(post.thread_ts ? { thread_ts: post.thread_ts } : {}), + }), + }), + ), + ...result.reactions.map( + (reaction): NormalizedMessage => ({ role: "assistant", - content: post.text, + content: { + type: "reaction_added", + emoji: reaction.emoji, + }, metadata: toJsonRecord({ - ...(post.channel ? { channel: post.channel } : {}), - ...(post.thread_ts ? { thread_ts: post.thread_ts } : {}), - files: post.files, + event_type: "reaction_added", + channel: reaction.channel, + timestamp: reaction.timestamp, }), }), ), + ...result.canvases.map( + (canvas): NormalizedMessage => ({ + role: "assistant", + content: { + type: "canvas_created", + title: canvas.title, + markdown: canvas.markdown, + }, + metadata: { + event_type: "canvas_created", + }, + }), + ), ...(toolCalls.length > 0 ? [ { @@ -123,14 +192,17 @@ function toHarnessRun(result: EvalResult): HarnessRun { ] : []), ]; +} + +function toHarnessRun(result: EvalResult): HarnessRun { + const toolCalls = result.toolInvocations.map(toToolCallRecord); + const messages = toSessionMessages(result, toolCalls); return { - output, session: { messages, - outputText: serializeEvalOutput(output), metadata: toJsonRecord({ - slack_metadata: output.slack_metadata, + slack_metadata: slackMetadata(result), log_records: result.logRecords.map(toLogMetadata), }), }, @@ -144,9 +216,7 @@ function toHarnessRun(result: EvalResult): HarnessRun { // ── Core eval wrapper ────────────────────────────────────── interface EvalRubric { - contract: string; pass: readonly string[]; - allow?: readonly string[]; fail?: readonly string[]; } @@ -180,20 +250,14 @@ function formatBulletSection( function formatRubric(criteria: EvalRubric): string { return [ - `Contract:\n${criteria.contract}`, formatBulletSection("Pass", criteria.pass), - formatBulletSection("Allow", criteria.allow), formatBulletSection("Fail", criteria.fail), ] .filter((section): section is string => section !== null) .join("\n\n"); } -function getEvalLabel(input: SlackEvalInput): string { - return input.criteria.contract; -} - -function assertGatewayReady(input: SlackEvalInput, result: EvalResult): void { +function assertGatewayReady(result: EvalResult): void { const failure = result.logRecords.find((record) => { if (record.eventName !== "ai_completion_failed") { return false; @@ -212,12 +276,12 @@ function assertGatewayReady(input: SlackEvalInput, result: EvalResult): void { failure.body || "AI Gateway authentication failed"; throw new Error( - `Eval gateway bootstrap failed for "${getEvalLabel(input)}". Received "${message}". ` + + `Eval gateway bootstrap failed. Received "${message}". ` + "Refresh AI Gateway auth first (for example via `vercel env pull`) and retry.", ); } -function assertSandboxReady(input: SlackEvalInput, result: EvalResult): void { +function assertSandboxReady(result: EvalResult): void { const failingPosts = result.posts.filter((post) => post.text.includes(SANDBOX_SETUP_FAILED_TEXT), ); @@ -227,12 +291,12 @@ function assertSandboxReady(input: SlackEvalInput, result: EvalResult): void { const sample = failingPosts[0]?.text ?? SANDBOX_SETUP_FAILED_TEXT; throw new Error( - `Eval sandbox bootstrap failed for "${getEvalLabel(input)}". Received "${sample}". ` + + `Eval sandbox bootstrap failed. Received "${sample}". ` + "Evals require a working Vercel Sandbox and do not permit local fallback.", ); } -function assertStatusCleared(input: SlackEvalInput, result: EvalResult): void { +function assertStatusCleared(result: EvalResult): void { const lastByThread = new Map(); for (const call of result.slackAdapter.statusCalls) { const key = `${call.channelId}:${call.threadTs}`; @@ -241,7 +305,7 @@ function assertStatusCleared(input: SlackEvalInput, result: EvalResult): void { for (const [thread, text] of lastByThread) { if (text !== "") { throw new Error( - `Eval "${getEvalLabel(input)}" left assistant status pending on thread ${thread}: "${text}". ` + + `Eval left assistant status pending on thread ${thread}: "${text}". ` + "Every turn must clear the assistant status indicator before completing.", ); } @@ -267,9 +331,6 @@ function assertTimeoutBudget(input: SlackEvalInput): void { /** Builds a structured, maintainer-readable judge rubric for an eval case. */ export function rubric(criteria: EvalRubric): EvalRubric { - if (criteria.contract.trim() === "") { - throw new Error("Eval rubric contract must be a non-empty sentence."); - } if (criteria.pass.length === 0) { throw new Error("Eval rubric must include at least one pass condition."); } @@ -295,6 +356,26 @@ const EVAL_SYSTEM = 'You are assessing a submitted output based on a given criterion. Ignore differences in style, grammar, punctuation, or length. Focus only on whether the criterion is met. Return only raw JSON matching {"answer":"A","rationale":"..."}.'; const EVAL_JUDGE_MODEL_ID = resolveGatewayModel("openai/gpt-5.4").id; +const judgeHarness = createJudgeHarness({ + name: "slack-rubric-judge-model", + run: async ({ prompt, system }, { metadata }) => { + const { text } = await completeText({ + modelId: EVAL_JUDGE_MODEL_ID, + system, + messages: [ + { + role: "user", + content: prompt, + timestamp: Date.now(), + }, + ], + temperature: 0, + metadata, + }); + return text; + }, +}); + function formatJudgePrompt(output: string, criteria: string): string { return ` ${output} @@ -339,22 +420,6 @@ function parseJudgeResult(text: string): JudgeResultPayload { /** Replays Slack events through the real runtime and returns normalized artifacts. */ export const slackHarness: Harness = { name: "slack", - prompt: async (input, options) => { - const { text } = await completeText({ - modelId: EVAL_JUDGE_MODEL_ID, - system: options?.system, - messages: [ - { - role: "user", - content: input, - timestamp: Date.now(), - }, - ], - temperature: 0, - metadata: options?.metadata, - }); - return text; - }, run: async (input) => { const logRecords: EmittedLogRecord[] = []; const unregisterLogSink = registerLogRecordSink((record) => { @@ -387,12 +452,12 @@ export const slackHarness: Harness = { ]) : await taskPromise; if (input.requireGatewayReady ?? true) { - assertGatewayReady(input, result); + assertGatewayReady(result); } if (input.requireSandboxReady ?? true) { - assertSandboxReady(input, result); + assertSandboxReady(result); } - assertStatusCleared(input, result); + assertStatusCleared(result); return toHarnessRun(result); } finally { unregisterLogSink(); @@ -405,25 +470,31 @@ export const RubricJudge = createJudge( "RubricJudge", async ({ input, - output, - harness, + session, + runJudge, }: JudgeContext< SlackEvalInput, - Record, + JsonValue | undefined, + HarnessMetadata, typeof slackHarness >) => { + if (!runJudge) { + throw new Error("RubricJudge requires a configured judgeHarness."); + } const object = parseJudgeResult( - await harness.prompt( - formatJudgePrompt( - serializeEvalOutput(output as Record), - formatRubric(input.criteria), - ), - { - system: EVAL_SYSTEM, - metadata: { - judge: "RubricJudge", + String( + await runJudge( + { + prompt: formatJudgePrompt( + serializeSession(session), + formatRubric(input.criteria), + ), + system: EVAL_SYSTEM, }, - }, + { + metadata: { judge: "RubricJudge" }, + }, + ), ), ); const answer = object.answer as keyof typeof CHOICE_SCORES; @@ -441,6 +512,7 @@ export const RubricJudge = createJudge( /** Shared vitest-evals suite options for Slack conversation evals. */ export const slackEvals = { harness: slackHarness, + judgeHarness, judges: [RubricJudge], judgeThreshold: 0.75, } satisfies DescribeEvalOptions; diff --git a/packages/junior-evals/evals/sentry/skill-workflows.eval.ts b/packages/junior-evals/evals/sentry/skill-workflows.eval.ts index 405c1b65e..5231b76cc 100644 --- a/packages/junior-evals/evals/sentry/skill-workflows.eval.ts +++ b/packages/junior-evals/evals/sentry/skill-workflows.eval.ts @@ -1,4 +1,4 @@ -import { describeEval } from "vitest-evals"; +import { assistantMessages, describeEval, toolCalls } from "vitest-evals"; import { expect } from "vitest"; import { mention, rubric, slackEvals, threadMessage } from "../helpers"; @@ -25,8 +25,6 @@ describeEval("Sentry Skill Workflows", slackEvals, (it) => { }), ], criteria: rubric({ - contract: - "A Sentry follow-up in an existing Slack thread still has skill context and queries Sentry instead of claiming tools are unavailable.", pass: [ "The first reply acknowledges it is available.", "The second reply reports latest Sentry issue data for getsentry, including `JUNIOR-1`, `Eval issue`, or the issue permalink.", @@ -39,27 +37,28 @@ describeEval("Sentry Skill Workflows", slackEvals, (it) => { ], }), }); - const output = (result.output ?? {}) as { - assistant_posts?: Array<{ text?: string }>; - observed_tool_invocations?: Array<{ - tool?: string; - skill_name?: string; - bash_command?: string; - }>; - }; - expect(output.observed_tool_invocations).toEqual( + expect(toolCalls(result.session)).toEqual( expect.arrayContaining([ - expect.objectContaining({ tool: "loadSkill", skill_name: "sentry" }), expect.objectContaining({ - tool: "bash", - bash_command: expect.stringMatching( - /\bsentry\s+(issue list|api organizations\/getsentry\/issues\/)/, - ), + name: "loadSkill", + arguments: expect.objectContaining({ skill_name: "sentry" }), + }), + expect.objectContaining({ + name: "bash", + arguments: expect.objectContaining({ + command: expect.stringMatching( + /\bsentry\s+(issue list|api organizations\/getsentry\/issues\/)/, + ), + }), }), ]), ); expect( - output.assistant_posts?.map((post) => post.text ?? "").join("\n") ?? "", + assistantMessages(result.session) + .map((message) => + typeof message.content === "string" ? message.content : "", + ) + .join("\n"), ).toMatch(/\b(JUNIOR-1|Eval issue|getsentry)\b/i); }); }); diff --git a/packages/junior-evals/package.json b/packages/junior-evals/package.json index 75c9b62ff..8a5534f10 100644 --- a/packages/junior-evals/package.json +++ b/packages/junior-evals/package.json @@ -5,8 +5,8 @@ "type": "module", "scripts": { "test": "vitest run", - "evals": "pnpm exec vitest run -c vitest.evals.config.ts", - "evals:record": "VITEST_EVALS_REPLAY_MODE=record pnpm exec vitest run -c vitest.evals.config.ts" + "evals": "vitest run -c vitest.evals.config.ts", + "evals:record": "VITEST_EVALS_REPLAY_MODE=record vitest run -c vitest.evals.config.ts" }, "devDependencies": { "@sentry/junior": "workspace:*", diff --git a/packages/junior-evals/vitest.evals.config.ts b/packages/junior-evals/vitest.evals.config.ts index e3e890aeb..6d69274e3 100644 --- a/packages/junior-evals/vitest.evals.config.ts +++ b/packages/junior-evals/vitest.evals.config.ts @@ -25,7 +25,7 @@ for (const envRoot of [workspaceRoot, juniorPackageRoot]) { process.env.JUNIOR_SECRET = "junior-test-secret"; process.env.JUNIOR_BASE_URL ??= "https://junior.example.com"; -process.env.JUNIOR_STATE_ADAPTER ??= "memory"; +process.env.JUNIOR_STATE_ADAPTER = "memory"; process.env.JUNIOR_STATE_KEY_PREFIX ??= `junior:eval:${process.pid}`; process.env.VITEST_EVALS_REPLAY_MODE ??= "auto"; diff --git a/packages/junior-hex/skills/hex/SKILL.md b/packages/junior-hex/skills/hex/SKILL.md index c059876a1..0ec66dac4 100644 --- a/packages/junior-hex/skills/hex/SKILL.md +++ b/packages/junior-hex/skills/hex/SKILL.md @@ -5,7 +5,9 @@ description: > Internal data access primitive. Executes a Hex query and returns structured results. Called by core skills — not intended for direct use. Invoke when you need to run a Hex query on behalf of a core skill that has provided a query - and pattern. + and pattern. Do not use for Sentry product telemetry, Sentry feature-usage + questions, or explicit requests to use Sentry telemetry when the Sentry skill + is available. --- # Query Hex (Atomic) diff --git a/packages/junior-sentry/skills/sentry/SKILL.md b/packages/junior-sentry/skills/sentry/SKILL.md index 3f7578647..a8ee13b21 100644 --- a/packages/junior-sentry/skills/sentry/SKILL.md +++ b/packages/junior-sentry/skills/sentry/SKILL.md @@ -1,6 +1,6 @@ --- name: sentry -description: Query live Sentry telemetry with the Sentry CLI and generate Sentry deep links. Use when users ask to investigate Sentry issues, events, logs, traces, organizations, projects, replays, or authenticated Sentry API data. Do not use it for repository/source-code/PR tasks, even when the topic concerns Sentry products. +description: Query live Sentry telemetry with the Sentry CLI and generate Sentry deep links. Use when users ask to investigate Sentry issues, events, logs, traces, organizations, projects, replays, product feature usage, Sentry's own product telemetry, or authenticated Sentry API data. Do not use it for repository/source-code/PR tasks, even when the topic concerns Sentry products. allowed-tools: bash --- @@ -18,7 +18,7 @@ Before declaring a Sentry data surface unavailable, verify the current CLI help: 1. Confirm operation and target: -- Determine operation: issue, event, log, trace, org, project, replay/deep-link, or API query. +- Determine operation: issue, event, log, trace, org, project, replay/deep-link, Sentry product feature usage, or API query. - Resolve org from channel config: `jr-rpc config get sentry.org` - Resolve project from channel config: `jr-rpc config get sentry.project` (optional — many queries span multiple projects). - If org is missing and needed, ask the user. diff --git a/packages/junior-sentry/skills/sentry/SOURCES.md b/packages/junior-sentry/skills/sentry/SOURCES.md index a5f993099..a1d883f82 100644 --- a/packages/junior-sentry/skills/sentry/SOURCES.md +++ b/packages/junior-sentry/skills/sentry/SOURCES.md @@ -1,42 +1,45 @@ # Sentry Skill Sources -Last updated: 2026-04-30 +Last updated: 2026-06-18 ## Source inventory -| Source | Trust tier | Confidence | Contribution | Usage constraints | -| --------------------------------------------------------------- | ---------- | ---------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| `https://github.com/getsentry/junior/issues/271` | canonical | high | Regression report: Junior tried stale `sentry organizations list` and should verify current CLI help before blocking. | Use as issue context, not as a full command reference. | -| `https://cli.sentry.dev/commands/issue/` | canonical | high | Current `sentry issue list`, target syntax, issue subcommands, and JSON support. | Verify live help when runtime CLI differs. | -| `https://cli.sentry.dev/commands/org/` | canonical | high | Current `sentry org list` and `sentry org view` commands. | Verify live help when runtime CLI differs. | -| `https://cli.sentry.dev/commands/log/` | canonical | high | Current `sentry log list` and `sentry log view` commands, trace filtering, and log query flags. | Verify live help when runtime CLI differs. | -| `https://cli.sentry.dev/commands/trace/` | canonical | high | Current `sentry trace list`, `view`, and `logs` commands. | Verify live help when runtime CLI differs. | -| `https://cli.sentry.dev/commands/api/` | canonical | high | Authenticated `sentry api ` fallback and request flags. | Use read-only requests unless the user asks for mutation. | -| `https://cli.sentry.dev/configuration/` | canonical | high | `SENTRY_AUTH_TOKEN`, JSON/global flags, cache controls, and runtime configuration behavior. | Junior injects credentials; do not persist or print tokens. | -| `pnpm view sentry version dist-tags description bin repository` | canonical | high | Confirmed npm package `sentry` latest is `0.30.0` and exposes `sentry` binary. | Package metadata only; command behavior still comes from help/docs. | -| `pnpm dlx sentry@latest --help` and subcommand help | canonical | high | Confirmed executable help lists org list/view, issue list/events/view, log list/view, trace list/view/logs, and api. | Re-run when updating for a newer CLI. | -| `packages/junior-sentry/plugin.yaml` | canonical | high | Confirms runtime dependency is the npm `sentry` package and auth token env is `SENTRY_AUTH_TOKEN`. | Local repo contract. | +| Source | Trust tier | Confidence | Contribution | Usage constraints | +| --------------------------------------------------------------- | ---------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | +| `https://github.com/getsentry/junior/issues/271` | canonical | high | Regression report: Junior tried stale `sentry organizations list` and should verify current CLI help before blocking. | Use as issue context, not as a full command reference. | +| `https://cli.sentry.dev/commands/issue/` | canonical | high | Current `sentry issue list`, target syntax, issue subcommands, and JSON support. | Verify live help when runtime CLI differs. | +| `https://cli.sentry.dev/commands/org/` | canonical | high | Current `sentry org list` and `sentry org view` commands. | Verify live help when runtime CLI differs. | +| `https://cli.sentry.dev/commands/log/` | canonical | high | Current `sentry log list` and `sentry log view` commands, trace filtering, and log query flags. | Verify live help when runtime CLI differs. | +| `https://cli.sentry.dev/commands/trace/` | canonical | high | Current `sentry trace list`, `view`, and `logs` commands. | Verify live help when runtime CLI differs. | +| `https://cli.sentry.dev/commands/api/` | canonical | high | Authenticated `sentry api ` fallback and request flags. | Use read-only requests unless the user asks for mutation. | +| `https://cli.sentry.dev/configuration/` | canonical | high | `SENTRY_AUTH_TOKEN`, JSON/global flags, cache controls, and runtime configuration behavior. | Junior injects credentials; do not persist or print tokens. | +| `pnpm view sentry version dist-tags description bin repository` | canonical | high | Confirmed npm package `sentry` latest is `0.30.0` and exposes `sentry` binary. | Package metadata only; command behavior still comes from help/docs. | +| `pnpm dlx sentry@latest --help` and subcommand help | canonical | high | Confirmed executable help lists org list/view, issue list/events/view, log list/view, trace list/view/logs, and api. | Re-run when updating for a newer CLI. | +| `packages/junior-sentry/plugin.yaml` | canonical | high | Confirms runtime dependency is the npm `sentry` package and auth token env is `SENTRY_AUTH_TOKEN`. | Local repo contract. | +| `https://github.com/getsentry/junior/issues/615` | canonical | high | Regression report: Sentry product feature usage routed to Hex, then an explicit "use Sentry telemetry" redirect was ignored after Hex auth paused. | Use as routing evidence, not as command reference. | ## Decisions -| Decision | Status | Rationale | -| ---------------------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------- | -| Use singular canonical command groups in runtime guidance. | adopted | Current docs and latest executable help use `issue`, `org`, `log`, and `trace`. | -| Add a live-help verification gate before blocking. | adopted | Issue 271 showed a stale remembered command produced a false blocked answer. | -| Keep `sentry api ` as a read-only fallback. | adopted | Current CLI exposes an authenticated API escape hatch for resources not covered by high-level commands. | -| Prefer `--json` and optional `--fields` for parsing. | adopted | Current CLI supports machine-readable output across command groups. | -| Preserve stale plural subcommands as recommended forms. | rejected | `organizations list` was the root failure; aliases should not be taught as canonical command shapes. | -| Create a broad new troubleshooting reference. | deferred | Current failure modes fit in the focused CLI reference without crowding `SKILL.md`. | +| Decision | Status | Rationale | +| ---------------------------------------------------------------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------- | +| Use singular canonical command groups in runtime guidance. | adopted | Current docs and latest executable help use `issue`, `org`, `log`, and `trace`. | +| Add a live-help verification gate before blocking. | adopted | Issue 271 showed a stale remembered command produced a false blocked answer. | +| Keep `sentry api ` as a read-only fallback. | adopted | Current CLI exposes an authenticated API escape hatch for resources not covered by high-level commands. | +| Prefer `--json` and optional `--fields` for parsing. | adopted | Current CLI supports machine-readable output across command groups. | +| Treat Sentry product feature usage and explicit Sentry telemetry redirects as Sentry skill triggers. | adopted | Issue 615 showed the previous trigger language under-specified product-introspection queries and let Hex win. | +| Preserve stale plural subcommands as recommended forms. | rejected | `organizations list` was the root failure; aliases should not be taught as canonical command shapes. | +| Create a broad new troubleshooting reference. | deferred | Current failure modes fit in the focused CLI reference without crowding `SKILL.md`. | ## Coverage matrix -| Dimension | Coverage status | Evidence | -| ---------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| API surface and behavior contracts | complete | `cli-commands.md` covers issue, org, log, trace, and API command shapes plus live help verification. | -| Config/runtime options | complete | `sandbox-runtime.md`, `plugin.yaml`, and CLI configuration docs cover injected auth and runtime package installation. | -| Common use cases | complete | `cli-commands.md` maps org listing, issue search/view/events, logs, traces, trace logs, and API fallback. | -| Known issues/workarounds | complete | `cli-commands.md` troubleshooting covers stale plural commands, target syntax, JSON parsing, cache, auth, scope, and access failures. | -| Version/migration variance | complete | The skill now treats live CLI help as final when references and installed CLI disagree. | +| Dimension | Coverage status | Evidence | +| ---------------------------------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| API surface and behavior contracts | complete | `cli-commands.md` covers issue, org, log, trace, and API command shapes plus live help verification. | +| Config/runtime options | complete | `sandbox-runtime.md`, `plugin.yaml`, and CLI configuration docs cover injected auth and runtime package installation. | +| Common use cases | complete | `cli-commands.md` maps org listing, issue search/view/events, logs, traces, trace logs, and API fallback. | +| Product telemetry routing | documented | `SKILL.md` and `SPEC.md` cover Sentry product feature usage and explicit "Sentry telemetry" redirects after an unrelated auth pause. A dedicated eval should wait for the eval harness boundary cleanup. | +| Known issues/workarounds | complete | `cli-commands.md` troubleshooting covers stale plural commands, target syntax, JSON parsing, cache, auth, scope, and access failures. | +| Version/migration variance | complete | The skill now treats live CLI help as final when references and installed CLI disagree. | ## Open gaps @@ -44,4 +47,5 @@ Last updated: 2026-04-30 ## Changelog +- 2026-06-18: Expanded trigger language for Sentry product telemetry and feature usage, and recorded issue 615 routing evidence. - 2026-04-30: Reconciled skill guidance with Sentry CLI `0.30.0`, replaced stale plural command forms, added live-help verification, expanded log/trace/API guidance, updated eval smoke artifacts, and added an org-list command-selection eval. diff --git a/packages/junior-sentry/skills/sentry/SPEC.md b/packages/junior-sentry/skills/sentry/SPEC.md index 787acb414..3ede34b2a 100644 --- a/packages/junior-sentry/skills/sentry/SPEC.md +++ b/packages/junior-sentry/skills/sentry/SPEC.md @@ -10,6 +10,7 @@ It should produce useful read-only investigation results or Sentry web links wit In scope: - Listing and viewing Sentry issues, issue events, logs, traces, organizations, and related read-only data. +- Investigating Sentry's own product telemetry and product feature usage through Sentry CLI/API data surfaces. - Using `sentry api ` for authenticated read-only requests when no high-level command exists. - Generating Sentry deep links for user-scoped or entity-specific views. - Diagnosing auth, scope, and access failures without guessing missing scopes. @@ -23,7 +24,7 @@ Out of scope: ## Users And Trigger Context - Primary users: Junior users asking Slack or harness agents to investigate Sentry data. -- Common user requests: "list my Sentry issues", "show error logs", "inspect this trace", "which orgs can I access", "open the issue in Sentry". +- Common user requests: "list my Sentry issues", "show error logs", "inspect this trace", "which orgs can I access", "open the issue in Sentry", "use Sentry telemetry", and "how much is this Sentry feature used". - Should not trigger for: source-code tasks, GitHub PRs, repository searches, or generic questions about Sentry SDK implementation. ## Runtime Contract diff --git a/policies/evals.md b/policies/evals.md index 5c58837e9..296a813fe 100644 --- a/policies/evals.md +++ b/policies/evals.md @@ -8,7 +8,10 @@ Evals are integration tests for agent-facing behavior through the real runtime. - Keep prompts realistic; do not script the user request to make the eval pass. - Assert behavior invariants, not incidental wording or execution sequence. -- Use tool/provider evidence when that boundary is part of the behavior. +- Treat the normalized `vitest-evals` session as the canonical eval surface for judges and assertions. +- Use native `vitest-evals` harness support for ordered full-turn transcripts; do not add repo-local event logs or sequencing layers to simulate them. +- Use `toolCalls(result.session)` or other `vitest-evals` primitives when tool/provider evidence is part of the behavior. +- Do not invent parallel transcript, event-log, or tool-call schemas for eval assertions; improve the harness boundary instead. - Keep eval cases within 30 seconds. - Use fixtures, mocks, or replay for external resources instead of raising timeouts. diff --git a/specs/eval-testing.md b/specs/eval-testing.md index 5056a0edb..d8360fafb 100644 --- a/specs/eval-testing.md +++ b/specs/eval-testing.md @@ -3,12 +3,14 @@ ## Metadata - Created: 2026-03-03 -- Last Edited: 2026-05-28 +- Last Edited: 2026-06-18 ## Intent Evals validate end-to-end conversational behavior outcomes through the runtime harness and LLM-judged criteria. Treat them as the integration-style layer for agent-facing behavior: use them when the contract depends on natural-language interpretation, continuity, prompt behavior, or reply quality. The Slack eval judge uses the same harness prompt seam as the suite, backed by Junior's Pi client and Vercel AI Gateway. +The normalized `vitest-evals` session is the canonical eval surface. Judges and deterministic assertions should use `result.session`, `toolCalls(result.session)`, artifacts, and traces before introducing any repo-local output schema. If a case needs a fully ordered agent transcript, use or improve the native `vitest-evals` Pi harness boundary instead of building a repo-local event log. + ## Scope In scope: @@ -26,11 +28,10 @@ In scope: 1. Define suites via `describeEval()` with the shared Slack harness options, and define cases as plain `it()` tests that call `run(...)` with event builders. 2. Keep each case focused on one primary behavior outcome. -3. Express expectations through the structured rubric shape used by `rubric({ contract, pass, allow, fail })`. +3. Express expectations through the structured rubric shape used by `rubric({ pass, fail })`. 4. Every new or edited eval must keep its rubric human-readable to maintainers. - `contract` states the user-visible behavior being proven. + The eval test name states the scenario and expected outcome. `pass` lists the observable pass conditions. - `allow` lists acceptable optional variations. `fail` lists failure conditions or forbidden output. 5. Do not write judge criteria as one dense paragraph. 6. Let the `describeEval()` block own the behavior area. The file path and `describeEval()` context already provide scope, so each individual eval name should only state the specific scenario and outcome. @@ -39,6 +40,8 @@ In scope: 9. Keep user prompts natural and product-realistic. Do not script exact internal commands, tool names, or implementation steps into the prompt just to force a path. 10. If a case only works when the prompt prescribes internal mechanics, treat that as an eval-design failure or product-behavior gap, not a passing eval. 11. If a case uses harness-controlled decision fixtures such as subscribed-message reply gating, do not claim those gated behaviors are being validated by the eval outcome. +12. Put semantic, model-dependent expectations in the rubric; put deterministic boundary expectations in normal Vitest assertions against `result.session`, `toolCalls(result.session)`, or `result.artifacts`. +13. Do not create parallel transcript, event-log, or tool-call schemas for assertions. If the `vitest-evals` primitives cannot express the contract, improve the harness boundary first. ## Boundaries @@ -48,6 +51,7 @@ Do not in eval files: - Use MSW queue/capture helpers intended for integration contract tests. - Rely on implementation-only identifiers (exact internal tool names, opaque IDs) unless the case intentionally evaluates that surface. - Encode exact internal commands or tool choices in user prompts when the contract under test is higher-level conversational behavior. +- Assert product behavior from logs, spans, or status telemetry. Use session/tool/artifact primitives for behavior contracts; reserve traces/spans for instrumentation tests or diagnostics. ## Relationship to Other Layers @@ -55,6 +59,8 @@ Do not in eval files: - Integration tests own real runtime behavior when a deterministic fake agent is sufficient and the contract is not model interpretation itself. - Unit tests own isolated deterministic logic invariants. - Evals own agent-facing conversational outcomes across realistic flows and replace ordinary integration tests for that surface. +- Agent-level evals for prompt behavior, skill routing, tool choice, provider/tool calls, and reply quality should use the Pi-agent `vitest-evals` harness boundary when Slack transport is not the behavior under test. +- Slack evals own Slack/runtime behavior: mentions, thread/channel delivery, OAuth privacy, lifecycle/resume behavior, reactions, and Slack-visible side effects. ## When To Choose Evals @@ -70,5 +76,4 @@ Do not choose evals for ordinary Slack payload-shape assertions, deterministic r Operational commands and harness details live in `packages/junior-evals/README.md`. -The eval artifact contract should preserve user-visible output structure. In particular, assistant thread posts must retain attachment metadata instead of flattening attachments into synthetic text. -Eval output should also stay readable in failure reports. Preserve the structured JSON shape instead of collapsing it into prose or synthetic summaries. +The eval session contract should preserve user-visible output structure. In particular, assistant thread posts must retain attachment metadata instead of flattening attachments into synthetic text. Do not collapse the normalized session into prose or synthetic summaries for judge scoring.