feat: code-grader plain-text fallback + workspace env preflight by christso · Pull Request #1209 · EntityProcess/agentv

christso · 2026-05-02T13:47:20Z

Summary

Implements two new eval features from issues #1207 and #1208.

`type: shell` grader (#1207)

Runs a shell command and checks its stdout/exit code:

assertions:
  # Pass if exit code is 0
  - type: shell
    command: "pdfinfo report.pdf | grep Pages"

  # Exact string match
  - type: shell
    command: "echo 42"
    expected: "42"

  # Numeric comparison (>, <, >=, <=, ==, !=)
  - type: shell
    command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
    operator: ">="
    expected: "5"

The command runs in the workspace directory when available.

`workspace.env` preflight checks (#1208)

Declares required system dependencies checked once before before_all hooks run. Fails fast with a clear diagnostic listing all missing items:

workspace:
  env:
    required_commands: [ffmpeg, pandoc, wkhtmltopdf]
    required_python_modules: [PIL, openai]

Error message on failure:

Preflight checks failed — missing dependencies:
  • command: ffmpeg
  • python module: PIL

Install the missing dependencies before running this eval.

Red/Green UAT Evidence

Red (before): type: shell in assertions → Unknown grader type "shell" error

Green (after):

1/4   ✅ test-exit-code       | 100% PASS  (exit code 0)
2/4   ✅ test-exact-match     | 100% PASS  (stdout "42" equals expected "42")
3/4   ✅ test-numeric-gte     | 100% PASS  (14 >= 5)
4/4   ⚠️ test-numeric-fail    | 0% FAIL    (3 >= 10 failed)

Preflight Red: eval with nonexistent_command_xyz_abc → immediate setup error with clear message before any test runs

Preflight Green: eval with bash, ls → preflight passes, eval proceeds normally

Test plan

14 unit tests for ShellGrader (all operator variants, error cases)
5 unit tests for shell grader parsing in grader-parser
All 2324 tests pass (build + typecheck + lint + test)
Manual e2e: shell grader with all 4 test modes ✅
Manual e2e: preflight fails fast on missing command ✅
Manual e2e: preflight passes on valid commands ✅

Closes #1207
Closes #1208

🤖 Generated with Claude Code

Adds two new eval features: **Shell grader** (`type: shell`): runs a shell command and checks its stdout. - No `expected`: passes when exit code is 0 - `expected` with no `operator`: exact string match (trimmed stdout) - `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison **Workspace env preflight** (`workspace.env`): declares required system dependencies that are checked once before before_all hooks run. Fails fast with a clear diagnostic listing all missing commands/modules. Example: ```yaml workspace: env: required_commands: [ffmpeg, pandoc] required_python_modules: [PIL, openai] assertions: - type: shell command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'" operator: ">=" expected: "5" ``` Closes #1207, #1208 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-02T13:47:50Z

Deploying agentv with Cloudflare Pages

Latest commit:	`54b2032`
Status:	✅ Deploy successful!
Preview URL:	https://6e020390.agentv.pages.dev
Branch Preview URL:	https://feat-1207-1208-shell-grader.agentv.pages.dev

View logs

…1210) Per design review: the `shell` grader type violated the "audit existing primitives first" principle — `code-grader` already runs shell commands. Promptfoo solves this the same way (javascript/python fallbacks, no dedicated shell type). Remove the `shell` grader type entirely and instead extend `code-grader` to accept plain-text stdout without requiring the JSON protocol: | stdout (trimmed, case-insensitive) | score | |---|---| | empty string | 1 if exit 0, 0 if exit non-zero | | "true", "pass", "1" | 1 | | "false", "fail", "0" | 0 | | numeric string | clamped float | | anything else | 1 if exit 0, 0 if exit non-zero | Scripts that write to stderr on non-zero exit still surface as errors (existing behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention. Usage: # numeric comparison via exit code - type: code-grader command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"] # score from stdout - type: code-grader command: ["bash", "-c", "echo 0.75"] Closes #1210 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso and others added 2 commits May 2, 2026 15:42

fix: resolve lint errors in shell grader and targets-validator imports

6c63a36

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso mentioned this pull request May 3, 2026

feat: extend code-grader to accept plain-text and exit-code output #1210

Open

christso and others added 2 commits May 3, 2026 06:21

style: fix biome formatting in code-grader

54b2032

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso changed the title ~~feat: shell grader + workspace env preflight checks~~ feat: code-grader plain-text fallback + workspace env preflight May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: code-grader plain-text fallback + workspace env preflight#1209

feat: code-grader plain-text fallback + workspace env preflight#1209
christso wants to merge 4 commits intomainfrom
feat/1207-1208-shell-grader-preflight

christso commented May 2, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented May 2, 2026

Summary

type: shell grader (#1207)

workspace.env preflight checks (#1208)

Red/Green UAT Evidence

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`type: shell` grader (#1207)

`workspace.env` preflight checks (#1208)

cloudflare-workers-and-pages Bot commented May 2, 2026 •

edited

Loading