Skip to content

feat: code-grader plain-text fallback + workspace env preflight#1209

Open
christso wants to merge 4 commits intomainfrom
feat/1207-1208-shell-grader-preflight
Open

feat: code-grader plain-text fallback + workspace env preflight#1209
christso wants to merge 4 commits intomainfrom
feat/1207-1208-shell-grader-preflight

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented May 2, 2026

Summary

Implements two new eval features from issues #1207 and #1208.

type: shell grader (#1207)

Runs a shell command and checks its stdout/exit code:

assertions:
  # Pass if exit code is 0
  - type: shell
    command: "pdfinfo report.pdf | grep Pages"

  # Exact string match
  - type: shell
    command: "echo 42"
    expected: "42"

  # Numeric comparison (>, <, >=, <=, ==, !=)
  - type: shell
    command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
    operator: ">="
    expected: "5"

The command runs in the workspace directory when available.

workspace.env preflight checks (#1208)

Declares required system dependencies checked once before before_all hooks run. Fails fast with a clear diagnostic listing all missing items:

workspace:
  env:
    required_commands: [ffmpeg, pandoc, wkhtmltopdf]
    required_python_modules: [PIL, openai]

Error message on failure:

Preflight checks failed — missing dependencies:
  • command: ffmpeg
  • python module: PIL

Install the missing dependencies before running this eval.

Red/Green UAT Evidence

Red (before): type: shell in assertions → Unknown grader type "shell" error

Green (after):

1/4   ✅ test-exit-code       | 100% PASS  (exit code 0)
2/4   ✅ test-exact-match     | 100% PASS  (stdout "42" equals expected "42")
3/4   ✅ test-numeric-gte     | 100% PASS  (14 >= 5)
4/4   ⚠️ test-numeric-fail    | 0% FAIL    (3 >= 10 failed)

Preflight Red: eval with nonexistent_command_xyz_abc → immediate setup error with clear message before any test runs

Preflight Green: eval with bash, ls → preflight passes, eval proceeds normally

Test plan

  • 14 unit tests for ShellGrader (all operator variants, error cases)
  • 5 unit tests for shell grader parsing in grader-parser
  • All 2324 tests pass (build + typecheck + lint + test)
  • Manual e2e: shell grader with all 4 test modes ✅
  • Manual e2e: preflight fails fast on missing command ✅
  • Manual e2e: preflight passes on valid commands ✅

Closes #1207
Closes #1208

🤖 Generated with Claude Code

christso and others added 2 commits May 2, 2026 15:42
Adds two new eval features:

**Shell grader** (`type: shell`): runs a shell command and checks its stdout.
- No `expected`: passes when exit code is 0
- `expected` with no `operator`: exact string match (trimmed stdout)
- `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison

**Workspace env preflight** (`workspace.env`): declares required system
dependencies that are checked once before before_all hooks run. Fails fast
with a clear diagnostic listing all missing commands/modules.

Example:
```yaml
workspace:
  env:
    required_commands: [ffmpeg, pandoc]
    required_python_modules: [PIL, openai]
assertions:
  - type: shell
    command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
    operator: ">="
    expected: "5"
```

Closes #1207, #1208

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 2, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 54b2032
Status: ✅  Deploy successful!
Preview URL: https://6e020390.agentv.pages.dev
Branch Preview URL: https://feat-1207-1208-shell-grader.agentv.pages.dev

View logs

christso and others added 2 commits May 3, 2026 06:21
…1210)

Per design review: the `shell` grader type violated the "audit existing primitives
first" principle — `code-grader` already runs shell commands. Promptfoo solves this
the same way (javascript/python fallbacks, no dedicated shell type).

Remove the `shell` grader type entirely and instead extend `code-grader` to accept
plain-text stdout without requiring the JSON protocol:

| stdout (trimmed, case-insensitive) | score |
|---|---|
| empty string | 1 if exit 0, 0 if exit non-zero |
| "true", "pass", "1" | 1 |
| "false", "fail", "0" | 0 |
| numeric string | clamped float |
| anything else | 1 if exit 0, 0 if exit non-zero |

Scripts that write to stderr on non-zero exit still surface as errors (existing
behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention.

Usage:
  # numeric comparison via exit code
  - type: code-grader
    command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"]

  # score from stdout
  - type: code-grader
    command: ["bash", "-c", "echo 0.75"]

Closes #1210
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christso christso changed the title feat: shell grader + workspace env preflight checks feat: code-grader plain-text fallback + workspace env preflight May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EVAL.yaml] Feature: Environment preflight checks for E2E evals [EVAL.yaml] Feature: Numeric comparison operators for shell assertions

1 participant