feat: code-grader plain-text fallback + workspace env preflight#1209
Open
feat: code-grader plain-text fallback + workspace env preflight#1209
Conversation
Adds two new eval features:
**Shell grader** (`type: shell`): runs a shell command and checks its stdout.
- No `expected`: passes when exit code is 0
- `expected` with no `operator`: exact string match (trimmed stdout)
- `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison
**Workspace env preflight** (`workspace.env`): declares required system
dependencies that are checked once before before_all hooks run. Fails fast
with a clear diagnostic listing all missing commands/modules.
Example:
```yaml
workspace:
env:
required_commands: [ffmpeg, pandoc]
required_python_modules: [PIL, openai]
assertions:
- type: shell
command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
operator: ">="
expected: "5"
```
Closes #1207, #1208
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
54b2032
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://6e020390.agentv.pages.dev |
| Branch Preview URL: | https://feat-1207-1208-shell-grader.agentv.pages.dev |
…1210) Per design review: the `shell` grader type violated the "audit existing primitives first" principle — `code-grader` already runs shell commands. Promptfoo solves this the same way (javascript/python fallbacks, no dedicated shell type). Remove the `shell` grader type entirely and instead extend `code-grader` to accept plain-text stdout without requiring the JSON protocol: | stdout (trimmed, case-insensitive) | score | |---|---| | empty string | 1 if exit 0, 0 if exit non-zero | | "true", "pass", "1" | 1 | | "false", "fail", "0" | 0 | | numeric string | clamped float | | anything else | 1 if exit 0, 0 if exit non-zero | Scripts that write to stderr on non-zero exit still surface as errors (existing behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention. Usage: # numeric comparison via exit code - type: code-grader command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"] # score from stdout - type: code-grader command: ["bash", "-c", "echo 0.75"] Closes #1210 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements two new eval features from issues #1207 and #1208.
type: shellgrader (#1207)Runs a shell command and checks its stdout/exit code:
The command runs in the workspace directory when available.
workspace.envpreflight checks (#1208)Declares required system dependencies checked once before
before_allhooks run. Fails fast with a clear diagnostic listing all missing items:Error message on failure:
Red/Green UAT Evidence
Red (before):
type: shellin assertions →Unknown grader type "shell"errorGreen (after):
Preflight Red: eval with
nonexistent_command_xyz_abc→ immediate setup error with clear message before any test runsPreflight Green: eval with
bash,ls→ preflight passes, eval proceeds normallyTest plan
ShellGrader(all operator variants, error cases)grader-parserCloses #1207
Closes #1208
🤖 Generated with Claude Code