fix: strip markdown code fences from LLM judge responses by cwilson613 · Pull Request #1 · Corbell-AI/evalmonkey

cwilson613 · 2026-05-01T01:43:06Z

Summary

Anthropic models (via litellm) wrap JSON responses in ```json ... ``` code blocks even when response_format=json_object is set
This causes json.loads() to fail, scoring every evaluation as 0/100
Fixed in both runner.py (benchmark scoring) and asset_generator.py (eval generation)

What changed

Added _strip_code_fences() helper in runner.py and inline fence stripping in asset_generator.py. Both extract JSON from markdown code blocks before parsing. Normal JSON responses (without fences) pass through unchanged.

How I found this

Running EvalMonkey with EVAL_MODEL=anthropic/claude-haiku-4-5 against a custom agent. Every benchmark scored 0/100 with the error Expecting value: line 1 column 1 (char 0). The raw LLM response was:

```json
{"score": 90, "reasoning": "..."}

extra text here


After this fix, benchmarks score correctly.

## Test plan

- [x] All 46 existing tests pass (`python -m pytest tests/ -v`)
- [x] Verified with Anthropic claude-haiku-4-5 as eval judge — scores now match expected behavior
- [x] Normal JSON responses (OpenAI, no fences) still parse correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Some providers (notably Anthropic via litellm) wrap JSON responses in ```json ... ``` markdown code blocks even when response_format= json_object is requested. This causes json.loads() to fail with a parse error, scoring every evaluation as 0/100. Fixed in both scoring paths: - runner.py (LLMJudgeProvider.score_run) — benchmark scoring - asset_generator.py (generate_improvement_evals) — eval generation The fix extracts JSON from code fences before parsing. Normal JSON responses (without fences) pass through unchanged. Tested with Anthropic claude-haiku-4-5 as eval judge — confirmed scoring works correctly after this fix (was 0/100 on all benchmarks before, now scores match expected behavior).

himmi-01 · 2026-05-02T03:24:03Z

Thanks @cwilson613 .

May be in future we can remove the duplication of.

if content and "```" in content:
                match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", content, re.DOTALL)
                if match:
                    content = match.group(1).strip()

himmi-01 merged commit 82a5925 into Corbell-AI:main May 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: strip markdown code fences from LLM judge responses#1

fix: strip markdown code fences from LLM judge responses#1
himmi-01 merged 1 commit intoCorbell-AI:mainfrom
cwilson613:fix/strip-json-code-fences

cwilson613 commented May 1, 2026

Uh oh!

himmi-01 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cwilson613 commented May 1, 2026

Summary

What changed

How I found this

Uh oh!

himmi-01 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants