Skip to content

fix: strip markdown code fences from LLM judge responses#1

Merged
himmi-01 merged 1 commit intoCorbell-AI:mainfrom
cwilson613:fix/strip-json-code-fences
May 2, 2026
Merged

fix: strip markdown code fences from LLM judge responses#1
himmi-01 merged 1 commit intoCorbell-AI:mainfrom
cwilson613:fix/strip-json-code-fences

Conversation

@cwilson613
Copy link
Copy Markdown
Contributor

Summary

  • Anthropic models (via litellm) wrap JSON responses in ```json ... ``` code blocks even when response_format=json_object is set
  • This causes json.loads() to fail, scoring every evaluation as 0/100
  • Fixed in both runner.py (benchmark scoring) and asset_generator.py (eval generation)

What changed

Added _strip_code_fences() helper in runner.py and inline fence stripping in asset_generator.py. Both extract JSON from markdown code blocks before parsing. Normal JSON responses (without fences) pass through unchanged.

How I found this

Running EvalMonkey with EVAL_MODEL=anthropic/claude-haiku-4-5 against a custom agent. Every benchmark scored 0/100 with the error Expecting value: line 1 column 1 (char 0). The raw LLM response was:

```json
{"score": 90, "reasoning": "..."}

extra text here


After this fix, benchmarks score correctly.

## Test plan

- [x] All 46 existing tests pass (`python -m pytest tests/ -v`)
- [x] Verified with Anthropic claude-haiku-4-5 as eval judge — scores now match expected behavior
- [x] Normal JSON responses (OpenAI, no fences) still parse correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Some providers (notably Anthropic via litellm) wrap JSON responses
in ```json ... ``` markdown code blocks even when response_format=
json_object is requested. This causes json.loads() to fail with a
parse error, scoring every evaluation as 0/100.

Fixed in both scoring paths:
- runner.py (LLMJudgeProvider.score_run) — benchmark scoring
- asset_generator.py (generate_improvement_evals) — eval generation

The fix extracts JSON from code fences before parsing. Normal JSON
responses (without fences) pass through unchanged.

Tested with Anthropic claude-haiku-4-5 as eval judge — confirmed
scoring works correctly after this fix (was 0/100 on all benchmarks
before, now scores match expected behavior).
@himmi-01
Copy link
Copy Markdown
Contributor

himmi-01 commented May 2, 2026

Thanks @cwilson613 .

May be in future we can remove the duplication of.

if content and "```" in content:
                match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", content, re.DOTALL)
                if match:
                    content = match.group(1).strip()

@himmi-01 himmi-01 merged commit 82a5925 into Corbell-AI:main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants