fix: strip markdown code fences from LLM judge responses#1
Merged
himmi-01 merged 1 commit intoCorbell-AI:mainfrom May 2, 2026
Merged
Conversation
Some providers (notably Anthropic via litellm) wrap JSON responses in ```json ... ``` markdown code blocks even when response_format= json_object is requested. This causes json.loads() to fail with a parse error, scoring every evaluation as 0/100. Fixed in both scoring paths: - runner.py (LLMJudgeProvider.score_run) — benchmark scoring - asset_generator.py (generate_improvement_evals) — eval generation The fix extracts JSON from code fences before parsing. Normal JSON responses (without fences) pass through unchanged. Tested with Anthropic claude-haiku-4-5 as eval judge — confirmed scoring works correctly after this fix (was 0/100 on all benchmarks before, now scores match expected behavior).
Contributor
|
Thanks @cwilson613 . May be in future we can remove the duplication of. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
```json ... ```code blocks even whenresponse_format=json_objectis setjson.loads()to fail, scoring every evaluation as 0/100runner.py(benchmark scoring) andasset_generator.py(eval generation)What changed
Added
_strip_code_fences()helper inrunner.pyand inline fence stripping inasset_generator.py. Both extract JSON from markdown code blocks before parsing. Normal JSON responses (without fences) pass through unchanged.How I found this
Running EvalMonkey with
EVAL_MODEL=anthropic/claude-haiku-4-5against a custom agent. Every benchmark scored 0/100 with the errorExpecting value: line 1 column 1 (char 0). The raw LLM response was:extra text here