Skip to content

fix(prompts): auto-append JUDGE_CONFIDENCE directive in buildJudgeSys…#4

Merged
marceloceccon merged 1 commit into
mainfrom
fix/judge-confidence-contract
May 26, 2026
Merged

fix(prompts): auto-append JUDGE_CONFIDENCE directive in buildJudgeSys…#4
marceloceccon merged 1 commit into
mainfrom
fix/judge-confidence-contract

Conversation

@marceloceccon

Copy link
Copy Markdown
Member

extractJudgeConfidence requires a trailing JUDGE_CONFIDENCE: N line and silently defaults to 50 when absent. buildJudgeSystemPrompt was passing custom judge prompts through untouched, unlike buildParticipantSystemPrompt which auto-appends the matching CONFIDENCE: N directive. Any caller overriding the default JUDGE_PERSONA prompt silently received 50 on every run — a measurement-shaped value that polluted downstream statistics.

This change makes buildJudgeSystemPrompt mirror the participant builder: idempotently append the directive, skipping when the input already contains the marker (so JUDGE_PERSONA's inline directive — and any diligent custom caller — is not duplicated).

Discovered by a 12-run bench in ai-consensus-mcp where judge confidence was reported as exactly 50.0 ± 0.0 across every run.

No public API change. buildJudgeSystemPrompt's output is longer when the input prompt lacks the marker; callers that snapshot-test that output need to regenerate snapshots.

…temPrompt (0.11.1)

extractJudgeConfidence requires a trailing `JUDGE_CONFIDENCE: N` line and
silently defaults to 50 when absent. buildJudgeSystemPrompt was passing
custom judge prompts through untouched, unlike buildParticipantSystemPrompt
which auto-appends the matching `CONFIDENCE: N` directive. Any caller
overriding the default JUDGE_PERSONA prompt silently received 50 on every
run — a measurement-shaped value that polluted downstream statistics.

This change makes buildJudgeSystemPrompt mirror the participant builder:
idempotently append the directive, skipping when the input already contains
the marker (so JUDGE_PERSONA's inline directive — and any diligent custom
caller — is not duplicated).

Discovered by a 12-run bench in ai-consensus-mcp where judge confidence
was reported as exactly 50.0 ± 0.0 across every run.

No public API change. buildJudgeSystemPrompt's output is longer when the
input prompt lacks the marker; callers that snapshot-test that output need
to regenerate snapshots.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant