Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,22 @@
All notable changes to `ai-consensus-core` will be documented here.
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), [SemVer](https://semver.org/spec/v2.0.0.html).

## [0.11.1] — 2026-05-25

### Fixed — judge-confidence parser contract

`buildJudgeSystemPrompt` now idempotently appends the `JUDGE_CONFIDENCE: [number 0-100]` directive, mirroring the `CONFIDENCE: [number 0-100]` handshake that `buildParticipantSystemPrompt` has always emitted. Previously, any caller that supplied a custom `ConsensusOptions.judge.systemPrompt` (instead of relying on `JUDGE_PERSONA.systemPrompt`, which has the directive inline) silently broke the parser contract: `extractJudgeConfidence` would not find the marker, fall through to its 50 default, and return a measurement-shaped value that polluted downstream statistics.

Discovered by a 12-run benchmark in `ai-consensus-mcp` where judge confidence was reported as exactly `μ=50.0, σ=0.0` across every run — the unmistakable fingerprint of the silent default. Every panel in that repo overrode `judgeSystemPrompt` and none re-emitted the marker.

- `buildJudgeSystemPrompt` auto-appends the directive when it is not already present in the supplied prompt
- Idempotency check is case-insensitive substring on `JUDGE_CONFIDENCE`, so `JUDGE_PERSONA`'s inline directive (and any diligent custom caller) is not duplicated
- New contract tests in `prompts.test.ts` mirror the existing participant-side test and fail loudly if a future edit breaks the handshake

### Backward compatibility

No public API change. The only observable difference is that `buildJudgeSystemPrompt`'s output string is longer when the input prompt lacks the marker. Callers that snapshot-test that output will need to regenerate snapshots. Callers that relied on the previous silent-50 behaviour will now see the real model-emitted value (which is the documented intent).

## [0.11.0] — 2026-04-30

### Added — tool calling
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "ai-consensus-core",
"version": "0.11.0",
"version": "0.11.1",
"description": "Dependency-light TypeScript implementation of the Consensus Validation Protocol (CVP): multi-model debate with confidence-weighted scoring, disagreement detection, and optional judge synthesis.",
"keywords": [
"consensus",
Expand Down
37 changes: 37 additions & 0 deletions src/__tests__/prompts.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,43 @@ describe("buildJudgeSystemPrompt", () => {
});
expect(out).toContain(JUDGE_PERSONA.systemPrompt);
});

it("always ends with the JUDGE_CONFIDENCE marker directive (parser contract)", () => {
// This is the handshake between prompt and parser. If a custom
// judgeSystemPrompt lacks the directive, extractJudgeConfidence silently
// returns its 50 default — a measurement-shaped value that pollutes
// statistics rather than surfacing as missing data.
const out = buildJudgeSystemPrompt({
judgeSystemPrompt:
"You are synthesising a debate. Produce a report. State your confidence.",
question: "Q",
});
expect(out).toMatch(/JUDGE_CONFIDENCE: \[number 0-100\]\s*$/);
});

it("does not duplicate the directive when the input already mentions JUDGE_CONFIDENCE", () => {
// JUDGE_PERSONA.systemPrompt has its own inline JUDGE_CONFIDENCE
// directive. Diligent callers may add one too. In both cases the
// builder must be idempotent.
const out = buildJudgeSystemPrompt({
judgeSystemPrompt: JUDGE_PERSONA.systemPrompt,
question: "Q",
});
const matches = out.match(/JUDGE_CONFIDENCE/gi) ?? [];
expect(matches.length).toBe(1);
});

it("appends the directive verbatim to a custom prompt that lacks it", () => {
const customPrompt = "You are the architecture judge. Pick one option.";
const out = buildJudgeSystemPrompt({
judgeSystemPrompt: customPrompt,
question: "Q",
});
expect(out).toContain(customPrompt);
expect(out).toContain(
"IMPORTANT: End your response with a line in exactly this format:\nJUDGE_CONFIDENCE: [number 0-100]",
);
});
});

describe("buildJudgeUserPrompt", () => {
Expand Down
19 changes: 18 additions & 1 deletion src/prompts.ts
Original file line number Diff line number Diff line change
Expand Up @@ -100,21 +100,38 @@ IMPORTANT: End your response with a line in exactly this format:
CONFIDENCE: [number 0-100]`;
}

/**
* Trailing directive that pins the judge's output to the parser contract
* (`extractJudgeConfidence` looks for this exact token). Mirrors the
* `CONFIDENCE: [number 0-100]` handshake on the participant side.
*/
const JUDGE_CONFIDENCE_DIRECTIVE = `

IMPORTANT: End your response with a line in exactly this format:
JUDGE_CONFIDENCE: [number 0-100]`;

/**
* Build the judge's system prompt. We append the original user prompt to
* the JUDGE_PERSONA's instructions so the model knows what was debated,
* without having to infer it from participant text.
*
* Idempotently appends the `JUDGE_CONFIDENCE: [number 0-100]` directive so
* `parser.extractJudgeConfidence` always finds a real value to parse rather
* than silently returning its 50 default. If the caller's prompt already
* contains a `JUDGE_CONFIDENCE` mention (as `JUDGE_PERSONA.systemPrompt`
* does), the directive is not duplicated.
*/
export function buildJudgeSystemPrompt(params: {
judgeSystemPrompt: string;
question: string;
}): string {
return `${params.judgeSystemPrompt}
const base = `${params.judgeSystemPrompt}

The original prompt that was debated was:
"""
${params.question}
"""`;
return /JUDGE_CONFIDENCE/i.test(base) ? base : `${base}${JUDGE_CONFIDENCE_DIRECTIVE}`;
}

/**
Expand Down
Loading