perf: single-pass drift metric computation (#447)#464
Merged
Conversation
Addresses #447 item 2b (drift). _compute_drift built two lists (recent/baseline) then iterated each THREE times — rate(), tokens(), avg_len() — and the tokens pass re-ran the _extract_recommendation_tokens regex per entry. Collapse to one pass: classify each entry into its window bucket and accumulate count, positive count, response-length sum, and the recommendation-token set inline, so each entry is visited once and its tokens extracted once. Metrics and the result dict are byte-identical (the existing drift tests pin them). Deferred (same item): the time-windowed tail-read (read only the last N days instead of the whole capped log). It needs a chronological-ordering assumption on the JSONL log for a marginal CLI-only gain, so it's left out; the full read is already bounded by pythia.max_entries. Tests: full oracle suite green (56), incl. the 8 drift tests that pin the acceptance-rate / jaccard / avg-length / count outputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #447 item 2b (drift).
_compute_driftbuilt two lists (recent/baseline) then iterated each three times —rate(),tokens(),avg_len()— and the tokens pass re-ran the_extract_recommendation_tokensregex per entry.Collapse to one pass: classify each entry into its window bucket and accumulate count, positive count, response-length sum, and the recommendation-token set inline — so each entry is visited once and its tokens extracted once. Metrics and the result dict are byte-identical (the existing drift tests pin them).
Deferred (same item): the time-windowed tail-read needs a chronological-ordering assumption on the JSONL log for a marginal CLI-only gain; the full read is already bounded by
pythia.max_entries.Tests: full oracle suite green (56), incl. the 8 drift tests pinning the acceptance-rate / jaccard / avg-length / count outputs.
🤖 Generated with Claude Code