fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage#231
Merged
Conversation
Foundation checkpoint (full suite 3241/3243, 0 failed, ASan/UBSan clean):
- Coverage-boost: raised src coverage across io/csv, ops/{datalog,join,
fused_group,sort,string,builtins,opt,traverse}, lang/eval, table/sym,
mem/heap by extending the nearest existing RFL/C tests, with reachability
notes for OOM/defensive/dead-by-design lines.
- fix(collection): `(at table 'col)` now materializes RAY_IS_PARTED /
RAY_MAPCOMMON columns from a .db.parted.get table (was: type error on
direct column access).
- fix(agg): scalar ray_avg_fn/ray_sum_fn reject non-numeric (SYM/STR/GUID)
vectors instead of averaging/summing raw element ids (the first half of the
scalar<->DAG aggregation reconciliation).
- Corrected two false-positive tests the parted fix exposed (min/max of SYM
are defined lexicographically, not type errors) and documented the substr
null-vector path (no bug; an earlier SIGSEGV note was a mixed-ABI build
artifact, not exec_substr).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (M1)
The scalar aggregate builtins (ray_sum_fn/ray_avg_fn/...) and the DAG/group
executors had divergent type-admission rules, so a select-form aggregate could
silently fold raw symbol ids, string bytes, or absolute-temporal integers that
the scalar builtin rejects. Unify both on one canonical table via a shared
agg_type_admitted() helper:
sum : numeric (BOOL/U8/I16/I32/I64/F64) + TIME (duration-like); rejects
SYM/STR/GUID and the absolute temporals DATE/TIMESTAMP
prod : numeric only
avg/var/var_pop/stddev/stddev_pop : numeric + temporal -> F64; rejects
SYM/STR/GUID
min/max/first/last/count : any type
DAG-side guards added at both select agg-build sites (no-by and by-group) in
query.c, and at the direct-reduction dispatch in exec.c. The helper unwraps
RAY_PARTED columns to their base element type and only rejects element types it
positively recognises as inadmissible (deferring unknown wrappers such as
RAY_MAPCOMMON to the runtime) so a plan-time guard never false-rejects a parted
numeric column.
sum(TIME) now preserves the TIME type in every DAG path (no-by select,
by-group select, whole-table reduction), matching the scalar ray_sum_fn —
previously the DAG returned a raw I64. The affine/linear SUM fast paths leave
the input vector unmaterialised, so emit_agg_columns recovers the source type
from the aggregation input op when the vector is absent.
Tests: sum.rfl gains a scalar<->DAG parity section (type-admission + sum(TIME)
preservation across no-by and by-group); list_med_var.rfl and
agg/branch_coverage.rfl updated to assert sum(TIMESTAMP) is a type error
(canonical: absolute temporals are not summable, matching DuckDB).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wide element types — STR (a 16-byte pool-pointer+length) and GUID (16 raw
bytes) — overflow the 8-byte integer reduce accumulators, so every min / max /
first / last over such a column silently truncated the value to a single byte.
The scalar builtins (which route STR/GUID through the DAG reducer) returned
empty/garbage for min/max, and the DAG select path returned garbage for all
four; only the scalar first/last (via collection_elem) were correct, so scalar
and DAG diverged.
Resolve all of them by tracking the WINNING ROW INDEX instead of an 8-byte
value and materialising that element:
* Helpers in group.c: agg_is_wide_type, wide_elem_cmp (lexicographic memcmp
for STR, byte order for GUID), wide_winner_row (min/max by comparison,
first/last by position, skipping nulls), agg_wide_reduce.
* Whole-table / scalar path: exec_reduction intercepts wide min/max/first/
last and materialises the winner — this also fixes the scalar builtins,
which delegate STR/GUID to the DAG.
* By-group path: a new per-group kernel ray_wide_minmax_per_group_buf runs
through the existing holistic post-fill machinery (the OP_MEDIAN/TOP_N/
BOT_N plumbing). Wide min/max/first/last are flagged holistic in
ght_compute_layout (new agg_is_wide bit) so the radix and dense post-fill
sites dispatch the wide kernel and the row-layout emit skips them. The
dense attach now keys off the holistic bitmask rather than the op literal.
* No-by select: emit_agg_columns truncates wide aggs, so the scalar
n_keys==0 path overrides those columns post-emit via the winning-row
materialisation (new ray_table_set_col_idx helper swaps a column vector).
* The DA fast path is disqualified for wide aggs so they reach the holistic
post-fill.
The wide kernel runs serial: ray_str_vec_set COW-mutates a shared string pool,
unsafe to parallelise; wide min/max is a cold path, not a bench kernel.
The rowform specialisations (maxmin/median/sum_count) are already gated to
integer value columns, so STR/GUID never bypass the general exec_group path.
Tests: agg/{min,max,first,last}.rfl gain STR + GUID scalar<->DAG parity
(no-by and by-group); per_group_holistic.rfl adds a 6000-row / 200-group case
that exercises the radix post-fill with exact group-0 min/first/last values.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
exec_group_pearson_rowform (the dedicated single-pass per-group Pearson path)
finalised each group as r² = (cov/(σx·σy))² instead of the signed coefficient
r. Squaring collapsed -1 and +1 to the same 1.0, so any by-group correlation
silently lost its sign — a perfect negative correlation reported as +1.0. The
scalar pearson_corr builtin and the general radix HT finalize both return the
signed r; align the rowform path with them.
Two existing tests encoded the buggy behaviour and are corrected to assert the
signed coefficient:
* agg/pearson_corr.rfl squared r (pow(...,2)) to dodge the sign; now asserts
group A=+1.0 / group B=-1.0 directly, with the r² parity check kept.
* group/null_aware_helpers.rfl asserted row-0==1.0 with a comment noting "the
engine appears to compute |r| or r²"; now asserts the order-independent
{min,max} = {-1.0,+1.0} pair (group emit order is hash-dependent).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A no-by select aggregate routes through exec_group's n_keys==0 scalar fast
path, which has no accumulator for holistic aggregates and whose emitter
(emit_agg_columns) has no median case — so (select {m: (median v) from: T})
returned 0 while the scalar (med v) returned the real median. Recompute the
median column post-emit over the whole (optionally where-selected) column as a
single group via ray_median_per_group_buf, mirroring the by-group path.
top/bot are intentionally left to their scalar-builtin form here: the no-by
planner does not thread the K argument into the group op, so a no-by select
top/bot has no K to honour. Whole-table var/stddev already matched (they have
n_keys==0 accumulators), so only median needed the fill.
Two tests that asserted the buggy 0 (each with a comment stating the true
median) are corrected: agg/med.rfl gains no-by median parity (F64, I64-to-F64,
and where-selected), and query/query_dag_agg_coverage.rfl asserts 6.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… DAG (M5)
reduction_i64_result (used by exec_reduction's min/max and first/last finalize)
had no RAY_BOOL case, so a BOOL reduction widened to i64. The scalar min/max
builtins delegate BOOL vectors to this path, so (min boolvec) returned an i64
0/1 while the DAG select path already returned a BOOL — every other element
type (U8/I16/I32/temporal/SYM) was preserved, BOOL alone was not. Add the
RAY_BOOL case so both paths return a BOOL, matching q (min 110b -> 0b) and
DuckDB (min(bool) -> bool).
Tests: agg/{min,max,first,last}.rfl gain BOOL type-preservation parity. The
"all-true -> 1 / any-true -> 0" idiom — (min boolvec) / (max boolvec) used as a
shard-independent boolean check — appeared across exec_worker_merge.rfl,
like_patterns.rfl and cross_type_workout.rfl asserting 0/1; those ~17 assertions
now assert false/true (the reduction still collapses the column to one boolean,
only the result type changed).
Other M5 audit items were confirmed NOT to be scalar<->DAG divergences and are
left as-is: prod has no scalar builtin (a feature gap, not a divergence); top/bot
reject a non-integer K by design; count(distinct) is exact (verified to 200k);
and an empty-table no-by aggregate uniformly yields an empty result in the DAG
(a consistent convention, distinct from the scalar vector-identity sum=0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ill loops Follow-up to the M2/M4 aggregation work: keep every type/operator decision at column granularity, with none inside the per-row loops. * wide_winner_row (STR/GUID min/max/first/last): the element type (STR vs GUID) and the operator are now resolved ONCE before any loop. first/last became pure positional scans (no value comparison); min/max run a single type-specialised compare loop with the direction (want_min) hoisted out. The former per-comparison wide_elem_cmp() — which branched on ->type on every call inside the loop — is gone. * no-by median fill: the selection branch (idx = sel[i] vs i) is hoisted into two straight loops instead of a per-row ternary. Behaviour is unchanged (full suite green under ASan, scalar<->DAG parity matrix intact); this only removes per-element dispatch, matching the convention of the integer reduce loops which template HAS_NULLS/HAS_IDX for the same reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reconciles every place where the scalar aggregate builtins (
sum/avg/min/max/med/…) and the DAG/group executor (theselect … by …path) disagreed on the result of an aggregation. Six audit mechanisms (M1–M5) were found where one path silently produced wrong or differently-typed results; all are fixed so the two paths now return identical values and types for the same input. A final perf pass keeps the changes off the hot loops.A
scalar ↔ DAGdifferential matrix (6 aggregates × 5 types, plus STR/GUID/temporal and rejection cases) now agrees end-to-end.The fixes
agg_type_admitted()shared by both paths:sum=numeric+TIME,avg/var/stddev=numeric+temporal→F64;SYM/STR/GUIDandsum(DATE/TIMESTAMP)→typeerror. Unwraps parted columns.sum(TIME)preserves the TIME type in every DAG path.min/max/first/lasttruncated to 1 byte (8-byte accumulators)ray_wide_minmax_per_group_bufwired through the existing holistic post-fill (radix + dense), plus a no-by override. Lexicographic for STR, byte-order for GUID.pearson_corrreturned r² →-1and+1both collapsed to1.0r(matching the scalar builtin and the radix finalize).(select {median})returned0instead of the medianray_median_per_group_bufas a single group (top/bot left to their scalar form — the no-by planner carries no K).min/max/first/lastover BOOL widened to i64 while the DAG kept boolreduction_i64_resultgained theRAY_BOOLcase → both return BOOL (matches qmin 110b → 0band DuckDB).A final commit (
perf(agg)) hoists all type/operator dispatch out of the wide-reduce and median-fill loops — every decision is made once per column, none per row — matching the convention of the integer reduce loops.Audit items confirmed NOT to be divergences (left as-is)
prodhas no scalar builtin (a feature gap, not a divergence);top/botreject a non-integer K by design;count(distinct)is exact (verified to 200k); an empty-table no-by aggregate uniformly yields an empty result in the DAG (a consistent convention, distinct from the scalar vector-identitysum=0).Tests
Regression tests added under each fix (
agg/{sum,min,max,first,last,med,pearson_corr,per_group_holistic,list_med_var}.rfl), including a 6000-row / 200-group case exercising the radix post-fill. Several tests that had encoded the buggy behaviour (squaring pearson to dodge the sign; asserting0for no-by median with a comment stating the true value;min boolvec → 1) are corrected to the canonical result.This branch also bundles a separate coverage-boost foundation commit (
datalog/csv/heap/sym/sorttests + parted-column access) — unrelated to the agg reconciliation but already part of the work in flight.Verification
-O3 -march=native -fno-signed-zeros) is within run-to-run noise vs baseline (by-group median ~49 ms both; whole-table parallel reduce identical). The hot morsel loops (par_reduce_fn, radix phase scatter/finalize, DA accumulators) are untouched — wide/median run serially as post-passes outside the parallel pipeline.🤖 Generated with Claude Code