Skip to content

fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage#231

Merged
singaraiona merged 9 commits into
masterfrom
fix/agg-scalar-dag-reconcile
Jun 8, 2026
Merged

fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage#231
singaraiona merged 9 commits into
masterfrom
fix/agg-scalar-dag-reconcile

Conversation

@ser-vasilich

Copy link
Copy Markdown
Collaborator

Summary

Reconciles every place where the scalar aggregate builtins (sum/avg/min/max/med/…) and the DAG/group executor (the select … by … path) disagreed on the result of an aggregation. Six audit mechanisms (M1–M5) were found where one path silently produced wrong or differently-typed results; all are fixed so the two paths now return identical values and types for the same input. A final perf pass keeps the changes off the hot loops.

A scalar ↔ DAG differential matrix (6 aggregates × 5 types, plus STR/GUID/temporal and rejection cases) now agrees end-to-end.

The fixes

# Divergence (scalar ≠ DAG) Fix
M1 DAG silently aggregated raw symbol ids / string bytes / absolute-date integers that the scalar builtin rejects One canonical agg_type_admitted() shared by both paths: sum=numeric+TIME, avg/var/stddev=numeric+temporal→F64; SYM/STR/GUID and sum(DATE/TIMESTAMP)type error. Unwraps parted columns. sum(TIME) preserves the TIME type in every DAG path.
M2 STR/GUID min/max/first/last truncated to 1 byte (8-byte accumulators) Row-index machinery: whole-table interception + a per-group kernel ray_wide_minmax_per_group_buf wired through the existing holistic post-fill (radix + dense), plus a no-by override. Lexicographic for STR, byte-order for GUID.
M3 by-group pearson_corr returned -1 and +1 both collapsed to 1.0 Emit the signed r (matching the scalar builtin and the radix finalize).
M4 no-by (select {median}) returned 0 instead of the median Recompute via ray_median_per_group_buf as a single group (top/bot left to their scalar form — the no-by planner carries no K).
M5 scalar min/max/first/last over BOOL widened to i64 while the DAG kept bool reduction_i64_result gained the RAY_BOOL case → both return BOOL (matches q min 110b → 0b and DuckDB).

A final commit (perf(agg)) hoists all type/operator dispatch out of the wide-reduce and median-fill loops — every decision is made once per column, none per row — matching the convention of the integer reduce loops.

Audit items confirmed NOT to be divergences (left as-is)

prod has no scalar builtin (a feature gap, not a divergence); top/bot reject a non-integer K by design; count(distinct) is exact (verified to 200k); an empty-table no-by aggregate uniformly yields an empty result in the DAG (a consistent convention, distinct from the scalar vector-identity sum=0).

Tests

Regression tests added under each fix (agg/{sum,min,max,first,last,med,pearson_corr,per_group_holistic,list_med_var}.rfl), including a 6000-row / 200-group case exercising the radix post-fill. Several tests that had encoded the buggy behaviour (squaring pearson to dodge the sign; asserting 0 for no-by median with a comment stating the true value; min boolvec → 1) are corrected to the canonical result.

This branch also bundles a separate coverage-boost foundation commit (datalog/csv/heap/sym/sort tests + parted-column access) — unrelated to the agg reconciliation but already part of the work in flight.

Verification

  • Full suite green under debug + ASan/UBSan: 3241/3243 passed, 0 failed (2 pre-existing skips).
  • Morsel/parallel group-by not regressed: release benchmark (20M rows, 100 groups, -O3 -march=native -fno-signed-zeros) is within run-to-run noise vs baseline (by-group median ~49 ms both; whole-table parallel reduce identical). The hot morsel loops (par_reduce_fn, radix phase scatter/finalize, DA accumulators) are untouched — wide/median run serially as post-passes outside the parallel pipeline.

🤖 Generated with Claude Code

ser-vasilich and others added 8 commits June 6, 2026 15:41
Foundation checkpoint (full suite 3241/3243, 0 failed, ASan/UBSan clean):

- Coverage-boost: raised src coverage across io/csv, ops/{datalog,join,
  fused_group,sort,string,builtins,opt,traverse}, lang/eval, table/sym,
  mem/heap by extending the nearest existing RFL/C tests, with reachability
  notes for OOM/defensive/dead-by-design lines.
- fix(collection): `(at table 'col)` now materializes RAY_IS_PARTED /
  RAY_MAPCOMMON columns from a .db.parted.get table (was: type error on
  direct column access).
- fix(agg): scalar ray_avg_fn/ray_sum_fn reject non-numeric (SYM/STR/GUID)
  vectors instead of averaging/summing raw element ids (the first half of the
  scalar<->DAG aggregation reconciliation).
- Corrected two false-positive tests the parted fix exposed (min/max of SYM
  are defined lexicographically, not type errors) and documented the substr
  null-vector path (no bug; an earlier SIGSEGV note was a mixed-ABI build
  artifact, not exec_substr).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (M1)

The scalar aggregate builtins (ray_sum_fn/ray_avg_fn/...) and the DAG/group
executors had divergent type-admission rules, so a select-form aggregate could
silently fold raw symbol ids, string bytes, or absolute-temporal integers that
the scalar builtin rejects.  Unify both on one canonical table via a shared
agg_type_admitted() helper:

  sum  : numeric (BOOL/U8/I16/I32/I64/F64) + TIME (duration-like); rejects
         SYM/STR/GUID and the absolute temporals DATE/TIMESTAMP
  prod : numeric only
  avg/var/var_pop/stddev/stddev_pop : numeric + temporal -> F64; rejects
         SYM/STR/GUID
  min/max/first/last/count : any type

DAG-side guards added at both select agg-build sites (no-by and by-group) in
query.c, and at the direct-reduction dispatch in exec.c.  The helper unwraps
RAY_PARTED columns to their base element type and only rejects element types it
positively recognises as inadmissible (deferring unknown wrappers such as
RAY_MAPCOMMON to the runtime) so a plan-time guard never false-rejects a parted
numeric column.

sum(TIME) now preserves the TIME type in every DAG path (no-by select,
by-group select, whole-table reduction), matching the scalar ray_sum_fn —
previously the DAG returned a raw I64.  The affine/linear SUM fast paths leave
the input vector unmaterialised, so emit_agg_columns recovers the source type
from the aggregation input op when the vector is absent.

Tests: sum.rfl gains a scalar<->DAG parity section (type-admission + sum(TIME)
preservation across no-by and by-group); list_med_var.rfl and
agg/branch_coverage.rfl updated to assert sum(TIMESTAMP) is a type error
(canonical: absolute temporals are not summable, matching DuckDB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wide element types — STR (a 16-byte pool-pointer+length) and GUID (16 raw
bytes) — overflow the 8-byte integer reduce accumulators, so every min / max /
first / last over such a column silently truncated the value to a single byte.
The scalar builtins (which route STR/GUID through the DAG reducer) returned
empty/garbage for min/max, and the DAG select path returned garbage for all
four; only the scalar first/last (via collection_elem) were correct, so scalar
and DAG diverged.

Resolve all of them by tracking the WINNING ROW INDEX instead of an 8-byte
value and materialising that element:

  * Helpers in group.c: agg_is_wide_type, wide_elem_cmp (lexicographic memcmp
    for STR, byte order for GUID), wide_winner_row (min/max by comparison,
    first/last by position, skipping nulls), agg_wide_reduce.
  * Whole-table / scalar path: exec_reduction intercepts wide min/max/first/
    last and materialises the winner — this also fixes the scalar builtins,
    which delegate STR/GUID to the DAG.
  * By-group path: a new per-group kernel ray_wide_minmax_per_group_buf runs
    through the existing holistic post-fill machinery (the OP_MEDIAN/TOP_N/
    BOT_N plumbing). Wide min/max/first/last are flagged holistic in
    ght_compute_layout (new agg_is_wide bit) so the radix and dense post-fill
    sites dispatch the wide kernel and the row-layout emit skips them. The
    dense attach now keys off the holistic bitmask rather than the op literal.
  * No-by select: emit_agg_columns truncates wide aggs, so the scalar
    n_keys==0 path overrides those columns post-emit via the winning-row
    materialisation (new ray_table_set_col_idx helper swaps a column vector).
  * The DA fast path is disqualified for wide aggs so they reach the holistic
    post-fill.

The wide kernel runs serial: ray_str_vec_set COW-mutates a shared string pool,
unsafe to parallelise; wide min/max is a cold path, not a bench kernel.

The rowform specialisations (maxmin/median/sum_count) are already gated to
integer value columns, so STR/GUID never bypass the general exec_group path.

Tests: agg/{min,max,first,last}.rfl gain STR + GUID scalar<->DAG parity
(no-by and by-group); per_group_holistic.rfl adds a 6000-row / 200-group case
that exercises the radix post-fill with exact group-0 min/first/last values.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
exec_group_pearson_rowform (the dedicated single-pass per-group Pearson path)
finalised each group as r² = (cov/(σx·σy))² instead of the signed coefficient
r.  Squaring collapsed -1 and +1 to the same 1.0, so any by-group correlation
silently lost its sign — a perfect negative correlation reported as +1.0.  The
scalar pearson_corr builtin and the general radix HT finalize both return the
signed r; align the rowform path with them.

Two existing tests encoded the buggy behaviour and are corrected to assert the
signed coefficient:
  * agg/pearson_corr.rfl squared r (pow(...,2)) to dodge the sign; now asserts
    group A=+1.0 / group B=-1.0 directly, with the r² parity check kept.
  * group/null_aware_helpers.rfl asserted row-0==1.0 with a comment noting "the
    engine appears to compute |r| or r²"; now asserts the order-independent
    {min,max} = {-1.0,+1.0} pair (group emit order is hash-dependent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A no-by select aggregate routes through exec_group's n_keys==0 scalar fast
path, which has no accumulator for holistic aggregates and whose emitter
(emit_agg_columns) has no median case — so (select {m: (median v) from: T})
returned 0 while the scalar (med v) returned the real median.  Recompute the
median column post-emit over the whole (optionally where-selected) column as a
single group via ray_median_per_group_buf, mirroring the by-group path.

top/bot are intentionally left to their scalar-builtin form here: the no-by
planner does not thread the K argument into the group op, so a no-by select
top/bot has no K to honour.  Whole-table var/stddev already matched (they have
n_keys==0 accumulators), so only median needed the fill.

Two tests that asserted the buggy 0 (each with a comment stating the true
median) are corrected: agg/med.rfl gains no-by median parity (F64, I64-to-F64,
and where-selected), and query/query_dag_agg_coverage.rfl asserts 6.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… DAG (M5)

reduction_i64_result (used by exec_reduction's min/max and first/last finalize)
had no RAY_BOOL case, so a BOOL reduction widened to i64.  The scalar min/max
builtins delegate BOOL vectors to this path, so (min boolvec) returned an i64
0/1 while the DAG select path already returned a BOOL — every other element
type (U8/I16/I32/temporal/SYM) was preserved, BOOL alone was not.  Add the
RAY_BOOL case so both paths return a BOOL, matching q (min 110b -> 0b) and
DuckDB (min(bool) -> bool).

Tests: agg/{min,max,first,last}.rfl gain BOOL type-preservation parity.  The
"all-true -> 1 / any-true -> 0" idiom — (min boolvec) / (max boolvec) used as a
shard-independent boolean check — appeared across exec_worker_merge.rfl,
like_patterns.rfl and cross_type_workout.rfl asserting 0/1; those ~17 assertions
now assert false/true (the reduction still collapses the column to one boolean,
only the result type changed).

Other M5 audit items were confirmed NOT to be scalar<->DAG divergences and are
left as-is: prod has no scalar builtin (a feature gap, not a divergence); top/bot
reject a non-integer K by design; count(distinct) is exact (verified to 200k);
and an empty-table no-by aggregate uniformly yields an empty result in the DAG
(a consistent convention, distinct from the scalar vector-identity sum=0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ill loops

Follow-up to the M2/M4 aggregation work: keep every type/operator decision at
column granularity, with none inside the per-row loops.

* wide_winner_row (STR/GUID min/max/first/last): the element type (STR vs GUID)
  and the operator are now resolved ONCE before any loop.  first/last became
  pure positional scans (no value comparison); min/max run a single
  type-specialised compare loop with the direction (want_min) hoisted out.  The
  former per-comparison wide_elem_cmp() — which branched on ->type on every
  call inside the loop — is gone.
* no-by median fill: the selection branch (idx = sel[i] vs i) is hoisted into
  two straight loops instead of a per-row ternary.

Behaviour is unchanged (full suite green under ASan, scalar<->DAG parity matrix
intact); this only removes per-element dispatch, matching the convention of the
integer reduce loops which template HAS_NULLS/HAS_IDX for the same reason.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich marked this pull request as draft June 8, 2026 08:03
@ser-vasilich ser-vasilich marked this pull request as ready for review June 8, 2026 10:32
@singaraiona singaraiona merged commit 422500d into master Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants