fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage by ser-vasilich · Pull Request #231 · RayforceDB/rayforce

ser-vasilich · 2026-06-08T07:59:50Z

Summary

Reconciles every place where the scalar aggregate builtins (sum/avg/min/max/med/…) and the DAG/group executor (the select … by … path) disagreed on the result of an aggregation. Six audit mechanisms (M1–M5) were found where one path silently produced wrong or differently-typed results; all are fixed so the two paths now return identical values and types for the same input. A final perf pass keeps the changes off the hot loops.

A scalar ↔ DAG differential matrix (6 aggregates × 5 types, plus STR/GUID/temporal and rejection cases) now agrees end-to-end.

The fixes

#	Divergence (scalar ≠ DAG)	Fix
M1	DAG silently aggregated raw symbol ids / string bytes / absolute-date integers that the scalar builtin rejects	One canonical `agg_type_admitted()` shared by both paths: `sum`=numeric+TIME, `avg`/`var`/`stddev`=numeric+temporal→F64; `SYM`/`STR`/`GUID` and `sum(DATE/TIMESTAMP)`→`type` error. Unwraps parted columns. `sum(TIME)` preserves the TIME type in every DAG path.
M2	STR/GUID `min`/`max`/`first`/`last` truncated to 1 byte (8-byte accumulators)	Row-index machinery: whole-table interception + a per-group kernel `ray_wide_minmax_per_group_buf` wired through the existing holistic post-fill (radix + dense), plus a no-by override. Lexicographic for STR, byte-order for GUID.
M3	by-group `pearson_corr` returned r² → `-1` and `+1` both collapsed to `1.0`	Emit the signed `r` (matching the scalar builtin and the radix finalize).
M4	no-by `(select {median})` returned `0` instead of the median	Recompute via `ray_median_per_group_buf` as a single group (top/bot left to their scalar form — the no-by planner carries no K).
M5	scalar `min`/`max`/`first`/`last` over BOOL widened to i64 while the DAG kept bool	`reduction_i64_result` gained the `RAY_BOOL` case → both return BOOL (matches q `min 110b → 0b` and DuckDB).

A final commit (perf(agg)) hoists all type/operator dispatch out of the wide-reduce and median-fill loops — every decision is made once per column, none per row — matching the convention of the integer reduce loops.

Audit items confirmed NOT to be divergences (left as-is)

prod has no scalar builtin (a feature gap, not a divergence); top/bot reject a non-integer K by design; count(distinct) is exact (verified to 200k); an empty-table no-by aggregate uniformly yields an empty result in the DAG (a consistent convention, distinct from the scalar vector-identity sum=0).

Tests

Regression tests added under each fix (agg/{sum,min,max,first,last,med,pearson_corr,per_group_holistic,list_med_var}.rfl), including a 6000-row / 200-group case exercising the radix post-fill. Several tests that had encoded the buggy behaviour (squaring pearson to dodge the sign; asserting 0 for no-by median with a comment stating the true value; min boolvec → 1) are corrected to the canonical result.

This branch also bundles a separate coverage-boost foundation commit (datalog/csv/heap/sym/sort tests + parted-column access) — unrelated to the agg reconciliation but already part of the work in flight.

Verification

Full suite green under debug + ASan/UBSan: 3241/3243 passed, 0 failed (2 pre-existing skips).
Morsel/parallel group-by not regressed: release benchmark (20M rows, 100 groups, -O3 -march=native -fno-signed-zeros) is within run-to-run noise vs baseline (by-group median ~49 ms both; whole-table parallel reduce identical). The hot morsel loops (par_reduce_fn, radix phase scatter/finalize, DA accumulators) are untouched — wide/median run serially as post-passes outside the parallel pipeline.

🤖 Generated with Claude Code

Foundation checkpoint (full suite 3241/3243, 0 failed, ASan/UBSan clean): - Coverage-boost: raised src coverage across io/csv, ops/{datalog,join, fused_group,sort,string,builtins,opt,traverse}, lang/eval, table/sym, mem/heap by extending the nearest existing RFL/C tests, with reachability notes for OOM/defensive/dead-by-design lines. - fix(collection): `(at table 'col)` now materializes RAY_IS_PARTED / RAY_MAPCOMMON columns from a .db.parted.get table (was: type error on direct column access). - fix(agg): scalar ray_avg_fn/ray_sum_fn reject non-numeric (SYM/STR/GUID) vectors instead of averaging/summing raw element ids (the first half of the scalar<->DAG aggregation reconciliation). - Corrected two false-positive tests the parted fix exposed (min/max of SYM are defined lexicographically, not type errors) and documented the substr null-vector path (no bug; an earlier SIGSEGV note was a mixed-ABI build artifact, not exec_substr). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s (M1) The scalar aggregate builtins (ray_sum_fn/ray_avg_fn/...) and the DAG/group executors had divergent type-admission rules, so a select-form aggregate could silently fold raw symbol ids, string bytes, or absolute-temporal integers that the scalar builtin rejects. Unify both on one canonical table via a shared agg_type_admitted() helper: sum : numeric (BOOL/U8/I16/I32/I64/F64) + TIME (duration-like); rejects SYM/STR/GUID and the absolute temporals DATE/TIMESTAMP prod : numeric only avg/var/var_pop/stddev/stddev_pop : numeric + temporal -> F64; rejects SYM/STR/GUID min/max/first/last/count : any type DAG-side guards added at both select agg-build sites (no-by and by-group) in query.c, and at the direct-reduction dispatch in exec.c. The helper unwraps RAY_PARTED columns to their base element type and only rejects element types it positively recognises as inadmissible (deferring unknown wrappers such as RAY_MAPCOMMON to the runtime) so a plan-time guard never false-rejects a parted numeric column. sum(TIME) now preserves the TIME type in every DAG path (no-by select, by-group select, whole-table reduction), matching the scalar ray_sum_fn — previously the DAG returned a raw I64. The affine/linear SUM fast paths leave the input vector unmaterialised, so emit_agg_columns recovers the source type from the aggregation input op when the vector is absent. Tests: sum.rfl gains a scalar<->DAG parity section (type-admission + sum(TIME) preservation across no-by and by-group); list_med_var.rfl and agg/branch_coverage.rfl updated to assert sum(TIMESTAMP) is a type error (canonical: absolute temporals are not summable, matching DuckDB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wide element types — STR (a 16-byte pool-pointer+length) and GUID (16 raw bytes) — overflow the 8-byte integer reduce accumulators, so every min / max / first / last over such a column silently truncated the value to a single byte. The scalar builtins (which route STR/GUID through the DAG reducer) returned empty/garbage for min/max, and the DAG select path returned garbage for all four; only the scalar first/last (via collection_elem) were correct, so scalar and DAG diverged. Resolve all of them by tracking the WINNING ROW INDEX instead of an 8-byte value and materialising that element: * Helpers in group.c: agg_is_wide_type, wide_elem_cmp (lexicographic memcmp for STR, byte order for GUID), wide_winner_row (min/max by comparison, first/last by position, skipping nulls), agg_wide_reduce. * Whole-table / scalar path: exec_reduction intercepts wide min/max/first/ last and materialises the winner — this also fixes the scalar builtins, which delegate STR/GUID to the DAG. * By-group path: a new per-group kernel ray_wide_minmax_per_group_buf runs through the existing holistic post-fill machinery (the OP_MEDIAN/TOP_N/ BOT_N plumbing). Wide min/max/first/last are flagged holistic in ght_compute_layout (new agg_is_wide bit) so the radix and dense post-fill sites dispatch the wide kernel and the row-layout emit skips them. The dense attach now keys off the holistic bitmask rather than the op literal. * No-by select: emit_agg_columns truncates wide aggs, so the scalar n_keys==0 path overrides those columns post-emit via the winning-row materialisation (new ray_table_set_col_idx helper swaps a column vector). * The DA fast path is disqualified for wide aggs so they reach the holistic post-fill. The wide kernel runs serial: ray_str_vec_set COW-mutates a shared string pool, unsafe to parallelise; wide min/max is a cold path, not a bench kernel. The rowform specialisations (maxmin/median/sum_count) are already gated to integer value columns, so STR/GUID never bypass the general exec_group path. Tests: agg/{min,max,first,last}.rfl gain STR + GUID scalar<->DAG parity (no-by and by-group); per_group_holistic.rfl adds a 6000-row / 200-group case that exercises the radix post-fill with exact group-0 min/first/last values. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

exec_group_pearson_rowform (the dedicated single-pass per-group Pearson path) finalised each group as r² = (cov/(σx·σy))² instead of the signed coefficient r. Squaring collapsed -1 and +1 to the same 1.0, so any by-group correlation silently lost its sign — a perfect negative correlation reported as +1.0. The scalar pearson_corr builtin and the general radix HT finalize both return the signed r; align the rowform path with them. Two existing tests encoded the buggy behaviour and are corrected to assert the signed coefficient: * agg/pearson_corr.rfl squared r (pow(...,2)) to dodge the sign; now asserts group A=+1.0 / group B=-1.0 directly, with the r² parity check kept. * group/null_aware_helpers.rfl asserted row-0==1.0 with a comment noting "the engine appears to compute |r| or r²"; now asserts the order-independent {min,max} = {-1.0,+1.0} pair (group emit order is hash-dependent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A no-by select aggregate routes through exec_group's n_keys==0 scalar fast path, which has no accumulator for holistic aggregates and whose emitter (emit_agg_columns) has no median case — so (select {m: (median v) from: T}) returned 0 while the scalar (med v) returned the real median. Recompute the median column post-emit over the whole (optionally where-selected) column as a single group via ray_median_per_group_buf, mirroring the by-group path. top/bot are intentionally left to their scalar-builtin form here: the no-by planner does not thread the K argument into the group op, so a no-by select top/bot has no K to honour. Whole-table var/stddev already matched (they have n_keys==0 accumulators), so only median needed the fill. Two tests that asserted the buggy 0 (each with a comment stating the true median) are corrected: agg/med.rfl gains no-by median parity (F64, I64-to-F64, and where-selected), and query/query_dag_agg_coverage.rfl asserts 6.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… DAG (M5) reduction_i64_result (used by exec_reduction's min/max and first/last finalize) had no RAY_BOOL case, so a BOOL reduction widened to i64. The scalar min/max builtins delegate BOOL vectors to this path, so (min boolvec) returned an i64 0/1 while the DAG select path already returned a BOOL — every other element type (U8/I16/I32/temporal/SYM) was preserved, BOOL alone was not. Add the RAY_BOOL case so both paths return a BOOL, matching q (min 110b -> 0b) and DuckDB (min(bool) -> bool). Tests: agg/{min,max,first,last}.rfl gain BOOL type-preservation parity. The "all-true -> 1 / any-true -> 0" idiom — (min boolvec) / (max boolvec) used as a shard-independent boolean check — appeared across exec_worker_merge.rfl, like_patterns.rfl and cross_type_workout.rfl asserting 0/1; those ~17 assertions now assert false/true (the reduction still collapses the column to one boolean, only the result type changed). Other M5 audit items were confirmed NOT to be scalar<->DAG divergences and are left as-is: prod has no scalar builtin (a feature gap, not a divergence); top/bot reject a non-integer K by design; count(distinct) is exact (verified to 200k); and an empty-table no-by aggregate uniformly yields an empty result in the DAG (a consistent convention, distinct from the scalar vector-identity sum=0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ill loops Follow-up to the M2/M4 aggregation work: keep every type/operator decision at column granularity, with none inside the per-row loops. * wide_winner_row (STR/GUID min/max/first/last): the element type (STR vs GUID) and the operator are now resolved ONCE before any loop. first/last became pure positional scans (no value comparison); min/max run a single type-specialised compare loop with the direction (want_min) hoisted out. The former per-comparison wide_elem_cmp() — which branched on ->type on every call inside the loop — is gone. * no-by median fill: the selection branch (idx = sel[i] vs i) is hoisted into two straight loops instead of a per-row ternary. Behaviour is unchanged (full suite green under ASan, scalar<->DAG parity matrix intact); this only removes per-element dispatch, matching the convention of the integer reduce loops which template HAS_NULLS/HAS_IDX for the same reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…reconcile

ser-vasilich and others added 8 commits June 6, 2026 15:41

Merge remote-tracking branch 'origin/master' into fix/agg-scalar-dag-…

378365b

…reconcile

ser-vasilich marked this pull request as draft June 8, 2026 08:03

Merge remote-tracking branch 'origin/master' into fix/agg-scalar-dag-…

5032692

…reconcile

ser-vasilich marked this pull request as ready for review June 8, 2026 10:32

singaraiona merged commit 422500d into master Jun 8, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage#231

fix(agg): reconcile scalar vs DAG aggregation (type-admission, STR/GUID, pearson, median, BOOL) + coverage#231
singaraiona merged 9 commits into
masterfrom
fix/agg-scalar-dag-reconcile

ser-vasilich commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ser-vasilich commented Jun 8, 2026

Summary

The fixes

Audit items confirmed NOT to be divergences (left as-is)

Tests

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants