[SPARK-57718][SQL][TESTS] Add JOIN correctness tests for nanosecond-precision timestamp types#56812
Conversation
End-to-end equi-join coverage over the nanosecond timestamp types TIMESTAMP_NTZ(p) / TIMESTAMP_LTZ(p) (p in 7..9). Joins already work -- they ride on the nanos hash+equals (hash joins) and ordering+equals (sort-merge join) primitives -- but had no test coverage; this locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only. TimestampNanosJoinSuiteBase (+ ANSI on/off) asserts sub-microsecond join-key correctness: every key shares epochMicros and differs only in nanosWithinMicro, so a fully-equal key must join, an epochMicros-equal-but-nanos-distinct key must not, and NULL keys must not match. Inner and left-outer, NTZ and LTZ, across whole-stage codegen on/off and each of the three equi-join strategies (broadcast-hash, sort-merge, shuffled-hash); each test asserts the intended physical join node fired before checking the answer. Tests only; no production change. Co-authored-by: Isaac
fd5aa46 to
b735ab5
Compare
|
cc @MaxGekk can you PTAL at this test-only PR. |
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 0 non-blocking, 0 nits.
A well-constructed, correct test-only PR. The asserted values are the intended sub-microsecond join semantics, the fixtures genuinely exercise sub-micro key equality/non-equality across all three join strategies and both codegen modes, suite placement and base-class choice are right, and parity with the existing nanos suites (TimestampNanosFunctionsSuiteBase / TimestampNanosWideningSuiteBase) is faithful.
Verification
I checked the asserted values are intended, not incidental, and that the tests can actually observe a precision bug (no SPARK-56395-style trap):
- The right relation is neither empty nor key-equal to the left — its 700ns/300ns keys share the same
epochMicrosas the left's 200ns/900ns. A truncate-to-micros bug would spuriously match every non-null left row to every non-null right row, blowing up INNER and turning thelid=2/lid=3left-outer NULLs into matches — so the suite fails loudly on such a bug.lid=2 -> nullproves a real sub-micro non-match (not an empty right side);lid=4 -> nullproves NULL != NULL. - The cross-precision-safety claim holds:
SparkDateTimeUtils.truncateNanosWithinMicroToPrecisionfloors to(n/100)*100at p=7 and(n/10)*10at p=8, and the remainders 200/300/500/700/900 are all multiples of 100, so they are exact and non-colliding at every p in [7,9] — the same fixtures and expected rows are valid verbatim at all three precisions. - Strategy forcing is sound:
forceApplyShuffledHashJoinshort-circuits the SHJ build-side selection (bypassing themuchSmallerheuristic that equal-sized tiny inputs would fail), SHJ permits LEFT OUTER, and with AQE disabledassertJoinUsed("exactly one")is deterministic.
The formal verdict stays with the committer.
|
All tests passed. +1, LGTM. Merging to master/4.x. |
…ecision timestamp types ### What changes were proposed in this pull request? Tests-only coverage for equi-JOINs over the nanosecond-precision timestamp types `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` (`p` in `[7, 9]`). One new suite, `TimestampNanosJoinSuiteBase`, with `AnsiOn` / `AnsiOff` subclasses. The headline assertion is sub-microsecond **join-key** correctness: every key in the test relations shares the same `epochMicros` and differs only in `nanosWithinMicro`, so a correct join must be driven by the full nanos value — - a fully-equal key (incl. the sub-microsecond remainder) must join, - an `epochMicros`-equal-but-`nanosWithinMicro`-distinct key must **not** join, - NULL keys must not match. ### Why are the changes needed? Joins over the nanosecond types already work (they ride on the hashing / ordering / `UnsafeRow` primitives), but there was no join coverage — Spark's join suites are organized by join type/strategy, not by column type. This locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only. ### Does this PR introduce _any_ user-facing change? No, tests only. ### How was this patch tested? New suites in this PR. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) Closes #56812 from stevomitric/stevomitric/timestamp-nanos-join-tests. Authored-by: Stevo Mitric <stevomitric2000@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ca4e07a) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
Tests-only coverage for equi-JOINs over the nanosecond-precision timestamp types
TIMESTAMP_NTZ(p)/TIMESTAMP_LTZ(p)(pin[7, 9]). One new suite,TimestampNanosJoinSuiteBase, withAnsiOn/AnsiOffsubclasses.The headline assertion is sub-microsecond join-key correctness: every key in the test relations shares the same
epochMicrosand differs only innanosWithinMicro, so a correct join must be driven by the full nanos value —epochMicros-equal-but-nanosWithinMicro-distinct key must not join,Why are the changes needed?
Joins over the nanosecond types already work (they ride on the hashing / ordering /
UnsafeRowprimitives), but there was no join coverage — Spark's join suites are organized by join type/strategy, not by column type. This locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only.Does this PR introduce any user-facing change?
No, tests only.
How was this patch tested?
New suites in this PR.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)