Skip to content

[SPARK-57718][SQL][TESTS] Add JOIN correctness tests for nanosecond-precision timestamp types#56812

Closed
stevomitric wants to merge 1 commit into
apache:masterfrom
stevomitric:stevomitric/timestamp-nanos-join-tests
Closed

[SPARK-57718][SQL][TESTS] Add JOIN correctness tests for nanosecond-precision timestamp types#56812
stevomitric wants to merge 1 commit into
apache:masterfrom
stevomitric:stevomitric/timestamp-nanos-join-tests

Conversation

@stevomitric

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Tests-only coverage for equi-JOINs over the nanosecond-precision timestamp types TIMESTAMP_NTZ(p) / TIMESTAMP_LTZ(p) (p in [7, 9]). One new suite, TimestampNanosJoinSuiteBase, with AnsiOn / AnsiOff subclasses.

The headline assertion is sub-microsecond join-key correctness: every key in the test relations shares the same epochMicros and differs only in nanosWithinMicro, so a correct join must be driven by the full nanos value —

  • a fully-equal key (incl. the sub-microsecond remainder) must join,
  • an epochMicros-equal-but-nanosWithinMicro-distinct key must not join,
  • NULL keys must not match.

Why are the changes needed?

Joins over the nanosecond types already work (they ride on the hashing / ordering / UnsafeRow primitives), but there was no join coverage — Spark's join suites are organized by join type/strategy, not by column type. This locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only.

Does this PR introduce any user-facing change?

No, tests only.

How was this patch tested?

New suites in this PR.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

End-to-end equi-join coverage over the nanosecond timestamp types TIMESTAMP_NTZ(p) /
TIMESTAMP_LTZ(p) (p in 7..9). Joins already work -- they ride on the nanos hash+equals
(hash joins) and ordering+equals (sort-merge join) primitives -- but had no test coverage;
this locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only.

TimestampNanosJoinSuiteBase (+ ANSI on/off) asserts sub-microsecond join-key correctness:
every key shares epochMicros and differs only in nanosWithinMicro, so a fully-equal key must
join, an epochMicros-equal-but-nanos-distinct key must not, and NULL keys must not match.
Inner and left-outer, NTZ and LTZ, across whole-stage codegen on/off and each of the three
equi-join strategies (broadcast-hash, sort-merge, shuffled-hash); each test asserts the
intended physical join node fired before checking the answer.

Tests only; no production change.

Co-authored-by: Isaac
@stevomitric stevomitric force-pushed the stevomitric/timestamp-nanos-join-tests branch from fd5aa46 to b735ab5 Compare June 26, 2026 14:43
@stevomitric stevomitric changed the title [SQL][TESTS] Add JOIN correctness tests for nanosecond-precision timestamp types [SPARK-57718][SQL][TESTS] Add JOIN correctness tests for nanosecond-precision timestamp types Jun 26, 2026
@stevomitric

Copy link
Copy Markdown
Contributor Author

cc @MaxGekk can you PTAL at this test-only PR.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 0 non-blocking, 0 nits.
A well-constructed, correct test-only PR. The asserted values are the intended sub-microsecond join semantics, the fixtures genuinely exercise sub-micro key equality/non-equality across all three join strategies and both codegen modes, suite placement and base-class choice are right, and parity with the existing nanos suites (TimestampNanosFunctionsSuiteBase / TimestampNanosWideningSuiteBase) is faithful.

Verification

I checked the asserted values are intended, not incidental, and that the tests can actually observe a precision bug (no SPARK-56395-style trap):

  • The right relation is neither empty nor key-equal to the left — its 700ns/300ns keys share the same epochMicros as the left's 200ns/900ns. A truncate-to-micros bug would spuriously match every non-null left row to every non-null right row, blowing up INNER and turning the lid=2/lid=3 left-outer NULLs into matches — so the suite fails loudly on such a bug. lid=2 -> null proves a real sub-micro non-match (not an empty right side); lid=4 -> null proves NULL != NULL.
  • The cross-precision-safety claim holds: SparkDateTimeUtils.truncateNanosWithinMicroToPrecision floors to (n/100)*100 at p=7 and (n/10)*10 at p=8, and the remainders 200/300/500/700/900 are all multiples of 100, so they are exact and non-colliding at every p in [7,9] — the same fixtures and expected rows are valid verbatim at all three precisions.
  • Strategy forcing is sound: forceApplyShuffledHashJoin short-circuits the SHJ build-side selection (bypassing the muchSmaller heuristic that equal-sized tiny inputs would fail), SHJ permits LEFT OUTER, and with AQE disabled assertJoinUsed("exactly one") is deterministic.

The formal verdict stays with the committer.

@MaxGekk

MaxGekk commented Jun 26, 2026

Copy link
Copy Markdown
Member

All tests passed.

+1, LGTM. Merging to master/4.x.
Thank you, @stevomitric.

@MaxGekk MaxGekk closed this in ca4e07a Jun 26, 2026
MaxGekk pushed a commit that referenced this pull request Jun 26, 2026
…ecision timestamp types

### What changes were proposed in this pull request?

Tests-only coverage for equi-JOINs over the nanosecond-precision timestamp types `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` (`p` in `[7, 9]`). One new suite, `TimestampNanosJoinSuiteBase`, with `AnsiOn` / `AnsiOff` subclasses.

The headline assertion is sub-microsecond **join-key** correctness: every key in the test relations shares the same `epochMicros` and differs only in `nanosWithinMicro`, so a correct join must be driven by the full nanos value —
- a fully-equal key (incl. the sub-microsecond remainder) must join,
- an `epochMicros`-equal-but-`nanosWithinMicro`-distinct key must **not** join,
- NULL keys must not match.

### Why are the changes needed?

Joins over the nanosecond types already work (they ride on the hashing / ordering / `UnsafeRow` primitives), but there was no join coverage — Spark's join suites are organized by join type/strategy, not by column type. This locks the behaviour in, mirroring the MIN/MAX follow-up which was likewise tests-only.

### Does this PR introduce _any_ user-facing change?

No, tests only.

### How was this patch tested?

New suites in this PR.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

Closes #56812 from stevomitric/stevomitric/timestamp-nanos-join-tests.

Authored-by: Stevo Mitric <stevomitric2000@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ca4e07a)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants