GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679
GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679Alex-PLACET wants to merge 14 commits intoapache:mainfrom
Conversation
|
|
- Added a new benchmark file `vector_search_sorted_benchmark.cc` to evaluate the performance of the SearchSorted function for various data types including Int64, String, and Binary. - Created a comprehensive test suite in `vector_search_sorted_test.cc` to validate the correctness of SearchSorted across different scenarios, including handling of null values, scalar needles, and run-end encoded arrays. - Ensured that the benchmarks cover both left and right search options, as well as edge cases like empty arrays and arrays with leading/trailing nulls.
…rks for needles with null runs
…tation overview and flow
…lize ranges for leading/trailing null counts
…ensive tests for supported types
8e09ea3 to
4ec630e
Compare
…xcept to length method
| [ | ||
| 0, | ||
| 0, | ||
| 3, | ||
| 5 | ||
| ] |
There was a problem hiding this comment.
| [ | |
| 0, | |
| 0, | |
| 3, | |
| 5 | |
| ] | |
| [ | |
| 0, | |
| 0, | |
| 3, | |
| 5 | |
| ] |
@Alex-PLACET just noticing that the elements here require two space chars to be removed. Similarly for the next two example outputs.
pitrou
left a comment
There was a problem hiding this comment.
I haven't looked at the implementation yet, but have reviewed the tests and benchmarks (which are quite comprehensive, thank you!).
One missing item is support for chunked arrays. Besides that, see comments below :)
| SearchSortedOptions(SearchSortedOptions::Left))); | ||
| ASSERT_OK_AND_ASSIGN(auto right, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Right))); |
There was a problem hiding this comment.
Let's call ValidateFull on both results here too?
(also I'm curious, why not reuse CheckSimpleSearchSorted?)
| std::string scalar_needle_json; | ||
| uint64_t expected_scalar_left; | ||
| uint64_t expected_scalar_right; |
There was a problem hiding this comment.
Note that you could also generate the scalar needle tests automatically by calling GetScalar on the array needles and the expected results. This would make this easier to maintain later.
| return std::static_pointer_cast<BinaryArray>(builder.Finish().ValueOrDie()); | ||
| } | ||
|
|
||
| std::shared_ptr<BinaryArray> BuildBinaryNeedles(int64_t size_bytes) { |
There was a problem hiding this comment.
I don't think it's worth benchmarking both string and binary, as they are expected to perform similarly.
|
|
||
| void SetSearchSortedArgs(benchmark::internal::Benchmark* bench) { | ||
| bench->Unit(benchmark::kMicrosecond); | ||
| for (const auto size : kMemorySizes) { |
There was a problem hiding this comment.
Unfortunately the largest size produces rather slow benchmark iterations, can we just keep {kL1Size, kL2Size}}?
There was a problem hiding this comment.
Which essentially means that all benchmarks become "quick" as per the definition of quick here :)
| RunSearchSortedBenchmark(state, values, needles, side); | ||
| } | ||
|
|
||
| static void BM_SearchSortedInt64ScalarNeedle(benchmark::State& state, |
There was a problem hiding this comment.
If there is just one needle to search for, we're just benchmarking the function call overhead rather than any significant part of the sorted search kernel, right? Is it useful?
(benchmarks for other compute functions focus on array performance, not scalar performance)
| ->Apply(SetSearchSortedArgs); | ||
|
|
||
| // String and binary scalar cases specifically exercise the direct scalar fast path that | ||
| // avoids boxing a scalar needle into a temporary one-element array. |
There was a problem hiding this comment.
Even if we wanted to benchmark this (which I doubt we do), we would only need a single benchmark IMHO.
…and add tests for chunked values and needles
|
@Alex-PLACET Is this ready for review again? |
|
@Alex-PLACET It looks like the benchmarks don't compile anymore? |
pitrou
left a comment
There was a problem hiding this comment.
Now I also took a look at the implementation and some aspects of it seem overly-complicated (but I may miss something). See comments below.
| "[1, 3, 3, 5]", | ||
| "[0, 3, 4, 6]", |
There was a problem hiding this comment.
Can we make the values and needles arrays different lengths?
| auto values = ArrayFromJSON(int16(), "[]"); | ||
| auto needles = ArrayFromJSON(int16(), "[1, 2, 3]"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto result, SearchSorted(Datum(values), Datum(needles))); |
There was a problem hiding this comment.
Why not reuse CheckSimpleSearchSorted? It would validate the results and also stress scalar needles.
| auto values = ArrayFromJSON(int32(), "[null, 200, 300, 300]"); | ||
| auto needles = ArrayFromJSON(int32(), "[50, 200, 250, 400]"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto left, |
There was a problem hiding this comment.
Again, let's re-use the testing helper CheckSimpleSearchSorted everywhere possible, otherwise we're forgetting some checks.
| {"Float32", float32(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]", | ||
| "[0, 1, 3, 4]", "[0, 3, 3, 4]"}, | ||
| {"Float64", float64(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]", | ||
| "[0, 1, 3, 4]", "[0, 3, 3, 4]"}, |
There was a problem hiding this comment.
Do we want to test NaNs here? They will probably need special treatment, whether in the values or the needles.
For the record, sorting an array with Arrow always put NaNs between nulls and non-null values.
e.g. either [1.0, 3.0, 3.0, 5.0, NaN, NaN, null] or [null, NaN, NaN, 1.0, 3.0, 3.0, 5.0].
| ASSERT_OK_AND_ASSIGN(auto left, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Left))); | ||
| ASSERT_OK_AND_ASSIGN(auto right, | ||
| SearchSorted(Datum(values), Datum(needles), | ||
| SearchSortedOptions(SearchSortedOptions::Right))); |
There was a problem hiding this comment.
Either you write or reuse a helper here, or you need to call ValidateFull directly.
There was a problem hiding this comment.
And ideally we should also check scalar needles too.
| auto values_type = run_end_encoded(int32(), int32()); | ||
| ASSERT_OK_AND_ASSIGN(auto ree_values, | ||
| REEFromJSON(values_type, "[0, 0, 1, 1, 1, 4, 4, 9]")); | ||
| auto sliced = ree_values->Slice(2, 5); |
There was a problem hiding this comment.
Can we use a slice that does not happily falls on run boundaries? (for example Slice(1, 5))
| AssertArraysEqual(*ArrayFromJSON(uint64(), "[0, 0, 3, 3, 5]"), *result.make_array()); | ||
| } | ||
|
|
||
| TEST(SearchSorted, BinaryValues) { |
There was a problem hiding this comment.
This is already part of the smoke tests, right?
| ValueType Value(int64_t index) const { | ||
| const auto physical_index = span_.PhysicalIndex(index); | ||
| return GetViewType<ArrowType>::LogicalValue(values_.GetView(physical_index)); | ||
| } |
There was a problem hiding this comment.
Hmm, so we're processing each run-end logical value individually, and then finding it in the array of physical values? This is wasteful, as it's O(n_needles * log n_values), while each physical value will produce the same output after all.
What run-end-encoded values should do instead is:
- run search_sorted on the run values, giving "physical results"
- turn the "physical results" into "logical results" by looking up the run start for each physical result
In other words, given:
values = run_end_encoded([10,10,10,30,30])(i.e. run values[10, 30]and run ends[3, 5])needles = [10, 20, 30, 40]
we would first compute the physical result [0, 1, 1, 2] of searching the needles in [10, 30], then turn it into the logical results [0, 3, 3, 5].
This would be O(n_needles * log n_value_runs) and therefore more efficient. It should also generate less template specializations, because you don't need to template this algorithm per (physical value type, run end type) pair.
| const auto& needle_data = needles.array(); | ||
| if (needle_data->type->id() == Type::RUN_END_ENCODED) { | ||
| RunEndEncodedArray ree(needle_data); | ||
| return DispatchRunEndEncodedByRunEndType<Status>( |
There was a problem hiding this comment.
Similarly to what I pointed above about run-end-encoded values, you don't need complex indexing for run-end-encoded needles. It's actually even simpler as you can just search each physical needle and run-end-decode the (needle run ends, physical results).
| const int64_t leading_null_count = CountLeadingNulls( | ||
| values.length, [&](int64_t index) { return values.IsNull(index); }); | ||
| const int64_t trailing_null_count = | ||
| leading_null_count > 0 ? 0 : CountTrailingNulls(values.length, [&](int64_t index) { | ||
| return values.IsNull(index); | ||
| }); | ||
|
|
There was a problem hiding this comment.
This is... complicated. We know there are null_count nulls, and they are either clustered at the start or at the end. We just have to look up the first validity bit (values.IsNull(0)) and then we know how many leading/trailing nulls there are.
And we can do the exact same thing for chunked arrays.
Rationale for this change
Add the implemenation of the search sorted compute kernel based on the numpy function: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html
What changes are included in this PR?
Implementation of the C++ kernel + Python API.
Tests in C++ and Python
Are these changes tested?
Yes
Are there any user-facing changes?
No breaking change
search_sortedkernel for all primitive types and run-end encoded arrays #49677