Skip to content

GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679

Open
Alex-PLACET wants to merge 14 commits intoapache:mainfrom
Alex-PLACET:add_search_sorted_compute_kernel
Open

GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679
Alex-PLACET wants to merge 14 commits intoapache:mainfrom
Alex-PLACET:add_search_sorted_compute_kernel

Conversation

@Alex-PLACET
Copy link
Copy Markdown

@Alex-PLACET Alex-PLACET commented Apr 7, 2026

Rationale for this change

Add the implemenation of the search sorted compute kernel based on the numpy function: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html

What changes are included in this PR?

Implementation of the C++ kernel + Python API.
Tests in C++ and Python

Are these changes tested?

Yes

Are there any user-facing changes?

No breaking change

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

⚠️ GitHub issue #49677 has been automatically assigned in GitHub to PR creator.

- Added a new benchmark file `vector_search_sorted_benchmark.cc` to evaluate the performance of the SearchSorted function for various data types including Int64, String, and Binary.
- Created a comprehensive test suite in `vector_search_sorted_test.cc` to validate the correctness of SearchSorted across different scenarios, including handling of null values, scalar needles, and run-end encoded arrays.
- Ensured that the benchmarks cover both left and right search options, as well as edge cases like empty arrays and arrays with leading/trailing nulls.
…lize ranges for leading/trailing null counts
@Alex-PLACET Alex-PLACET force-pushed the add_search_sorted_compute_kernel branch from 8e09ea3 to 4ec630e Compare April 8, 2026 08:15
Comment on lines +53 to +58
[
0,
0,
3,
5
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[
0,
0,
3,
5
]
[
0,
0,
3,
5
]

@Alex-PLACET just noticing that the elements here require two space chars to be removed. Similarly for the next two example outputs.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 8, 2026
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the implementation yet, but have reviewed the tests and benchmarks (which are quite comprehensive, thank you!).

One missing item is support for chunked arrays. Besides that, see comments below :)

Comment thread cpp/src/arrow/compute/kernels/vector_search_sorted_test.cc
SearchSortedOptions(SearchSortedOptions::Left)));
ASSERT_OK_AND_ASSIGN(auto right,
SearchSorted(Datum(values), Datum(needles),
SearchSortedOptions(SearchSortedOptions::Right)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call ValidateFull on both results here too?

(also I'm curious, why not reuse CheckSimpleSearchSorted?)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fixed too

Comment on lines +90 to +92
std::string scalar_needle_json;
uint64_t expected_scalar_left;
uint64_t expected_scalar_right;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you could also generate the scalar needle tests automatically by calling GetScalar on the array needles and the expected results. This would make this easier to maintain later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fixed

Comment thread cpp/src/arrow/compute/kernels/vector_search_sorted_test.cc
return std::static_pointer_cast<BinaryArray>(builder.Finish().ValueOrDie());
}

std::shared_ptr<BinaryArray> BuildBinaryNeedles(int64_t size_bytes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth benchmarking both string and binary, as they are expected to perform similarly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


void SetSearchSortedArgs(benchmark::internal::Benchmark* bench) {
bench->Unit(benchmark::kMicrosecond);
for (const auto size : kMemorySizes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the largest size produces rather slow benchmark iterations, can we just keep {kL1Size, kL2Size}}?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which essentially means that all benchmarks become "quick" as per the definition of quick here :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fixed

RunSearchSortedBenchmark(state, values, needles, side);
}

static void BM_SearchSortedInt64ScalarNeedle(benchmark::State& state,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is just one needle to search for, we're just benchmarking the function call overhead rather than any significant part of the sorted search kernel, right? Is it useful?

(benchmarks for other compute functions focus on array performance, not scalar performance)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

->Apply(SetSearchSortedArgs);

// String and binary scalar cases specifically exercise the direct scalar fast path that
// avoids boxing a scalar needle into a temporary one-element array.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we wanted to benchmark this (which I doubt we do), we would only need a single benchmark IMHO.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 23, 2026

@Alex-PLACET Is this ready for review again?

@Alex-PLACET Alex-PLACET requested a review from pitrou April 23, 2026 12:02
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 23, 2026

@Alex-PLACET It looks like the benchmarks don't compile anymore?

/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In function 'void arrow::compute::BM_SearchSortedBinaryScalarNeedle(benchmark::State&, SearchSortedOptions::Side)':
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:255:29: error: 'BuildSortedBinaryValues' was not declared in this scope; did you mean 'BuildSortedInt64Values'?
  255 |   const auto values_array = BuildSortedBinaryValues(state.range(0));
      |                             ^~~~~~~~~~~~~~~~~~~~~~~
      |                             BuildSortedInt64Values
In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:18:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In lambda function:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:290:19: error: 'BM_SearchSortedInt64ScalarNeedle' was not declared in this scope; did you mean 'BM_SearchSortedBinaryScalarNeedle'?
  290 | BENCHMARK_CAPTURE(BM_SearchSortedInt64ScalarNeedle, left, SearchSortedOptions::Left)
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In lambda function:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:292:19: error: 'BM_SearchSortedInt64ScalarNeedle' was not declared in this scope; did you mean 'BM_SearchSortedBinaryScalarNeedle'?
  292 | BENCHMARK_CAPTURE(BM_SearchSortedInt64ScalarNeedle, right, SearchSortedOptions::Right)
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: At global scope:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:253:13: warning: 'void arrow::compute::BM_SearchSortedBinaryScalarNeedle(benchmark::State&, SearchSortedOptions::Side)' defined but not used [-Wunused-function]
  253 | static void BM_SearchSortedBinaryScalarNeedle(benchmark::State& state,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:243:13: warning: 'void arrow::compute::BM_SearchSortedStringScalarNeedle(benchmark::State&, SearchSortedOptions::Side)' defined but not used [-Wunused-function]
  243 | static void BM_SearchSortedStringScalarNeedle(benchmark::State& state,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I also took a look at the implementation and some aspects of it seem overly-complicated (but I may miss something). See comments below.

Comment on lines +113 to +114
"[1, 3, 3, 5]",
"[0, 3, 4, 6]",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the values and needles arrays different lengths?

auto values = ArrayFromJSON(int16(), "[]");
auto needles = ArrayFromJSON(int16(), "[1, 2, 3]");

ASSERT_OK_AND_ASSIGN(auto result, SearchSorted(Datum(values), Datum(needles)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse CheckSimpleSearchSorted? It would validate the results and also stress scalar needles.

auto values = ArrayFromJSON(int32(), "[null, 200, 300, 300]");
auto needles = ArrayFromJSON(int32(), "[50, 200, 250, 400]");

ASSERT_OK_AND_ASSIGN(auto left,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's re-use the testing helper CheckSimpleSearchSorted everywhere possible, otherwise we're forgetting some checks.

Comment on lines +174 to +177
{"Float32", float32(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]",
"[0, 1, 3, 4]", "[0, 3, 3, 4]"},
{"Float64", float64(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]",
"[0, 1, 3, 4]", "[0, 3, 3, 4]"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test NaNs here? They will probably need special treatment, whether in the values or the needles.

For the record, sorting an array with Arrow always put NaNs between nulls and non-null values.
e.g. either [1.0, 3.0, 3.0, 5.0, NaN, NaN, null] or [null, NaN, NaN, 1.0, 3.0, 3.0, 5.0].

Comment on lines +334 to +339
ASSERT_OK_AND_ASSIGN(auto left,
SearchSorted(Datum(values), Datum(needles),
SearchSortedOptions(SearchSortedOptions::Left)));
ASSERT_OK_AND_ASSIGN(auto right,
SearchSorted(Datum(values), Datum(needles),
SearchSortedOptions(SearchSortedOptions::Right)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either you write or reuse a helper here, or you need to call ValidateFull directly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And ideally we should also check scalar needles too.

auto values_type = run_end_encoded(int32(), int32());
ASSERT_OK_AND_ASSIGN(auto ree_values,
REEFromJSON(values_type, "[0, 0, 1, 1, 1, 4, 4, 9]"));
auto sliced = ree_values->Slice(2, 5);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a slice that does not happily falls on run boundaries? (for example Slice(1, 5))

AssertArraysEqual(*ArrayFromJSON(uint64(), "[0, 0, 3, 3, 5]"), *result.make_array());
}

TEST(SearchSorted, BinaryValues) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already part of the smoke tests, right?

Comment on lines +232 to +235
ValueType Value(int64_t index) const {
const auto physical_index = span_.PhysicalIndex(index);
return GetViewType<ArrowType>::LogicalValue(values_.GetView(physical_index));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so we're processing each run-end logical value individually, and then finding it in the array of physical values? This is wasteful, as it's O(n_needles * log n_values), while each physical value will produce the same output after all.

What run-end-encoded values should do instead is:

  1. run search_sorted on the run values, giving "physical results"
  2. turn the "physical results" into "logical results" by looking up the run start for each physical result

In other words, given:

  • values = run_end_encoded([10,10,10,30,30]) (i.e. run values [10, 30] and run ends [3, 5])
  • needles = [10, 20, 30, 40]

we would first compute the physical result [0, 1, 1, 2] of searching the needles in [10, 30], then turn it into the logical results [0, 3, 3, 5].

This would be O(n_needles * log n_value_runs) and therefore more efficient. It should also generate less template specializations, because you don't need to template this algorithm per (physical value type, run end type) pair.

const auto& needle_data = needles.array();
if (needle_data->type->id() == Type::RUN_END_ENCODED) {
RunEndEncodedArray ree(needle_data);
return DispatchRunEndEncodedByRunEndType<Status>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to what I pointed above about run-end-encoded values, you don't need complex indexing for run-end-encoded needles. It's actually even simpler as you can just search each physical needle and run-end-decode the (needle run ends, physical results).

Comment on lines +462 to +468
const int64_t leading_null_count = CountLeadingNulls(
values.length, [&](int64_t index) { return values.IsNull(index); });
const int64_t trailing_null_count =
leading_null_count > 0 ? 0 : CountTrailingNulls(values.length, [&](int64_t index) {
return values.IsNull(index);
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is... complicated. We know there are null_count nulls, and they are either clustered at the start or at the end. We just have to look up the first validity bit (values.IsNull(0)) and then we know how many leading/trailing nulls there are.

And we can do the exact same thing for chunked arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants