GH-49677 [Python][C++][Compute] Add search sorted compute kernel by Alex-PLACET · Pull Request #49679 · apache/arrow

Alex-PLACET · 2026-04-07T09:41:02Z

Rationale for this change

Add the implemenation of the search sorted compute kernel based on the numpy function: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html

What changes are included in this PR?

Implementation of the C++ kernel + Python API.
Tests in C++ and Python

Are these changes tested?

Yes

Are there any user-facing changes?

No breaking change

GitHub Issue: [C++][Python] Implement search_sorted kernel for all primitive types and run-end encoded arrays #49677

github-actions · 2026-04-07T09:41:36Z

⚠️ GitHub issue #49677 has been automatically assigned in GitHub to PR creator.

- Added a new benchmark file `vector_search_sorted_benchmark.cc` to evaluate the performance of the SearchSorted function for various data types including Int64, String, and Binary. - Created a comprehensive test suite in `vector_search_sorted_test.cc` to validate the correctness of SearchSorted across different scenarios, including handling of null values, scalar needles, and run-end encoded arrays. - Ensured that the benchmarks cover both left and right search options, as well as edge cases like empty arrays and arrays with leading/trailing nulls.

…rks for needles with null runs

…tation overview and flow

…lize ranges for leading/trailing null counts

…ensive tests for supported types

…xcept to length method

tadeja · 2026-04-08T13:28:24Z

+    [
+        0,
+        0,
+        3,
+        5
+    ]


Suggested change

[

0,

0,

3,

5

]

[

0,

0,

3,

5

]

@Alex-PLACET just noticing that the elements here require two space chars to be removed. Similarly for the next two example outputs.

pitrou

I haven't looked at the implementation yet, but have reviewed the tests and benchmarks (which are quite comprehensive, thank you!).

One missing item is support for chunked arrays. Besides that, see comments below :)

pitrou · 2026-04-15T13:40:10Z

+                                    SearchSortedOptions(SearchSortedOptions::Left)));
+  ASSERT_OK_AND_ASSIGN(auto right,
+                       SearchSorted(Datum(values), Datum(needles),
+                                    SearchSortedOptions(SearchSortedOptions::Right)));


Let's call ValidateFull on both results here too?

(also I'm curious, why not reuse CheckSimpleSearchSorted?)

Ok fixed too

pitrou · 2026-04-15T13:43:12Z

+  std::string scalar_needle_json;
+  uint64_t expected_scalar_left;
+  uint64_t expected_scalar_right;


Note that you could also generate the scalar needle tests automatically by calling GetScalar on the array needles and the expected results. This would make this easier to maintain later.

pitrou · 2026-04-15T13:58:09Z

+  return std::static_pointer_cast<BinaryArray>(builder.Finish().ValueOrDie());
+}
+
+std::shared_ptr<BinaryArray> BuildBinaryNeedles(int64_t size_bytes) {


I don't think it's worth benchmarking both string and binary, as they are expected to perform similarly.

pitrou · 2026-04-15T13:58:49Z

+
+void SetSearchSortedArgs(benchmark::internal::Benchmark* bench) {
+  bench->Unit(benchmark::kMicrosecond);
+  for (const auto size : kMemorySizes) {


Unfortunately the largest size produces rather slow benchmark iterations, can we just keep {kL1Size, kL2Size}}?

Which essentially means that all benchmarks become "quick" as per the definition of quick here :)

pitrou · 2026-04-15T14:01:21Z

+  RunSearchSortedBenchmark(state, values, needles, side);
+}
+
+static void BM_SearchSortedInt64ScalarNeedle(benchmark::State& state,


If there is just one needle to search for, we're just benchmarking the function call overhead rather than any significant part of the sorted search kernel, right? Is it useful?

(benchmarks for other compute functions focus on array performance, not scalar performance)

pitrou · 2026-04-15T14:05:01Z

+    ->Apply(SetSearchSortedArgs);
+
+// String and binary scalar cases specifically exercise the direct scalar fast path that
+// avoids boxing a scalar needle into a temporary one-element array.


Even if we wanted to benchmark this (which I doubt we do), we would only need a single benchmark IMHO.

…and add tests for chunked values and needles

…expected results

pitrou · 2026-04-23T09:09:04Z

@Alex-PLACET Is this ready for review again?

…ctions

pitrou · 2026-04-23T14:01:46Z

@Alex-PLACET It looks like the benchmarks don't compile anymore?

/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In function 'void arrow::compute::BM_SearchSortedBinaryScalarNeedle(benchmark::State&, SearchSortedOptions::Side)':
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:255:29: error: 'BuildSortedBinaryValues' was not declared in this scope; did you mean 'BuildSortedInt64Values'?
  255 |   const auto values_array = BuildSortedBinaryValues(state.range(0));
      |                             ^~~~~~~~~~~~~~~~~~~~~~~
      |                             BuildSortedInt64Values
In file included from /home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:18:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In lambda function:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:290:19: error: 'BM_SearchSortedInt64ScalarNeedle' was not declared in this scope; did you mean 'BM_SearchSortedBinaryScalarNeedle'?
  290 | BENCHMARK_CAPTURE(BM_SearchSortedInt64ScalarNeedle, left, SearchSortedOptions::Left)
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: In lambda function:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:292:19: error: 'BM_SearchSortedInt64ScalarNeedle' was not declared in this scope; did you mean 'BM_SearchSortedBinaryScalarNeedle'?
  292 | BENCHMARK_CAPTURE(BM_SearchSortedInt64ScalarNeedle, right, SearchSortedOptions::Right)
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc: At global scope:
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:253:13: warning: 'void arrow::compute::BM_SearchSortedBinaryScalarNeedle(benchmark::State&, SearchSortedOptions::Side)' defined but not used [-Wunused-function]
  253 | static void BM_SearchSortedBinaryScalarNeedle(benchmark::State& state,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/vector_search_sorted_benchmark.cc:243:13: warning: 'void arrow::compute::BM_SearchSortedStringScalarNeedle(benchmark::State&, SearchSortedOptions::Side)' defined but not used [-Wunused-function]
  243 | static void BM_SearchSortedStringScalarNeedle(benchmark::State& state,
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pitrou

Now I also took a look at the implementation and some aspects of it seem overly-complicated (but I may miss something). See comments below.

pitrou · 2026-04-23T14:06:33Z

+          "[1, 3, 3, 5]",
+          "[0, 3, 4, 6]",


Can we make the values and needles arrays different lengths?

pitrou · 2026-04-23T14:08:09Z

+  auto values = ArrayFromJSON(int16(), "[]");
+  auto needles = ArrayFromJSON(int16(), "[1, 2, 3]");
+
+  ASSERT_OK_AND_ASSIGN(auto result, SearchSorted(Datum(values), Datum(needles)));


Why not reuse CheckSimpleSearchSorted? It would validate the results and also stress scalar needles.

pitrou · 2026-04-23T14:09:30Z

+  auto values = ArrayFromJSON(int32(), "[null, 200, 300, 300]");
+  auto needles = ArrayFromJSON(int32(), "[50, 200, 250, 400]");
+
+  ASSERT_OK_AND_ASSIGN(auto left,


Again, let's re-use the testing helper CheckSimpleSearchSorted everywhere possible, otherwise we're forgetting some checks.

pitrou · 2026-04-23T14:11:46Z

+      {"Float32", float32(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]",
+       "[0, 1, 3, 4]", "[0, 3, 3, 4]"},
+      {"Float64", float64(), "[1.0, 3.0, 3.0, 5.0]", "[0.0, 3.0, 4.0, 6.0]",
+       "[0, 1, 3, 4]", "[0, 3, 3, 4]"},


Do we want to test NaNs here? They will probably need special treatment, whether in the values or the needles.

For the record, sorting an array with Arrow always put NaNs between nulls and non-null values.
e.g. either [1.0, 3.0, 3.0, 5.0, NaN, NaN, null] or [null, NaN, NaN, 1.0, 3.0, 3.0, 5.0].

pitrou · 2026-04-23T14:14:51Z

+  ASSERT_OK_AND_ASSIGN(auto left,
+                       SearchSorted(Datum(values), Datum(needles),
+                                    SearchSortedOptions(SearchSortedOptions::Left)));
+  ASSERT_OK_AND_ASSIGN(auto right,
+                       SearchSorted(Datum(values), Datum(needles),
+                                    SearchSortedOptions(SearchSortedOptions::Right)));


Either you write or reuse a helper here, or you need to call ValidateFull directly.

And ideally we should also check scalar needles too.

pitrou · 2026-04-23T14:22:50Z

+  auto values_type = run_end_encoded(int32(), int32());
+  ASSERT_OK_AND_ASSIGN(auto ree_values,
+                       REEFromJSON(values_type, "[0, 0, 1, 1, 1, 4, 4, 9]"));
+  auto sliced = ree_values->Slice(2, 5);


Can we use a slice that does not happily falls on run boundaries? (for example Slice(1, 5))

pitrou · 2026-04-23T14:23:35Z

+  AssertArraysEqual(*ArrayFromJSON(uint64(), "[0, 0, 3, 3, 5]"), *result.make_array());
+}
+
+TEST(SearchSorted, BinaryValues) {


This is already part of the smoke tests, right?

pitrou · 2026-04-23T14:40:52Z

+  ValueType Value(int64_t index) const {
+    const auto physical_index = span_.PhysicalIndex(index);
+    return GetViewType<ArrowType>::LogicalValue(values_.GetView(physical_index));
+  }


Hmm, so we're processing each run-end logical value individually, and then finding it in the array of physical values? This is wasteful, as it's O(n_needles * log n_values), while each physical value will produce the same output after all.

What run-end-encoded values should do instead is:

run search_sorted on the run values, giving "physical results"

turn the "physical results" into "logical results" by looking up the run start for each physical result

In other words, given:

values = run_end_encoded([10,10,10,30,30]) (i.e. run values [10, 30] and run ends [3, 5])

needles = [10, 20, 30, 40]

we would first compute the physical result [0, 1, 1, 2] of searching the needles in [10, 30], then turn it into the logical results [0, 3, 3, 5].

This would be O(n_needles * log n_value_runs) and therefore more efficient. It should also generate less template specializations, because you don't need to template this algorithm per (physical value type, run end type) pair.

pitrou · 2026-04-23T14:43:16Z

+  const auto& needle_data = needles.array();
+  if (needle_data->type->id() == Type::RUN_END_ENCODED) {
+    RunEndEncodedArray ree(needle_data);
+    return DispatchRunEndEncodedByRunEndType<Status>(


Similarly to what I pointed above about run-end-encoded values, you don't need complex indexing for run-end-encoded needles. It's actually even simpler as you can just search each physical needle and run-end-decode the (needle run ends, physical results).

pitrou · 2026-04-23T14:55:07Z

+  const int64_t leading_null_count = CountLeadingNulls(
+      values.length, [&](int64_t index) { return values.IsNull(index); });
+  const int64_t trailing_null_count =
+      leading_null_count > 0 ? 0 : CountTrailingNulls(values.length, [&](int64_t index) {
+        return values.IsNull(index);
+      });
+


This is... complicated. We know there are null_count nulls, and they are either clustered at the start or at the end. We just have to look up the first validity bit (values.IsNull(0)) and then we know how many leading/trailing nulls there are.

And we can do the exact same thing for chunked arrays.

github-actions Bot added Component: C++ Component: Python Component: Documentation awaiting review Awaiting review labels Apr 7, 2026

Alex-PLACET marked this pull request as ready for review April 7, 2026 12:57

Alex-PLACET requested review from AlenkaF, raulcd and rok as code owners April 7, 2026 12:57

Alex-PLACET added 6 commits April 8, 2026 10:14

Refactor vector_search_sorted kernel to use ArrayData and add benchma…

a47c6e5

…rks for needles with null runs

Enhance documentation for search_sorted kernel with detailed implemen…

f62b93d

…tation overview and flow

Refactor vector_search_sorted kernel to improve null handling and uti…

b826e49

…lize ranges for leading/trailing null counts

Refactor search_sorted kernel: improve error messages and add compreh…

4460a03

…ensive tests for supported types

Formatting

4ec630e

Alex-PLACET force-pushed the add_search_sorted_compute_kernel branch from 8e09ea3 to 4ec630e Compare April 8, 2026 08:15

Refactor vector_search_sorted kernel: enhance readability and add noe…

56146c2

…xcept to length method

tadeja reviewed Apr 8, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 8, 2026

Refactor search_sorted documentation: adjust indentation for clarity

9def02c

pitrou requested changes Apr 15, 2026

View reviewed changes

Alex-PLACET added 3 commits April 22, 2026 20:47

Implement chunked array support in search_sorted: enhance validation …

122f4fa

…and add tests for chunked values and needles

Refactor CheckSimpleScalarSearchSorted: support multiple needles and …

52c8bcd

…expected results

Add tests for chunked run-end encoded values and needles in SearchSorted

7ba7bf3

Alex-PLACET added 3 commits April 23, 2026 11:18

formatting

d0cd123

Refactor search_sorted: extract null counting logic into reusable fun…

2c907ca

…ctions

Formatting

102fb46

Alex-PLACET requested a review from pitrou April 23, 2026 12:02

pitrou requested changes Apr 23, 2026

View reviewed changes

+                  [
+,
+,
+,
+
+                  ]

-    [
-,
-,
-,
-    ]
+    [
+,
+,
+,
+    ]

Conversation

Alex-PLACET commented Apr 7, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alex-PLACET commented Apr 7, 2026 •

edited by github-actions Bot

Loading