Skip to content

[env](compiler) Pick compiler optimization PRs to branch-4.1#62865

Open
Mryange wants to merge 10 commits intoapache:branch-4.1from
Mryange:branch-4.1-pick-pr-dev-4.27-part-4
Open

[env](compiler) Pick compiler optimization PRs to branch-4.1#62865
Mryange wants to merge 10 commits intoapache:branch-4.1from
Mryange:branch-4.1-pick-pr-dev-4.27-part-4

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 27, 2026

What problem does this PR solve?

  • #61179 Reduce the compilation time of aggregate functions
  • #61227 Reduce template instantiation in scan_operator
  • #61349 Reduce hash join build template instantiations
  • #61515 Reduce template instantiations in datetime floor/ceil functions
  • #61606 Optimize the compilation time of aggregate_function_orthogonal_bitmap.cpp
  • #61276 Optimize the compilation time of cast
  • #61232 Reduce window function template instantiations
  • #62047 Reduce compile time of aggregate_function_reader_replace.cpp
  • #62433 Enforce final on CRTP aggregate function derived classes to enable compiler devirtualization

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Mryange added 9 commits April 27, 2026 15:53
…pache#61179)

Previously, some of our aggregate functions were implemented in very
large .cpp files, which made compilation slow.
For example, be/src/exprs/aggregate/aggregate_function_window.cpp takes
347.25 seconds to compile. T

his PR splits these files into multiple smaller files so they can be
compiled in parallel.

before
```
 347.25  be/src/exprs/aggregate/aggregate_function_window.cpp
```

now
```
[1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp
  → 25.9s

[2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp
  → 60.5s

[3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp
  → 19.6s

[4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp
  → 60.6s

[5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp
  → 63.9s

[6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp
  → 69.1s

[7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp
  → 74.5s
```
…rove compile time (apache#61227)

`scan_operator.cpp` was the 2nd slowest file to compile (~289s), caused
by template
instantiation blowup. Methods defined on `ScanLocalState<Derived>` were
instantiated
for each of the 6 `Derived` types (OlapScan, JDBCScan, FileScan, EsScan,
MetaScan,
GroupCommitScan). Specifically:

- 4 inner `PrimitiveType`-templated methods × 6 Derived × 21
PrimitiveTypes = **504 instantiations**
- 5 non-template normalize methods × 6 Derived = **30 extra
compilations**

None of these methods actually depend on `Derived` — they only use base
class members
or virtual dispatch.

## Solution

Move the methods that do not depend on `Derived` up to
`ScanLocalStateBase`:

**Inner PrimitiveType-template methods** (instantiation reduced from
6×21 → 21, ÷6):
- `_normalize_in_predicate<T>`
- `_normalize_binary_predicate<T>`
- `_normalize_is_null_predicate<T>`
- `_change_value_range<T, Func>`

**Non-template normalize methods** (each compiled 6× → 1×):
- `_eval_const_conjuncts`
- `_normalize_bloom_filter`
- `_normalize_topn_filter`
- `_normalize_bitmap_filter`
- `_normalize_function_filters`

**Accompanying changes:**
- Move `_should_push_down_*` virtual methods, `_eos`,
`_push_down_functions` members
to `ScanLocalStateBase` so the moved methods can call them without
`Derived`.
- Add `_max_pushdown_conditions_per_column` to base class, initialized
from parent
  in `ScanLocalState<Derived>::init()`.
- `_normalize_topn_filter`: replace `p.node_id()` (via `Derived::Parent`
cast) with
  `_parent->node_id()` (base class method).


before
```
288.82s  be/src/exec/operator/scan_operator.cpp
```

now
```
[1/1] Compiling: be/src/exec/operator/scan_operator.cpp
  → 209.9s
```
…#61349)

before
```
[1/1] Compiling: be/src/exec/operator/hashjoin_build_sink.cpp
  → 69.6s
```

now
```
[1/1] Compiling: be/src/exec/operator/hashjoin_build_sink.cpp
  → 48.9s
```
…unctions (apache#61515)

`function_datetime_floor_ceil.cpp` has excessive template instantiation
bloat. The class `FunctionDateTimeFloorCeil<Flag, PType, ArgNum,
UseDelta>` has 4 template parameters, producing 192 instantiations.
However, all heavy computation functions (9 `vector_*` variants,
`time_round_two_args`, `floor_opt`, etc.) only depend on `Flag` and
`PType` — they never use `ArgNum` or `UseDelta`. This means these
expensive functions are compiled 4x more than necessary.

This PR applies three optimizations:

1. **Extract `DateTimeFloorCeilCore<Flag, PType>` struct**: All heavy
computation functions are moved into a separate struct with only 2
template parameters. `FunctionDateTimeFloorCeil` becomes a thin shell
that delegates to `Core::vector_*()`. This reduces instantiations of the
heaviest code from **192 → 48** (÷4).

2. **Extract `convert_utc_to_local_impl<DateValueType>` free function**:
This function only depends on `DateValueType`, not
`Flag`/`ArgNum`/`UseDelta`. Extracting it reduces its instantiations
from 192 → 3.

3. **Extract `FunctionDateTimeFloorCeilBase` non-template base class**:
Three virtual overrides (`get_number_of_arguments`, `is_variadic`,
`use_default_implementation_for_nulls`) use zero template parameters.
Moving them to a non-template base eliminates 192 redundant copies.
…thogonal_bitmap.cpp (apache#61606)

### What problem does this PR solve?

Optimize the compilation time of
`aggregate_function_orthogonal_bitmap.cpp` by reducing template
instantiation bloat and improving build parallelism.

### Changes

**1. Split single translation unit into multiple files**

The original `aggregate_function_orthogonal_bitmap.cpp` instantiated all
6 aggregate functions (~36 template instances) in a single file, forcing
serial compilation. Now split into:

- `aggregate_function_orthogonal_bitmap.cpp` — forward declarations +
registration only (compiles fast)
- `aggregate_function_orth_bitmap_intersect.cpp`
- `aggregate_function_orth_bitmap_intersect_count.cpp`
- `aggregate_function_orth_bitmap_union_count.cpp`
- `aggregate_function_orth_intersect_count.cpp`
- `aggregate_function_orth_bitmap_expr_cal.cpp`
- `aggregate_function_orth_bitmap_expr_cal_count.cpp`

**2. Remove unnecessary `PrimitiveType` template parameter from
`OrthBitmapUnionCountData`**

`OrthBitmapUnionCountData<T>` never uses `T` — it only operates on
`ColumnBitmap` directly. Removed the template parameter, reducing 6
useless instantiations to 1.

**3. Eliminate type dispatch for `ExprCal` / `ExprCalCount`**

FE guarantees these functions receive exactly 3 parameters with VARCHAR
filter columns. Replaced the generic 6-type dispatch with direct
`TYPE_STRING` instantiation, reducing 12 instantiations to 2.

**4. Use precise expression tags matching FE definitions**

Added `ExprTag` template parameter to `AggFunctionOrthBitmapFunc`
(default: `VarargsExpression`):

| Function | FE Arity | Old BE Tag | New BE Tag |
|----------|----------|------------|------------|
| `orthogonal_bitmap_union_count` | `UnaryExpression` (1 arg) |
`VarargsExpression` | `UnaryExpression` |
| `orthogonal_bitmap_expr_calculate` | Fixed 3 args |
`VarargsExpression` | `MultiExpression` |
| `orthogonal_bitmap_expr_calculate_count` | Fixed 3 args |
`VarargsExpression` | `MultiExpression` |
| Others (intersect/intersect_count) | Varargs (≥3) |
`VarargsExpression` | `VarargsExpression` (unchanged) |

**5. Add `inline` to template specializations in `bitmap_intersect.h`**

The `Helper::write_to` / `serialize_size` / `read_from` template
specializations in the header lacked `inline`, causing duplicate symbol
linker errors after splitting into multiple translation units.

### Summary

| Metric | Before | After |
|--------|--------|-------|
| Translation units | 1 (serial) | 7 (parallel) |
| Template instantiations | ~36 | ~21 (~42% reduction) |
| `ExprCal`/`ExprCalCount` instantiations | 12 | 2 |
| `OrthBitmapUnionCountData` instantiations | 6 | 1 |
Previously, many cast instantiations were done in .h files, which caused
a lot of code to be instantiated whenever other files included them.
Here we move those instantiations into .cpp files, and keep only the
code for casting a single value in the .h file.
…#61232)

1. **Remove `ColVecType` from window-path Data types** (`LeadLagData`,
`FirstLastData`, `NthValueData`): The window path enters via virtual
`IAggregateFunction*`, so devirtualizing `insert_from` is pointless. Use
`IColumn::insert_from` virtual call instead. Reader path (CRTP batch
loop) keeps `ColVecType`.

2. **Move `ColVecType` from class-level to function-level** on
`Value`/`CopiedValue`: Only `insert_into` needs it; other members
(`reset`, `is_null`, `set_value`) no longer get 24× instantiated.

3. **Use `void` as marker type**: `FirstLastData` inherits
`ReaderFirstAndLastData<void, ...>`. `if constexpr
(std::is_same_v<ColVecType, void>)` selects virtual vs devirtualized
path at compile time.

4. **Lift `arg_ignore_null` to compile-time**: The old runtime `if`
inside the 24-type switch doubled instantiations. Now dispatched at
registration via lambda.

5. **Delete the 300-line type switch**: Replaced by
`CREATE_WINDOW_FUNCTION_DIRECT` macro that directly instantiates 4
combinations (2×2).

## Compile time (single-thread -j1)
Before
```
[1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp
  → 25.9s

[2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp
  → 60.5s

[3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp
  → 19.6s

[4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp
  → 60.6s

[5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp
  → 63.9s

[6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp
  → 69.1s

[7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp
  → 74.5s

```
Now
```

[1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp
  → 25.4s

[2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp
  → 24.4s

[3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp
  → 19.3s

[4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp
  → 24.4s

[5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp
  → 25.1s

[6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp
  → 24.9s

[7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp
  → 25.2s
```
…ace.cpp (apache#62047)

`aggregate_function_reader_replace.cpp` took **86.91s** to compile
because it included `aggregate_function_reader_first_last.h`, which
pulled in heavy template machinery: a 28-case column-type switch × 4
bool combinations × 4 factory functions = ~448
`make_shared<ReaderFunctionData<...>>` instantiations in a single TU.

### How is it solved?

1. **Decouple from `aggregate_function_reader_first_last.h`**: The
reader/load path now has a self-contained, lightweight implementation
directly in the `.cpp` file. The shared header is no longer included.

2. **Introduce `PointerStore<ArgIsNullable>` /
`CopyStore<ArgIsNullable>`**: Two small storage classes with a uniform
interface (`is_null`, `set_value<SkipNull>`, `insert_into`, `reset`).
Reader path uses zero-copy `PointerStore`; load path uses `CopyStore`
with `Field` deep-copy.

3. **Use 3 bool template params `<IsFirst, SkipNull, ArgIsNullable>`**:
`IsCopy` is derived as `!IsFirst`. `ArgIsNullable` is resolved at
registration time (nullable vs non-nullable factory map), eliminating
runtime branches. `result_is_nullable` stays as a runtime bool (called
once per group).

4. **All column operations use `IColumn` virtual dispatch**: No
column-type-specific template instantiations. Performance impact is
negligible — `Field` operations and the `add()` virtual call itself
dominate.

5. **Clean up `aggregate_function_reader_first_last.h`**: Removed
`ReaderFunctionFirstData`, `ReaderFunctionLastData`,
`ReaderFunctionFirstNonNullData`, `ReaderFunctionLastNonNullData`,
`ReaderFunctionData`, `create_function_single_value`, the 28-case
switch, and the `CREATE_READER_FUNCTION_WITH_NAME_AND_DATA` macro — all
now unused. Also removed unnecessary includes (`column_array.h`,
`column_map.h`, etc.).

### Compile time

| Stage | Time |
|-------|------|
| Before | 86.91s |
| After | **17.0s (↓80%)** |
… classes to enable compiler devirtualization (apache#62433)

`IAggregateFunctionHelper<Derived>` implements hot aggregate paths such
as `add_batch()` and `add_batch_single_place()` by calling
`assert_cast<const Derived*>(this)->add(...)`. Because `add()` is
declared `virtual` in `IAggregateFunction`, the compiler cannot
devirtualize this call unless it can prove that `Derived` has no further
overrides — which requires `Derived` to be marked `final`.

This PR enforces this via a `static_assert` in the
`IAggregateFunctionHelper` constructor. Note that `final` here is a
**performance hint only** — correctness is not affected, since `add()`
remains virtual and subclasses always dispatch correctly through the
vtable regardless. Marking `Derived` as `final` allows the compiler to
devirtualize (and potentially inline) the `add()` call inside hot
aggregation paths.

Changes:

1. Add a `static_assert` in the `IAggregateFunctionHelper` constructor
asserting `std::is_final_v<Derived>`, so that missing `final` is caught
at compile time.
2. Introduce `AggregateFunctionNonFinalBase` as an opt-out marker for
classes that intentionally form inheritance hierarchies (e.g.
`AggregateStateUnion → AggregateStateMerge`, `AggregateFunctionForEach →
AggregateFunctionForEachV2`).
3. Mark all concrete aggregate function classes `final` across
`be/src/exprs/aggregate/`.
4. Refactor `AggregateFunctionTopNBase` to accept a `Derived` template
parameter so that `AggregateFunctionTopN` and
`AggregateFunctionTopNArray` can each be marked `final`.
5. Refactor `AggregateFunctionPercentileApprox` into
`AggregateFunctionPercentileApproxBase<Derived>` so the four concrete
specializations can each be marked `final`.
@Mryange Mryange requested a review from yiguolei as a code owner April 27, 2026 08:10
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

…ization PRs to branch-4.1

- Add 'final' to AggregateFunctionRegrSimple to satisfy the static_assert
  introduced by apache#62433
- Add change/change_if_less/change_if_greater methods to
  SingleValueDataComplexType to match the interface expected by
  aggregate_function_min_max_impl.h introduced by apache#61179
@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 54.30% (537/989) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.26% (20039/37627)
Line Coverage 36.73% (188767/513979)
Region Coverage 33.02% (146596/443928)
Branch Coverage 34.12% (64161/188029)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 58.75% (581/989) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.69% (23455/36829)
Line Coverage 46.91% (240341/512297)
Region Coverage 44.08% (197445/447900)
Branch Coverage 45.22% (85268/188580)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants