[env](compiler) Pick compiler optimization PRs to branch-4.1#62865
Open
Mryange wants to merge 10 commits intoapache:branch-4.1from
Open
[env](compiler) Pick compiler optimization PRs to branch-4.1#62865Mryange wants to merge 10 commits intoapache:branch-4.1from
Mryange wants to merge 10 commits intoapache:branch-4.1from
Conversation
…pache#61179) Previously, some of our aggregate functions were implemented in very large .cpp files, which made compilation slow. For example, be/src/exprs/aggregate/aggregate_function_window.cpp takes 347.25 seconds to compile. T his PR splits these files into multiple smaller files so they can be compiled in parallel. before ``` 347.25 be/src/exprs/aggregate/aggregate_function_window.cpp ``` now ``` [1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp → 25.9s [2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp → 60.5s [3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp → 19.6s [4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp → 60.6s [5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp → 63.9s [6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp → 69.1s [7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp → 74.5s ```
…rove compile time (apache#61227) `scan_operator.cpp` was the 2nd slowest file to compile (~289s), caused by template instantiation blowup. Methods defined on `ScanLocalState<Derived>` were instantiated for each of the 6 `Derived` types (OlapScan, JDBCScan, FileScan, EsScan, MetaScan, GroupCommitScan). Specifically: - 4 inner `PrimitiveType`-templated methods × 6 Derived × 21 PrimitiveTypes = **504 instantiations** - 5 non-template normalize methods × 6 Derived = **30 extra compilations** None of these methods actually depend on `Derived` — they only use base class members or virtual dispatch. ## Solution Move the methods that do not depend on `Derived` up to `ScanLocalStateBase`: **Inner PrimitiveType-template methods** (instantiation reduced from 6×21 → 21, ÷6): - `_normalize_in_predicate<T>` - `_normalize_binary_predicate<T>` - `_normalize_is_null_predicate<T>` - `_change_value_range<T, Func>` **Non-template normalize methods** (each compiled 6× → 1×): - `_eval_const_conjuncts` - `_normalize_bloom_filter` - `_normalize_topn_filter` - `_normalize_bitmap_filter` - `_normalize_function_filters` **Accompanying changes:** - Move `_should_push_down_*` virtual methods, `_eos`, `_push_down_functions` members to `ScanLocalStateBase` so the moved methods can call them without `Derived`. - Add `_max_pushdown_conditions_per_column` to base class, initialized from parent in `ScanLocalState<Derived>::init()`. - `_normalize_topn_filter`: replace `p.node_id()` (via `Derived::Parent` cast) with `_parent->node_id()` (base class method). before ``` 288.82s be/src/exec/operator/scan_operator.cpp ``` now ``` [1/1] Compiling: be/src/exec/operator/scan_operator.cpp → 209.9s ```
…#61349) before ``` [1/1] Compiling: be/src/exec/operator/hashjoin_build_sink.cpp → 69.6s ``` now ``` [1/1] Compiling: be/src/exec/operator/hashjoin_build_sink.cpp → 48.9s ```
…unctions (apache#61515) `function_datetime_floor_ceil.cpp` has excessive template instantiation bloat. The class `FunctionDateTimeFloorCeil<Flag, PType, ArgNum, UseDelta>` has 4 template parameters, producing 192 instantiations. However, all heavy computation functions (9 `vector_*` variants, `time_round_two_args`, `floor_opt`, etc.) only depend on `Flag` and `PType` — they never use `ArgNum` or `UseDelta`. This means these expensive functions are compiled 4x more than necessary. This PR applies three optimizations: 1. **Extract `DateTimeFloorCeilCore<Flag, PType>` struct**: All heavy computation functions are moved into a separate struct with only 2 template parameters. `FunctionDateTimeFloorCeil` becomes a thin shell that delegates to `Core::vector_*()`. This reduces instantiations of the heaviest code from **192 → 48** (÷4). 2. **Extract `convert_utc_to_local_impl<DateValueType>` free function**: This function only depends on `DateValueType`, not `Flag`/`ArgNum`/`UseDelta`. Extracting it reduces its instantiations from 192 → 3. 3. **Extract `FunctionDateTimeFloorCeilBase` non-template base class**: Three virtual overrides (`get_number_of_arguments`, `is_variadic`, `use_default_implementation_for_nulls`) use zero template parameters. Moving them to a non-template base eliminates 192 redundant copies.
…thogonal_bitmap.cpp (apache#61606) ### What problem does this PR solve? Optimize the compilation time of `aggregate_function_orthogonal_bitmap.cpp` by reducing template instantiation bloat and improving build parallelism. ### Changes **1. Split single translation unit into multiple files** The original `aggregate_function_orthogonal_bitmap.cpp` instantiated all 6 aggregate functions (~36 template instances) in a single file, forcing serial compilation. Now split into: - `aggregate_function_orthogonal_bitmap.cpp` — forward declarations + registration only (compiles fast) - `aggregate_function_orth_bitmap_intersect.cpp` - `aggregate_function_orth_bitmap_intersect_count.cpp` - `aggregate_function_orth_bitmap_union_count.cpp` - `aggregate_function_orth_intersect_count.cpp` - `aggregate_function_orth_bitmap_expr_cal.cpp` - `aggregate_function_orth_bitmap_expr_cal_count.cpp` **2. Remove unnecessary `PrimitiveType` template parameter from `OrthBitmapUnionCountData`** `OrthBitmapUnionCountData<T>` never uses `T` — it only operates on `ColumnBitmap` directly. Removed the template parameter, reducing 6 useless instantiations to 1. **3. Eliminate type dispatch for `ExprCal` / `ExprCalCount`** FE guarantees these functions receive exactly 3 parameters with VARCHAR filter columns. Replaced the generic 6-type dispatch with direct `TYPE_STRING` instantiation, reducing 12 instantiations to 2. **4. Use precise expression tags matching FE definitions** Added `ExprTag` template parameter to `AggFunctionOrthBitmapFunc` (default: `VarargsExpression`): | Function | FE Arity | Old BE Tag | New BE Tag | |----------|----------|------------|------------| | `orthogonal_bitmap_union_count` | `UnaryExpression` (1 arg) | `VarargsExpression` | `UnaryExpression` | | `orthogonal_bitmap_expr_calculate` | Fixed 3 args | `VarargsExpression` | `MultiExpression` | | `orthogonal_bitmap_expr_calculate_count` | Fixed 3 args | `VarargsExpression` | `MultiExpression` | | Others (intersect/intersect_count) | Varargs (≥3) | `VarargsExpression` | `VarargsExpression` (unchanged) | **5. Add `inline` to template specializations in `bitmap_intersect.h`** The `Helper::write_to` / `serialize_size` / `read_from` template specializations in the header lacked `inline`, causing duplicate symbol linker errors after splitting into multiple translation units. ### Summary | Metric | Before | After | |--------|--------|-------| | Translation units | 1 (serial) | 7 (parallel) | | Template instantiations | ~36 | ~21 (~42% reduction) | | `ExprCal`/`ExprCalCount` instantiations | 12 | 2 | | `OrthBitmapUnionCountData` instantiations | 6 | 1 |
Previously, many cast instantiations were done in .h files, which caused a lot of code to be instantiated whenever other files included them. Here we move those instantiations into .cpp files, and keep only the code for casting a single value in the .h file.
…#61232) 1. **Remove `ColVecType` from window-path Data types** (`LeadLagData`, `FirstLastData`, `NthValueData`): The window path enters via virtual `IAggregateFunction*`, so devirtualizing `insert_from` is pointless. Use `IColumn::insert_from` virtual call instead. Reader path (CRTP batch loop) keeps `ColVecType`. 2. **Move `ColVecType` from class-level to function-level** on `Value`/`CopiedValue`: Only `insert_into` needs it; other members (`reset`, `is_null`, `set_value`) no longer get 24× instantiated. 3. **Use `void` as marker type**: `FirstLastData` inherits `ReaderFirstAndLastData<void, ...>`. `if constexpr (std::is_same_v<ColVecType, void>)` selects virtual vs devirtualized path at compile time. 4. **Lift `arg_ignore_null` to compile-time**: The old runtime `if` inside the 24-type switch doubled instantiations. Now dispatched at registration via lambda. 5. **Delete the 300-line type switch**: Replaced by `CREATE_WINDOW_FUNCTION_DIRECT` macro that directly instantiates 4 combinations (2×2). ## Compile time (single-thread -j1) Before ``` [1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp → 25.9s [2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp → 60.5s [3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp → 19.6s [4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp → 60.6s [5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp → 63.9s [6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp → 69.1s [7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp → 74.5s ``` Now ``` [1/7] Compiling: be/src/exprs/aggregate/aggregate_function_window.cpp → 25.4s [2/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_first.cpp → 24.4s [3/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_funnel.cpp → 19.3s [4/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lag.cpp → 24.4s [5/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_last.cpp → 25.1s [6/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_lead.cpp → 24.9s [7/7] Compiling: be/src/exprs/aggregate/aggregate_function_window_nth_value.cpp → 25.2s ```
…ace.cpp (apache#62047) `aggregate_function_reader_replace.cpp` took **86.91s** to compile because it included `aggregate_function_reader_first_last.h`, which pulled in heavy template machinery: a 28-case column-type switch × 4 bool combinations × 4 factory functions = ~448 `make_shared<ReaderFunctionData<...>>` instantiations in a single TU. ### How is it solved? 1. **Decouple from `aggregate_function_reader_first_last.h`**: The reader/load path now has a self-contained, lightweight implementation directly in the `.cpp` file. The shared header is no longer included. 2. **Introduce `PointerStore<ArgIsNullable>` / `CopyStore<ArgIsNullable>`**: Two small storage classes with a uniform interface (`is_null`, `set_value<SkipNull>`, `insert_into`, `reset`). Reader path uses zero-copy `PointerStore`; load path uses `CopyStore` with `Field` deep-copy. 3. **Use 3 bool template params `<IsFirst, SkipNull, ArgIsNullable>`**: `IsCopy` is derived as `!IsFirst`. `ArgIsNullable` is resolved at registration time (nullable vs non-nullable factory map), eliminating runtime branches. `result_is_nullable` stays as a runtime bool (called once per group). 4. **All column operations use `IColumn` virtual dispatch**: No column-type-specific template instantiations. Performance impact is negligible — `Field` operations and the `add()` virtual call itself dominate. 5. **Clean up `aggregate_function_reader_first_last.h`**: Removed `ReaderFunctionFirstData`, `ReaderFunctionLastData`, `ReaderFunctionFirstNonNullData`, `ReaderFunctionLastNonNullData`, `ReaderFunctionData`, `create_function_single_value`, the 28-case switch, and the `CREATE_READER_FUNCTION_WITH_NAME_AND_DATA` macro — all now unused. Also removed unnecessary includes (`column_array.h`, `column_map.h`, etc.). ### Compile time | Stage | Time | |-------|------| | Before | 86.91s | | After | **17.0s (↓80%)** |
… classes to enable compiler devirtualization (apache#62433) `IAggregateFunctionHelper<Derived>` implements hot aggregate paths such as `add_batch()` and `add_batch_single_place()` by calling `assert_cast<const Derived*>(this)->add(...)`. Because `add()` is declared `virtual` in `IAggregateFunction`, the compiler cannot devirtualize this call unless it can prove that `Derived` has no further overrides — which requires `Derived` to be marked `final`. This PR enforces this via a `static_assert` in the `IAggregateFunctionHelper` constructor. Note that `final` here is a **performance hint only** — correctness is not affected, since `add()` remains virtual and subclasses always dispatch correctly through the vtable regardless. Marking `Derived` as `final` allows the compiler to devirtualize (and potentially inline) the `add()` call inside hot aggregation paths. Changes: 1. Add a `static_assert` in the `IAggregateFunctionHelper` constructor asserting `std::is_final_v<Derived>`, so that missing `final` is caught at compile time. 2. Introduce `AggregateFunctionNonFinalBase` as an opt-out marker for classes that intentionally form inheritance hierarchies (e.g. `AggregateStateUnion → AggregateStateMerge`, `AggregateFunctionForEach → AggregateFunctionForEachV2`). 3. Mark all concrete aggregate function classes `final` across `be/src/exprs/aggregate/`. 4. Refactor `AggregateFunctionTopNBase` to accept a `Derived` template parameter so that `AggregateFunctionTopN` and `AggregateFunctionTopNArray` can each be marked `final`. 5. Refactor `AggregateFunctionPercentileApprox` into `AggregateFunctionPercentileApproxBase<Derived>` so the four concrete specializations can each be marked `final`.
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ization PRs to branch-4.1 - Add 'final' to AggregateFunctionRegrSimple to satisfy the static_assert introduced by apache#62433 - Add change/change_if_less/change_if_greater methods to SingleValueDataComplexType to match the interface expected by aggregate_function_min_max_impl.h introduced by apache#61179
Contributor
Author
|
run buildall |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
finalon CRTP aggregate function derived classes to enable compiler devirtualizationRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)