Skip to content

[SPARK-57704][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF#56794

Open
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57704
Open

[SPARK-57704][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF#56794
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57704

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add ASV microbenchmarks for the SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF eval type in python/benchmarks/bench_eval_type.py, with both time_* and peakmem_* variants over the same scenario grid as the plain SQL_TRANSFORM_WITH_STATE_PANDAS_UDF benchmark plus a small seeded initial-state dataset per group. The benchmark reconstructs the worker wire protocol for transformWithStateInPandas with initial state: a single Arrow stream whose top-level schema is struct<inputData, initState> (matching TransformWithStateInPySparkPythonInitialStateRunner), emitting all initial-state batches first then all data batches (the JVM initData ++ data ordering), with the inactive side of each batch written as an all-null struct so TransformWithStateInPandasInitStateSerializer never sees a mixed batch and regroups rows by the leading key.

Why are the changes needed?

This is the last transformWithState Pandas eval type without benchmark coverage. The eval type is slated for the serializer/eval-type refactor, and a microbenchmark establishes the baseline needed to prove the refactor introduces no regression.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. Test-only addition; no behavior change.

Ran locally with COLUMNS=120 asv run --python=same --bench TransformWithStatePandasInitState -a repeat=3. Results are stable across repeated runs; one representative run below.

[time] TransformWithStatePandasInitStateUDFTimeBench.time_worker
================ ============== ============ ============
--                                 udf
---------------- ----------------------------------------
    scenario      identity_udf    sort_udf    count_udf
================ ============== ============ ============
 few_groups_sm      810±4ms       833±3ms      835±20ms
 few_groups_lg     7.48±0.1s     7.70±0.3s    7.28±0.2s
 many_groups_sm    7.93±0.3s     7.95±0.1s    8.87±0.05s
 many_groups_lg    4.04±0.05s    4.10±0.02s   4.27±0.04s
   wide_cols       8.29±0.3s     8.20±0.2s    7.60±0.04s
   mixed_cols      3.42±0.05s    3.45±0.02s   3.25±0.03s
 nested_struct     7.99±0.2s     7.91±0.02s   5.67±0.03s
================ ============== ============ ============

[peakmem] TransformWithStatePandasInitStateUDFPeakmemBench.peakmem_worker
================ ============== ========== ===========
--                                udf
---------------- -------------------------------------
    scenario      identity_udf   sort_udf   count_udf
================ ============== ========== ===========
 few_groups_sm        116M         115M        106M
 few_groups_lg        248M         248M        248M
 many_groups_sm       176M         177M        161M
 many_groups_lg       151M         151M        151M
   wide_cols          364M         367M        342M
   mixed_cols         182M         182M        182M
 nested_struct        210M         210M        210M
================ ============== ========== ===========

Was this patch authored or co-authored using generative AI tooling?

No.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Yicong-Huang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants