Skip to content

Perf: Pre-size buffer allocations to avoid intermediate allocations#10262

Open
Rich-T-kid wants to merge 1 commit into
apache:mainfrom
Rich-T-kid:rich-T-kid/re-use-buffers
Open

Perf: Pre-size buffer allocations to avoid intermediate allocations#10262
Rich-T-kid wants to merge 1 commit into
apache:mainfrom
Rich-T-kid:rich-T-kid/re-use-buffers

Conversation

@Rich-T-kid

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

TLDR: Its useful to pre-allocate vectors when you know the amount of data it will require

When IpcDataGenerator uses the IpcBodySink::Write variant, record batch buffer bytes are written directly into a Vec. If that Vec is undersized, it repeatedly reallocates and copies bytes into a larger buffer, growing exponentially (1, 4, 16, 32 ... KB ... MB) and paying two costs on each reallocation:

  1. an OS memory request and
  2. a full copy of existing bytes into the new buffer.

For large batches this cascade is expensive, and paying it fresh on every record batch chunk compounds the problem further. Since FlightDataEncoder::split_batch_for_grpc_response splits record batches into roughly equal-sized chunks, we exploit this by using the previous buffer's final capacity as an estimate for the next call, keeping a correctly-sized Vec alive across iterations and avoiding repeated reallocation on the hot path.

why not pre-allocate the buffers using an estimate with the length split_batch_for_grpc_response uses?

Using the final capacity rather than the uncompressed dictionary size is intentional, since IPC encoding and compression both affect the actual bytes written, the final capacity naturally adapts to whatever encoding and compression settings are in effect rather than consistently overprovisioning.

What changes are included in this PR?

  • Move the scratch buffer out of ipc_write_context via mem::take (zero copy)
  • Write the IPC bytes into the buffer
  • Record the final capacity
  • Pre-allocate a fresh scratch buffer at that capacity for the next call

Are these changes tested?

n/a

Are there any user-facing changes?

no

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jul 2, 2026
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

pretty big descriptions for a (+3,-2) PR 😅

@Jefffrey Jefffrey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty big descriptions for a (+3,-2) PR 😅

you love to see it 🙂

Comment thread arrow-ipc/src/writer.rs
)?;
arrow_data.extend_from_slice(&PADDING[..tail_pad]);
let final_capcity = arrow_data.capacity();
ipc_write_context.scratch.reserve(final_capcity); // reset scratch to the same capacity as before, due to ['FlightDataEncoder::split_batch_for_grpc_response'] we know that batches are split up into roughly equal sized chunks,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but i do wonder does this mean the final encode call will always allocate again but not use scratch?

@Jefffrey

Jefffrey commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

run benchmark flight

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4863304054-797-gbhwf 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/re-use-buffers (53f1c95) to 32bba5a (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                     main                                    rich-T-kid_re-use-buffers
-----                                     ----                                    -------------------------
decode/fixed/65536x1                      1.00     48.3±0.21µs    40.4 GB/sec     1.06     51.1±0.56µs    38.2 GB/sec
decode/fixed/65536x4                      1.01    264.0±4.32µs    29.6 GB/sec     1.00    260.9±2.75µs    29.9 GB/sec
decode/fixed/65536x8                      1.00   580.1±12.93µs    26.9 GB/sec     7.11      4.1±0.06ms     3.8 GB/sec
decode/fixed/8192x1                       1.01      7.8±0.05µs    31.2 GB/sec     1.00      7.7±0.08µs    31.6 GB/sec
decode/fixed/8192x4                       1.00     28.4±0.21µs    34.4 GB/sec     1.02     28.9±0.16µs    33.8 GB/sec
decode/fixed/8192x8                       1.08     66.5±1.18µs    29.4 GB/sec     1.00     61.8±1.16µs    31.6 GB/sec
decode/nested/65536x1                     1.00  685.2±167.09µs     7.1 GB/sec     1.01  690.9±165.54µs     7.1 GB/sec
decode/nested/65536x4                     1.01      3.1±0.68ms     6.3 GB/sec     1.00      3.1±0.68ms     6.4 GB/sec
decode/nested/65536x8                     2.21     14.7±1.35ms     2.7 GB/sec     1.00      6.6±1.33ms     5.9 GB/sec
decode/nested/8192x1                      1.00    82.9±20.64µs     7.4 GB/sec     1.02    84.5±20.74µs     7.2 GB/sec
decode/nested/8192x4                      1.00   352.6±82.85µs     6.9 GB/sec     1.01   354.6±84.64µs     6.9 GB/sec
decode/nested/8192x8                      1.00  721.6±166.32µs     6.8 GB/sec     1.01  730.3±167.18µs     6.7 GB/sec
decode/variable/65536x1                   1.01  1216.0±184.37µs     7.2 GB/sec    1.00  1204.3±185.93µs     7.3 GB/sec
decode/variable/65536x4                   1.01      5.7±0.63ms     6.2 GB/sec     1.00      5.6±0.72ms     6.3 GB/sec
decode/variable/65536x8                   1.38     15.7±1.45ms     4.5 GB/sec     1.00     11.4±1.38ms     6.2 GB/sec
decode/variable/8192x1                    1.00   134.0±22.30µs     8.2 GB/sec     1.01   135.6±20.98µs     8.1 GB/sec
decode/variable/8192x4                    1.00   586.5±91.85µs     7.5 GB/sec     1.00   587.5±87.81µs     7.5 GB/sec
decode/variable/8192x8                    1.00  1216.8±181.84µs     7.2 GB/sec    1.04  1263.3±164.65µs     7.0 GB/sec
decode_stream/dict/65536x1x4              1.00   182.1±34.61µs     5.4 GB/sec     1.07   194.9±27.46µs     5.0 GB/sec
decode_stream/dict/65536x4x4              1.01  773.6±118.81µs     5.1 GB/sec     1.00  767.2±134.83µs     5.1 GB/sec
decode_stream/dict/65536x8x4              1.00  1596.7±170.94µs     4.9 GB/sec    1.04  1652.8±282.59µs     4.8 GB/sec
decode_stream/dict/8192x1x4               1.00     26.0±0.34µs     4.9 GB/sec     1.01     26.2±0.40µs     4.9 GB/sec
decode_stream/dict/8192x4x4               1.00    101.2±2.36µs     5.0 GB/sec     1.03    104.1±9.79µs     4.9 GB/sec
decode_stream/dict/8192x8x4               1.00    205.5±2.23µs     5.0 GB/sec     1.02    209.8±6.14µs     4.9 GB/sec
decode_stream/fixed/65536x1x4             1.07     52.2±0.52µs    37.4 GB/sec     1.00     48.8±0.16µs    40.0 GB/sec
decode_stream/fixed/65536x4x4             1.01    269.1±2.33µs    29.0 GB/sec     1.00    267.4±3.07µs    29.2 GB/sec
decode_stream/fixed/65536x8x4             1.00  590.0±119.26µs    26.5 GB/sec     1.01   594.1±98.28µs    26.3 GB/sec
decode_stream/fixed/8192x1x4              1.00      7.8±0.03µs    31.5 GB/sec     1.00      7.7±0.06µs    31.7 GB/sec
decode_stream/fixed/8192x4x4              1.00     28.4±0.35µs    34.4 GB/sec     1.04     29.5±0.10µs    33.2 GB/sec
decode_stream/fixed/8192x8x4              1.00     67.0±0.25µs    29.2 GB/sec     1.00     66.8±1.11µs    29.3 GB/sec
decode_stream/nested/65536x1x4            1.00  683.4±165.09µs     7.1 GB/sec     1.01  692.7±167.62µs     7.0 GB/sec
decode_stream/nested/65536x4x4            1.08      3.3±0.68ms     6.0 GB/sec     1.00      3.0±0.68ms     6.4 GB/sec
decode_stream/nested/65536x8x4            1.00      6.5±1.36ms     6.1 GB/sec     1.01      6.5±1.35ms     6.0 GB/sec
decode_stream/nested/8192x1x4             1.00    83.9±20.63µs     7.3 GB/sec     1.01    84.6±20.73µs     7.2 GB/sec
decode_stream/nested/8192x4x4             1.00   350.4±82.81µs     7.0 GB/sec     1.01   352.9±84.00µs     6.9 GB/sec
decode_stream/nested/8192x8x4             1.00  720.6±166.19µs     6.8 GB/sec     1.01  730.7±168.83µs     6.7 GB/sec
decode_stream/variable/65536x1x4          1.00  1217.2±183.86µs     7.2 GB/sec    1.02  1247.4±171.40µs     7.0 GB/sec
decode_stream/variable/65536x4x4          1.01      5.8±0.59ms     6.1 GB/sec     1.00      5.7±0.57ms     6.1 GB/sec
decode_stream/variable/65536x8x4          1.00     11.6±1.38ms     6.1 GB/sec     1.58     18.3±1.35ms     3.8 GB/sec
decode_stream/variable/8192x1x4           1.03   140.3±19.42µs     7.8 GB/sec     1.00   136.1±21.22µs     8.1 GB/sec
decode_stream/variable/8192x4x4           1.04   606.0±78.94µs     7.3 GB/sec     1.00   583.1±89.02µs     7.5 GB/sec
decode_stream/variable/8192x8x4           1.01  1232.0±178.50µs     7.1 GB/sec    1.00  1216.0±181.84µs     7.2 GB/sec
do_put_dictionary/dict/hydrate/65536x1    1.00    376.9±6.55µs   667.0 MB/sec     1.01    379.4±5.60µs   662.7 MB/sec
do_put_dictionary/dict/hydrate/65536x4    1.00  1402.1±19.91µs   717.3 MB/sec     1.03  1447.9±73.81µs   694.5 MB/sec
do_put_dictionary/dict/hydrate/65536x8    1.00      3.4±0.27ms   593.4 MB/sec     1.09      3.7±0.34ms   542.4 MB/sec
do_put_dictionary/dict/hydrate/8192x1     1.01     91.5±1.19µs   356.8 MB/sec     1.00     90.5±1.28µs   360.7 MB/sec
do_put_dictionary/dict/hydrate/8192x4     1.00    205.1±2.89µs   636.9 MB/sec     1.03    210.5±3.32µs   620.6 MB/sec
do_put_dictionary/dict/hydrate/8192x8     1.00    371.3±5.57µs   703.7 MB/sec     1.00    371.8±7.04µs   702.8 MB/sec
do_put_dictionary/dict/resend/65536x1     1.00    108.5±1.60µs     2.3 GB/sec     1.00    108.0±2.74µs     2.3 GB/sec
do_put_dictionary/dict/resend/65536x4     1.00    292.5±3.50µs     3.4 GB/sec     1.00    292.0±3.33µs     3.4 GB/sec
do_put_dictionary/dict/resend/65536x8     1.02    521.1±8.03µs     3.8 GB/sec     1.00    510.0±5.90µs     3.9 GB/sec
do_put_dictionary/dict/resend/8192x1      1.03     61.0±1.06µs   535.6 MB/sec     1.00     59.4±0.78µs   549.4 MB/sec
do_put_dictionary/dict/resend/8192x4      1.01     83.1±0.94µs  1571.9 MB/sec     1.00     82.3±1.14µs  1586.8 MB/sec
do_put_dictionary/dict/resend/8192x8      1.00    114.6±1.67µs     2.2 GB/sec     1.00    114.8±1.80µs     2.2 GB/sec
encode/fixed/65536x1                      1.00     10.1±0.04µs    48.4 GB/sec     1.03     10.4±0.01µs    46.9 GB/sec
encode/fixed/65536x4                      1.00     51.4±0.27µs    38.0 GB/sec     9.64    495.7±1.02µs     3.9 GB/sec
encode/fixed/65536x8                      1.00   1063.8±2.72µs     3.7 GB/sec     1.04   1101.3±3.94µs     3.5 GB/sec
encode/fixed/8192x1                       1.00      3.2±0.01µs    18.9 GB/sec     1.03      3.4±0.01µs    18.2 GB/sec
encode/fixed/8192x4                       1.00      9.0±0.02µs    27.2 GB/sec     1.22     11.0±0.02µs    22.3 GB/sec
encode/fixed/8192x8                       1.00     18.0±0.03µs    27.2 GB/sec     1.24     22.3±0.05µs    22.0 GB/sec
encode/nested/65536x1                     1.00     28.7±0.41µs    42.6 GB/sec     1.02     29.1±0.26µs    41.9 GB/sec
encode/nested/65536x4                     1.00   1415.5±5.89µs     3.5 GB/sec     1.02   1446.3±4.81µs     3.4 GB/sec
encode/nested/65536x8                     1.05      3.2±0.04ms     3.1 GB/sec     1.00      3.0±0.04ms     3.2 GB/sec
encode/nested/8192x1                      1.00      5.8±0.01µs    26.5 GB/sec     1.14      6.6±0.01µs    23.3 GB/sec
encode/nested/8192x4                      1.00     21.2±0.05µs    28.9 GB/sec     1.03     21.9±0.04µs    27.9 GB/sec
encode/nested/8192x8                      1.01     48.4±0.12µs    25.3 GB/sec     1.00     48.0±0.08µs    25.4 GB/sec
encode/variable/65536x1                   1.08     64.7±0.27µs    33.9 GB/sec     1.00     59.9±0.37µs    36.7 GB/sec
encode/variable/65536x4                   1.04      2.5±0.03ms     3.6 GB/sec     1.00      2.4±0.02ms     3.7 GB/sec
encode/variable/65536x8                   1.09      5.7±0.05ms     3.1 GB/sec     1.00      5.2±0.07ms     3.4 GB/sec
encode/variable/8192x1                    1.00      6.9±0.01µs    39.6 GB/sec     1.42      9.9±0.01µs    27.9 GB/sec
encode/variable/8192x4                    1.02     26.8±0.06µs    41.0 GB/sec     1.00     26.3±0.05µs    41.7 GB/sec
encode/variable/8192x8                    1.10     86.9±0.26µs    25.3 GB/sec     1.00     78.8±0.21µs    27.9 GB/sec
roundtrip/fixed/65536x1                   1.01    315.1±4.21µs  1587.2 MB/sec     1.00    311.5±3.39µs  1605.6 MB/sec
roundtrip/fixed/65536x4                   1.01  1221.8±16.47µs  1637.2 MB/sec     1.00  1209.4±20.20µs  1654.0 MB/sec
roundtrip/fixed/65536x8                   1.00      2.2±0.02ms  1779.6 MB/sec     1.02      2.3±0.05ms  1744.4 MB/sec
roundtrip/fixed/8192x1                    1.03     92.1±1.29µs   679.6 MB/sec     1.00     89.9±1.10µs   696.6 MB/sec
roundtrip/fixed/8192x4                    1.00    201.0±2.08µs  1245.7 MB/sec     1.00    201.0±2.16µs  1245.8 MB/sec
roundtrip/fixed/8192x8                    1.00    345.3±4.77µs  1450.2 MB/sec     1.01    347.8±4.13µs  1439.8 MB/sec
roundtrip/nested/65536x1                  1.01   884.3±44.76µs  1413.8 MB/sec     1.00   879.8±43.93µs  1421.0 MB/sec
roundtrip/nested/65536x4                  1.00      4.3±0.14ms  1151.3 MB/sec     1.00      4.4±0.12ms  1149.1 MB/sec
roundtrip/nested/65536x8                  1.04      9.1±0.36ms  1098.6 MB/sec     1.00      8.7±0.30ms  1146.8 MB/sec
roundtrip/nested/8192x1                   1.03    161.8±6.33µs   966.7 MB/sec     1.00    157.1±5.09µs   995.9 MB/sec
roundtrip/nested/8192x4                   1.01   479.2±21.25µs  1306.0 MB/sec     1.00   472.6±21.80µs  1324.3 MB/sec
roundtrip/nested/8192x8                   1.02   945.4±40.91µs  1323.9 MB/sec     1.00   926.2±43.62µs  1351.4 MB/sec
roundtrip/variable/65536x1                1.04  1364.7±84.52µs  1648.8 MB/sec     1.00  1308.1±61.99µs  1720.2 MB/sec
roundtrip/variable/65536x4                1.07      8.4±0.28ms  1065.7 MB/sec     1.00      7.9±0.36ms  1143.7 MB/sec
roundtrip/variable/65536x8                1.08     14.9±0.42ms  1206.1 MB/sec     1.00     13.9±0.47ms  1298.9 MB/sec
roundtrip/variable/8192x1                 1.01    210.1±5.65µs  1339.3 MB/sec     1.00    208.6±5.46µs  1349.3 MB/sec
roundtrip/variable/8192x4                 1.00   703.5±22.70µs  1600.2 MB/sec     1.02   715.1±24.40µs  1574.3 MB/sec
roundtrip/variable/8192x8                 1.01  1259.8±24.13µs  1787.1 MB/sec     1.00  1244.6±24.20µs  1808.9 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 915.2s
Peak memory 171.1 MiB
Avg memory 63.7 MiB
CPU user 920.5s
CPU sys 136.5s
Peak spill 0 B

branch

Metric Value
Wall time 940.2s
Peak memory 165.1 MiB
Avg memory 66.0 MiB
CPU user 930.7s
CPU sys 150.3s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize arrow-flight

3 participants