Perf: Pre-size buffer allocations to avoid intermediate allocations#10262
Perf: Pre-size buffer allocations to avoid intermediate allocations#10262Rich-T-kid wants to merge 1 commit into
Conversation
|
pretty big descriptions for a (+3,-2) PR 😅 |
Jefffrey
left a comment
There was a problem hiding this comment.
pretty big descriptions for a (+3,-2) PR 😅
you love to see it 🙂
| )?; | ||
| arrow_data.extend_from_slice(&PADDING[..tail_pad]); | ||
| let final_capcity = arrow_data.capacity(); | ||
| ipc_write_context.scratch.reserve(final_capcity); // reset scratch to the same capacity as before, due to ['FlightDataEncoder::split_batch_for_grpc_response'] we know that batches are split up into roughly equal sized chunks, |
There was a problem hiding this comment.
nit but i do wonder does this mean the final encode call will always allocate again but not use scratch?
|
run benchmark flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/re-use-buffers (53f1c95) to 32bba5a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
TLDR: Its useful to pre-allocate vectors when you know the amount of data it will require
When
IpcDataGeneratoruses theIpcBodySink::Writevariant, record batch buffer bytes are written directly into aVec. If thatVecis undersized, it repeatedly reallocates and copies bytes into a larger buffer, growing exponentially (1, 4, 16, 32 ... KB ... MB) and paying two costs on each reallocation:For large batches this cascade is expensive, and paying it fresh on every record batch chunk compounds the problem further. Since
FlightDataEncoder::split_batch_for_grpc_responsesplits record batches into roughly equal-sized chunks, we exploit this by using the previous buffer's final capacity as an estimate for the next call, keeping a correctly-sizedVecalive across iterations and avoiding repeated reallocation on the hot path.why not pre-allocate the buffers using an estimate with the length
split_batch_for_grpc_responseuses?Using the final capacity rather than the uncompressed dictionary size is intentional, since IPC encoding and compression both affect the actual bytes written, the final capacity naturally adapts to whatever encoding and compression settings are in effect rather than consistently overprovisioning.
What changes are included in this PR?
Are these changes tested?
n/a
Are there any user-facing changes?
no