[10125] arrow-flight decode path optimizations (add skip_validation to arrow-flight)#10206
Conversation
e152490 to
080628d
Compare
|
benchmarks locally look very promising |
|
@Jefffrey (sorry for the ping) could you run the arrow-flight benchmarks command on this when you get a chance? |
9f81b71 to
65c9b89
Compare
65c9b89 to
316b8bd
Compare
|
run benchmark flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/arrow-flight-decode-opt-impl (316b8bd) to d2f1611 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Nice, decode path shows a large performance boost. The more data is being passed around, the larger the performance gains. This is difficult because of the lack of control we have over the byte offset that is handed to us by tonic. looking at it seems that the math for checking if the buffer is aligned just comes down to From my understanding arrow-ipc adds padding within the IPC body to account for alignment. The issue occurs when the initial buffer returned by Tonic starts at a misaligned byte offset. For example, if the buffer starts at byte offset 253, it isn't aligned to a 32 or 64-byte boundary. Because Arrow IPC's internal padding is relative to the start of the buffer, a misaligned starting offset means all the sub-buffers inside it are also misaligned, the padding does nothing to fix this. The result is that every sub-buffer has to be copied into a fresh, correctly aligned allocation anyway. This would avoid N buffer copies in exchange for 1 large buffer copy at the start. for future readers who are wondering why alignment is important: |
|
|
|
@gabotechs this PR is ready for review 🚀. |
gabotechs
left a comment
There was a problem hiding this comment.
Cool! just have on comment, but otherwise LGTM
|
@alamb could you take a look when you get a chance 🚀 We may also want to run the benchmarks again to see how much the buffer re-alignment impacted performance. |
b13b3da to
6379781
Compare
6379781 to
a2c436b
Compare
|
run benchmarks flight |
|
run benchmarks ipc_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/arrow-flight-decode-opt-impl (a2c436b) to d2f1611 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/arrow-flight-decode-opt-impl (a2c436b) to d2f1611 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
With f2fcc0c, buffers are only copied when they're actually misaligned. Out of ~18 million total calls during benchmarking, only ~46k required reallocation, meaning 99.75% of the time we skip the copy entirely, while still correctly handling the cases where alignment is off. |
|
This should be good to go ( for real this time 😅). we may want to run the benchmarks again |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Rich-T-kid -- the code looks good to me (I had only some minor nits)
The biggest comment is baout the breaking API change -- I had a suggestion. let me know what you think
76cc53b to
2d22728
Compare
|
@alamb pushed up a revision. |
skip_validation to arrow-flight)
9b3c27b to
aa41dc9
Compare
| let buf = if data.data_body.as_ptr() as usize % 64 == 0 { | ||
| Buffer::from(data.data_body.clone()) | ||
| } else { | ||
| Buffer::from(data.data_body.as_ref()) | ||
| }; |
There was a problem hiding this comment.
This is a bit subtle, how about adding a comment here about why this is needed?
There was a problem hiding this comment.
Or even leave this optimization aside for now, as it seems a different topic than exposing the with_skip_validation() method.
One thought I'm having is that it seems easy to shoot ourselves in the foot with this very low-level optimizations. For example, if it wasn't for the current tests it would have been very easy to introduce a silent bug in main (#10206 (comment)).
If it turns out that one of these small optimizations introduces one of these subtle bugs, but it happens that there's no specific test for that one, it will blow up in people's codebases instead of here. If that happens, the problematic PR in arrow-rs needs to be rolled back, but if that PR happens to introduce a mix of 4 or 5 minor optimizations, we would be rolling back all those minor optimizations instead of just the problematic one.
With this I mean that there's value in making different PRs for the different optimizations even if they are extra small, and avoiding the "small optimizations collection"-style PRs just for the sake of being able to roll them back manually if things go south.
There was a problem hiding this comment.
Ill introduce this change in a follow up PR once this is merged to avoid git conflicts.
There was a problem hiding this comment.
This is great advice @gabotechs -- and thank you @Rich-T-kid for bearing with us
|
@gabotechs is exposing (adding) new public methods considered a breaking API change? |
|
No, because existing code consuming this library will keep working without further changes from their maintainers. I'd anyways make sure it's strictly necessary to expose those new methods though. If there's another way that does not require broadening the public API that would be ideal. |
alamb
left a comment
There was a problem hiding this comment.
Thank you @Rich-T-kid and @gabotechs
|
@alamb pushed up the changes.
I can open the ticket for this later today If you don't a chance to. In any case, it should be a pretty small change so I'll open a follow up PR for it afterwards |
|
Let's do it |
|
Codex also noticed that dictionary decoding will always validate (there is no way to turn it off). I filed a ticket to track |

Which issue does this PR close?
Rationale for this change
see #10137 & #10125
What changes are included in this PR?
from_slice_ref()which would allocate another buffer and then copy bytes into it. instead replaced with.clone()which just is O(1)Are these changes tested?
N/A
Are there any user-facing changes?
Users now have the option to skip data validation