Skip to content

[FIX]: Spsc queue false sharing#960

Open
raphael-s-steiner wants to merge 5 commits into
hw-native-sys:mainfrom
huawei-csl:SPSCQueue-false-sharing
Open

[FIX]: Spsc queue false sharing#960
raphael-s-steiner wants to merge 5 commits into
hw-native-sys:mainfrom
huawei-csl:SPSCQueue-false-sharing

Conversation

@raphael-s-steiner
Copy link
Copy Markdown
Contributor

@raphael-s-steiner raphael-s-steiner commented Jun 1, 2026

The mask and data ptr of PTO2SpscQueue are used (read) by both producer and consumer, but lies on a consumer cache line where the consumer also writes to - > False Sharing!

FIX: Duplicated mask and data ptr such that producer and consumer both have a local copy.
(Duplicated rather than own cache line as there is an imposed size constraint of 4 cache lines.)

Additionally, a minor optimisation to pop_batch:
Cached head value gets updated when less elements than requested rather than if no elements.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2136b19b-9074-43b4-b34b-fe6dab8ceccf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

PTO2SpscQueue across both a2a3 and a5 runtime implementations is refactored to maintain separate cached buffer pointers and masks for producer and consumer execution paths, replacing the previous shared buffer_/mask_ fields to improve cache line isolation.

Changes

SPSC Queue Per-Side Cache Isolation

Layer / File(s) Summary
Data structure: per-side cached fields
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
PTO2SpscQueue struct replaces single buffer_ and mask_ members with producer-side (buffer_p_, mask_p_) and consumer-side (buffer_c_, mask_c_) cached fields in both implementations.
Initialization: per-side mask setup
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
init_data_from_layout() writes the capacity-derived mask into both mask_p_ and mask_c_ instead of a single mask_ field in both code versions.
Arena wiring and lifecycle cleanup
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
wire_arena_pointers() wires the arena buffer pointer into both buffer_p_ and buffer_c_, and destroy() clears both per-side pointers instead of a single buffer in both implementations.
Producer path: push with isolated state
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
push() method now uses producer-cached mask_p_ for full-capacity checks and buffer_p_ for element writes in both a2a3 and a5 implementations.
Consumer path: pop_batch with isolated state
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
pop_batch() method reads queued entries from consumer-cached buffer_c_ indexed with mask_c_ (after head-cache refill logic) in both a2a3 and a5 implementations.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit's queue grows divided—
Producer and consumer, once tight-knit, now widen their stride,
Each path claims its own cache line with pride,
No false sharing to slow the ride! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[FIX]: Spsc queue false sharing' directly and clearly describes the main change: fixing false sharing in the SPSC queue by separating producer and consumer cached pointers.
Description check ✅ Passed The description is directly related to the changeset, explaining the false sharing problem, the fix (duplicating mask and data pointers for producer and consumer), and a minor optimization to pop_batch.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request splits the shared buffer and mask members of PTO2SpscQueue into separate producer (buffer_p_, mask_p_) and consumer (buffer_c_, mask_c_) local copies in both scheduler headers, and updates the pop_batch logic. However, these changes reduce the struct size from 256 to 192 bytes, which will cause static assertions regarding struct size and alignment to fail. To resolve this, 96 bytes of padding should be added to the end of the struct to maintain the 256-byte size.

alignas(64) uint64_t head_cached_{0};

// --- Shared (immutable after init) ---
PTO2TaskSlotState **buffer_{nullptr};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use alignas(64) for buffer_? and static_assert(5*64 == sizeof(...));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial thought. But I assumed the static_assert(4*64 == sizeof(...)); was there for a reason. Meaning there is some assumption somewhere that depends on that this is exactly 4 cache lines. In particular, because some of the memory management in the code base is manual, I could not guarantee that it does not affect something else.

If you tell me that 5 cache lines is fine, I am happy to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants