Skip to content

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

Draft
sifakis wants to merge 110 commits into
AcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port
Draft

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 110 commits into
AcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

Conversation

@sifakis

@sifakis sifakis commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Draft for CI and diff review. WIP.

@sifakis sifakis changed the title NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026
sifakis and others added 2 commits April 2, 2026 13:46
Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as
static __device__ members of VoxelBlockManager<Log2BlockWidth>. These
implement the first phase of a two-function WENO5 stencil gather:
resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to
resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided
array with the 19 global sequential indices using WenoPt<i,j,k>::idx.

voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes
(x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal.
WenoPt<i,j,k>::idx is used throughout to remain independent of any future
re-alignment with OpenVDB's NineteenPt (which uses a different convention).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and
CUDA VoxelBlockManager implementations, along with design documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
sifakis and others added 15 commits April 2, 2026 13:50
Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill
its non-obvious design decisions into the knowledge base:

- §11: decodeInverseMaps is intentionally single-threaded/stateless;
  caller distributes blocks; contrast with cooperative GPU version.
- §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask
  words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields;
  mPrefixSum is still used for the cross-word offset in Step 5.
- §13: output fill is range-fill + contiguous copy (not scatter) because
  shuffleDownMask produces a sorted compacted array; std::fill/copy caveat
  for alignment when output arrays come from TLS or stack pointers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Implementation complete; design rationale distilled into VBMImplementationKnowledge.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Design reference for the per-block stencil gather kernel: decodes inverse
maps into block-local scratch, then resolves neighbor leaf pointers and
fills N-point stencil index arrays for all active voxels in the block.
WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic.
Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup),
unified StencilLeafPtrs template, and reach-R generalization considerations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- §3: Stencil type as template parameter needs index→offsets direction
  (for-each-slot gather loop), not the offsets→index direction of WenoPt.
  Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor,
  no accessor coupling.
- §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u);
  output is homogeneous std::array (not tuple); K=1 degenerates to scalar;
  SoA output layout results[k][BlockWidth] for SIMD efficiency.
- Renumber §4-§8 → §5-§9; update open questions accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…notes

Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting
abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an
SoA-wide version that loops over W lanes and is the auto-vectorization target.

The motivating kernel is WENO5 normSqGrad (19-point stencil, matching
WenoStencil::normSqGrad from Stencils.h).  The six weno5() calls vectorize
cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways
depending on how it is written:

  1. std::max / bool isOutside ternaries → "control flow in loop"
  2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to
     GCC's inability to see through std::tuple's recursive-inheritance
     struct layout in GIMPLE alias analysis

INVESTIGATION.md documents all experiments, findings, current blockers,
and proposed next steps (pointer-cache approach, Clang comparison, etc.).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Evangelos Sifakis <esifakis@gmail.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal
header-only SIMD abstraction backed by std::array<T,W> with arithmetic
operators, SimdMask, min/max, and where().  Mirrors the C++26 std::simd
interface for forward compatibility.

Rewrite the WENO5 normSqGrad kernel as a template on T:
- T=float  : scalar __hostdev__ path for GPU (one thread per voxel)
- T=Simd<float,W> : W-wide CPU path (one call per batch)

A single templated godunovsNormSqrd + normSqGrad definition serves both
execution contexts with no #ifdef, structurally matching Stencils.h.
Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions
in the hot function, assembly-verified); GCC 13 does not.

Update INVESTIGATION.md with the full scoreboard, both approaches, and
next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration).

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and
__cpp_lib_experimental_parallel_simd.  When available, Simd<T,W> and
SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask,
delegating all arithmetic to the standard type.

The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into
the 3-arg select(mask, a, b) form expected by the kernels.

Verified with clang++-18 -std=c++26: both paths produce identical assembly
(1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes
through the wrapper completely.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Document the std::experimental::simd backend alongside the std::array
default, including the TS v2 where() adaptation, the auto-detection
mechanism, and the assembly comparison showing byte-for-byte identical
output between the two backends under Clang 18.

Update the vectorization results table and open questions accordingly.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
StencilKernel.h — new prototype header:
  - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling
  - WenoStencilKernel<T>: derives from above, provides normSqGrad()
  - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h
  - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path

lift_test.cpp — rewritten to use WenoStencilKernel<T> directly:
  - SIMD and scalar reference paths both instantiate the same class
  - dx passed to constructor; mValues populated via operator[]

Simd.h — refinements:
  - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for
    stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct)
  - element_aligned_tag / element_aligned: portable load/store tag, always
    present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B
  - Backend B load constructor and store() accept element_aligned_tag (defaulted)
  - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B

INVESTIGATION.md — updated:
  - Approach B section updated to reflect class hierarchy instead of free functions
  - Backend B GCC note: the struct-access failure was specific to Approach A's
    liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize
    on GCC when used with the Generic-T class hierarchy
  - New ymm tables for both backends under GCC (Backend A: 1267 total,
    Backend B: 619 total); both pass correctness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…chPtrs)

Add full CPU batch neighbor-leaf resolution design to the planning doc:

- §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit
  encoding for probedMask/ptrs[27], stencil-specific batchPtrs population
  (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil),
  and GPU scalar design note kept separate.

- §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask;
  add batchPtrs population step (Phase 2) after the probeLeaf loop.

- §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions
  (bits 4,10,12,14,16,22 for WENO5 face neighbors).

- §8f/§8g: Minor notation updates to match ptrs[27] naming.

- §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Standalone CPU-only executable that verifies the neighbor leaf resolution
design from StencilGather.md §8d–§8f:

- For each VBM block: calls decodeInverseMaps, recomputes nLeaves from
  jumpMap, then processes SIMDw=16 batches with the full probedMask /
  lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline.
- Does not call computeStencil.  Instead verifies batchPtrs against a
  direct probeLeaf reference for all 18 non-center WENO5 stencil offsets
  that cross leaf boundaries.
- Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw>
vector, moving sentinel/masking responsibility to the single gather site
where leafMask is known:

- kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr)
- Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with
  real expandVoxelOffset() values before calling computeNeededDirs
- computeNeededDirs is now a pure add+reduce with no masking or cross-check

Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit
groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all
six WENO5 directions simultaneously.  kExpandCarryK = 0x514530C3.

AVX2 codegen confirmed via objdump:
- computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm +
  vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path
- activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2
- Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm

Always-on scalar cross-check at every computeNeededDirs call site.
verifyComputeNeededDirsSentinel() tests both the sentinel carry property
and the straddle-lane non-pollution scenario before runPrototype().

StencilGather.md §8e and §8f updated to match new API and codegen notes.
Phase 1 prototype marked complete in §9.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor —
developed from the ex_stencil_gather_cpu Phase 1 prototype discussion.

Core concept: instead of caching the path to one leaf, cache the full
3×3×3 neighborhood of 27 leaf pointers around the current center leaf,
serving SIMDw voxels per call.

Key design elements documented:

Eviction policy: fires on none_of(leafMask) only — straddle lanes do not
evict. leafMask is the "partial-hit" signal with no scalar-accessor analog.

Prefetch coverage argument:
- WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary
  and sufficient — equivalent to the computeNeededDirs carry trick
- Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26
  non-center directions for any voxel position in the batch

Three-tier API:
- prefetch<di,dj,dk>(vo, leafMask, treeAcc)
- cachedGetValue<di,dj,dk>(vo, leafMask)  — no treeAcc, cache assumed warm
- getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style)

Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit,
dead-axis elimination, VDB convention alignment; runtime Coord overload
provided for generic stencil adapters.

AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm),
gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar
bottlenecks from Phase 1 prototype are eliminated.

StencilGather.md: add cross-reference to BatchAccessor.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduces BatchAccessor<BuildT,ValueT,VoxelOffsetT,LeafIDT,PredicateT>,
the SIMD analog of NanoVDB's ValueAccessor.  Instead of caching the path
to one leaf, it caches the 27-entry 3×3×3 neighbor pointer table around
the current center leaf and serves a SIMD batch of LaneWidth voxels per
call.

Key design decisions
--------------------
- Eager center: constructor and advance() populate mLeafNeighbors[13]
  directly (O(1), no probeLeaf), so cachedGetValue<0,0,0> is valid
  immediately and the probe loop never needs a center special-case.

- SWAR neededMask: prefetch<di,dj,dk> expands the 9-bit voxel offsets
  into a 15-bit packed form (lz@[0:2], lx@[6:8], ly@[12:14]) using SIMD
  bitwise ops, then adds a compile-time packed stencil offset and checks
  carry bits for crossing detection.  One vpaddw YMM instruction covers
  all 16 lanes; clang folds packed_d into the blend constants at compile
  time, reducing the expand+blend+add to 5 SIMD instructions.

- Heterogeneous where: Simd.h gains where<T,U,W>(SimdMask<U,W>, ...) so
  a PredicateT=SimdMask<float,W> can gate a VoxelOffsetT=Simd<uint16_t,W>
  blend without explicit casting.  Array backend uses a trivial bool loop;
  stdx backend converts via a bool[] round-trip.

- Correctness verified in-process: stencil_gather_cpu.cpp integrates
  BatchAccessor as an alternate execution path and cross-checks all 18
  WENO5 tap directions against direct tree references (12.3M lane checks).

Simd.h additions (array backend)
---------------------------------
- SimdMask<T,W>: converting constructor from SimdMask<U,W>
- Simd<T,W>: operator|, &, ^, <<(Simd), >>(Simd)
- where<T,U,W>: heterogeneous mask overload (both backends)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Simd.h:
- Add scalar_traits<T> / scalar_traits_t<T>: extracts element type from
  plain scalars (identity) and Simd<T,W> (T); used by BatchAccessor
  static_asserts and as the shift-count type for the new uniform-shift ops.
- Add Simd::operator<<(T) / operator>>(T): uniform scalar shift (all lanes
  by the same immediate).  Maps to vpsllw imm8 / vpsrlw imm8 on x86 —
  distinguished from the existing per-lane Simd<<Simd overload which would
  require the nonexistent vpsllvw instruction for 16-bit lanes.

BatchAccessor::prefetch SWAR section:
- Replace hard-coded uint16_t/uint32_t casts with VoxelOffsetScalarT
  (= scalar_traits_t<VoxelOffsetT>) so the code is correct for any
  unsigned 16+-bit instantiation, not just Simd<uint16_t,W>.
- Add class-level static_asserts (unsigned + sizeof >= 2) with explanatory
  messages referencing the SWAR carry-detection contract.
- Remove static_assert(LaneWidth >= 16): was a performance aspiration,
  not a correctness requirement; SWAR works for any LaneWidth >= 1.
- Use 'auto' for expanded / packed_lc / packed_sum (type already expressed
  by the initializer); keep explicit uint32_t for hor_or/hor_and/s where
  the width is a deliberate semantic choice.
- Replace kMask15 hex literal 0x71C7u with 0b111'000'111'000'111u (binary
  makes the three 3-bit mask fields and three 3-bit gaps visually explicit).
- Use vo << VoxelOffsetScalarT(9) (uniform shift) instead of
  vo << VoxelOffsetT(VoxelOffsetScalarT(9)) (broadcast-then-per-lane).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
@danrbailey

Copy link
Copy Markdown
Contributor

I know that this is marked as not for review yet, but I just wanted to comment and say that specifically there is one aspect of this PR which is something I've wanted in OpenVDB for a long time. I brought this up in our TSC meetings again recently. A lot of the time when we use the ValueAccessor, it is not true random access, but stencil access patterns and the fact that the ValueAccessor is only able to store one leaf node means you end up thrashing the cache in the ValueAccessor when accessing different neighboring leaf nodes along leaf boundaries.

I have done a little experimentation in this area and the lazy leaf neighbor pattern that you use here is what I was envisioning, though I was contemplating whether a templated stencil argument (ie 6-neighbor vs 27-neighbor) may provide improved performance depending on the use case. I also hadn't settled on a name - NeighborAccessor, StencilAccessor, etc.

I would be very interested in comparing performance of a similar structure for OpenVDB that doesn't include the SIMD logic to compare with performance of the ValueAccessor (that also does not use SIMD). If it outperforms the ValueAccessor, I think we should look to switch some of our neighbor use cases across.

The other big issue with the ValueAccessor IMO is the tight coupling with the Grid and Tree (getValueAndCache() etc). This hurts impacts compile times and a new acceleration structure that extends the ValueAccessor in a new direction might be a good opportunity to revisit this and the attach/release accessor mechanism. I know NanoVDB took a slightly different approach on these trade offs.

(@kmuseth - a good topic for a future TSC meeting I think)

sifakis and others added 9 commits April 16, 2026 10:33
…lingZeros

- Add 2-argument where(mask, target) = value proxy (stdx and array backends):
  stdx-style masked assignment; encourages GCC to emit vpblendvb
- Add util::reduce(v, op) to both backends: tree-reduces to a scalar with
  std::bit_or<>{} / std::bit_and<>{} etc.; replaces scalar horizontal loop
- Add scalar reduce(T, BinaryOp) identity overload for W=1 path
- Add util::countTrailingZeros(uint32_t) to nanovdb/util/Util.h: __hostdev__,
  CUDA/HIP/__builtin_ctz/MSVC/De Bruijn dispatch; removes ad-hoc ctz from Simd.h
- BatchAccessor: use 2-arg where for packed_lc blend, util::reduce for hor_or/and,
  util::countTrailingZeros in toProbe loop, hoist root ref out of loop body

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Replace the per-lane scalar loop in cachedGetValue with a fully SIMD
gather chain that populates offsets, prefixSums, and maskWords without
any scalar iteration over lanes.

Pipeline (all in Simd<T,W>):
  packed_sum (uint16_t)
    → ×1129 → >>10 → &31        : d_vec (0..26), stays uint16_t —
                                   bits [10:14] of product lie below bit 16
                                   so modular uint16_t multiply is exact
    → gather(mNeighborLeafIDs)   : leaf_id_vec (uint32_t)
    → ×kStride                   : raw_idx (int32_t, null lanes → 0)
    → gather(offset_base)        : mOffset per lane
    → gather(prefix_base)        : mPrefixSum packed, then shift-extract field w
    → gather(mask_word_base + w) : valueMask().words()[w] per lane

Switch mLeafNeighbors[27] (const LeafT*) to mNeighborLeafIDs[27] (uint32_t)
with kNullLeafID = ~uint32_t(0) sentinel, enabling the flat-base SIMD gather
pattern.  prefetch and advance updated accordingly.

Add simd_cast<DstT>(Simd<SrcT,W>) to Simd.h for widening (uint16_t → int32_t,
uint16_t → uint64_t, uint32_t → int32_t) used in gather index construction.

Debug cross-check (#ifndef NDEBUG) validates all three vectors against the
scalar reference path; 12M+ lane checks pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
§8a was stale ("prefetch uses a scalar loop") — prefetch has no per-lane
loop. Update with accurate description of its two-phase structure:

- SWAR expansion + vpblendvb + vpaddw: fully in YMM across all LaneWidth
  lanes (vpsllw, vpor, vpand, vpblendvb, vpaddw)
- Horizontal reduction: unavoidable vextracti128 + vpand/vpor tree to
  produce scalar hor_and / hor_or for the per-axis crossing decision

Include the actual Release assembly (ex_stencil_gather_cpu, -O3 -mavx2)
confirming the YMM path survived the mLeafNeighbors → mNeighborLeafIDs
encoding change.

§8c: update nullptr sentinel language to reflect kNullLeafID / valid_u32
mask in the SIMD gather chain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- Promote SWAR encoding literals to class-scope uint16_t constants:
  kSwarXZMask (0x1C07), kSwarYMask (0x00E0), kSwarSentinel (4|4<<5|4<<10).
  Shared between prefetch and cachedGetValue; implicit conversion to
  VoxelOffsetT at Simd construction time.
- Add direction-extraction local constants in cachedGetValue:
  kSwarCarryMask (0x6318), kDirMul (1129), kDirMask (31).
- Rename w_vec -> wordIndex for clarity.
- Move U64Traits alias inside #ifndef NDEBUG where it is only used.
- Replace non-ASCII characters (em dashes, arrows, element-of, comparison
  operators) with ASCII equivalents throughout.
- Add explicit parentheses around the d_u16 shift-then-mask chain to make
  operator precedence unambiguous.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…liases

Simd.h:
- Add scalar simd_cast<DstT>(SrcT) overload (degrades to static_cast).
- Add scalar 2-arg where(bool, T&) masked-assignment proxy matching the
  SIMD WhereExpression form.
- Add scalar gather(const T*, int32_t) and gather_if(T&, bool, ...) to
  complete the scalar overload set alongside the existing where(bool,T,T).

BatchAccessor.h:
- Remove LeafIDT template parameter (was reserved/unused; now derived).
- Add private class-scope LeafIDVecT and LeafDataVecT using conditional_t:
  plain uint32_t/uint64_t when LaneWidth==1, Simd<T,W> otherwise.
  This upholds the convention that scalar instantiations use underlying
  types directly rather than Simd<T,1> wrappers.
- Replace local U32T/U64T aliases in cachedGetValue with class-scope names.

stencil_gather_cpu.cpp:
- Drop the now-removed LeafIDT argument from the BAccT instantiation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
cachedGetValue is now fully vectorised end-to-end for ValueOnIndex grids.
The scalar leaf->getValue(offset) loop is removed; result is filled via a
2-arg where(isActive, result) = ... directly on the output argument so that
leafMask-clear and inactive-voxel lanes are never touched.

Key design decisions recorded in BatchAccessor.md §8e:

- tapLeafOffset_i64 widened to int64_t before *= kStride to avoid uint32_t
  overflow (kNullLeafID = 0xFFFFFFFF causes wild gather indices in uint32_t).
  simd_cast_if<int64_t>(dst, valid_u32, src) writes 0 for invalid lanes,
  keeping gather indices non-negative for vpgatherqq (signed int64_t).

- gather_if gains a MaskElemT template parameter to support heterogeneous
  masks: valid_u32 (SimdMask<uint32_t,W>) applied to uint64_t data fields.

- Activity check: ValueOnIndex::getValue returns 0 for inactive voxels.
  Detected as isActive = (maskWords & (1<<dest_yz)) != 0. Null-leaf lanes
  have maskWords=0 and are therefore implicitly handled by the same check.

- popcount uses a SWAR shift-and-add tree (popcount64 in Simd.h) rather than
  __builtin_popcountll: AVX2 has no 64-bit lane-wise popcount (VPOPCNTQ is
  AVX-512DQ only), and the scalar popcnt instruction is not vectorisable.

- mOffsetBase, mPrefixBase, mMaskWordBase promoted to class-level const
  pointers, computed once in the constructor, shared across all 18
  cachedGetValue instantiations in a WENO5 stencil gather.

Verified: debug build (-O0) + release build (-O3), 12 321 275 lane checks
across all 18 WENO5 non-center taps, zero mismatches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Add §8f with assembly comparison between GCC 13 and Clang 18
(-O3 -DNDEBUG -march=native, cachedGetValue<1,0,0>, W=16):

- GCC does not inline any Simd.h helper (gather_if, simd_cast, where,
  popcount); emits 14 out-of-line calls and 13 vzeroupper transitions.
  gather_if body uses scalar vmovq/vpinsrq/vinserti128 — no hardware gathers.

- Clang inlines everything except popcount; emits 2 vpgatherdd + 12 vpgatherqq
  hardware gather instructions and only 2 vzeroupper transitions.
  popcount body (88 ymm instrs, pure SWAR vpsrlq/vpand/vpaddq) is also
  fully vectorized but remains out-of-line.

- 43 vpinsrb in Clang output are mask-widening cost for heterogeneous
  gather_if (SimdMask<uint32_t,16> → 4x SimdMask<uint64_t,4>).

Action item added to §10: [[gnu::always_inline]] on Simd.h helpers
would eliminate the GCC regression and fold popcount inline in both
compilers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…adeoffs

Expand §8f with analysis of three popcount strategies for the 16-lane
uint64_t case:

- SWAR shift-and-add tree (current): 88 instructions, all AVX2-friendly
  but port-intensive; stays in vector registers throughout.

- Scalar popcnt with extract/reassemble: ~56 instructions; popcnt is
  pipelined (1/cycle throughput on port 1, 3-cycle latency) so 16
  independent lanes retire in ~16 cycles, but the ymm↔GPR domain
  crossing adds ~2 cycles bypass latency per extraction and port 1
  serialises all 16 popcnts.

- vpshufb nibble-LUT + vpsadbw (recommended): ~40 instructions, no
  domain crossing, uses ports 0/5 and 5 (orthogonal to SWAR ports),
  standard compiler-generated AVX2 popcount pattern. Added as action
  item in §10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds StencilAccessor<BuildT,W,StencilT> — a compile-time-parameterized SIMD
wrapper around BatchAccessor that owns the straddling loop, hull-prefetch
sequencing, and per-tap cachedGetValue blending for one VBM block.  Output
is mIndices[SIZE] of Simd<uint64_t,W>, one vector per stencil tap.

Includes Weno5Stencil (18 taps, 6-tap hull) and the findIndex constexpr
fold for compile-time getValue<DI,DJ,DK>() inverse-map lookup.

Also:
- BatchAccessor.h: add centerLeafID() getter used by StencilAccessor
- BatchAccessor.md: expand §8f assembly matrix (compiler × backend × ISA)
- stencil_gather_cpu.cpp: wire StencilAccessor into runPrototype with
  verifyStencilAccessor correctness checks; add rdtsc-based runPerf
- Simd.h: remove = {} default argument from element_aligned_tag overloads
- CLAUDE.md: add project build/architecture reference for Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
sifakis and others added 30 commits June 11, 2026 17:32
…teLeafNodesFunctor<NN_FACE_EDGE_VERTEX>

The NN_FACE_EDGE_VERTEX leaf functor accessed its [10][3][3] word stencil through a
reinterpret_cast view anchored mid-array at [1][1][1] with negative indices. Those negative
logical indices form addresses before the view object's base -- out-of-object pointer
arithmetic (UB). It is correct on-device today, but GCC at -O2/-O3 exploits the UB in the
read-modify-write z/y/x passes and miscompiles it (reproduced standalone: -O0 matches, -O2/-O3
diverge in ~44% of random trials), making it a latent landmine if ever compiled host-side --
which the host port now does.

Replace the anchored view with a +1-biased index accessor (originalWordsShifted[i+1][j+1][k+1]),
matching the host util::morphology::DilateLeafNodes fix. Every access now lands within the real
array's [0,10)x[0,3)x[0,3) bounds; no out-of-object arithmetic, correct at all optimization
levels. This is the only morphology functor with the before-base pattern; the other two
reinterpret views (MaskShift's [2][8][4] and DilateInternalNodesFunctor's [3][3][3]) anchor at
element [0] with in-bounds indices and are unaffected.

Verified: ex_dilate_nanovdb_cuda (DilateGrid + PruneGrid) CORRECT on-device for
NN_FACE_EDGE_VERTEX across dragon/armadillo/iss/space.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…__ helper

Record a future-refactoring note (§7.9): since DilateLeafNodes is thread-centric (no warp/CTA
cooperation) and its primitives are already __hostdev__, the per-leaf body could be a single
__hostdev__ helper shared by the host forEach and the CUDA kernel -- removing the duplication
that just forced the [10][3][3] stencil UB fix to be applied twice (host 0110561, cuda
c199fcb). DilateInternalNodes is explicitly not a candidate (warp-cooperative MaskShift/WarpReduce).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 4.8 is now in progress: DilateGrid's two compute-bearing operator methods
(dilateInternalNodes, dilateLeafNodes) are ported to host and validated end-to-end
against the OpenVDB dilateActiveValues reference for both NN_FACE and the full
NN_FACE_EDGE_VERTEX stencil. §7.7 rewritten with a per-method DilateGrid table, the new
host pieces (shuffleDown/Up, MaskShift, DilateInternalNodes, DilateLeafNodes) and their
validation, and the remaining work (processGridTreeRoot/dilateRoot host conversion, then
the dilate->prune round-trip once PruneGrid lands). Coarsen/Refine/Prune still pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…idTreeRootFunctor)

Verbatim reuse of the MergeGrids host version (7e5f965): replace the cudaMemcpyDeviceToDevice
of the source GridData with a host std::memcpy (both ends host-accessible) and call
BuildGridTreeRootFunctor directly on host instead of via a single-element lambdaKernel, behind a
leading stream-sync that drains getBuffer's output-buffer zero-fill. Add <cstring>.

This was the last kernel launch in DilateGrid.h; the operator is now kernel-free, at parity with
MergeGrids.h. dilateRoot still does its D2H source copy (a shared Phase 4.4/4.7 cleanup item,
pending for mergeRoot too).

Validated CORRECT against the OpenVDB dilateActiveValues reference across dragon/armadillo/iss/
space (NN_FACE_EDGE_VERTEX).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rid (drop D2H copy)

Remove the HostBuffer + cudaMemcpyAsync D2H copy of the source RootNode and Upper Nodes;
read them directly from the managed (UnifiedBuffer) source grid instead. This is the same
managed-access assumption dilateInternalNodes/dilateLeafNodes already rely on, and the source
pages are drained by the stream-sync at the top of getHandle. getChild() resolves its relative
offsets within the same managed grid, so the uppers remain reachable.

DilateGrid.h is now free of D2H copies and kernel launches; the only remaining device op is the
output mProcessedRoot.deviceUpload (the shared Phase 4.4 dual-mode-buffer item, pending for
mergeRoot too -- where the analogous source D2H copy can be dropped the same way).

Validated CORRECT against the OpenVDB dilateActiveValues reference across dragon/armadillo/iss/
space (NN_FACE_EDGE_VERTEX).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nctors

The eight topology::detail functor operator()s (BuildGridTreeRoot, UpdateLeafVoxelCountsAndPrefixSum,
UpdateLeafVoxelOffsets, UpdateAndPropagateLeafBBox, PropagateLowerBBox, PropagateUpperBBox,
UpdateRootWorldBBox, PostProcessGridTree) were __hostdev__ during the transition while some
operators still launched them as kernels. Now that every caller is host (util::forEach bodies and
direct host calls in TopologyBuilder's methods, plus BuildGridTreeRootFunctor invoked directly by
Merge/Dilate processGridTreeRoot) and TopologyBuilder.h has no kernel launches, these are host-only
and the __hostdev__ is dropped. These are the host tools::topology::detail functors -- distinct
from the CUDA tools::cuda::topology::detail ones the device operators use, so nothing calls them
from device.

The Data POD accessors (getGrid/getTree/... ) keep __hostdev__ (lower-level, harmless).

Both examples still build and check CORRECT (dilate dragon; merge dragon+armadillo).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ilateRoot host + __hostdev__ cleanup

Update the DilateGrid table: all four operator methods are HOST now -- processGridTreeRoot
(ce25500) and dilateRoot (38fa025, source read directly from the managed grid, D2H copy
dropped) joined dilateInternalNodes/dilateLeafNodes. DilateGrid.h has no kernel launches and no
D2H copies; only the shared Phase 4.4 mProcessedRoot.deviceUpload remains. Also note the
TopologyBuilder detail functors dropping __hostdev__ (517f94b). Phase 4.8 line updated to
"DilateGrid kernel-free, at merge parity".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…anovdb_cpu

Port PruneGrid following the DilateGrid playbook. Both functors are thread-centric (no warp
cooperation), ported to util/Morphology.h:
- PruneInternalNodes: per source leaf, retain iff its value mask intersects the leaf-mask
  sidecar; set the pruned upper/lower child masks via setOnAtomic (sibling leaves race).
- PruneLeafMasks: per source leaf, set the retained output leaf's value mask to
  srcValueMask & sidecar; flat over source leaves, distinct output leaves, no atomics.

tools/PruneGrid.h is a host operator modeled on the now-clean DilateGrid.h: reuses the host
TopologyBuilder, UnifiedBuffer getHandle default + ScratchBufferT alias, reads the source
root/grid/mask directly from managed memory (no D2H copies), and runs pruneInternalNodes,
processGridTreeRoot, and pruneLeafNodes on host. No kernel launches in PruneGrid.h; only the
shared Phase 4.4 mProcessedRoot.deviceUpload remains.

ex_dilate_nanovdb_cpu now does the dilate->prune round-trip: inject the original topology as a
UnifiedBuffer leaf-mask sidecar (InjectGridMaskFunctor kept a CUDA kernel -- an example helper,
not part of PruneGrid), then host-prune the dilated grid back and check against the original.

Validated CORRECT (DilateGrid + PruneGrid) across dragon/armadillo/iss/space for
NN_FACE_EDGE_VERTEX; merge unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch the example's default stencil from NN_FACE_EDGE_VERTEX to NN_FACE (matching the CUDA
ex_dilate_nanovdb_cuda default). The full dilate->inject->prune round-trip checks CORRECT for
both DilateGrid and PruneGrid across dragon/armadillo/iss/space at NN_FACE, complementing the
earlier NN_FACE_EDGE_VERTEX validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a PruneGrid section: all four operator methods HOST, kernel-free, both functors
thread-centric (PruneInternalNodes retain-check + setOnAtomic; PruneLeafMasks per-leaf
intersection). Document the leaf-mask sidecar and the dilate->inject->prune round-trip in
ex_dilate_nanovdb_cpu (CORRECT for Dilate+Prune across the geometry set at both stencils;
example now defaults to NN_FACE). Update the dilate example note (round-trip now present),
trim "Still to port" to Coarsen/Refine, and update the Phase 4.8 line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Buffer (cleanup step 1)

The mask/offset/parent scratch (ScratchBufferT) is written and read entirely on the host now --
no device kernel touches it -- so it becomes plain HostBuffer:
- ScratchBufferT = nanovdb::HostBuffer (+ <nanovdb/HostBuffer.h>, <cstring>)
- ScratchBufferT::create(size, nullptr, device, stream) -> create(size); drop the now-unused
  cudaGetDevice(&device) locals
- mask zero-fills cudaMemsetAsync(.deviceData()) -> std::memset(.data())
- EnumerateNodes fed mUpperMasks.data()/mLowerMasks.data(); deviceUpperMasks()/deviceLowerMasks()
  accessors deleted (their only host caller was countNodes); d_upperOffsets <- mUpperOffsets.data()

Self-contained: mData and mProcessedRoot stay DeviceBuffer, the output grid buffer stays
UnifiedBuffer, the stream param and the examples (still .cu, cudaMemcpy bufferCheck) are
unchanged. The remaining cudaStreamSynchronize drains still cover the device-side
mData/mProcessedRoot uploads and the output-buffer memset (migrated in later steps).

Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip
(dragon/armadillo/iss/space, NN_FACE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p 2)

mProcessedRoot is transient host scratch (the speculative root, discarded after the tree is
built), so it now shares the ScratchBufferT (= HostBuffer) storage policy:

- TopologyBuilder: mProcessedRoot member -> ScratchBufferT; ScratchBufferT moved to the public
  section so the operators (which own the <op>Root allocation) share this single definition;
  delete the now-unused deviceProcessedRoot() accessor; mProcessedRoot.clear(stream) -> clear().
- Merge/Dilate/Prune: alias ScratchBufferT from the builder (typename TopologyBuilder<BuildT>::
  ScratchBufferT) rather than re-declaring it; allocate mProcessedRoot via ScratchBufferT::create
  (was DeviceBuffer::create); drop the trailing deviceUpload and the now-unused cudaGetDevice
  local. The host-side speculative-root tile-fill is unchanged.

Self-contained: mData stays DeviceBuffer, the output grid buffer stays UnifiedBuffer, the examples
are untouched. The host path read mProcessedRoot via hostProcessedRoot() (.data()) already, so no
reader changes; the CUDA operators use their own tools::cuda::TopologyBuilder, unaffected.

Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip
(dragon/armadillo/iss/space, NN_FACE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… directly

d_upperOffsets was a kernel-era smuggling channel: it cached the mUpperOffsets scratch pointer
into the Data POD so the formerly-kernel processUpperNodes could reach it through its Data*
argument. processUpperNodes is now an inline forEach lambda in TopologyBuilder with direct member
access, so it reads mUpperOffsets.data() straight, and the field (plus its getBuffer set site) is
removed. Its sole reader was processUpperNodes; mUpperOffsets is allocated in countNodes and not
cleared before processUpperNodes runs, so the direct read is valid.

Data is now {d_bufferPtr, grid..size byte offsets, nodeCount[3]} + the get* accessors. d_bufferPtr
stays -- it's the only handle the detail functors (handed just a Data*) have to the output grid.

Validated CORRECT: merge (dragon+armadillo), dilate+prune (dragon, iss).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nup step 3)

mData is a single small POD that is written once (countNodes + getBuffer) and thereafter only
read on the host; its device copy had no readers (the deviceData() accessor was dead, all kernels
gone), so it was a pure deviceUpload staging target. Collapse it to a plain `Data mData{};`
member:
- drop the DeviceBuffer allocation in the constructor and the mData.deviceUpload in getBuffer
- delete the deviceData() accessor; data() now returns &mData
- refresh the processUpperNodes stream-sync comment (it now drains only the output-buffer memset)

No host/device divergence existed to preserve (verified earlier). The output grid buffer stays
UnifiedBuffer and the examples are untouched; d_bufferPtr still holds the (managed) output base,
host-derefable as before.

Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip
(dragon/armadillo/iss/space, NN_FACE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite §7.3 as the shared host-only cleanup tracker (Merge/Dilate/Prune + TopologyBuilder),
targeting HostBuffer directly (UnifiedBuffer was transitional). Record done: scratch -> HostBuffer
(db14618), mProcessedRoot -> HostBuffer with the ScratchBufferT alias now owned by the builder
and aliased by the operators (56f480b), mData -> plain Data member (5fee27c) + d_upperOffsets
removed (82f71cd), updateChecksum -> host (333c942). Remaining = Step 4: output grid buffer ->
HostBuffer, inputs host-resident (getTreeData/mergeRoot D2H, drop stream param), residual
stream-sync drains, dead code (CALL_CUBS/mTempDevicePool/cub include), and the .cu->.cpp example
milestone (Dilate needs a host InjectGridMask). Fix the stale mergeRoot note in §7.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nd-agnostic (step 4, merge path)

Make the output grid buffer host-allocable and flip the merge chain to host-resident inputs+output:

- TopologyBuilder::getBuffer is now backend-agnostic and CUDA-free: BufferT::create(size, &pool)
  + std::memset(buffer.data()) + d_bufferPtr = buffer.data(). Works for HostBuffer (merge) and
  UnifiedBuffer (dilate/prune still default to it) alike -- UnifiedBuffer's create(size,ref) places
  it CPU-side and data() is host-writable -- with no behavior change for dilate/prune.
- MergeGrids: getHandle default BufferT -> HostBuffer; getTreeData (D2H via DeviceGridTraits) ->
  direct host read (mSrcTreeData = srcGrid->tree()); mergeRoot reads the source RootNodes directly
  from the host grids (drop the D2H copies + their HostBuffer staging).
- ex_merge example: all grids HostBuffer (no deviceUpload/deviceGrid); the result is read via
  handle.grid<BuildT>() and checked with std::memcmp; mainMergeGrids signature simplified to the
  three host grid pointers.

This eliminated the GridHandle<UnifiedBuffer> link dependency the .cpp previously had (the source
grids were UnifiedBuffer). MergeGrids.h still carries CUDA includes (mTimer, cudaStreamSynchronize,
cub) so the example stays a .cu for now; stripping those + the .cu->.cpp completion are the next
sub-steps.

Validated CORRECT: merge across dragon/armadillo/iss/space (incl. reversed); dilate+prune
(dragon, iss) unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ated in-place)

Port InjectGridMaskFunctor (the thread-centric, non-warp-cooperative injector used by the
dilate->prune round-trip) to a host util::injectGridMask -- a util::forEach over destination
leaves, each writing its own sidecar entry (src.valueMask & dst.valueMask), no atomics. Lives in
util/Injection.h, parallel to util/cuda/Injection.cuh.

Swap the ex_dilate example's InjectGridMask CUDA kernel for the host call. Everything else stays
UnifiedBuffer (managed = host-accessible), so this validates the host inject in isolation before
any HostBuffer flip: dilate -> host inject -> prune checks CORRECT (Dilate + Prune) across
dragon/armadillo/iss/space at NN_FACE.

This removes the last kernel launch from the dilate example's mainDilateGrid; the remaining CUDA
there is the operator headers' includes (and getTreeData), to be shed when DilateGrid/PruneGrid
flip to HostBuffer and the example goes .cu -> .cpp.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ilate path)

Mirror the merge HostBuffer flip for the dilate->inject->prune chain:
- DilateGrid/PruneGrid: getHandle default BufferT -> HostBuffer; getTreeData (D2H via
  DeviceGridTraits) -> direct host read (mSrcTreeData = srcGrid->tree()). The <op>Root methods
  already read the managed source directly, so no other source reads remain.
- ex_dilate example: all grids HostBuffer (no deviceUpload/deviceGrid); dstLeafCount via host
  dstGrid->tree().nodeCount(0); the leaf-mask sidecar is a HostBuffer; the host injectGridMask,
  PruneGrid, and the Dilate/Prune checks all operate on host grids (std::memcmp). mainDilateGrid
  simplified: the original source doubles as the prune reference (one srcGrid param).

getBuffer (backend-agnostic since the merge step) yields HostBuffer output here too. DilateGrid.h/
PruneGrid.h still carry CUDA includes (mTimer/cudaStreamSynchronize/cub) so the example stays a
.cu; that and the .cu->.cpp completion are the shared next step.

Validated CORRECT (DilateGrid + PruneGrid) across dragon/armadillo/iss/space at NN_FACE, and
dragon/space at NN_FACE_EDGE_VERTEX.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nd examples)

Replace the CUDA event timer (util::cuda::Timer, constructed with a stream) with the host
util::Timer (util/Timer.h; same start/restart/stop API, no stream) in MergeGrids/DilateGrid/
PruneGrid: include swap, member type, and drop mTimer(stream) from the ctor init lists. The
example gpuTimers (which got util::cuda::Timer transitively via the operator headers) likewise
swap to util::Timer.

One of the operator headers' CUDA dependencies removed; remaining are cub, util/cuda/Morphology.cuh,
util/cuda/DeviceGridTraits.cuh, util/cuda/Util.h, and the cudaStream_t/cudaStreamSynchronize
plumbing -- the rest of the CUDA-strip.

Validated CORRECT: merge (dragon+armadillo) and dilate+prune (dragon).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every operation in the host operators + TopologyBuilder is now synchronous host code
(std::memset, util::forEach, host checksum) -- there is no async op, kernel launch, cudaMemcpy*,
or deviceUpload left -- so the cudaStreamSynchronize calls were draining an empty default stream
and the whole cudaStream_t/mStream apparatus was vestigial. Remove it:

- TopologyBuilder: drop the cudaStream_t param from every method (allocateInternalMaskBuffers,
  countNodes, getBuffer, processUpperNodes/LowerNodes/LeafOffsets, processBBox, postProcessGridTree)
  and the constructor; delete all cudaStreamSynchronize calls and the (void)stream.
- Merge/Dilate/Prune: drop the stream ctor param, the mStream member, every cudaStreamSynchronize,
  and pass no stream to the mBuilder.* calls; default-construct mBuilder.
- Remove the now-orphaned "Drain upstream ..." comments and stale @param stream / "Copy TreeData
  from GPU->CPU" doxygen.

Purely mechanical: behavior is identical (verified there was nothing stream-ordered to preserve).
The dead CALL_CUBS macro + mTempDevicePool (which still name `stream` in an unexpanded macro) and
the remaining cub/util/cuda includes are the next dead-code/include pass.

Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune (all four geometries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng work

Update §7.3: the whole data path is now host and CUDA-free in behavior -- getBuffer
backend-agnostic (8d8993e), all three operators default to HostBuffer + host-resident inputs
(8d8993e, 7eb3a50), host InjectGridMask (7931906), host Timer (dae9740), and the cudaStream_t
plumbing ripped out (d5a962b). Remaining is cosmetic/build-capability only: dead code
(CALL_CUBS/mTempDevicePool/mNumThreads), include purge (cub, util/cuda/*, cuda buffers), and the
.cu->.cpp completion (which also proves CUDA-freeness under plain g++). Phase tracker: 4.4 done
for Merge/Dilate/Prune, 4.7 in progress. CoarsenGrid/RefineGrid remain unported and will be
written clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add §7.8.1: warm-state host timings vs the OpenVDB ground-truth reference
and the nanovdb CUDA path. Host merge is ~8-13x faster than OpenVDB's
deepCopy+compSum; host dilate is on par with dilateActiveValues (and
compute-only already ~4x faster, alloc-bound). Notes the FloatGrid-vs-
OnIndexGrid caveat and that the GPU stays ~30-50x ahead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
CoarsenInternalNodesFunctor and RefineInternalNodesFunctor named their
processed-root parameter and local `prunedRoot`/`prunedTile`, pasted from
PruneInternalNodesFunctor. They operate on the coarsened/refined root, not a
pruned one. Rename to coarsenedRoot/coarsenedTile and refinedRoot/refinedTile
respectively (PruneInternalNodesFunctor's prunedRoot is correct, left as-is).

Pure rename, no behavior change; the CUDA coarsen/refine examples still build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Port the CUDA-only CoarsenGrid to a host-only operator at tools/CoarsenGrid.h
(namespace nanovdb::tools), following the prune/dilate playbook: it rides the
already-host TopologyBuilder pipeline unchanged and adds only its own functors.

- util/Morphology.h: host CoarsenInternalNodes (every non-empty source leaf maps
  via coarsenCoord into the coarsened topology; upper/lower masks set via
  setOnAtomic) and CoarsenLeafMasks (coarsenMask word arithmetic + util::atomicOr
  scatter, since up to eight source leaves coarsen into one destination leaf).
  Near-verbatim transcriptions of the CUDA functors, wrapped in util::forEach.
- tools/CoarsenGrid.h: all four operator methods on the host, no kernel launches
  and no D2H copies (only the shared mProcessedRoot.deviceUpload remains).
  coarsenRoot reads the source root directly from the host-resident grid.
- ex_coarsen_nanovdb_cpu: HostBuffer example checking against an OpenVDB
  reference; CORRECT across dragon/iss/space/torus.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Port the CUDA-only RefineGrid to a host-only operator at tools/RefineGrid.h
(namespace nanovdb::tools), completing the host port of all five topology
operators (Merge/Dilate/Prune/Coarsen/Refine). Like coarsen it rides the
already-host TopologyBuilder pipeline and adds only its own functors.

- util/Morphology.h: host RefineInternalNodes (detect present octants from the
  source value-mask words, then refineCoord + setOnAtomic) and RefineLeafMasks
  (refineMask bit-spread into up to eight distinct destination leaves, no
  atomics). Near-verbatim transcriptions of the CUDA functors via util::forEach.
- tools/RefineGrid.h: all four operator methods on the host, no kernel launches
  and no D2H copies (only the shared mProcessedRoot.deviceUpload remains).
  refineRoot reads the source root and upper nodes directly from the
  host-resident grid for its bbox-based 26-connected octant speculation.
- ex_refine_nanovdb_cpu: HostBuffer example checking against an OpenVDB
  reference (TBB VoxelRefiner); CORRECT across dragon/iss/space/torus.
- TopologyCpuPortPlan.md: record CoarsenGrid/RefineGrid as ported (7.7) and
  mark Phase 4.8 compute-complete for all five operators.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Phase 4.7 cleanup (plan items 10-11): the topology operators' data path has
been host-only for a while, but the headers still carried inert CUDA artifacts.
Remove them now (no behavior change; still compiled via the example .cu files
until the item-12 .cu->.cpp flip).

- TopologyBuilder.h: drop the dead mNumThreads/numBlocks, mTempDevicePool, and
  the CALL_CUBS macro + its trailing #undef (all named the long-gone stream and
  were inert/unexpanded). Drop the cuda/{TempPool,DeviceBuffer,UnifiedBuffer}.h
  and util/cuda/Morphology.cuh includes (EnumerateNodes/ProcessLowerNodes come
  from the host util/Morphology.h).
- MergeGrids/DilateGrid/PruneGrid/CoarsenGrid/RefineGrid.h: drop the per-operator
  mNumThreads/numBlocks and the cub/cub.cuh, cuda/UnifiedBuffer.h,
  util/cuda/{DeviceGridTraits.cuh,Morphology.cuh,Util.h} includes. Add the direct
  <map> and host util/Morphology.h includes the operators had been getting
  transitively through the removed CUDA headers.

ScratchBufferT is HostBuffer, so UnifiedBuffer/DeviceBuffer/cub were unused as
types; NearestNeighbors comes from util/MorphologyHelpers.h (kept). All four
_cpu examples build and validate CORRECT across dragon/iss/space/torus.

The \warning "include only from .cu files" doxygen notes are now stale but kept
until item 12 (.cu->.cpp), which removes them and proves CUDA-freedom under g++.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Phase 4.7 completion signal (plan item 12): make the host topology operators
prove their CUDA-freedom by compiling under plain g++.

- Rename each example's *_nanovdb_cpu_kernels.cu -> *_nanovdb_cpu_benchmark.cpp
  (git mv). The "kernels" name was stale (no kernels remain); the file is the
  operator-driving/benchmark half. The nanovdb_example CMake function globs *.cpp
  and only invokes nvcc when a *.cu exists, so the rename alone turns each _cpu
  example into a pure C++ target (verified: both TUs build as CXX, no CUDA object).
- Drop the now-unused cuda/DeviceBuffer.h (+ cuda/UnifiedBuffer.h) includes from
  the example main .cpp files; everything uses HostBuffer.
- Remove the now-stale "\warning ... include only from .cu files" doxygen notes
  from all six headers (TopologyBuilder + the five operators), and tidy the two
  "(host-side port in progress)" briefs to "(host-side port)".

This compile under g++ (no __CUDACC__) is what proves the headers are CUDA-free;
nvcc's host pass is more permissive. The one dependency it had masked (<map>,
transitively via a removed CUDA header) was already fixed in the item-11 purge.
All four _cpu examples build as g++ targets and validate CORRECT (dragon/iss/
space/torus, incl. the dilate->prune round-trip). Phase 4.7 is complete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
The CUDA benchmark warm-loop labels were copy-paste leftovers: merge printed
"Re-running entire dilation after warmstart" and refine printed "... refinment"
(misspelled). Correct them to "merge" and "refinement". Label-only change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
The host Timer measured with high_resolution_clock but reported via an integer
duration_cast<milliseconds>, truncating the fraction (e.g. "48 milliseconds"),
while util::cuda::Timer prints fractional ms from cudaEventElapsedTime. In code
that interleaves host and device timings (e.g. the Sem3D benchmark) the two were
inconsistent.

Make the default reporting fractional: milliseconds() returns duration<float,
milli>, and stop()/restart()/elapsed() default their AccuracyT to
duration<double, milli> (with a label branch). Measurement is unchanged; explicit
stop<std::chrono::microseconds>()/<seconds>()/<milliseconds>() still work, so this
is backward compatible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Host counterpart of the CUDA util::cuda::InjectGridDataFunctor: copy sidecar
values from a source grid into the sidecar of an overlapping destination grid,
for the voxels active in both (destination voxels with no source counterpart are
left unchanged). Pairs with the existing host injectGridMask.

Parallelizes over source leaves with util::forEach; each task probes the
destination for an overlapping leaf (early-out if none) and copies the
intersection of their value masks. Source-leaf origins are unique, so each task
writes a distinct destination leaf -- no atomics.

Instead of calling getValue() per voxel (a per-voxel popcount, costly on CPU
without AVX-512), each word's base sidecar offset is re-derived from the leaf's
packed mPrefixSum via a one-step carry, and the within-word rank is advanced one
increment per active bit -- the serial analogue of the CUDA functor's warp-
cooperative srcOffset/srcCnt decomposition. Templated on a compile-time channel
offset to match the CUDA functor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants