[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps by sifakis · Pull Request #2186 · AcademySoftwareFoundation/openvdb

sifakis · 2026-03-27T17:04:49Z

Draft for CI and diff review. WIP.

Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as static __device__ members of VoxelBlockManager<Log2BlockWidth>. These implement the first phase of a two-function WENO5 stencil gather: resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided array with the 19 global sequential indices using WenoPt<i,j,k>::idx. voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes (x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal. WenoPt<i,j,k>::idx is used throughout to remain independent of any future re-alignment with OpenVDB's NineteenPt (which uses a different convention). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and CUDA VoxelBlockManager implementations, along with design documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill its non-obvious design decisions into the knowledge base: - §11: decodeInverseMaps is intentionally single-threaded/stateless; caller distributes blocks; contrast with cooperative GPU version. - §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields; mPrefixSum is still used for the cross-word offset in Step 5. - §13: output fill is range-fill + contiguous copy (not scatter) because shuffleDownMask produces a sorted compacted array; std::fill/copy caveat for alignment when output arrays come from TLS or stack pointers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Design reference for the per-block stencil gather kernel: decodes inverse maps into block-local scratch, then resolves neighbor leaf pointers and fills N-point stencil index arrays for all active voxels in the block. WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic. Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup), unified StencilLeafPtrs template, and reach-R generalization considerations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- §3: Stencil type as template parameter needs index→offsets direction (for-each-slot gather loop), not the offsets→index direction of WenoPt. Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor, no accessor coupling. - §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u); output is homogeneous std::array (not tuple); K=1 degenerates to scalar; SoA output layout results[k][BlockWidth] for SIMD efficiency. - Renumber §4-§8 → §5-§9; update open questions accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…notes Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an SoA-wide version that loops over W lanes and is the auto-vectorization target. The motivating kernel is WENO5 normSqGrad (19-point stencil, matching WenoStencil::normSqGrad from Stencils.h). The six weno5() calls vectorize cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways depending on how it is written: 1. std::max / bool isOutside ternaries → "control flow in loop" 2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to GCC's inability to see through std::tuple's recursive-inheritance struct layout in GIMPLE alias analysis INVESTIGATION.md documents all experiments, findings, current blockers, and proposed next steps (pointer-cache approach, Clang comparison, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Evangelos Sifakis <esifakis@gmail.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal header-only SIMD abstraction backed by std::array<T,W> with arithmetic operators, SimdMask, min/max, and where(). Mirrors the C++26 std::simd interface for forward compatibility. Rewrite the WENO5 normSqGrad kernel as a template on T: - T=float : scalar __hostdev__ path for GPU (one thread per voxel) - T=Simd<float,W> : W-wide CPU path (one call per batch) A single templated godunovsNormSqrd + normSqGrad definition serves both execution contexts with no #ifdef, structurally matching Stencils.h. Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions in the hot function, assembly-verified); GCC 13 does not. Update INVESTIGATION.md with the full scoreboard, both approaches, and next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration). Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and __cpp_lib_experimental_parallel_simd. When available, Simd<T,W> and SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask, delegating all arithmetic to the standard type. The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into the 3-arg select(mask, a, b) form expected by the kernels. Verified with clang++-18 -std=c++26: both paths produce identical assembly (1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes through the wrapper completely. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Document the std::experimental::simd backend alongside the std::array default, including the TS v2 where() adaptation, the auto-detection mechanism, and the assembly comparison showing byte-for-byte identical output between the two backends under Clang 18. Update the vectorization results table and open questions accordingly. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

StencilKernel.h — new prototype header: - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling - WenoStencilKernel<T>: derives from above, provides normSqGrad() - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path lift_test.cpp — rewritten to use WenoStencilKernel<T> directly: - SIMD and scalar reference paths both instantiate the same class - dx passed to constructor; mValues populated via operator[] Simd.h — refinements: - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct) - element_aligned_tag / element_aligned: portable load/store tag, always present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B - Backend B load constructor and store() accept element_aligned_tag (defaulted) - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B INVESTIGATION.md — updated: - Approach B section updated to reflect class hierarchy instead of free functions - Backend B GCC note: the struct-access failure was specific to Approach A's liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize on GCC when used with the Generic-T class hierarchy - New ymm tables for both backends under GCC (Backend A: 1267 total, Backend B: 619 total); both pass correctness Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…chPtrs) Add full CPU batch neighbor-leaf resolution design to the planning doc: - §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit encoding for probedMask/ptrs[27], stencil-specific batchPtrs population (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil), and GPU scalar design note kept separate. - §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask; add batchPtrs population step (Phase 2) after the probeLeaf loop. - §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions (bits 4,10,12,14,16,22 for WENO5 face neighbors). - §8f/§8g: Minor notation updates to match ptrs[27] naming. - §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Standalone CPU-only executable that verifies the neighbor leaf resolution design from StencilGather.md §8d–§8f: - For each VBM block: calls decodeInverseMaps, recomputes nLeaves from jumpMap, then processes SIMDw=16 batches with the full probedMask / lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline. - Does not call computeStencil. Instead verifies batchPtrs against a direct probeLeaf reference for all 18 non-center WENO5 stencil offsets that cross leaf boundaries. - Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw> vector, moving sentinel/masking responsibility to the single gather site where leafMask is known: - kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr) - Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with real expandVoxelOffset() values before calling computeNeededDirs - computeNeededDirs is now a pure add+reduce with no masking or cross-check Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all six WENO5 directions simultaneously. kExpandCarryK = 0x514530C3. AVX2 codegen confirmed via objdump: - computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm + vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path - activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2 - Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm Always-on scalar cross-check at every computeNeededDirs call site. verifyComputeNeededDirsSentinel() tests both the sentinel carry property and the straddle-lane non-pollution scenario before runPrototype(). StencilGather.md §8e and §8f updated to match new API and codegen notes. Phase 1 prototype marked complete in §9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor — developed from the ex_stencil_gather_cpu Phase 1 prototype discussion. Core concept: instead of caching the path to one leaf, cache the full 3×3×3 neighborhood of 27 leaf pointers around the current center leaf, serving SIMDw voxels per call. Key design elements documented: Eviction policy: fires on none_of(leafMask) only — straddle lanes do not evict. leafMask is the "partial-hit" signal with no scalar-accessor analog. Prefetch coverage argument: - WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary and sufficient — equivalent to the computeNeededDirs carry trick - Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26 non-center directions for any voxel position in the batch Three-tier API: - prefetch<di,dj,dk>(vo, leafMask, treeAcc) - cachedGetValue<di,dj,dk>(vo, leafMask) — no treeAcc, cache assumed warm - getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style) Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit, dead-axis elimination, VDB convention alignment; runtime Coord overload provided for generic stencil adapters. AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm), gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar bottlenecks from Phase 1 prototype are eliminated. StencilGather.md: add cross-reference to BatchAccessor.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Introduces BatchAccessor<BuildT,ValueT,VoxelOffsetT,LeafIDT,PredicateT>, the SIMD analog of NanoVDB's ValueAccessor. Instead of caching the path to one leaf, it caches the 27-entry 3×3×3 neighbor pointer table around the current center leaf and serves a SIMD batch of LaneWidth voxels per call. Key design decisions -------------------- - Eager center: constructor and advance() populate mLeafNeighbors[13] directly (O(1), no probeLeaf), so cachedGetValue<0,0,0> is valid immediately and the probe loop never needs a center special-case. - SWAR neededMask: prefetch<di,dj,dk> expands the 9-bit voxel offsets into a 15-bit packed form (lz@[0:2], lx@[6:8], ly@[12:14]) using SIMD bitwise ops, then adds a compile-time packed stencil offset and checks carry bits for crossing detection. One vpaddw YMM instruction covers all 16 lanes; clang folds packed_d into the blend constants at compile time, reducing the expand+blend+add to 5 SIMD instructions. - Heterogeneous where: Simd.h gains where<T,U,W>(SimdMask<U,W>, ...) so a PredicateT=SimdMask<float,W> can gate a VoxelOffsetT=Simd<uint16_t,W> blend without explicit casting. Array backend uses a trivial bool loop; stdx backend converts via a bool[] round-trip. - Correctness verified in-process: stencil_gather_cpu.cpp integrates BatchAccessor as an alternate execution path and cross-checks all 18 WENO5 tap directions against direct tree references (12.3M lane checks). Simd.h additions (array backend) --------------------------------- - SimdMask<T,W>: converting constructor from SimdMask<U,W> - Simd<T,W>: operator|, &, ^, <<(Simd), >>(Simd) - where<T,U,W>: heterogeneous mask overload (both backends) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Simd.h: - Add scalar_traits<T> / scalar_traits_t<T>: extracts element type from plain scalars (identity) and Simd<T,W> (T); used by BatchAccessor static_asserts and as the shift-count type for the new uniform-shift ops. - Add Simd::operator<<(T) / operator>>(T): uniform scalar shift (all lanes by the same immediate). Maps to vpsllw imm8 / vpsrlw imm8 on x86 — distinguished from the existing per-lane Simd<<Simd overload which would require the nonexistent vpsllvw instruction for 16-bit lanes. BatchAccessor::prefetch SWAR section: - Replace hard-coded uint16_t/uint32_t casts with VoxelOffsetScalarT (= scalar_traits_t<VoxelOffsetT>) so the code is correct for any unsigned 16+-bit instantiation, not just Simd<uint16_t,W>. - Add class-level static_asserts (unsigned + sizeof >= 2) with explanatory messages referencing the SWAR carry-detection contract. - Remove static_assert(LaneWidth >= 16): was a performance aspiration, not a correctness requirement; SWAR works for any LaneWidth >= 1. - Use 'auto' for expanded / packed_lc / packed_sum (type already expressed by the initializer); keep explicit uint32_t for hor_or/hor_and/s where the width is a deliberate semantic choice. - Replace kMask15 hex literal 0x71C7u with 0b111'000'111'000'111u (binary makes the three 3-bit mask fields and three 3-bit gaps visually explicit). - Use vo << VoxelOffsetScalarT(9) (uniform shift) instead of vo << VoxelOffsetT(VoxelOffsetScalarT(9)) (broadcast-then-per-lane). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

danrbailey · 2026-04-16T05:28:55Z

I know that this is marked as not for review yet, but I just wanted to comment and say that specifically there is one aspect of this PR which is something I've wanted in OpenVDB for a long time. I brought this up in our TSC meetings again recently. A lot of the time when we use the ValueAccessor, it is not true random access, but stencil access patterns and the fact that the ValueAccessor is only able to store one leaf node means you end up thrashing the cache in the ValueAccessor when accessing different neighboring leaf nodes along leaf boundaries.

I have done a little experimentation in this area and the lazy leaf neighbor pattern that you use here is what I was envisioning, though I was contemplating whether a templated stencil argument (ie 6-neighbor vs 27-neighbor) may provide improved performance depending on the use case. I also hadn't settled on a name - NeighborAccessor, StencilAccessor, etc.

I would be very interested in comparing performance of a similar structure for OpenVDB that doesn't include the SIMD logic to compare with performance of the ValueAccessor (that also does not use SIMD). If it outperforms the ValueAccessor, I think we should look to switch some of our neighbor use cases across.

The other big issue with the ValueAccessor IMO is the tight coupling with the Grid and Tree (getValueAndCache() etc). This hurts impacts compile times and a new acceleration structure that extends the ValueAccessor in a new direction might be a good opportunity to revisit this and the attach/release accessor mechanism. I know NanoVDB took a slightly different approach on these trade offs.

(@kmuseth - a good topic for a future TSC meeting I think)

…lingZeros - Add 2-argument where(mask, target) = value proxy (stdx and array backends): stdx-style masked assignment; encourages GCC to emit vpblendvb - Add util::reduce(v, op) to both backends: tree-reduces to a scalar with std::bit_or<>{} / std::bit_and<>{} etc.; replaces scalar horizontal loop - Add scalar reduce(T, BinaryOp) identity overload for W=1 path - Add util::countTrailingZeros(uint32_t) to nanovdb/util/Util.h: __hostdev__, CUDA/HIP/__builtin_ctz/MSVC/De Bruijn dispatch; removes ad-hoc ctz from Simd.h - BatchAccessor: use 2-arg where for packed_lc blend, util::reduce for hor_or/and, util::countTrailingZeros in toProbe loop, hoist root ref out of loop body Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Replace the per-lane scalar loop in cachedGetValue with a fully SIMD gather chain that populates offsets, prefixSums, and maskWords without any scalar iteration over lanes. Pipeline (all in Simd<T,W>): packed_sum (uint16_t) → ×1129 → >>10 → &31 : d_vec (0..26), stays uint16_t — bits [10:14] of product lie below bit 16 so modular uint16_t multiply is exact → gather(mNeighborLeafIDs) : leaf_id_vec (uint32_t) → ×kStride : raw_idx (int32_t, null lanes → 0) → gather(offset_base) : mOffset per lane → gather(prefix_base) : mPrefixSum packed, then shift-extract field w → gather(mask_word_base + w) : valueMask().words()[w] per lane Switch mLeafNeighbors[27] (const LeafT*) to mNeighborLeafIDs[27] (uint32_t) with kNullLeafID = ~uint32_t(0) sentinel, enabling the flat-base SIMD gather pattern. prefetch and advance updated accordingly. Add simd_cast<DstT>(Simd<SrcT,W>) to Simd.h for widening (uint16_t → int32_t, uint16_t → uint64_t, uint32_t → int32_t) used in gather index construction. Debug cross-check (#ifndef NDEBUG) validates all three vectors against the scalar reference path; 12M+ lane checks pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

§8a was stale ("prefetch uses a scalar loop") — prefetch has no per-lane loop. Update with accurate description of its two-phase structure: - SWAR expansion + vpblendvb + vpaddw: fully in YMM across all LaneWidth lanes (vpsllw, vpor, vpand, vpblendvb, vpaddw) - Horizontal reduction: unavoidable vextracti128 + vpand/vpor tree to produce scalar hor_and / hor_or for the per-axis crossing decision Include the actual Release assembly (ex_stencil_gather_cpu, -O3 -mavx2) confirming the YMM path survived the mLeafNeighbors → mNeighborLeafIDs encoding change. §8c: update nullptr sentinel language to reflect kNullLeafID / valid_u32 mask in the SIMD gather chain. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

- Promote SWAR encoding literals to class-scope uint16_t constants: kSwarXZMask (0x1C07), kSwarYMask (0x00E0), kSwarSentinel (4|4<<5|4<<10). Shared between prefetch and cachedGetValue; implicit conversion to VoxelOffsetT at Simd construction time. - Add direction-extraction local constants in cachedGetValue: kSwarCarryMask (0x6318), kDirMul (1129), kDirMask (31). - Rename w_vec -> wordIndex for clarity. - Move U64Traits alias inside #ifndef NDEBUG where it is only used. - Replace non-ASCII characters (em dashes, arrows, element-of, comparison operators) with ASCII equivalents throughout. - Add explicit parentheses around the d_u16 shift-then-mask chain to make operator precedence unambiguous. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…liases Simd.h: - Add scalar simd_cast<DstT>(SrcT) overload (degrades to static_cast). - Add scalar 2-arg where(bool, T&) masked-assignment proxy matching the SIMD WhereExpression form. - Add scalar gather(const T*, int32_t) and gather_if(T&, bool, ...) to complete the scalar overload set alongside the existing where(bool,T,T). BatchAccessor.h: - Remove LeafIDT template parameter (was reserved/unused; now derived). - Add private class-scope LeafIDVecT and LeafDataVecT using conditional_t: plain uint32_t/uint64_t when LaneWidth==1, Simd<T,W> otherwise. This upholds the convention that scalar instantiations use underlying types directly rather than Simd<T,1> wrappers. - Replace local U32T/U64T aliases in cachedGetValue with class-scope names. stencil_gather_cpu.cpp: - Drop the now-removed LeafIDT argument from the BAccT instantiation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

cachedGetValue is now fully vectorised end-to-end for ValueOnIndex grids. The scalar leaf->getValue(offset) loop is removed; result is filled via a 2-arg where(isActive, result) = ... directly on the output argument so that leafMask-clear and inactive-voxel lanes are never touched. Key design decisions recorded in BatchAccessor.md §8e: - tapLeafOffset_i64 widened to int64_t before *= kStride to avoid uint32_t overflow (kNullLeafID = 0xFFFFFFFF causes wild gather indices in uint32_t). simd_cast_if<int64_t>(dst, valid_u32, src) writes 0 for invalid lanes, keeping gather indices non-negative for vpgatherqq (signed int64_t). - gather_if gains a MaskElemT template parameter to support heterogeneous masks: valid_u32 (SimdMask<uint32_t,W>) applied to uint64_t data fields. - Activity check: ValueOnIndex::getValue returns 0 for inactive voxels. Detected as isActive = (maskWords & (1<<dest_yz)) != 0. Null-leaf lanes have maskWords=0 and are therefore implicitly handled by the same check. - popcount uses a SWAR shift-and-add tree (popcount64 in Simd.h) rather than __builtin_popcountll: AVX2 has no 64-bit lane-wise popcount (VPOPCNTQ is AVX-512DQ only), and the scalar popcnt instruction is not vectorisable. - mOffsetBase, mPrefixBase, mMaskWordBase promoted to class-level const pointers, computed once in the constructor, shared across all 18 cachedGetValue instantiations in a WENO5 stencil gather. Verified: debug build (-O0) + release build (-O3), 12 321 275 lane checks across all 18 WENO5 non-center taps, zero mismatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Add §8f with assembly comparison between GCC 13 and Clang 18 (-O3 -DNDEBUG -march=native, cachedGetValue<1,0,0>, W=16): - GCC does not inline any Simd.h helper (gather_if, simd_cast, where, popcount); emits 14 out-of-line calls and 13 vzeroupper transitions. gather_if body uses scalar vmovq/vpinsrq/vinserti128 — no hardware gathers. - Clang inlines everything except popcount; emits 2 vpgatherdd + 12 vpgatherqq hardware gather instructions and only 2 vzeroupper transitions. popcount body (88 ymm instrs, pure SWAR vpsrlq/vpand/vpaddq) is also fully vectorized but remains out-of-line. - 43 vpinsrb in Clang output are mask-widening cost for heterogeneous gather_if (SimdMask<uint32_t,16> → 4x SimdMask<uint64_t,4>). Action item added to §10: [[gnu::always_inline]] on Simd.h helpers would eliminate the GCC regression and fold popcount inline in both compilers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…adeoffs Expand §8f with analysis of three popcount strategies for the 16-lane uint64_t case: - SWAR shift-and-add tree (current): 88 instructions, all AVX2-friendly but port-intensive; stays in vector registers throughout. - Scalar popcnt with extract/reassemble: ~56 instructions; popcnt is pipelined (1/cycle throughput on port 1, 3-cycle latency) so 16 independent lanes retire in ~16 cycles, but the ymm↔GPR domain crossing adds ~2 cycles bypass latency per extraction and port 1 serialises all 16 popcnts. - vpshufb nibble-LUT + vpsadbw (recommended): ~40 instructions, no domain crossing, uses ports 0/5 and 5 (orthogonal to SWAR ports), standard compiler-generated AVX2 popcount pattern. Added as action item in §10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Adds StencilAccessor<BuildT,W,StencilT> — a compile-time-parameterized SIMD wrapper around BatchAccessor that owns the straddling loop, hull-prefetch sequencing, and per-tap cachedGetValue blending for one VBM block. Output is mIndices[SIZE] of Simd<uint64_t,W>, one vector per stencil tap. Includes Weno5Stencil (18 taps, 6-tap hull) and the findIndex constexpr fold for compile-time getValue<DI,DJ,DK>() inverse-map lookup. Also: - BatchAccessor.h: add centerLeafID() getter used by StencilAccessor - BatchAccessor.md: expand §8f assembly matrix (compiler × backend × ISA) - stencil_gather_cpu.cpp: wire StencilAccessor into runPrototype with verifyStencilAccessor correctness checks; add rdtsc-based runPerf - Simd.h: remove = {} default argument from element_aligned_tag overloads - CLAUDE.md: add project build/architecture reference for Claude Code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

…teLeafNodesFunctor<NN_FACE_EDGE_VERTEX> The NN_FACE_EDGE_VERTEX leaf functor accessed its [10][3][3] word stencil through a reinterpret_cast view anchored mid-array at [1][1][1] with negative indices. Those negative logical indices form addresses before the view object's base -- out-of-object pointer arithmetic (UB). It is correct on-device today, but GCC at -O2/-O3 exploits the UB in the read-modify-write z/y/x passes and miscompiles it (reproduced standalone: -O0 matches, -O2/-O3 diverge in ~44% of random trials), making it a latent landmine if ever compiled host-side -- which the host port now does. Replace the anchored view with a +1-biased index accessor (originalWordsShifted[i+1][j+1][k+1]), matching the host util::morphology::DilateLeafNodes fix. Every access now lands within the real array's [0,10)x[0,3)x[0,3) bounds; no out-of-object arithmetic, correct at all optimization levels. This is the only morphology functor with the before-base pattern; the other two reinterpret views (MaskShift's [2][8][4] and DilateInternalNodesFunctor's [3][3][3]) anchor at element [0] with in-bounds indices and are unaffected. Verified: ex_dilate_nanovdb_cuda (DilateGrid + PruneGrid) CORRECT on-device for NN_FACE_EDGE_VERTEX across dragon/armadillo/iss/space. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…__ helper Record a future-refactoring note (§7.9): since DilateLeafNodes is thread-centric (no warp/CTA cooperation) and its primitives are already __hostdev__, the per-leaf body could be a single __hostdev__ helper shared by the host forEach and the CUDA kernel -- removing the duplication that just forced the [10][3][3] stencil UB fix to be applied twice (host 0110561, cuda c199fcb). DilateInternalNodes is explicitly not a candidate (warp-cooperative MaskShift/WarpReduce). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Phase 4.8 is now in progress: DilateGrid's two compute-bearing operator methods (dilateInternalNodes, dilateLeafNodes) are ported to host and validated end-to-end against the OpenVDB dilateActiveValues reference for both NN_FACE and the full NN_FACE_EDGE_VERTEX stencil. §7.7 rewritten with a per-method DilateGrid table, the new host pieces (shuffleDown/Up, MaskShift, DilateInternalNodes, DilateLeafNodes) and their validation, and the remaining work (processGridTreeRoot/dilateRoot host conversion, then the dilate->prune round-trip once PruneGrid lands). Coarsen/Refine/Prune still pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…idTreeRootFunctor) Verbatim reuse of the MergeGrids host version (7e5f965): replace the cudaMemcpyDeviceToDevice of the source GridData with a host std::memcpy (both ends host-accessible) and call BuildGridTreeRootFunctor directly on host instead of via a single-element lambdaKernel, behind a leading stream-sync that drains getBuffer's output-buffer zero-fill. Add <cstring>. This was the last kernel launch in DilateGrid.h; the operator is now kernel-free, at parity with MergeGrids.h. dilateRoot still does its D2H source copy (a shared Phase 4.4/4.7 cleanup item, pending for mergeRoot too). Validated CORRECT against the OpenVDB dilateActiveValues reference across dragon/armadillo/iss/ space (NN_FACE_EDGE_VERTEX). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rid (drop D2H copy) Remove the HostBuffer + cudaMemcpyAsync D2H copy of the source RootNode and Upper Nodes; read them directly from the managed (UnifiedBuffer) source grid instead. This is the same managed-access assumption dilateInternalNodes/dilateLeafNodes already rely on, and the source pages are drained by the stream-sync at the top of getHandle. getChild() resolves its relative offsets within the same managed grid, so the uppers remain reachable. DilateGrid.h is now free of D2H copies and kernel launches; the only remaining device op is the output mProcessedRoot.deviceUpload (the shared Phase 4.4 dual-mode-buffer item, pending for mergeRoot too -- where the analogous source D2H copy can be dropped the same way). Validated CORRECT against the OpenVDB dilateActiveValues reference across dragon/armadillo/iss/ space (NN_FACE_EDGE_VERTEX). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nctors The eight topology::detail functor operator()s (BuildGridTreeRoot, UpdateLeafVoxelCountsAndPrefixSum, UpdateLeafVoxelOffsets, UpdateAndPropagateLeafBBox, PropagateLowerBBox, PropagateUpperBBox, UpdateRootWorldBBox, PostProcessGridTree) were __hostdev__ during the transition while some operators still launched them as kernels. Now that every caller is host (util::forEach bodies and direct host calls in TopologyBuilder's methods, plus BuildGridTreeRootFunctor invoked directly by Merge/Dilate processGridTreeRoot) and TopologyBuilder.h has no kernel launches, these are host-only and the __hostdev__ is dropped. These are the host tools::topology::detail functors -- distinct from the CUDA tools::cuda::topology::detail ones the device operators use, so nothing calls them from device. The Data POD accessors (getGrid/getTree/... ) keep __hostdev__ (lower-level, harmless). Both examples still build and check CORRECT (dilate dragon; merge dragon+armadillo). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ilateRoot host + __hostdev__ cleanup Update the DilateGrid table: all four operator methods are HOST now -- processGridTreeRoot (ce25500) and dilateRoot (38fa025, source read directly from the managed grid, D2H copy dropped) joined dilateInternalNodes/dilateLeafNodes. DilateGrid.h has no kernel launches and no D2H copies; only the shared Phase 4.4 mProcessedRoot.deviceUpload remains. Also note the TopologyBuilder detail functors dropping __hostdev__ (517f94b). Phase 4.8 line updated to "DilateGrid kernel-free, at merge parity". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…anovdb_cpu Port PruneGrid following the DilateGrid playbook. Both functors are thread-centric (no warp cooperation), ported to util/Morphology.h: - PruneInternalNodes: per source leaf, retain iff its value mask intersects the leaf-mask sidecar; set the pruned upper/lower child masks via setOnAtomic (sibling leaves race). - PruneLeafMasks: per source leaf, set the retained output leaf's value mask to srcValueMask & sidecar; flat over source leaves, distinct output leaves, no atomics. tools/PruneGrid.h is a host operator modeled on the now-clean DilateGrid.h: reuses the host TopologyBuilder, UnifiedBuffer getHandle default + ScratchBufferT alias, reads the source root/grid/mask directly from managed memory (no D2H copies), and runs pruneInternalNodes, processGridTreeRoot, and pruneLeafNodes on host. No kernel launches in PruneGrid.h; only the shared Phase 4.4 mProcessedRoot.deviceUpload remains. ex_dilate_nanovdb_cpu now does the dilate->prune round-trip: inject the original topology as a UnifiedBuffer leaf-mask sidecar (InjectGridMaskFunctor kept a CUDA kernel -- an example helper, not part of PruneGrid), then host-prune the dilated grid back and check against the original. Validated CORRECT (DilateGrid + PruneGrid) across dragon/armadillo/iss/space for NN_FACE_EDGE_VERTEX; merge unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Switch the example's default stencil from NN_FACE_EDGE_VERTEX to NN_FACE (matching the CUDA ex_dilate_nanovdb_cuda default). The full dilate->inject->prune round-trip checks CORRECT for both DilateGrid and PruneGrid across dragon/armadillo/iss/space at NN_FACE, complementing the earlier NN_FACE_EDGE_VERTEX validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a PruneGrid section: all four operator methods HOST, kernel-free, both functors thread-centric (PruneInternalNodes retain-check + setOnAtomic; PruneLeafMasks per-leaf intersection). Document the leaf-mask sidecar and the dilate->inject->prune round-trip in ex_dilate_nanovdb_cpu (CORRECT for Dilate+Prune across the geometry set at both stencils; example now defaults to NN_FACE). Update the dilate example note (round-trip now present), trim "Still to port" to Coarsen/Refine, and update the Phase 4.8 line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Buffer (cleanup step 1) The mask/offset/parent scratch (ScratchBufferT) is written and read entirely on the host now -- no device kernel touches it -- so it becomes plain HostBuffer: - ScratchBufferT = nanovdb::HostBuffer (+ <nanovdb/HostBuffer.h>, <cstring>) - ScratchBufferT::create(size, nullptr, device, stream) -> create(size); drop the now-unused cudaGetDevice(&device) locals - mask zero-fills cudaMemsetAsync(.deviceData()) -> std::memset(.data()) - EnumerateNodes fed mUpperMasks.data()/mLowerMasks.data(); deviceUpperMasks()/deviceLowerMasks() accessors deleted (their only host caller was countNodes); d_upperOffsets <- mUpperOffsets.data() Self-contained: mData and mProcessedRoot stay DeviceBuffer, the output grid buffer stays UnifiedBuffer, the stream param and the examples (still .cu, cudaMemcpy bufferCheck) are unchanged. The remaining cudaStreamSynchronize drains still cover the device-side mData/mProcessedRoot uploads and the output-buffer memset (migrated in later steps). Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip (dragon/armadillo/iss/space, NN_FACE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p 2) mProcessedRoot is transient host scratch (the speculative root, discarded after the tree is built), so it now shares the ScratchBufferT (= HostBuffer) storage policy: - TopologyBuilder: mProcessedRoot member -> ScratchBufferT; ScratchBufferT moved to the public section so the operators (which own the <op>Root allocation) share this single definition; delete the now-unused deviceProcessedRoot() accessor; mProcessedRoot.clear(stream) -> clear(). - Merge/Dilate/Prune: alias ScratchBufferT from the builder (typename TopologyBuilder<BuildT>:: ScratchBufferT) rather than re-declaring it; allocate mProcessedRoot via ScratchBufferT::create (was DeviceBuffer::create); drop the trailing deviceUpload and the now-unused cudaGetDevice local. The host-side speculative-root tile-fill is unchanged. Self-contained: mData stays DeviceBuffer, the output grid buffer stays UnifiedBuffer, the examples are untouched. The host path read mProcessedRoot via hostProcessedRoot() (.data()) already, so no reader changes; the CUDA operators use their own tools::cuda::TopologyBuilder, unaffected. Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip (dragon/armadillo/iss/space, NN_FACE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… directly d_upperOffsets was a kernel-era smuggling channel: it cached the mUpperOffsets scratch pointer into the Data POD so the formerly-kernel processUpperNodes could reach it through its Data* argument. processUpperNodes is now an inline forEach lambda in TopologyBuilder with direct member access, so it reads mUpperOffsets.data() straight, and the field (plus its getBuffer set site) is removed. Its sole reader was processUpperNodes; mUpperOffsets is allocated in countNodes and not cleared before processUpperNodes runs, so the direct read is valid. Data is now {d_bufferPtr, grid..size byte offsets, nodeCount[3]} + the get* accessors. d_bufferPtr stays -- it's the only handle the detail functors (handed just a Data*) have to the output grid. Validated CORRECT: merge (dragon+armadillo), dilate+prune (dragon, iss). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nup step 3) mData is a single small POD that is written once (countNodes + getBuffer) and thereafter only read on the host; its device copy had no readers (the deviceData() accessor was dead, all kernels gone), so it was a pure deviceUpload staging target. Collapse it to a plain `Data mData{};` member: - drop the DeviceBuffer allocation in the constructor and the mData.deviceUpload in getBuffer - delete the deviceData() accessor; data() now returns &mData - refresh the processUpperNodes stream-sync comment (it now drains only the output-buffer memset) No host/device divergence existed to preserve (verified earlier). The output grid buffer stays UnifiedBuffer and the examples are untouched; d_bufferPtr still holds the (managed) output base, host-derefable as before. Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune round-trip (dragon/armadillo/iss/space, NN_FACE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rewrite §7.3 as the shared host-only cleanup tracker (Merge/Dilate/Prune + TopologyBuilder), targeting HostBuffer directly (UnifiedBuffer was transitional). Record done: scratch -> HostBuffer (db14618), mProcessedRoot -> HostBuffer with the ScratchBufferT alias now owned by the builder and aliased by the operators (56f480b), mData -> plain Data member (5fee27c) + d_upperOffsets removed (82f71cd), updateChecksum -> host (333c942). Remaining = Step 4: output grid buffer -> HostBuffer, inputs host-resident (getTreeData/mergeRoot D2H, drop stream param), residual stream-sync drains, dead code (CALL_CUBS/mTempDevicePool/cub include), and the .cu->.cpp example milestone (Dilate needs a host InjectGridMask). Fix the stale mergeRoot note in §7.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nd-agnostic (step 4, merge path) Make the output grid buffer host-allocable and flip the merge chain to host-resident inputs+output: - TopologyBuilder::getBuffer is now backend-agnostic and CUDA-free: BufferT::create(size, &pool) + std::memset(buffer.data()) + d_bufferPtr = buffer.data(). Works for HostBuffer (merge) and UnifiedBuffer (dilate/prune still default to it) alike -- UnifiedBuffer's create(size,ref) places it CPU-side and data() is host-writable -- with no behavior change for dilate/prune. - MergeGrids: getHandle default BufferT -> HostBuffer; getTreeData (D2H via DeviceGridTraits) -> direct host read (mSrcTreeData = srcGrid->tree()); mergeRoot reads the source RootNodes directly from the host grids (drop the D2H copies + their HostBuffer staging). - ex_merge example: all grids HostBuffer (no deviceUpload/deviceGrid); the result is read via handle.grid<BuildT>() and checked with std::memcmp; mainMergeGrids signature simplified to the three host grid pointers. This eliminated the GridHandle<UnifiedBuffer> link dependency the .cpp previously had (the source grids were UnifiedBuffer). MergeGrids.h still carries CUDA includes (mTimer, cudaStreamSynchronize, cub) so the example stays a .cu for now; stripping those + the .cu->.cpp completion are the next sub-steps. Validated CORRECT: merge across dragon/armadillo/iss/space (incl. reversed); dilate+prune (dragon, iss) unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ated in-place) Port InjectGridMaskFunctor (the thread-centric, non-warp-cooperative injector used by the dilate->prune round-trip) to a host util::injectGridMask -- a util::forEach over destination leaves, each writing its own sidecar entry (src.valueMask & dst.valueMask), no atomics. Lives in util/Injection.h, parallel to util/cuda/Injection.cuh. Swap the ex_dilate example's InjectGridMask CUDA kernel for the host call. Everything else stays UnifiedBuffer (managed = host-accessible), so this validates the host inject in isolation before any HostBuffer flip: dilate -> host inject -> prune checks CORRECT (Dilate + Prune) across dragon/armadillo/iss/space at NN_FACE. This removes the last kernel launch from the dilate example's mainDilateGrid; the remaining CUDA there is the operator headers' includes (and getTreeData), to be shed when DilateGrid/PruneGrid flip to HostBuffer and the example goes .cu -> .cpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ilate path) Mirror the merge HostBuffer flip for the dilate->inject->prune chain: - DilateGrid/PruneGrid: getHandle default BufferT -> HostBuffer; getTreeData (D2H via DeviceGridTraits) -> direct host read (mSrcTreeData = srcGrid->tree()). The <op>Root methods already read the managed source directly, so no other source reads remain. - ex_dilate example: all grids HostBuffer (no deviceUpload/deviceGrid); dstLeafCount via host dstGrid->tree().nodeCount(0); the leaf-mask sidecar is a HostBuffer; the host injectGridMask, PruneGrid, and the Dilate/Prune checks all operate on host grids (std::memcmp). mainDilateGrid simplified: the original source doubles as the prune reference (one srcGrid param). getBuffer (backend-agnostic since the merge step) yields HostBuffer output here too. DilateGrid.h/ PruneGrid.h still carry CUDA includes (mTimer/cudaStreamSynchronize/cub) so the example stays a .cu; that and the .cu->.cpp completion are the shared next step. Validated CORRECT (DilateGrid + PruneGrid) across dragon/armadillo/iss/space at NN_FACE, and dragon/space at NN_FACE_EDGE_VERTEX. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nd examples) Replace the CUDA event timer (util::cuda::Timer, constructed with a stream) with the host util::Timer (util/Timer.h; same start/restart/stop API, no stream) in MergeGrids/DilateGrid/ PruneGrid: include swap, member type, and drop mTimer(stream) from the ctor init lists. The example gpuTimers (which got util::cuda::Timer transitively via the operator headers) likewise swap to util::Timer. One of the operator headers' CUDA dependencies removed; remaining are cub, util/cuda/Morphology.cuh, util/cuda/DeviceGridTraits.cuh, util/cuda/Util.h, and the cudaStream_t/cudaStreamSynchronize plumbing -- the rest of the CUDA-strip. Validated CORRECT: merge (dragon+armadillo) and dilate+prune (dragon). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Every operation in the host operators + TopologyBuilder is now synchronous host code (std::memset, util::forEach, host checksum) -- there is no async op, kernel launch, cudaMemcpy*, or deviceUpload left -- so the cudaStreamSynchronize calls were draining an empty default stream and the whole cudaStream_t/mStream apparatus was vestigial. Remove it: - TopologyBuilder: drop the cudaStream_t param from every method (allocateInternalMaskBuffers, countNodes, getBuffer, processUpperNodes/LowerNodes/LeafOffsets, processBBox, postProcessGridTree) and the constructor; delete all cudaStreamSynchronize calls and the (void)stream. - Merge/Dilate/Prune: drop the stream ctor param, the mStream member, every cudaStreamSynchronize, and pass no stream to the mBuilder.* calls; default-construct mBuilder. - Remove the now-orphaned "Drain upstream ..." comments and stale @param stream / "Copy TreeData from GPU->CPU" doxygen. Purely mechanical: behavior is identical (verified there was nothing stream-ordered to preserve). The dead CALL_CUBS macro + mTempDevicePool (which still name `stream` in an unexpanded macro) and the remaining cub/util/cuda includes are the next dead-code/include pass. Validated CORRECT: merge (dragon+armadillo, iss+space) and dilate+prune (all four geometries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng work Update §7.3: the whole data path is now host and CUDA-free in behavior -- getBuffer backend-agnostic (8d8993e), all three operators default to HostBuffer + host-resident inputs (8d8993e, 7eb3a50), host InjectGridMask (7931906), host Timer (dae9740), and the cudaStream_t plumbing ripped out (d5a962b). Remaining is cosmetic/build-capability only: dead code (CALL_CUBS/mTempDevicePool/mNumThreads), include purge (cub, util/cuda/*, cuda buffers), and the .cu->.cpp completion (which also proves CUDA-freeness under plain g++). Phase tracker: 4.4 done for Merge/Dilate/Prune, 4.7 in progress. CoarsenGrid/RefineGrid remain unported and will be written clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add §7.8.1: warm-state host timings vs the OpenVDB ground-truth reference and the nanovdb CUDA path. Host merge is ~8-13x faster than OpenVDB's deepCopy+compSum; host dilate is on par with dilateActiveValues (and compute-only already ~4x faster, alloc-bound). Notes the FloatGrid-vs- OnIndexGrid caveat and that the GPU stays ~30-50x ahead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

CoarsenInternalNodesFunctor and RefineInternalNodesFunctor named their processed-root parameter and local `prunedRoot`/`prunedTile`, pasted from PruneInternalNodesFunctor. They operate on the coarsened/refined root, not a pruned one. Rename to coarsenedRoot/coarsenedTile and refinedRoot/refinedTile respectively (PruneInternalNodesFunctor's prunedRoot is correct, left as-is). Pure rename, no behavior change; the CUDA coarsen/refine examples still build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Port the CUDA-only CoarsenGrid to a host-only operator at tools/CoarsenGrid.h (namespace nanovdb::tools), following the prune/dilate playbook: it rides the already-host TopologyBuilder pipeline unchanged and adds only its own functors. - util/Morphology.h: host CoarsenInternalNodes (every non-empty source leaf maps via coarsenCoord into the coarsened topology; upper/lower masks set via setOnAtomic) and CoarsenLeafMasks (coarsenMask word arithmetic + util::atomicOr scatter, since up to eight source leaves coarsen into one destination leaf). Near-verbatim transcriptions of the CUDA functors, wrapped in util::forEach. - tools/CoarsenGrid.h: all four operator methods on the host, no kernel launches and no D2H copies (only the shared mProcessedRoot.deviceUpload remains). coarsenRoot reads the source root directly from the host-resident grid. - ex_coarsen_nanovdb_cpu: HostBuffer example checking against an OpenVDB reference; CORRECT across dragon/iss/space/torus. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Port the CUDA-only RefineGrid to a host-only operator at tools/RefineGrid.h (namespace nanovdb::tools), completing the host port of all five topology operators (Merge/Dilate/Prune/Coarsen/Refine). Like coarsen it rides the already-host TopologyBuilder pipeline and adds only its own functors. - util/Morphology.h: host RefineInternalNodes (detect present octants from the source value-mask words, then refineCoord + setOnAtomic) and RefineLeafMasks (refineMask bit-spread into up to eight distinct destination leaves, no atomics). Near-verbatim transcriptions of the CUDA functors via util::forEach. - tools/RefineGrid.h: all four operator methods on the host, no kernel launches and no D2H copies (only the shared mProcessedRoot.deviceUpload remains). refineRoot reads the source root and upper nodes directly from the host-resident grid for its bbox-based 26-connected octant speculation. - ex_refine_nanovdb_cpu: HostBuffer example checking against an OpenVDB reference (TBB VoxelRefiner); CORRECT across dragon/iss/space/torus. - TopologyCpuPortPlan.md: record CoarsenGrid/RefineGrid as ported (7.7) and mark Phase 4.8 compute-complete for all five operators. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Phase 4.7 cleanup (plan items 10-11): the topology operators' data path has been host-only for a while, but the headers still carried inert CUDA artifacts. Remove them now (no behavior change; still compiled via the example .cu files until the item-12 .cu->.cpp flip). - TopologyBuilder.h: drop the dead mNumThreads/numBlocks, mTempDevicePool, and the CALL_CUBS macro + its trailing #undef (all named the long-gone stream and were inert/unexpanded). Drop the cuda/{TempPool,DeviceBuffer,UnifiedBuffer}.h and util/cuda/Morphology.cuh includes (EnumerateNodes/ProcessLowerNodes come from the host util/Morphology.h). - MergeGrids/DilateGrid/PruneGrid/CoarsenGrid/RefineGrid.h: drop the per-operator mNumThreads/numBlocks and the cub/cub.cuh, cuda/UnifiedBuffer.h, util/cuda/{DeviceGridTraits.cuh,Morphology.cuh,Util.h} includes. Add the direct <map> and host util/Morphology.h includes the operators had been getting transitively through the removed CUDA headers. ScratchBufferT is HostBuffer, so UnifiedBuffer/DeviceBuffer/cub were unused as types; NearestNeighbors comes from util/MorphologyHelpers.h (kept). All four _cpu examples build and validate CORRECT across dragon/iss/space/torus. The \warning "include only from .cu files" doxygen notes are now stale but kept until item 12 (.cu->.cpp), which removes them and proves CUDA-freedom under g++. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Phase 4.7 completion signal (plan item 12): make the host topology operators prove their CUDA-freedom by compiling under plain g++. - Rename each example's *_nanovdb_cpu_kernels.cu -> *_nanovdb_cpu_benchmark.cpp (git mv). The "kernels" name was stale (no kernels remain); the file is the operator-driving/benchmark half. The nanovdb_example CMake function globs *.cpp and only invokes nvcc when a *.cu exists, so the rename alone turns each _cpu example into a pure C++ target (verified: both TUs build as CXX, no CUDA object). - Drop the now-unused cuda/DeviceBuffer.h (+ cuda/UnifiedBuffer.h) includes from the example main .cpp files; everything uses HostBuffer. - Remove the now-stale "\warning ... include only from .cu files" doxygen notes from all six headers (TopologyBuilder + the five operators), and tidy the two "(host-side port in progress)" briefs to "(host-side port)". This compile under g++ (no __CUDACC__) is what proves the headers are CUDA-free; nvcc's host pass is more permissive. The one dependency it had masked (<map>, transitively via a removed CUDA header) was already fixed in the item-11 purge. All four _cpu examples build as g++ targets and validate CORRECT (dragon/iss/ space/torus, incl. the dilate->prune round-trip). Phase 4.7 is complete. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

The CUDA benchmark warm-loop labels were copy-paste leftovers: merge printed "Re-running entire dilation after warmstart" and refine printed "... refinment" (misspelled). Correct them to "merge" and "refinement". Label-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

The host Timer measured with high_resolution_clock but reported via an integer duration_cast<milliseconds>, truncating the fraction (e.g. "48 milliseconds"), while util::cuda::Timer prints fractional ms from cudaEventElapsedTime. In code that interleaves host and device timings (e.g. the Sem3D benchmark) the two were inconsistent. Make the default reporting fractional: milliseconds() returns duration<float, milli>, and stop()/restart()/elapsed() default their AccuracyT to duration<double, milli> (with a label branch). Measurement is unchanged; explicit stop<std::chrono::microseconds>()/<seconds>()/<milliseconds>() still work, so this is backward compatible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

Host counterpart of the CUDA util::cuda::InjectGridDataFunctor: copy sidecar values from a source grid into the sidecar of an overlapping destination grid, for the voxels active in both (destination voxels with no source counterpart are left unchanged). Pairs with the existing host injectGridMask. Parallelizes over source leaves with util::forEach; each task probes the destination for an overlapping leaf (early-out if none) and copies the intersection of their value masks. Source-leaf origins are unique, so each task writes a distinct destination leaf -- no atomics. Instead of calling getValue() per voxel (a per-voxel popcount, costly on CPU without AVX-512), each word's base sidecar offset is re-derived from the leaf's packed mPrefixSum via a one-step carry, and the within-word rank is advanced one increment per active bit -- the serial analogue of the CUDA functor's warp- cooperative srcOffset/srcCnt decomposition. Templated on a compile-time channel offset to match the CUDA functor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis changed the title ~~NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps~~ [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026

sifakis force-pushed the vbm-cpu-port branch from 249e657 to 294f604 Compare April 2, 2026 18:44

sifakis and others added 2 commits April 2, 2026 13:46

sifakis force-pushed the vbm-cpu-port branch from 294f604 to c9906f4 Compare April 2, 2026 18:46

sifakis and others added 15 commits April 2, 2026 13:50

ex_voxelBlockManager_host_cuda: remove DecodeInverseMapsCPUPlan.md

9a31aed

Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>

sifakis and others added 9 commits April 16, 2026 10:33

sifakis and others added 30 commits June 11, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 110 commits into
AcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

sifakis commented Mar 27, 2026

Uh oh!

danrbailey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sifakis commented Mar 27, 2026

Uh oh!

danrbailey commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants