C++ search by ms609 · Pull Request #210 · ms609/TreeSearch

ms609 · 2026-03-19T18:56:59Z

other optimizations + features

Manual testing underway; shiny app in particular has some usability issues.

TNT is 32-bit i386 with zero SIMD and 64KB LUT popcount. TreeSearch has ~4x throughput advantage (128-bit SSE2). TNT's 3-5x convergence speed is strategic, not implementation. Added T-249..T-253 investigation tasks to to-do.md.

Same approach as TreeDist::popcnt64: emit popcnt instruction via inline asm on GCC/Clang x86-64, __popcnt64 on MSVC. Software Hamming weight fallback for non-x86-64 platforms. CRAN-compatible (no compile flag change). Old: __builtin_popcountll → compiler emits ~10-instruction shift-mask New: single popcnt instruction (92 occurrences in compiled DLL) Also updated T-251 description to focus on candidates-per-improvement.

The TNT download page labels the Windows build as '[32 bits]'. The ~4x throughput advantage finding applies only to the Windows 32-bit binary and should not be generalized to Hamilton (Linux 64-bit) benchmarks.

… identified Trajectory comparison on 3 gap datasets (Geisler2001, Zhu2013, Wortley2006) at 30s budgets, 3 seeds each. Key findings: 1. Drift consumes 16-23% of wall time but gains <1% of score improvement (405-1498 ms/step vs 0.4-44 ms/step for other phases). 30-170x less efficient than the next-worst phase (ratchet). 2. TNT evaluates 1.5-3.6x more rearrangements/second than TreeSearch despite 32-bit scalar architecture vs TreeSearch's SSE2. Per-evaluation overhead in data structure management negates the SIMD advantage. 3. TNT's xmult does extensive intra-replicate sectorial search (~67% of trajectory entries are SECT), while TreeSearch does one XSS+RSS+CSS pass per outer cycle (6-10% of time). 4. TNT achieves 10-16 steps better per-replicate median scores. Recommendations: eliminate drift from default preset (save ~20% time), increase sectorial search rounds, reduce per-evaluation overhead. Files: bench_trajectory.R (comparison script), trajectory_results.rds (raw data), tnt_trajectory_analysis.md (full write-up).

Strategy tuning tasks: - T-254: drift MPT diversity experiment (validate before reducing) - T-255: reduce drift in default preset (blocked on T-254) - T-256: increase sectorial search intensity - T-257: post-ratchet sectorial search pass (blocked on T-256) - T-258: intra-replicate fusing - T-259: ratchet cycle count experiment - T-260: per-evaluation overhead profiling

Compared driftCycles=0 vs driftCycles=2 on Wortley2006 (37t), Zhu2013 (75t), Geisler2001 (68t). 3 seeds × 30s+120s budgets, with and without consensus stopping. Key findings: - Zero score benefit (identical best scores on all datasets/seeds) - Zero MPT benefit (no-drift finds 4 MPTs on Wortley2006 vs 1-3 with drift) - Zero diversity benefit (mean RF identical: 7.3 vs 7.4 on Geisler2001) - 10-22% replicate loss from drift time consumption (15-19% of wall time) - With consensus stopping, drift delays convergence without improving answer Unblocks T-255 (reduce/eliminate drift in default and thorough presets). Analysis: dev/benchmarks/drift_mpt_analysis.md

T-254 confirmed drift provides zero benefit on all metrics (score, MPT count, topological diversity) while consuming 15-19% of wall time and reducing replicates by 10-22%. Changes: - SearchControl() default: driftCycles 2 -> 0 - default preset: driftCycles 2 -> 0 - thorough preset: driftCycles 12 -> 0 - sprint and large presets: already 0, no change - Updated roxygen docs: removed drift from strategy descriptions - Updated test-SearchControl.R: expected default 0L Drift search infrastructure (ts_drift.h/.cpp, SearchControl params) remains available for explicit use via SearchControl(driftCycles = N).

devtools::check_man() loaded the old installed package via load_installed, writing stale n_mc=5000L into PrepareDataProfile.Rd and StepInformation.Rd. Corrected to match source (100000L). Updated spelling wordlist: added LCM, TREE's, speedup; removed 28 stale entries.

test-ts-parallel.R 'Parallel search respects timeout' used Vinther2008 (23 tips) which completes all replicates in <1s on fast ARM64 hardware, so timed_out was never set. Switched to Agnarsson2004 (62 tips) where each replicate takes ~100ms, making timeout reliable.

T-243: FlatBlock metadata, flat EW indirect functions, TBR prefetch

Add intraFuse parameter to SearchControl/DrivenParams. When enabled, each outer cycle fuses the current replicate tree against pool donors after TBR polish, approximating TNT's within-replicate fusing pattern. Implementation: - run_single_replicate() now accepts const TreePool* for read-only pool access during intra-replicate fusing - Fuse fires after TBR polish (step 6b) with max_rounds=3 - State arrays rebuilt after fuse via build_postorder()+reset_states() - Disabled in parallel mode (nullptr pool; between-replicate fusing via ThreadSafePool is already active) Fixed segfault: tree_fuse() internal TBR leaves state arrays in a form incompatible with subsequent pipeline phases. Explicit reset_states() after fuse resolves this. Tests: 5 new tests including dataset-size-change regression test.

2 configs × 5 datasets × 5 seeds × 30s = 50 runs (~30 min).

T-248 reduced annealCycles from 3 to 1 in the large preset but missed this test assertion. GHA 23590522833 caught it.

Hamilton HPC and r-package-profiling skills at ~/.positai/skills/ should be loaded via skill() tool before relevant work.

Top 3 hotspots at 88 tips (Dikow2009, EW, 1000 TBR passes): 1. StateSnapshot save/restore: 14.6% (memcpy ~190KB/candidate) 2. reset_states zeroing + tip reload: 9.1% 3. fitch_na_score: 29.2% (expected, core algorithm) Non-scoring overhead = 37.8% of TBR time. Combined fix potential ~16-19%. Write-up: dev/benchmarks/vtune_tbr_analysis.md Driver: dev/vtune-tbr-driver.R

T-261: Eliminate std::fill zeroing in reset_states (~3.9% savings) T-262: Reduce load_tip_states scope (~2-3% savings) T-263: Selective StateSnapshot save/restore (~10-12% savings) Combined potential: ~16-19% TBR wall time reduction.

C++ instrumentation of tbr_search() with post-acceptance sector-masked TBR on clip subtree. Hit rate ~35% regardless of scoring mode (no IW-specific benefit), but NET HARMFUL: disrupts global TBR trajectory. mbank_X30754 EW: +17 to +34 steps TAEB at 30-120s. Validates existing pipeline design (XSS as separate post-convergence phase). Closed.

Phase 1 (a159311) added diagnostic instrumentation and the TIPS_FIRST, INV_WEIGHT, BUCKET, ANTI_TIP, LARGE_FIRST ordering variants to ts_tbr.cpp. Phase 2 completes the implementation: Bug fix: clip_order was only propagated to the initial TBR and final TBR polish (~10% of replicate time). The ratchet and all sectorial TBR calls defaulted to RANDOM, making the ordering variants effectively inert for the dominant phase (ratchet ~76%). Fix: add clip_order field to RatchetParams and SectorParams, propagate from SearchControl through ts_driven.cpp into every TBR call site in ts_ratchet.cpp and ts_sector.cpp (6 sites + search_sector signature). Empirical validation (5 seeds, 30s, default config): Agnarsson2004 (62t, default preset): TIPS_FIRST -2%, INV_WEIGHT neutral Zhu2013 (75t, thorough preset): TIPS_FIRST +13%, INV_WEIGHT +9% Dikow2009 (88t, thorough preset): TIPS_FIRST +8%, INV_WEIGHT +3% Theoretical model (Poisson bucket, corrected): TIPS_FIRST saves ~48% per productive TBR pass at 88t; practical throughput gain is ~8-13% because null passes (ordering-invariant, exhaust all clips) dilute savings. Benefit is dataset-size dependent: < ~65t: tip enrichment is low (Agnarsson2004: 0.43); TIPS_FIRST neutral 65-120t (thorough): tip enrichment moderate; TIPS_FIRST +8-13% No preset defaults changed yet — pending GHA 10-seed validation. bench_clip_ordering.R contains the full benchmark driver.

@param

The SearchControl.Rd usage section was generated from an old installed build (missing clipOrder and many parameters added since). The codoc check correctly flagged the mismatch. - Added @param clipOrder documentation in R/SearchControl.R - Regenerated man/SearchControl.Rd with correct \usage and \item{clipOrder}

TBR clip-ordering strategy (SearchControl clipOrder)

When concordance mode uses the dataset (qc/mcc/spc/clc/phc) and the plotted tree's taxa don't match the dataset's names, return NULL rather than passing mismatched objects to QuartetConcordance/LabelSplits etc. This prevents the `mat[i, j, ..., drop=FALSE]: subscript out of bounds` crash when loading trees with non-overlapping taxa (T-293) or reloading a dataset with removed taxa (T-300). Remove T-292, T-293, T-300 from to-do.md (all fixed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ms609 marked this pull request as draft March 25, 2026 14:21

ms609 added 29 commits March 26, 2026 08:14

T-250: TNT Fitch kernel disassembly analysis

27a8636

TNT is 32-bit i386 with zero SIMD and 64KB LUT popcount. TreeSearch has ~4x throughput advantage (128-bit SSE2). TNT's 3-5x convergence speed is strategic, not implementation. Added T-249..T-253 investigation tasks to to-do.md.

Set maxReplicates default to 96 (multiple of 48 for parallel efficiency)

13501b1

docs: T-250 scope caveat — Windows TNT is 32-bit, Linux/Mac are 64-bit

42aa9fa

The TNT download page labels the Windows build as '[32 bits]'. The ~4x throughput advantage finding applies only to the Windows 32-bit binary and should not be generalized to Hamilton (Linux 64-bit) benchmarks.

chore: close T-251, update completed-tasks.md

caefa11

chore: park T-255 (GHA 23589478370)

b110e5b

chore: update T-255 GHA run ID (23590370573)

ba7cf2e

chore: update T-243 notes, agent-E status

ba0e60c

chore: re-dispatch T-255 GHA with all fixes (23590522833)

f1242d8

Merge pull request #230 from ms609/feature/hot-loop-opt

68a488e

T-243: FlatBlock metadata, flat EW indirect functions, TBR prefetch

T-258: Intra-fuse benchmark script for Hamilton

a524dc4

2 configs × 5 datasets × 5 seeds × 30s = 50 runs (~30 min).

.Renviron

0082642

metad

bfc71a2

fix: update anneal test for T-248 large preset (annealCycles 3→1)

7dc2ed9

T-248 reduced annealCycles from 3 to 1 in the large preset but missed this test assertion. GHA 23590522833 caught it.

chore: update T-255 GHA run ID (23591874696), anneal test fix

7dd853c

chore: S-COORD round 23

0ff7e3d

chore: S-RED focus 7 done, no bugs; T-255 still parked (GHA in progress)

09b619a

docs: add user-level skills reference to AGENTS.md

d91cf53

Hamilton HPC and r-package-profiling skills at ~/.positai/skills/ should be loaded via skill() tool before relevant work.

M-041: Migrate issues protocol to u.nnn files

4e7105f

Coordination

e2d2f41

ms609 and others added 30 commits March 29, 2026 08:06

chore: agent F — F-030 complete (PR #239 clip-ordering phase 2)

3cf476d

Merge branch 'cpp-search' into feature/weighted-clip-order

72fce2e

Merge pull request #239 from ms609/feature/weighted-clip-order

6972444

TBR clip-ordering strategy (SearchControl clipOrder)

chore: agent F — IDLE; TS-WeightClip worktree deregistered

14ff3f9

Concordance doc

f16e44c

Goloboff2021 [-b]

9b23311

Goloboff2021-b

5d987a1

Concordance refs

066fb14

Handle DNA in QuartetConcordance

f675172

ConcordanceTable margins

b99265d

characterwise

9ac78ba

Update .Rbuildignore

571ed05

User issues

a4627d2

memchex

0b28d9e

Simplify AGENTS.md

a941956

a. triage

d009425

Dynamic limit for perturbStopFactor

9a6ac50

roxygen2 v8

b189017

testthat/edition: 2

1bd54d1

Avoid full_rescore unless needed

db92b71

Deglitch UI

17f49c3

rm flicker

73cb644

Clarify intent

ef3cefa

Reorder

dbf4787

Top/right margins on ConcordanceTable

190ec25

mod-consensus rogue floor

f006918

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ search#210

C++ search#210
ms609 wants to merge 588 commits intomainfrom
cpp-search

ms609 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant