Speed up browse over network drives (SM4-focused) by jacobson30-bot · Pull Request #3 · SPMQT-Lab/ProbeFlow

jacobson30-bot · 2026-06-02T02:19:53Z

Context

A colleague browsing RHK .sm4 files from a network drive reported that Browse opens slowly. On a LAN-mounted drive, per-operation latency and per-byte transfer dominate, and the browse path multiplied both: each file was sniffed twice, stat'd several times, fully read for metadata, and fully read again per thumbnail — with nothing cached across revisits. SM4 was the worst case (its "metadata" path decoded every pixel of every page).

Implements the plan in .claude/plans/we-re-going-to-work-transient-storm.md.

Changes

SM4-specific (the colleague's case)

S1 — read_sm4_metadata no longer decodes pixels. A metadata_only parse path skips _decode_image_payload/float64 conversions and builds ScanMetadata directly from page headers (metadata_from_rhk_sm4).
S3 — thumbnails decode only the displayed plane via read_sm4_thumbnail_plane (reads just that page's PAGE_DATA range), instead of all N pages.
S2 — metadata_only reads only header/table/string byte ranges via seek (_collect_sm4_metadata_ranges), never the PAGE_DATA blobs. On the real 4.4 MB fixture this fetches ~3.6 KB (0.08%) with byte-identical metadata. Falls back to a whole-file read for small files (≤1 MiB) or on any structural surprise.

All formats

R3 — index_folder_shallow/_peek_subfolder enumerate via os.scandir (type + cached stat from one directory read) instead of iterdir + per-entry is_file()/is_dir()/stat.
R4 — the sniffed FileType is threaded into read_scan_metadata/read_spec_metadata, so identify_* no longer re-sniffs. Each file is now sniffed exactly once per index.
R1 — a best-effort on-disk cache (browse_cache.py) for index metadata and rendered thumbnails, keyed by (abspath, mtime_ns, size). Revisiting an unchanged folder reads no file content. mtime+size invalidation, version tag, LRU eviction (512 MiB cap); PROBEFLOW_DISABLE_BROWSE_CACHE / PROBEFLOW_CACHE_DIR env controls.

Testing

New tests cover SM4 metadata-only parity (synthetic + real fixture), single-plane thumbnail parity, sparse-read byte savings, sniff-once indexing, and the cache (round-trip, invalidation, eviction, disabled-mode, second-pass cache hit). All pass. The full GUI suite can't run in this headless environment (pre-existing Qt SIGABRT unrelated to these changes); all non-Qt-window tests for the touched areas pass (108 passed, 23 skipped).

🤖 Generated with Claude Code

read_sm4_metadata previously went through read_sm4 -> metadata_from_scan, fully decoding every image page (frombuffer + reshape + two float64 scaling passes per page) just to read header fields. On a network drive this also forced the entire file into memory. Add a metadata_only path through read_rhk_sm4 / _parse_pages / _parse_page_from_objects that parses PAGE_HEADER + STRING_DATA but skips _decode_image_payload and the physical conversion, leaving page image arrays empty. read_sm4_metadata now builds ScanMetadata directly from the parsed pages via the new metadata_from_rhk_sm4 helper, which mirrors read_sm4's page selection / naming / scan-range logic for byte-identical output. Verified parity against the full-decode path on the real VT260430_0004.sm4 fixture and synthetic files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Browse thumbnails only ever show one plane, but load_scan(.sm4) decoded every image page (an 8-page file did ~8x the needed pixel work). Add read_sm4_thumbnail_plane: parse headers once (metadata_only), resolve the target plane from its names, then read and decode only that page's PAGE_DATA byte range via the new per-page data_object location. A shared _select_image_pages helper keeps page selection identical to read_sm4 and metadata_from_rhk_sm4. The thumbnail workers now call a new gui.rendering.load_thumbnail_plane that uses the single-plane path for SM4 and falls back to a full load for other formats (whose decode is a single read anyway). Verified the single-plane result is byte-identical to the corresponding plane from a full read_sm4 on the real fixture and synthetic files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

read_rhk_sm4(metadata_only=True) still slurped the entire file via read_bytes(), so indexing a folder of SM4 files transferred every byte over the network just to read headers. Add _read_sm4_metadata_buffer: files at or below 1 MiB are read whole (the round-trips of a sparse read aren't worth it), but larger files are served by _collect_sm4_metadata_ranges, which walks the object graph with seek reads and fetches only the file header, object tables, page index, page headers and string data into a full-length zero-filled buffer. PAGE_DATA byte ranges are never read. The existing absolute-offset parser then runs unchanged on that buffer. Any structural surprise falls back to a whole-file read, so correctness never depends on the sparse traversal being exhaustive. On the real VT260430_0004.sm4 fixture this fetches ~3.6 KB of a 4.4 MB file (0.08%) and produces byte-identical ScanMetadata. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

R3: index_folder_shallow and _peek_subfolder now enumerate via os.scandir instead of Path.iterdir() + per-entry is_file()/is_dir(). DirEntry serves type info (and a cached stat) from the single directory read, removing separate stat round-trips per file — the dominant cost on a network drive. The DirEntry.stat() is reused to fill mtime/size instead of a fresh _file_stat. R4: the FileType from the indexing sniff is threaded through _build_item -> _item_from_scan/_item_from_spec into read_scan_metadata/read_spec_metadata via a new file_type= argument, so identify_scan_file/identify_spectrum_file (which re-sniff and redo exists/is_file/resolve) are skipped. Each file is now sniffed exactly once per index (regression-tested). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Re-opening the same network folder previously re-read every file twice (once to build its ProbeFlowItem, once to render its thumbnail). Add a best-effort local disk cache (probeflow/core/browse_cache.py) keyed by (absolute_path, mtime_ns, size_bytes): - index_folder_shallow consults the metadata cache before sniffing/reading a file; a hit returns the stored ProbeFlowItem (or a cached "not recognised" None) with zero content reads. Misses populate the cache. - ThumbnailLoader and FolderThumbnailLoader cache the rendered PNG, keyed also by colormap/channel/clip/size/processing; a hit loads the pixmap straight from bytes without reading or decoding the file. mtime+size are baked into the key, so modified/replaced files miss automatically; a cache-version tag in the directory name invalidates on format changes; least-recently-used eviction keeps the cache under a soft 512 MiB cap. All cache I/O is wrapped so a corrupt/unreadable cache degrades to a normal read. PROBEFLOW_DISABLE_BROWSE_CACHE=1 disables it; PROBEFLOW_CACHE_DIR relocates it. Note: the metadata key tracks the scan file only, not its optional ScanFlow sidecar; a sidecar edited without touching the scan would be served stale until the scan changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Remove sm4_image_plane_names: dead code (read_sm4_thumbnail_plane computes names inline). - load_thumbnail_plane re-sniffed/resolved twice for SXM/Createc on a cache miss (identify_scan_file, then load_scan re-identifying). Expose load_scan_from_signature and reuse the signature already resolved, so each thumbnail sniffs/resolves once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Store metadata entries as (schema_tag, item) where schema_tag is the tuple of the dataclass's field names. On read, a tag mismatch (any added/removed/renamed ProbeFlowItem field) is treated as a miss, so a stale cache can never feed a half-constructed object into the app at attribute-access time. Bump the cache version to 2 (envelope shape changed; old bare-pickle entries now miss safely). Add tests: schema-change invalidation, and an end-to-end run through the real browse pipeline on VT260430_0004.sm4 (index -> cache-hit second pass -> single -plane thumbnail decode -> PNG cache round-trip). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The thumbnail workers now call load_thumbnail_plane (single-plane SM4 decode + cache) instead of load_scan, so the three TestGuiWorkers tests that monkeypatched worker_mod.load_scan failed in CI (they are skipped locally where Qt can't init). Repoint them at load_thumbnail_plane: the plane-selection assertion becomes a check that the requested channel is forwarded, and the failure-path tests raise from the new seam. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rk-sm4

jacobson30-bot and others added 9 commits June 2, 2026 11:56

Merge remote-tracking branch 'origin/main' into optimize-browse-netwo…

fad46cd

…rk-sm4

jacobson30-bot merged commit a06e5b1 into main Jun 2, 2026
3 checks passed

jacobson30-bot deleted the optimize-browse-network-sm4 branch June 2, 2026 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up browse over network drives (SM4-focused)#3

Speed up browse over network drives (SM4-focused)#3
jacobson30-bot merged 9 commits into
mainfrom
optimize-browse-network-sm4

jacobson30-bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacobson30-bot commented Jun 2, 2026

Context

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant