Speed up browse over network drives (SM4-focused)#3
Merged
Conversation
read_sm4_metadata previously went through read_sm4 -> metadata_from_scan, fully decoding every image page (frombuffer + reshape + two float64 scaling passes per page) just to read header fields. On a network drive this also forced the entire file into memory. Add a metadata_only path through read_rhk_sm4 / _parse_pages / _parse_page_from_objects that parses PAGE_HEADER + STRING_DATA but skips _decode_image_payload and the physical conversion, leaving page image arrays empty. read_sm4_metadata now builds ScanMetadata directly from the parsed pages via the new metadata_from_rhk_sm4 helper, which mirrors read_sm4's page selection / naming / scan-range logic for byte-identical output. Verified parity against the full-decode path on the real VT260430_0004.sm4 fixture and synthetic files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Browse thumbnails only ever show one plane, but load_scan(.sm4) decoded every image page (an 8-page file did ~8x the needed pixel work). Add read_sm4_thumbnail_plane: parse headers once (metadata_only), resolve the target plane from its names, then read and decode only that page's PAGE_DATA byte range via the new per-page data_object location. A shared _select_image_pages helper keeps page selection identical to read_sm4 and metadata_from_rhk_sm4. The thumbnail workers now call a new gui.rendering.load_thumbnail_plane that uses the single-plane path for SM4 and falls back to a full load for other formats (whose decode is a single read anyway). Verified the single-plane result is byte-identical to the corresponding plane from a full read_sm4 on the real fixture and synthetic files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_rhk_sm4(metadata_only=True) still slurped the entire file via read_bytes(), so indexing a folder of SM4 files transferred every byte over the network just to read headers. Add _read_sm4_metadata_buffer: files at or below 1 MiB are read whole (the round-trips of a sparse read aren't worth it), but larger files are served by _collect_sm4_metadata_ranges, which walks the object graph with seek reads and fetches only the file header, object tables, page index, page headers and string data into a full-length zero-filled buffer. PAGE_DATA byte ranges are never read. The existing absolute-offset parser then runs unchanged on that buffer. Any structural surprise falls back to a whole-file read, so correctness never depends on the sparse traversal being exhaustive. On the real VT260430_0004.sm4 fixture this fetches ~3.6 KB of a 4.4 MB file (0.08%) and produces byte-identical ScanMetadata. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
R3: index_folder_shallow and _peek_subfolder now enumerate via os.scandir instead of Path.iterdir() + per-entry is_file()/is_dir(). DirEntry serves type info (and a cached stat) from the single directory read, removing separate stat round-trips per file — the dominant cost on a network drive. The DirEntry.stat() is reused to fill mtime/size instead of a fresh _file_stat. R4: the FileType from the indexing sniff is threaded through _build_item -> _item_from_scan/_item_from_spec into read_scan_metadata/read_spec_metadata via a new file_type= argument, so identify_scan_file/identify_spectrum_file (which re-sniff and redo exists/is_file/resolve) are skipped. Each file is now sniffed exactly once per index (regression-tested). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-opening the same network folder previously re-read every file twice (once to build its ProbeFlowItem, once to render its thumbnail). Add a best-effort local disk cache (probeflow/core/browse_cache.py) keyed by (absolute_path, mtime_ns, size_bytes): - index_folder_shallow consults the metadata cache before sniffing/reading a file; a hit returns the stored ProbeFlowItem (or a cached "not recognised" None) with zero content reads. Misses populate the cache. - ThumbnailLoader and FolderThumbnailLoader cache the rendered PNG, keyed also by colormap/channel/clip/size/processing; a hit loads the pixmap straight from bytes without reading or decoding the file. mtime+size are baked into the key, so modified/replaced files miss automatically; a cache-version tag in the directory name invalidates on format changes; least-recently-used eviction keeps the cache under a soft 512 MiB cap. All cache I/O is wrapped so a corrupt/unreadable cache degrades to a normal read. PROBEFLOW_DISABLE_BROWSE_CACHE=1 disables it; PROBEFLOW_CACHE_DIR relocates it. Note: the metadata key tracks the scan file only, not its optional ScanFlow sidecar; a sidecar edited without touching the scan would be served stale until the scan changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Remove sm4_image_plane_names: dead code (read_sm4_thumbnail_plane computes names inline). - load_thumbnail_plane re-sniffed/resolved twice for SXM/Createc on a cache miss (identify_scan_file, then load_scan re-identifying). Expose load_scan_from_signature and reuse the signature already resolved, so each thumbnail sniffs/resolves once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Store metadata entries as (schema_tag, item) where schema_tag is the tuple of the dataclass's field names. On read, a tag mismatch (any added/removed/renamed ProbeFlowItem field) is treated as a miss, so a stale cache can never feed a half-constructed object into the app at attribute-access time. Bump the cache version to 2 (envelope shape changed; old bare-pickle entries now miss safely). Add tests: schema-change invalidation, and an end-to-end run through the real browse pipeline on VT260430_0004.sm4 (index -> cache-hit second pass -> single -plane thumbnail decode -> PNG cache round-trip). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The thumbnail workers now call load_thumbnail_plane (single-plane SM4 decode + cache) instead of load_scan, so the three TestGuiWorkers tests that monkeypatched worker_mod.load_scan failed in CI (they are skipped locally where Qt can't init). Repoint them at load_thumbnail_plane: the plane-selection assertion becomes a check that the requested channel is forwarded, and the failure-path tests raise from the new seam. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
A colleague browsing RHK
.sm4files from a network drive reported that Browse opens slowly. On a LAN-mounted drive, per-operation latency and per-byte transfer dominate, and the browse path multiplied both: each file was sniffed twice, stat'd several times, fully read for metadata, and fully read again per thumbnail — with nothing cached across revisits. SM4 was the worst case (its "metadata" path decoded every pixel of every page).Implements the plan in
.claude/plans/we-re-going-to-work-transient-storm.md.Changes
SM4-specific (the colleague's case)
read_sm4_metadatano longer decodes pixels. Ametadata_onlyparse path skips_decode_image_payload/float64 conversions and buildsScanMetadatadirectly from page headers (metadata_from_rhk_sm4).read_sm4_thumbnail_plane(reads just that page'sPAGE_DATArange), instead of all N pages.metadata_onlyreads only header/table/string byte ranges via seek (_collect_sm4_metadata_ranges), never thePAGE_DATAblobs. On the real 4.4 MB fixture this fetches ~3.6 KB (0.08%) with byte-identical metadata. Falls back to a whole-file read for small files (≤1 MiB) or on any structural surprise.All formats
index_folder_shallow/_peek_subfolderenumerate viaos.scandir(type + cached stat from one directory read) instead ofiterdir+ per-entryis_file()/is_dir()/stat.FileTypeis threaded intoread_scan_metadata/read_spec_metadata, soidentify_*no longer re-sniffs. Each file is now sniffed exactly once per index.browse_cache.py) for index metadata and rendered thumbnails, keyed by(abspath, mtime_ns, size). Revisiting an unchanged folder reads no file content. mtime+size invalidation, version tag, LRU eviction (512 MiB cap);PROBEFLOW_DISABLE_BROWSE_CACHE/PROBEFLOW_CACHE_DIRenv controls.Testing
New tests cover SM4 metadata-only parity (synthetic + real fixture), single-plane thumbnail parity, sparse-read byte savings, sniff-once indexing, and the cache (round-trip, invalidation, eviction, disabled-mode, second-pass cache hit). All pass. The full GUI suite can't run in this headless environment (pre-existing Qt SIGABRT unrelated to these changes); all non-Qt-window tests for the touched areas pass (108 passed, 23 skipped).
🤖 Generated with Claude Code