Skip to content

Speed up browse over network drives (SM4-focused)#3

Merged
jacobson30-bot merged 9 commits into
mainfrom
optimize-browse-network-sm4
Jun 2, 2026
Merged

Speed up browse over network drives (SM4-focused)#3
jacobson30-bot merged 9 commits into
mainfrom
optimize-browse-network-sm4

Conversation

@jacobson30-bot
Copy link
Copy Markdown
Contributor

Context

A colleague browsing RHK .sm4 files from a network drive reported that Browse opens slowly. On a LAN-mounted drive, per-operation latency and per-byte transfer dominate, and the browse path multiplied both: each file was sniffed twice, stat'd several times, fully read for metadata, and fully read again per thumbnail — with nothing cached across revisits. SM4 was the worst case (its "metadata" path decoded every pixel of every page).

Implements the plan in .claude/plans/we-re-going-to-work-transient-storm.md.

Changes

SM4-specific (the colleague's case)

  • S1read_sm4_metadata no longer decodes pixels. A metadata_only parse path skips _decode_image_payload/float64 conversions and builds ScanMetadata directly from page headers (metadata_from_rhk_sm4).
  • S3 — thumbnails decode only the displayed plane via read_sm4_thumbnail_plane (reads just that page's PAGE_DATA range), instead of all N pages.
  • S2metadata_only reads only header/table/string byte ranges via seek (_collect_sm4_metadata_ranges), never the PAGE_DATA blobs. On the real 4.4 MB fixture this fetches ~3.6 KB (0.08%) with byte-identical metadata. Falls back to a whole-file read for small files (≤1 MiB) or on any structural surprise.

All formats

  • R3index_folder_shallow/_peek_subfolder enumerate via os.scandir (type + cached stat from one directory read) instead of iterdir + per-entry is_file()/is_dir()/stat.
  • R4 — the sniffed FileType is threaded into read_scan_metadata/read_spec_metadata, so identify_* no longer re-sniffs. Each file is now sniffed exactly once per index.
  • R1 — a best-effort on-disk cache (browse_cache.py) for index metadata and rendered thumbnails, keyed by (abspath, mtime_ns, size). Revisiting an unchanged folder reads no file content. mtime+size invalidation, version tag, LRU eviction (512 MiB cap); PROBEFLOW_DISABLE_BROWSE_CACHE / PROBEFLOW_CACHE_DIR env controls.

Testing

New tests cover SM4 metadata-only parity (synthetic + real fixture), single-plane thumbnail parity, sparse-read byte savings, sniff-once indexing, and the cache (round-trip, invalidation, eviction, disabled-mode, second-pass cache hit). All pass. The full GUI suite can't run in this headless environment (pre-existing Qt SIGABRT unrelated to these changes); all non-Qt-window tests for the touched areas pass (108 passed, 23 skipped).

🤖 Generated with Claude Code

jacobson30-bot and others added 9 commits June 2, 2026 11:56
read_sm4_metadata previously went through read_sm4 -> metadata_from_scan,
fully decoding every image page (frombuffer + reshape + two float64 scaling
passes per page) just to read header fields. On a network drive this also
forced the entire file into memory.

Add a metadata_only path through read_rhk_sm4 / _parse_pages /
_parse_page_from_objects that parses PAGE_HEADER + STRING_DATA but skips
_decode_image_payload and the physical conversion, leaving page image arrays
empty. read_sm4_metadata now builds ScanMetadata directly from the parsed
pages via the new metadata_from_rhk_sm4 helper, which mirrors read_sm4's page
selection / naming / scan-range logic for byte-identical output.

Verified parity against the full-decode path on the real VT260430_0004.sm4
fixture and synthetic files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Browse thumbnails only ever show one plane, but load_scan(.sm4) decoded
every image page (an 8-page file did ~8x the needed pixel work). Add
read_sm4_thumbnail_plane: parse headers once (metadata_only), resolve the
target plane from its names, then read and decode only that page's PAGE_DATA
byte range via the new per-page data_object location. A shared
_select_image_pages helper keeps page selection identical to read_sm4 and
metadata_from_rhk_sm4.

The thumbnail workers now call a new gui.rendering.load_thumbnail_plane that
uses the single-plane path for SM4 and falls back to a full load for other
formats (whose decode is a single read anyway).

Verified the single-plane result is byte-identical to the corresponding plane
from a full read_sm4 on the real fixture and synthetic files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_rhk_sm4(metadata_only=True) still slurped the entire file via
read_bytes(), so indexing a folder of SM4 files transferred every byte over
the network just to read headers.

Add _read_sm4_metadata_buffer: files at or below 1 MiB are read whole (the
round-trips of a sparse read aren't worth it), but larger files are served by
_collect_sm4_metadata_ranges, which walks the object graph with seek reads and
fetches only the file header, object tables, page index, page headers and
string data into a full-length zero-filled buffer. PAGE_DATA byte ranges are
never read. The existing absolute-offset parser then runs unchanged on that
buffer. Any structural surprise falls back to a whole-file read, so
correctness never depends on the sparse traversal being exhaustive.

On the real VT260430_0004.sm4 fixture this fetches ~3.6 KB of a 4.4 MB file
(0.08%) and produces byte-identical ScanMetadata.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
R3: index_folder_shallow and _peek_subfolder now enumerate via os.scandir
instead of Path.iterdir() + per-entry is_file()/is_dir(). DirEntry serves
type info (and a cached stat) from the single directory read, removing
separate stat round-trips per file — the dominant cost on a network drive.
The DirEntry.stat() is reused to fill mtime/size instead of a fresh _file_stat.

R4: the FileType from the indexing sniff is threaded through _build_item ->
_item_from_scan/_item_from_spec into read_scan_metadata/read_spec_metadata via
a new file_type= argument, so identify_scan_file/identify_spectrum_file (which
re-sniff and redo exists/is_file/resolve) are skipped. Each file is now sniffed
exactly once per index (regression-tested).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-opening the same network folder previously re-read every file twice (once
to build its ProbeFlowItem, once to render its thumbnail). Add a best-effort
local disk cache (probeflow/core/browse_cache.py) keyed by
(absolute_path, mtime_ns, size_bytes):

- index_folder_shallow consults the metadata cache before sniffing/reading a
  file; a hit returns the stored ProbeFlowItem (or a cached "not recognised"
  None) with zero content reads. Misses populate the cache.
- ThumbnailLoader and FolderThumbnailLoader cache the rendered PNG, keyed also
  by colormap/channel/clip/size/processing; a hit loads the pixmap straight
  from bytes without reading or decoding the file.

mtime+size are baked into the key, so modified/replaced files miss
automatically; a cache-version tag in the directory name invalidates on format
changes; least-recently-used eviction keeps the cache under a soft 512 MiB cap.
All cache I/O is wrapped so a corrupt/unreadable cache degrades to a normal
read. PROBEFLOW_DISABLE_BROWSE_CACHE=1 disables it; PROBEFLOW_CACHE_DIR
relocates it.

Note: the metadata key tracks the scan file only, not its optional ScanFlow
sidecar; a sidecar edited without touching the scan would be served stale until
the scan changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Remove sm4_image_plane_names: dead code (read_sm4_thumbnail_plane computes
  names inline).
- load_thumbnail_plane re-sniffed/resolved twice for SXM/Createc on a cache
  miss (identify_scan_file, then load_scan re-identifying). Expose
  load_scan_from_signature and reuse the signature already resolved, so each
  thumbnail sniffs/resolves once.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Store metadata entries as (schema_tag, item) where schema_tag is the tuple of
the dataclass's field names. On read, a tag mismatch (any added/removed/renamed
ProbeFlowItem field) is treated as a miss, so a stale cache can never feed a
half-constructed object into the app at attribute-access time. Bump the cache
version to 2 (envelope shape changed; old bare-pickle entries now miss safely).

Add tests: schema-change invalidation, and an end-to-end run through the real
browse pipeline on VT260430_0004.sm4 (index -> cache-hit second pass -> single
-plane thumbnail decode -> PNG cache round-trip).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The thumbnail workers now call load_thumbnail_plane (single-plane SM4 decode +
cache) instead of load_scan, so the three TestGuiWorkers tests that monkeypatched
worker_mod.load_scan failed in CI (they are skipped locally where Qt can't init).
Repoint them at load_thumbnail_plane: the plane-selection assertion becomes a
check that the requested channel is forwarded, and the failure-path tests raise
from the new seam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jacobson30-bot jacobson30-bot merged commit a06e5b1 into main Jun 2, 2026
3 checks passed
@jacobson30-bot jacobson30-bot deleted the optimize-browse-network-sm4 branch June 2, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant