Skip to content

Add NISAR azimuth-block splitting for parallel full-frame processing#717

Draft
mgovorcin wants to merge 2 commits into
isce-framework:mainfrom
mgovorcin:azimuth-block
Draft

Add NISAR azimuth-block splitting for parallel full-frame processing#717
mgovorcin wants to merge 2 commits into
isce-framework:mainfrom
mgovorcin:azimuth-block

Conversation

@mgovorcin
Copy link
Copy Markdown

@mgovorcin mgovorcin commented May 22, 2026

Adds a block-as-burst pattern so NISAR full-frame GSLCs can be processed in parallel using the existing burst pool.

For burst inputs nothing changes. For NISAR (non-burst) inputs, when input_options.azimuth_blocks > 1 the frame is split into N azimuth blocks at the dispatch layer, each block is processed as a synthetic burst through ProcessPoolExecutor, and per-block outputs are stitched by the existing stitching_bursts.run step. No changes to wrapped_phase.run or the stitcher are required.

Change Summary

  • workflows/_block_split.py (new)split_frame_into_blocks() divides the pixel grid into N azimuth blocks with a halo; _min_halo_rows() auto-computes the minimum safe overlap from half-window, similarity search radius, and stride settings.
  • workflows/config/_common.pyInputOptions gains azimuth_blocks: int = 1 and halo_rows: Optional[int] = None.
  • workflows/displacement.py — extracts dispatch into a testable _prepare_grouped_inputs() function; sets HDF5_USE_FILE_LOCKING=FALSE before spawning workers when multiple processes share the same HDF5 files (prevents hangs on NFS/Lustre).
  • workflows/_utils.py_create_burst_cfg accepts an optional bounds parameter to set output_options.bounds per block.

Typical config for a 4-block run:

input_options:
  azimuth_blocks: 4
worker_settings:
  n_parallel_bursts: 4
  threads_per_worker: 2

CLAUDE NOTES

└── Process #k (one per active "burst" / NISAR block)
    │
    ├── ThreadPoolExecutor (size = workers_per_burst = n_parallel_bursts // active_bursts)
    │   └── threads doing GDAL/HDF5 read-ahead inside EagerLoader
    │
    └── Main thread: per-output-block phase-linking
        ├── JAX/XLA host pool (default = os.cpu_count(), can cap via XLA_FLAGS)
        │   └── parallelizes vmap'd ops across pixels
        └── BLAS via OMP_NUM_THREADS = threads_per_worker
            └── multi-threads eigh, solve, matmul inside JAX kernels

Key insight: n_parallel_bursts is dual-purpose. It's the outer pool size AND a budget that gets re-allocated to inner I/O workers when there are fewer "bursts" than pool slots.

n_parallel_bursts = 8, with N bursts available:
N = 1 (FULL) → 1 active process × 8 inner I/O threads
N = 4 → 4 active processes × 2 inner I/O threads each
N = 8 (BLOCKED) → 8 active processes × 1 inner I/O thread each

FULL vs BLOCKED

Aspect FULL (azimuth_blocks=1) BLOCKED (azimuth_blocks=N)
Processes 1 N (8 typical)
GIL Locked to 1 interpreter Escaped — N interpreters
JAX kernels Parallelize across pixels via vmap (within 1 block) Same per block, run N in parallel across blocks
Python control flow Strictly serial sweep through frame N serial sweeps in parallel
HDF5 reads 1 stream N concurrent streams
JIT compile Once, ~10–30 s N times in parallel, ~30 s wall
Working set Full frame in 1 process ~1/N of frame in each process

The speedup mechanism is N copies of single-threaded work running in parallel.

Each worker still uses ~1 core; you just have N of them running simultaneously on different rows.


Per-Knob Impact

n_parallel_bursts

  • Bigger = more outer pool slots (capped by available bursts)
  • For NISAR: this is the number of blocks running in parallel
  • Set equal to azimuth_blocks for maximum block parallelism
  • Typical sweet spot: both at 8

threads_per_worker

Sets:

  • OMP_NUM_THREADS
  • MKL_NUM_THREADS

Controls BLAS threads inside JAX kernels:

  • eigh
  • solve
  • matmul

Separate from JAX/XLA’s own thread pool.

Typical values:

  • 2 → reasonable default
  • 1 → if CPU constrained
  • 4+ → can help in FULL mode where one process can use more parallelism

JAX / XLA

JAX’s XLA backend maintains its own thread pool, sized by default to:

os.cpu_count()

Local benchmark full frame vs blocked results

displacement.run end-to-end through PS+PL+stitch benchmark run with taskset -c 0-33

GSLCs : 7 (Track 042 Frame 070 Descending orbit)
half_window : 5
num_blocks : 8
n_parallel_bursts : 8 (both modes)
threads_per_worker : 2 (both modes)
block_shape: 2048x2048 (both modes)
active threads FULL : 2
active threads BLK : 16

FULL wall = 10667.3s (177.79 min)
BLOCKED wall = 2411.2s ( 40.19 min)
SPEEDUP (BLOCKED vs FULL) : 4.42×

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Tests pass on CI
  • Documentation reflects the changes where applicable
  • My PR is ready to review

mgovorcin and others added 2 commits May 21, 2026 22:22
Splits a single non-OPERA-burst frame (e.g. NISAR GSLC) into N azimuth
blocks processed in parallel as synthetic "bursts". Each block reuses
the full input file list but applies its own output_options.bounds, so
wrapped_phase.run masks per block via _get_mask. The existing burst pool
runs the blocks concurrently and stitching_bursts.run mosaics them.

Changes
-------
- New _block_split.split_frame_into_blocks: computes per-block (bounds,
  epsg) from cfg georef with a minimum-halo formula
    max(half_window_y,
        similarity_search_radius * stride_y,
        (corr_window_y // 2) * stride_y) + 5
  so block edges aren't cropped by SHP/similarity/stitcher windows.
- displacement.py: new module-level _GroupedInputs dataclass and
  _prepare_grouped_inputs helper that detects OPERA vs non-OPERA-burst
  mode from group_by_burst behavior. In non-OPERA mode, optional file
  lists (amp_disp, amp_mean, layover_shadow) are broadcast to every
  block id so second-batch reuse works.
- _create_burst_cfg gains a bounds= kwarg that writes
  output_options.bounds/_epsg per block. OPERA path passes None and is
  unchanged.
- config.InputOptions adds azimuth_blocks: int = 1 (default = no
  splitting, OPERA path unchanged) and halo_rows: int | None.
When running azimuth-block mode with n_parallel_bursts > 1, multiple
worker processes read the same NISAR GSLC (HDF5) files concurrently
at different spatial extents. On NFS/Lustre filesystems this triggers
HDF5 file-locking hangs. Set HDF5_USE_FILE_LOCKING=FALSE (via setdefault,
so an explicit caller setting is honoured) before spawning workers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scottstanie pushed a commit to scottstanie/dolphin that referenced this pull request May 24, 2026
Replaces the cube → per-layer tif export with a single per-layer VRT
shim that points at ``ZARR:cube.zarr:/<var>:i`` as its source. GDAL's
ZARR driver (3.10+) reads the slice natively, so the existing tif-based
downstream pipeline (interferogram formation, stitching, unwrap,
timeseries) sees ordinary 2D rasters with no pixel-data duplication.

Highlights:
- Default `zarr_format=2` for the cube (GDAL's ZARR read path supports
  it cleanly; v3's `bytes` codec isn't implemented on the GDAL read side
  yet). `zarr_format=3` is still selectable via writer kwarg.
- New `emit_layer_vrts`, `zarr_subdataset_uri`, `layer_vrt_path` helpers
  in `dolphin.io._geozarr`. VRTs use absolute paths so they survive the
  per-ministack → top-level rename in `sequential.py`.
- `_ZarrLayerStack.export_tifs` now writes VRT shims instead of dumping
  layer data into tif files. `_make_stack_output` returns `.slc.vrt`
  paths in GEOZARR mode.
- Skip the GeoTIFF-only `repack_rasters` passes in GEOZARR mode (the
  cube is already chunked and compressed; repack would create one tif
  per VRT, undoing the no-duplication win).
- `_ensure_data_array` rejects the reserved coord names (x/y/spatial_ref)
  and checks shape against any existing array of the same name to catch
  accidental writes that would silently corrupt a coord.
- `wrapped_phase.py` cached/skip path and `sequential._get_outputs_from_folder`
  also glob for `.vrt` so GEOZARR runs can resume.

Parallelism: matches PR isce-framework#717. Each "burst" (or synthetic-burst
azimuth-split block) process gets its own work directory and its own
cube — no cross-process zarr coordination. Inside a process the cube
writer drains on a single background thread, the direct analog of
`BackgroundStackWriter`'s single-thread tif feeder, so the parallel
ThreadPoolExecutor in `single.py` sees the same write-side throughput
either way.

Tests updated:
- `test_io_geozarr.py`: added VRT-readable-via-rasterio, VRT-survives-move,
  shape-mismatch rejection, reserved-coord-name rejection, and
  default-is-v2 cases.
- `test_workflows_single.py::test_sequential_geozarr` and
  `test_workflows_displacement.py::test_displacement_run_geozarr` now
  assert *no* per-date `.slc.tif` are written, that VRTs are emitted,
  and that VRT reads round-trip to the cube layer.

https://claude.ai/code/session_01UC8HNQWpNsKBmBNhdMhsLu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant