-
Notifications
You must be signed in to change notification settings - Fork 24
Add markdown for Schur/matrix-free memory performance in /doc #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
f77fca2
Add markdown for Schur/matrix-free memory performance in /doc
SEOKWOOPARK df53e70
The rejection parameter 20 -> 30
SEOKWOOPARK eea3ea6
Replace the number of camera parameter from 7 to 10 with the correspo…
SEOKWOOPARK c767927
Remove a necessary expression in Configuration section
SEOKWOOPARK 6c30f0f
Change the title of the markdown focusing on neutral comparison
SEOKWOOPARK 0bbe421
Change the title of the markdown focusing on neutral comparison
SEOKWOOPARK 7dac7e7
Add the context of what Bundle Adjustment optimizers mean
SEOKWOOPARK a088bb4
Change the title of the markdown and remove BA's definition
SEOKWOOPARK File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,165 @@ | ||
| # Comparing GPU Memory: Vanilla LM, Schur-complement Optimizer, and Matrix-free options | ||
|
|
||
| This page benchmarks GPU memory across the two implementation choices `bae` | ||
| offers for that solve: | ||
|
|
||
| - **`LM` vs `Schur`** — solve the full normal equations (`LM`, | ||
| Levenberg–Marquardt) or use the **Schur complement** to reduce them to the | ||
| cameras only (`Schur`). | ||
| - **matrix-free vs materialized** — apply the operator through matrix–vector | ||
| products meaning matrix-free or build it explicitly in memory. | ||
|
|
||
| The sweep runs a set of [BAL](https://grail.cs.washington.edu/projects/bal/) | ||
| problems of increasing size through three configurations —> `Schur` matrix-free, | ||
| `Schur` materialized, and `LM` matrix-free. This records peak and average GPU | ||
| memory and plots it against problem size. | ||
|
|
||
| ## Output Graph | ||
|
|
||
|  | ||
|
|
||
| *Peak (left) and average (right) GPU memory allocation across the three | ||
| optimizer configurations, plotted against problem size.* | ||
|
|
||
| ## Background: Schur complement & matrix-free | ||
|
|
||
| ### 1) Schur complement | ||
|
|
||
| Each Bundle Adjustment step solves a large sparse linear system — the normal | ||
| equations `H Δx = g`, with `H = JᵀJ`. The unknowns split into two groups: | ||
| **camera** parameters (poses + intrinsics) and **3D point** parameters. Ordering | ||
| them that way gives `H` a 2×2 block structure (symbols match `Schur.step`): | ||
|
|
||
| ``` | ||
| [ U W ] [ Δc ] [ g_c ] | ||
| [ Wᵀ V ] [ Δp ] = [ g_p ] | ||
| ``` | ||
|
|
||
| - `U = J0ᵀ J0` — camera block | ||
| - `V = J1ᵀ J1` — point block, **block-diagonal**: each 3D point touches only its | ||
| own observations, so `V⁻¹` is cheap to apply | ||
| - `W = J0ᵀ J1` — camera–point coupling | ||
|
|
||
| Because `V` is trivially invertible, the point unknowns can be eliminated, | ||
| leaving a much smaller system in the cameras only — the **Schur complement** | ||
| system: | ||
|
|
||
| ``` | ||
| S Δc = g_c − W V⁻¹ g_p , where S = U − W V⁻¹ Wᵀ | ||
| ``` | ||
|
|
||
| ``` | ||
| Δp = V⁻¹ (g_p − Wᵀ Δc) | ||
| ``` | ||
|
|
||
|
|
||
| ### 2) Matrix-free | ||
|
|
||
| The operator `S` can be handled two ways, which is what the `matrix_free_normal` | ||
| flag selects. The branch lives in [`bae/optim/optimizer.py`](../bae/optim/optimizer.py) | ||
| — inside `Schur.step` (and `LM.step` for the non-Schur path): | ||
|
|
||
| - **`False` — normal path** (the *Schur, normal* condition) — the matrices are **materialized**: `W`, | ||
| `Wᵀ`, the product `W V⁻¹ Wᵀ`, and the assembled `S` are built explicitly as | ||
| sparse matrices and handed to the solver. | ||
| - **`True` — matrix-free path** (the *Schur, matrix-free* condition) — `S` (and even `W`) is **never | ||
| formed**. CG only needs the result of `S·x` for a vector `x`, so the operator | ||
| is applied on the fly as a chain of sparse matrix–vector products (`J0`, `J1`, | ||
| `V⁻¹`, …) inside `schur_matvec`. | ||
|
|
||
| Iterative solvers like CG never inspect the matrix entries — they only multiply | ||
| the matrix by a vector — so the matrix-free form computes the **same** step while | ||
| **avoiding the memory of storing `W`, `Wᵀ`, and `S`**. That saved memory grows | ||
| with problem size, which is exactly what the plots show: the matrix-free | ||
| conditions (*Schur, matrix-free* and *LM, matrix-free*) sit well below | ||
| *Schur, normal*. | ||
|
|
||
| The same flag also applies to the plain `LM` optimizer, where it swaps the | ||
| explicitly built `JᵀJ` matrix for the matrix-free `NormalMatVec` operator. | ||
|
|
||
| ## What it measures | ||
|
|
||
| | Metric | Meaning | | ||
| | --- | --- | | ||
| | Number of parameters | The x-axis count: 3 per 3D point (xyz position) plus 10 per camera (7 SE(3) pose + 3 intrinsics f, k1, k2) — i.e. 3 × (number of 3D points) + 10 × (number of cameras). With `OPTIMIZE_INTRINSICS = True` all 10 camera parameters are optimized, so `num_params` counts `NUM_CAMERA_PARAMS = 10` per camera. | | ||
| | Peak GPU memory | The peak allocation in MiB — the max of the background sampler, `torch.cuda.max_memory_allocated`, and the Warp mempool high-water mark | | ||
| | Average GPU memory | The mean allocation in MiB over the run, sampled every 5 ms | | ||
| | Initial / final loss | Sum-of-squared-residuals loss (`model.loss`) before and after the 20 LM steps — a convergence sanity check (not plotted) | | ||
| | Per-pixel loss | Final mean squared reprojection error per observation (`least_square_error`) = final loss ÷ number of observations | | ||
|
|
||
| GPU memory is the sum of **PyTorch** allocations | ||
| and the **Warp** memory pool captured by a | ||
| background `MemorySampler` thread with interval 5ms. | ||
|
|
||
| ## Conditions | ||
|
|
||
| Optimizer | `matrix_free_normal` | Color | | ||
| --- | --- | --- | | ||
| `Schur` | `True` | green | | ||
| `Schur` | `False` | orange | | ||
| `LM` | `True` | red | | ||
|
|
||
| The optimizer setup (shared by all conditions) is: | ||
|
|
||
| - Strategy: `TrustRegion(up=2.0, down=0.5**4)` | ||
| - Solver: `PCG(tol=1e-4, maxiter=250)` | ||
| - `reject=30` | ||
| - Model: `Residual` from [`examples/schur.py`](../examples/schur.py), optimizing | ||
| camera params + 3D points (**3 per point**: xyz). Intrinsics are included | ||
| (`OPTIMIZE_INTRINSICS = True` → 10 camera params per camera). | ||
|
|
||
| This whole setup about the `TrustRegion` / `PCG` / `reject=30` configuration and the | ||
| `Residual` model is taken from the reference example | ||
| [`examples/schur.py`](../examples/schur.py). | ||
|
|
||
| ## How it works | ||
|
|
||
| The three benchmark conditions all optimize the same reprojection `Residual` | ||
| model from [`examples/schur.py`](../examples/schur.py), and trace a path from the | ||
| `LM` baseline to `Schur` with matrix-free normal equations — each step lowers GPU | ||
| memory (the mechanism behind each is detailed in *Background* above): | ||
|
|
||
| 1. **`LM` → `Schur`** — instead of solving the full normal equations, `Schur` | ||
| eliminates the block-diagonal point block `V` and only solves the much | ||
| smaller camera-only *reduced* system. Fewer unknowns reach the CG solver and | ||
| the reduced system is better conditioned. | ||
| 2. **`+ matrix-free`** — the reduced operator `S` is applied through | ||
| matrix–vector products instead of being materialized, so `W`, `Wᵀ`, and `S` | ||
| are never stored. This is where the bulk of the memory saving comes from. | ||
|
|
||
| Measured on the largest benchmark problem (`venice/problem-1778`, ≈ 3.0 M | ||
| optimizable parameters): | ||
|
|
||
| | Condition | Peak | Average | | ||
| | --- | --- | --- | | ||
| | LM, matrix-free | ~10.3 GiB | ~5.8 GiB | | ||
| | Schur, normal | ~9.9 GiB | ~7.7 GiB | | ||
| | Schur, matrix-free | **~5.8 GiB** | **~2.3 GiB** | | ||
|
|
||
| So **Schur with matrix-free** uses about **43% less peak** memory than the | ||
| **LM (matrix-free)** baseline (~4.5 GiB less) and **~61% less average**. Notably, | ||
| **Schur with the normal (materialized) path** has the *highest* average of the | ||
| three — building `W`/`Wᵀ`/`S` explicitly keeps that memory occupied for the | ||
| entire solve — so it is the **matrix-free** half that makes Schur pay off, not | ||
| the complement alone. **Schur with matrix-free** also has the flattest | ||
| memory-vs-size trend (~2.0 vs ~3.6 MiB per 1 K parameters at peak), so the gap | ||
| keeps widening as problems grow. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Constant | Value | Purpose | | ||
| | --- | --- | --- | | ||
| | `TARGET_PROBLEMS` | ladybug / trafalgar / dubrovnik / venice list | target datasets | | ||
| | `CONDITIONS` | 3 configs | Optimizer / matrix-free combinations to compare | | ||
| | `DEVICE` | `"cuda"` | Compute device | | ||
| | `NUM_ITERATIONS` | `20` | LM steps per run | | ||
| | `OPTIMIZE_INTRINSICS` | `True` | Optimize the 3 camera intrinsics → `NUM_CAMERA_PARAMS = 10` (7 pose + 3 intrinsics) | | ||
| | `POINT_PARAMS` | `3` | Parameters per 3D point (xyz position) | | ||
| | `MEMORY_SAMPLE_INTERVAL_S` | `0.005` | Background memory sampling interval | | ||
|
|
||
| ## Requirements | ||
|
|
||
| - CUDA-capable GPU (the script asserts a Warp CUDA device is available) | ||
| - PyTorch, NumPy, Matplotlib, [Warp](https://github.com/NVIDIA/warp) (`warp-lang`) | ||
| - `bae` installed (see the repo [README](../README.md)); BAL problems are | ||
| downloaded/cached on first use via `datapipes.bal_loader.get_problem`. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.