diff --git a/doc/memory_performance.md b/doc/memory_performance.md
new file mode 100644
index 0000000..d8c783c
--- /dev/null
+++ b/doc/memory_performance.md
@@ -0,0 +1,165 @@
+# Comparing GPU Memory: Vanilla LM, Schur-complement Optimizer, and Matrix-free options
+
+This page benchmarks GPU memory across the two implementation choices `bae`
+offers for that solve:
+
+- **`LM` vs `Schur`** — solve the full normal equations (`LM`,
+  Levenberg–Marquardt) or use the **Schur complement** to reduce them to the
+  cameras only (`Schur`).
+- **matrix-free vs materialized** — apply the operator through matrix–vector
+  products meaning matrix-free or build it explicitly in memory.
+
+The sweep runs a set of [BAL](https://grail.cs.washington.edu/projects/bal/)
+problems of increasing size through three configurations —> `Schur` matrix-free,
+`Schur` materialized, and `LM` matrix-free. This records peak and average GPU
+memory and plots it against problem size.
+
+## Output Graph
+
+![Peak and average GPU memory vs. number of optimizable parameters, for the three optimizer configurations](memory_performance.png)
+
+*Peak (left) and average (right) GPU memory allocation across the three
+optimizer configurations, plotted against problem size.*
+
+## Background: Schur complement & matrix-free
+
+### 1) Schur complement
+
+Each Bundle Adjustment step solves a large sparse linear system — the normal
+equations `H Δx = g`, with `H = JᵀJ`. The unknowns split into two groups:
+**camera** parameters (poses + intrinsics) and **3D point** parameters. Ordering
+them that way gives `H` a 2×2 block structure (symbols match `Schur.step`):
+
+```
+[ U   W ] [ Δc ]   [ g_c ]
+[ Wᵀ  V ] [ Δp ] = [ g_p ]
+```
+
+- `U = J0ᵀ J0` — camera block
+- `V = J1ᵀ J1` — point block, **block-diagonal**: each 3D point touches only its
+  own observations, so `V⁻¹` is cheap to apply
+- `W = J0ᵀ J1` — camera–point coupling
+
+Because `V` is trivially invertible, the point unknowns can be eliminated,
+leaving a much smaller system in the cameras only — the **Schur complement**
+system:
+
+```
+S Δc = g_c − W V⁻¹ g_p ,     where   S = U − W V⁻¹ Wᵀ
+```
+
+```
+Δp = V⁻¹ (g_p − Wᵀ Δc)
+```
+
+
+### 2) Matrix-free
+
+The operator `S` can be handled two ways, which is what the `matrix_free_normal`
+flag selects. The branch lives in [`bae/optim/optimizer.py`](../bae/optim/optimizer.py)
+— inside `Schur.step` (and `LM.step` for the non-Schur path):
+
+- **`False` — normal path** (the *Schur, normal* condition) — the matrices are **materialized**: `W`,
+  `Wᵀ`, the product `W V⁻¹ Wᵀ`, and the assembled `S` are built explicitly as
+  sparse matrices and handed to the solver.
+- **`True` — matrix-free path** (the *Schur, matrix-free* condition) — `S` (and even `W`) is **never
+  formed**. CG only needs the result of `S·x` for a vector `x`, so the operator
+  is applied on the fly as a chain of sparse matrix–vector products (`J0`, `J1`,
+  `V⁻¹`, …) inside `schur_matvec`.
+
+Iterative solvers like CG never inspect the matrix entries — they only multiply
+the matrix by a vector — so the matrix-free form computes the **same** step while
+**avoiding the memory of storing `W`, `Wᵀ`, and `S`**. That saved memory grows
+with problem size, which is exactly what the plots show: the matrix-free
+conditions (*Schur, matrix-free* and *LM, matrix-free*) sit well below
+*Schur, normal*.
+
+The same flag also applies to the plain `LM` optimizer, where it swaps the
+explicitly built `JᵀJ` matrix for the matrix-free `NormalMatVec` operator.
+
+## What it measures
+
+| Metric | Meaning |
+| --- | --- |
+| Number of parameters | The x-axis count: 3 per 3D point (xyz position) plus 10 per camera (7 SE(3) pose + 3 intrinsics f, k1, k2) — i.e. 3 × (number of 3D points) + 10 × (number of cameras). With `OPTIMIZE_INTRINSICS = True` all 10 camera parameters are optimized, so `num_params` counts `NUM_CAMERA_PARAMS = 10` per camera. |
+| Peak GPU memory | The peak allocation in MiB — the max of the background sampler, `torch.cuda.max_memory_allocated`, and the Warp mempool high-water mark |
+| Average GPU memory | The mean allocation in MiB over the run, sampled every 5 ms |
+| Initial / final loss | Sum-of-squared-residuals loss (`model.loss`) before and after the 20 LM steps — a convergence sanity check (not plotted) |
+| Per-pixel loss | Final mean squared reprojection error per observation (`least_square_error`) = final loss ÷ number of observations |
+
+GPU memory is the sum of **PyTorch** allocations 
+and the **Warp** memory pool captured by a
+background `MemorySampler` thread with interval 5ms.
+
+## Conditions
+
+Optimizer | `matrix_free_normal` | Color |
+--- | --- | --- |
+`Schur` | `True` | green |
+`Schur` | `False` | orange |
+`LM` | `True` | red |
+
+The optimizer setup (shared by all conditions) is:
+
+- Strategy: `TrustRegion(up=2.0, down=0.5**4)`
+- Solver: `PCG(tol=1e-4, maxiter=250)`
+- `reject=30`
+- Model: `Residual` from [`examples/schur.py`](../examples/schur.py), optimizing
+  camera params + 3D points (**3 per point**: xyz). Intrinsics are included
+  (`OPTIMIZE_INTRINSICS = True` → 10 camera params per camera).
+
+This whole setup about the `TrustRegion` / `PCG` / `reject=30` configuration and the
+`Residual` model is taken from the reference example
+[`examples/schur.py`](../examples/schur.py).
+
+## How it works
+
+The three benchmark conditions all optimize the same reprojection `Residual`
+model from [`examples/schur.py`](../examples/schur.py), and trace a path from the
+`LM` baseline to `Schur` with matrix-free normal equations — each step lowers GPU
+memory (the mechanism behind each is detailed in *Background* above):
+
+1. **`LM` → `Schur`** — instead of solving the full normal equations, `Schur`
+   eliminates the block-diagonal point block `V` and only solves the much
+   smaller camera-only *reduced* system. Fewer unknowns reach the CG solver and
+   the reduced system is better conditioned.
+2. **`+ matrix-free`** — the reduced operator `S` is applied through
+   matrix–vector products instead of being materialized, so `W`, `Wᵀ`, and `S`
+   are never stored. This is where the bulk of the memory saving comes from.
+
+Measured on the largest benchmark problem (`venice/problem-1778`, ≈ 3.0 M
+optimizable parameters):
+
+| Condition | Peak | Average |
+| --- | --- | --- |
+| LM, matrix-free | ~10.3 GiB | ~5.8 GiB |
+| Schur, normal | ~9.9 GiB | ~7.7 GiB |
+| Schur, matrix-free | **~5.8 GiB** | **~2.3 GiB** |
+
+So **Schur with matrix-free** uses about **43% less peak** memory than the
+**LM (matrix-free)** baseline (~4.5 GiB less) and **~61% less average**. Notably,
+**Schur with the normal (materialized) path** has the *highest* average of the
+three — building `W`/`Wᵀ`/`S` explicitly keeps that memory occupied for the
+entire solve — so it is the **matrix-free** half that makes Schur pay off, not
+the complement alone. **Schur with matrix-free** also has the flattest
+memory-vs-size trend (~2.0 vs ~3.6 MiB per 1 K parameters at peak), so the gap
+keeps widening as problems grow.
+
+## Configuration
+
+| Constant | Value | Purpose |
+| --- | --- | --- |
+| `TARGET_PROBLEMS` | ladybug / trafalgar / dubrovnik / venice list | target datasets |
+| `CONDITIONS` | 3 configs | Optimizer / matrix-free combinations to compare |
+| `DEVICE` | `"cuda"` | Compute device |
+| `NUM_ITERATIONS` | `20` | LM steps per run |
+| `OPTIMIZE_INTRINSICS` | `True` | Optimize the 3 camera intrinsics → `NUM_CAMERA_PARAMS = 10` (7 pose + 3 intrinsics) |
+| `POINT_PARAMS` | `3` | Parameters per 3D point (xyz position) |
+| `MEMORY_SAMPLE_INTERVAL_S` | `0.005` | Background memory sampling interval |
+
+## Requirements
+
+- CUDA-capable GPU (the script asserts a Warp CUDA device is available)
+- PyTorch, NumPy, Matplotlib, [Warp](https://github.com/NVIDIA/warp) (`warp-lang`)
+- `bae` installed (see the repo [README](../README.md)); BAL problems are
+  downloaded/cached on first use via `datapipes.bal_loader.get_problem`.
diff --git a/doc/memory_performance.png b/doc/memory_performance.png
new file mode 100644
index 0000000..eb5dc12
Binary files /dev/null and b/doc/memory_performance.png differ