Skip to content

perf: prune explicit zeros at matrix assembly (cut build peak memory)#816

Draft
FBumann wants to merge 3 commits into
masterfrom
perf/prune-zeros-at-source
Draft

perf: prune explicit zeros at matrix assembly (cut build peak memory)#816
FBumann wants to merge 3 commits into
masterfrom
perf/prune-zeros-at-source

Conversation

@FBumann

@FBumann FBumann commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Intent placeholder — @FBumann to replace with your own words.

Important

Draft — depends on #815. Merge #815 first, then mark this ready.
This targets master, so until #815 merges the diff below also includes #815's commit (the _stack eliminate_zeros()). Once #815 is in, GitHub collapses the diff to just this PR's source-level prune.

Follow-up that completes the zero-drop story on the memory axis.

Note

The following was generated by AI.

Why a second PR

#815 eliminates zeros from the stacked constraint matrix with eliminate_zeros(). That runs after scipy.vstack has already materialised the full dense-with-zeros block, so the peak allocation — exactly what the CodSpeed memory instrument tracks — is unchanged; only the resting matrix size and the solver-ingest cost drop. That's why the memory job shows no movement even though the handoff is measurably faster.

Change

Prune at the source: fold coeffs != 0 into the valid-entry mask in Constraint._matrix_export_data, so broadcast zeros never enter cols/data/the CSR at all. Snapshot capture and the matrix build both go through this path, so they stay consistent. #815's eliminate_zeros() stays as cheap safety for the one case a pre-filter can't catch — duplicate variable terms that cancel to zero after sum_duplicates.

Two ripples (both handled)

  1. Dual alignment. matrices.dual derived its active rows from stored nnz (np.diff(csr.indptr)). Once zeros are pruned, an all-zero-coefficient row (e.g. 0·x ≤ 5) keeps its row in A but stores no entry, so it would silently lose its dual slot → misaligned dual vector. Fixed by deriving active rows from row activity via a new ConstraintBase.active_row_mask (coefficient-independent; also skips a redundant CSR rebuild).
  2. Warm-start semantics. A zero-coefficient term no longer changes the sparsity pattern, so the persistent-solver path no longer forces a SPARSITY rebuild when one is added — correct, since the matrix is unchanged. test_shape_mismatch_triggers_sparsity_rebuild updated to use a non-zero coefficient (still exercises the real path), and test_zero_coefficient_term_needs_no_rebuild added for the new behaviour.

Impact

Peak allocation building m.matrices.A — sparse_network(250)
variant build peak final nnz
master (keep zeros) 50.1 MB 1,506,000
#815 (_stack eliminate_zeros) 49.9 MB 18,000
this PR (prune at source) 36.3 MB 18,000

~27% lower peak. The residual is the unavoidable dense-coeffs flatten — removing that needs expression-level sparsity, out of scope here. Solver result is identical; addMConstr/addRows stay ~2× faster from the smaller nnz.

Full test suite green (3770 passed, 45 skipped).

🤖 Generated with Claude Code

FBumann and others added 2 commits July 2, 2026 15:39
Expressions that broadcast against a dense coordinate store one
coefficient per pair, most of them structurally zero. Those explicit
zeros were carried all the way into `matrices.A` and thus into every
solver handoff. `highspy.Highs.addRows` (and the other direct backends'
matrix loaders) scale with *stored* nnz, so the handoff spent most of
its work describing zeros — e.g. sparse_network(250) stored 1.5M entries
for 18k structural nonzeros (98.8% zeros).

Prune them once, centrally, in `_stack` via `eliminate_zeros()`, so `A`
and `indicator_A` — and hence HiGHS, gurobi, xpress, copt, mosek and the
LP/MPS writers — all hand the solver only structural nonzeros. A zero
coefficient never changes a constraint, so this is mathematically
identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #815. That eliminates zeros from the stacked constraint matrix
after scipy.vstack has already materialised the full dense-with-zeros block,
so the peak allocation (what the CodSpeed memory instrument tracks) is
unchanged — only the resting size and solver-ingest cost drop.

Prune at the source instead: fold `coeffs != 0` into the valid-entry mask in
Constraint._matrix_export_data, so broadcast zeros never enter cols/data/the
CSR. Snapshot capture and the matrix build share this path, so they stay
consistent.

Two ripples handled:
- matrices.dual derived active rows from stored nnz (np.diff(indptr)); an
  all-zero-coefficient row would lose its dual slot once pruned. Derive active
  rows from row activity via a new ConstraintBase.active_row_mask (also avoids
  rebuilding the CSR just to count rows).
- A zero-coefficient term no longer changes the sparsity pattern, so the
  persistent warm-start path no longer forces a SPARSITY rebuild for it. Test
  updated to use a non-zero coefficient; a no-rebuild case added.

sparse_network(250): build peak 50 -> 36 MB (~27%), identical solver result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann changed the base branch from perf/drop-explicit-zeros to master July 2, 2026 14:35
@FBumann FBumann marked this pull request as draft July 2, 2026 14:35
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented Jul 2, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 43.25%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 6 improved benchmarks
✅ 167 untouched benchmarks
⏩ 173 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory test_to_solver[highs-sparse_network-n=250] 64.8 MB 34.8 MB +86.16%
Memory test_to_solver[highs-kvl_cycles-severity=50] 245.6 MB 160.6 MB +52.99%
Memory test_to_solver[highs-kvl_cycles-severity=100] 217 MB 155 MB +39.95%
Memory test_to_solver[gurobi-sparse_network-n=250] 48 MB 35.1 MB +37.07%
Memory test_to_solver[gurobi-kvl_cycles-severity=100] 198.8 MB 155.4 MB +27.96%
Memory test_to_solver[gurobi-kvl_cycles-severity=50] 198.8 MB 160.9 MB +23.58%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing perf/prune-zeros-at-source (97ab692) with master (e861678)

Open in CodSpeed

Footnotes

  1. 173 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant