Skip to content

perf: scatter groupby-sum terms directly instead of unstacking#793

Open
FBumann wants to merge 2 commits into
PyPSA:masterfrom
fluxopt:perf/groupby-sum-scatter-upstream
Open

perf: scatter groupby-sum terms directly instead of unstacking#793
FBumann wants to merge 2 commits into
PyPSA:masterfrom
fluxopt:perf/groupby-sum-scatter-upstream

Conversation

@FBumann

@FBumann FBumann commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Note

The following content was generated by AI.

What this does

The fast path of LinearExpression.groupby(...).sum() previously did
ds.unstack(group_dim, fill_value=...) followed by a stack. That
materializes 2–3 intermediate copies of the padded result
(n_groups × max_group_size × nterm) and routes through pandas
MultiIndex machinery sized by the number of elements.

This change factorizes the groups and scatters coeffs/vars directly
into preallocated padded result arrays; constants are group-summed with
np.add.at. Peak memory drops to input + result (the minimum for the
padded layout), and the grouping itself gets considerably faster. The
result is unchanged: same dims, coords, term ordering and padding.

The unstack-based implementation is kept as _sum_by_unstack and is
still used for chunked (dask-backed) data, which cannot be scattered into
numpy arrays. NaN group labels now raise an informative ValueError
instead of failing inside unstack.

Notes

  • Self-contained: touches only linopy/expressions.py and adds tests in
    test/test_linear_expression.py (124 new lines).
  • _sum_by_unstack retains the current master names_to_drop logic
    (drop every coordinate aligned to group_dim), so the slow path keeps
    the existing behavior.
Verification
  • pytest test/test_linear_expression.py → 309 passed
  • groupby/sum subset (-k "group or sum or scatter or unstack") → 73 passed
  • ruff check, ruff format --check, mypy linopy/expressions.py → clean

The fast path of LinearExpression.groupby(...).sum() used
ds.unstack(group_dim, fill_value=...) followed by a stack, which
materializes 2-3 intermediate copies of the padded result
(n_groups x max_group_size x nterm) and goes through pandas MultiIndex
machinery sized by the number of elements.

Instead, factorize the groups and scatter coeffs/vars directly into the
preallocated padded result arrays; constants are group-summed with
np.add.at. Peak memory drops to input + result (the minimum for the
padded layout) and the grouping itself gets considerably faster.

The result is unchanged: same dims, coords, term ordering and padding.
The unstack-based implementation is kept as _sum_by_unstack and still
used for chunked (dask-backed) data, which cannot be scattered into
numpy arrays. NaN group labels now raise an informative ValueError
instead of failing inside unstack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented Jun 29, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by ×2.1

⚡ 10 improved benchmarks
✅ 128 untouched benchmarks
⏩ 138 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory test_to_lp[nodal_balance-severity=100] 17.9 MB 6 MB ×3
Memory test_to_lp[nodal_balance-severity=50] 9.2 MB 3.1 MB ×3
Memory test_to_lp[nodal_balance-severity=0] 385.3 KB 135.3 KB ×2.8
Memory test_build[nodal_balance-severity=100] 32 MB 12.8 MB ×2.5
Memory test_build[nodal_balance-severity=50] 16.8 MB 7 MB ×2.4
Memory test_to_solver[highs-nodal_balance-severity=100] 24.9 MB 13.3 MB +87.47%
Memory test_to_solver[gurobi-nodal_balance-severity=100] 25.1 MB 13.5 MB +86.1%
Memory test_to_solver[highs-nodal_balance-severity=50] 12.9 MB 7.1 MB +81.68%
Memory test_to_solver[gurobi-nodal_balance-severity=50] 13.1 MB 7.3 MB +79.32%
Memory test_build[nodal_balance-severity=0] 1.4 MB 1.2 MB +19.65%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing fluxopt:perf/groupby-sum-scatter-upstream (34d1332) with master (4ddf3fb)2

Open in CodSpeed

Footnotes

  1. 138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on master (1dbde37) during the generation of this report, so 4ddf3fb was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant