perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write by rahst12 · Pull Request #9762 · dgraph-io/dgraph

rahst12 · 2026-06-23T06:18:34Z

What

Three small, additive, flag-gated changes on top of this PR's per-predicate mutation pipeline, so a single hot/dominant predicate's apply actually parallelizes (today the pipeline fans one goroutine per predicate, so a single dominant predicate gets ~no intra-predicate speedup).

Proportional goroutine budget across predicates — mutations-pipeline-goroutines=N distributes N workers across the predicates of a batch by edge count (largest-remainder), so a dominant predicate gets >1 worker.
Auto mode (-1) — budget derived at runtime: min(GOMAXPROCS * fraction, totalEdges / minEdgesPerWorker). Tunables: mutations-pipeline-goroutines-fraction (default 1.0) and mutations-pipeline-min-edges-per-worker (default 256).
Parallel forward data-write in ProcessList — the merge-light one-to-one <pred,uid> pass, mirroring the existing ProcessSingle split, so [uid] / @reverse predicates' forward write parallelizes. ProcessReverse stays serial (many-to-one, hot-target-prone — parallelizing it is a proven net loss).

Why

Profiling a reverse-heavy ingest workload (~50–70% @reverse [uid]) showed the per-predicate pipeline gives a dominant scalar predicate a good speedup but leaves @reverse [uid] predicates at 1.00× — they route to ProcessList, whose forward write was serial. Change #3 closes that gap; #1/#2 make the worker count adaptive instead of one-per-predicate.

How to configure

All keys live under the existing --feature-flags superflag (semicolon-separated key=value pairs). Two keys work together — the new intra-predicate parallelism only kicks in when both are set:

Key	Default (this branch)	Meaning
`mutations-pipeline-threshold`	`1`	Routes a mutation through the pipeline when `>0` and `len(edges) >= threshold`. `0` = legacy path entirely (the budget below is then irrelevant).
`mutations-pipeline-goroutines`	`30`	Intra-predicate worker budget. `0` or `1` = one goroutine per predicate (no intra-predicate split). `N>1` = fixed budget. `-1` = auto (derive from the two keys below).
`mutations-pipeline-goroutines-fraction`	`1.0`	Auto only. Budget = `round(GOMAXPROCS * fraction)`, capped by the next key.
`mutations-pipeline-min-edges-per-worker`	`256`	Auto only. Caps the budget at `totalEdges / this` so small batches don't over-spawn.

Note: this branch currently ships with the budget on (threshold=1, goroutines=30). To get byte-identical legacy behavior, set mutations-pipeline-goroutines=0 (or mutations-pipeline-threshold=0 to bypass the pipeline entirely).

Example: alpha with a fixed 64-goroutine budget

dgraph alpha \
  --my=localhost:7080 \
  --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=64"

Example: alpha with the runtime auto budget (recommended on high-core hardware)

dgraph alpha \
  --my=localhost:7080 \
  --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=-1; mutations-pipeline-goroutines-fraction=1.0; mutations-pipeline-min-edges-per-worker=256"

Auto sizes the budget to the host at runtime: for a host with H cores and a ~20k-edge batch it grants min(round(H * fraction), 20000/256) workers, distributed across the batch's predicates by edge count.

Disable (legacy one-goroutine-per-predicate)

dgraph alpha --my=localhost:7080 --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-goroutines=0"

Correctness

Setting mutations-pipeline-goroutines=0 (or any value <2) ⇒ byte-identical to the current one-goroutine-per-predicate path.
Byte-identical committed output and identical conflict-key set verified (under -race) for: scalar predicates, [uid] @reverse predicates, and AUTO-budget == fixed-budget.
ProcessReverse, allocateWorkers, and the auto derivation are pure/serial where it matters; per-worker scratch key buffers; lock-free store only for provably-disjoint <pred,uid> keys; all workers join before reverse/index/count passes (MVCC order preserved).
New tests: TestAllocateWorkers, TestAutoBudget, TestPipelineBudgetByteIdentical, TestProcessListBudgetByteIdentical, TestPipelineBudgetAutoByteIdentical.

Benchmarks (8-core dev box — a floor; the win scales with cores)

~20k-edge batch, 5 hot reverse targets (modeling 2–5 hot nodes). edges/s, speedup vs budget=0:

Workload	budget=0	budget=8	budget=32	auto
`[uid] @reverse` dominant	1.00×	1.35×	1.29×	1.15×
50% reverse / 50% scalar	1.00×	1.42×	1.82×	1.39×
scalar `@index` dominant (control)	1.00×	1.23×	1.44×	1.19×

The @reverse case was 1.00× before change #3. The still-serial ProcessReverse caps it below the scalar ceiling, as expected. Numbers are a lower bound (8 cores); BenchmarkReverseDominant / BenchmarkReverseFiftyFifty are included so this can be re-swept on production-class hardware to tune fraction.

Notes

All benchmark numbers are from an 8-core box. On a high-core box the forward-write parallelism has more room, and the optimal fraction likely differs — auto currently trails the best fixed budget on a small box. Recommend a sweep on real hardware before defaulting to auto.
Happy to split this into separate commits (budget / auto / ProcessList) if you'd prefer to land them independently.

🤖 Generated with Claude Code

@reverse

…sList forward write Builds on the per-predicate mutation pipeline (dgraph-io#9467). Three additive, flag-gated changes that make a single hot/dominant predicate's apply actually parallelize: 1. Proportional goroutine budget across predicates (allocateWorkers): mutations-pipeline-goroutines=N distributes N workers by edge count (largest-remainder), so a dominant predicate gets >1 worker. 2. Auto mode (=-1): budget = min(GOMAXPROCS*fraction, edges/minEdgesPerWorker), derived at runtime. Tunables mutations-pipeline-goroutines-fraction (1.0) and mutations-pipeline-min-edges-per-worker (256). 3. Parallel forward data-write in ProcessList (the merge-light <pred,uid> pass), matching the existing ProcessSingle split, so [uid]/@reverse predicates' forward write parallelizes. ProcessReverse stays SERIAL. Default is off (0) => byte-identical to the current one-goroutine-per- predicate path. Byte-identical equivalence + conflict-key-set tests pass under -race for scalar, [uid] @reverse, and AUTO==fixed. Benchmarks (8-core dev box, ~20k-edge batch) included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rahst12 requested a review from a team as a code owner June 23, 2026 06:18

matthewmcneely self-requested a review June 23, 2026 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write#9762

perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write#9762
rahst12 wants to merge 1 commit into
dgraph-io:harshil-goel/mutation-pipelinefrom
rahst12:rahst12/hybrid-pipeline-reverse-parallel

rahst12 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rahst12 commented Jun 23, 2026

What

Why

How to configure

Example: alpha with a fixed 64-goroutine budget

Example: alpha with the runtime auto budget (recommended on high-core hardware)

Disable (legacy one-goroutine-per-predicate)

Correctness

Benchmarks (8-core dev box — a floor; the win scales with cores)

Notes

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant