Skip to content

perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write#9762

Open
rahst12 wants to merge 1 commit into
dgraph-io:harshil-goel/mutation-pipelinefrom
rahst12:rahst12/hybrid-pipeline-reverse-parallel
Open

perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write#9762
rahst12 wants to merge 1 commit into
dgraph-io:harshil-goel/mutation-pipelinefrom
rahst12:rahst12/hybrid-pipeline-reverse-parallel

Conversation

@rahst12

@rahst12 rahst12 commented Jun 23, 2026

Copy link
Copy Markdown

What

Three small, additive, flag-gated changes on top of this PR's per-predicate mutation pipeline, so a single hot/dominant predicate's apply actually parallelizes (today the pipeline fans one goroutine per predicate, so a single dominant predicate gets ~no intra-predicate speedup).

  1. Proportional goroutine budget across predicatesmutations-pipeline-goroutines=N distributes N workers across the predicates of a batch by edge count (largest-remainder), so a dominant predicate gets >1 worker.
  2. Auto mode (-1) — budget derived at runtime: min(GOMAXPROCS * fraction, totalEdges / minEdgesPerWorker). Tunables: mutations-pipeline-goroutines-fraction (default 1.0) and mutations-pipeline-min-edges-per-worker (default 256).
  3. Parallel forward data-write in ProcessList — the merge-light one-to-one <pred,uid> pass, mirroring the existing ProcessSingle split, so [uid] / @reverse predicates' forward write parallelizes. ProcessReverse stays serial (many-to-one, hot-target-prone — parallelizing it is a proven net loss).

Why

Profiling a reverse-heavy ingest workload (~50–70% @reverse [uid]) showed the per-predicate pipeline gives a dominant scalar predicate a good speedup but leaves @reverse [uid] predicates at 1.00× — they route to ProcessList, whose forward write was serial. Change #3 closes that gap; #1/#2 make the worker count adaptive instead of one-per-predicate.

How to configure

All keys live under the existing --feature-flags superflag (semicolon-separated key=value pairs). Two keys work together — the new intra-predicate parallelism only kicks in when both are set:

Key Default (this branch) Meaning
mutations-pipeline-threshold 1 Routes a mutation through the pipeline when >0 and len(edges) >= threshold. 0 = legacy path entirely (the budget below is then irrelevant).
mutations-pipeline-goroutines 30 Intra-predicate worker budget. 0 or 1 = one goroutine per predicate (no intra-predicate split). N>1 = fixed budget. -1 = auto (derive from the two keys below).
mutations-pipeline-goroutines-fraction 1.0 Auto only. Budget = round(GOMAXPROCS * fraction), capped by the next key.
mutations-pipeline-min-edges-per-worker 256 Auto only. Caps the budget at totalEdges / this so small batches don't over-spawn.

Note: this branch currently ships with the budget on (threshold=1, goroutines=30). To get byte-identical legacy behavior, set mutations-pipeline-goroutines=0 (or mutations-pipeline-threshold=0 to bypass the pipeline entirely).

Example: alpha with a fixed 64-goroutine budget

dgraph alpha \
  --my=localhost:7080 \
  --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=64"

Example: alpha with the runtime auto budget (recommended on high-core hardware)

dgraph alpha \
  --my=localhost:7080 \
  --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=-1; mutations-pipeline-goroutines-fraction=1.0; mutations-pipeline-min-edges-per-worker=256"

Auto sizes the budget to the host at runtime: for a host with H cores and a ~20k-edge batch it grants min(round(H * fraction), 20000/256) workers, distributed across the batch's predicates by edge count.

Disable (legacy one-goroutine-per-predicate)

dgraph alpha --my=localhost:7080 --zero=localhost:5080 \
  --feature-flags "mutations-pipeline-goroutines=0"

Correctness

  • Setting mutations-pipeline-goroutines=0 (or any value <2) ⇒ byte-identical to the current one-goroutine-per-predicate path.
  • Byte-identical committed output and identical conflict-key set verified (under -race) for: scalar predicates, [uid] @reverse predicates, and AUTO-budget == fixed-budget.
  • ProcessReverse, allocateWorkers, and the auto derivation are pure/serial where it matters; per-worker scratch key buffers; lock-free store only for provably-disjoint <pred,uid> keys; all workers join before reverse/index/count passes (MVCC order preserved).
  • New tests: TestAllocateWorkers, TestAutoBudget, TestPipelineBudgetByteIdentical, TestProcessListBudgetByteIdentical, TestPipelineBudgetAutoByteIdentical.

Benchmarks (8-core dev box — a floor; the win scales with cores)

~20k-edge batch, 5 hot reverse targets (modeling 2–5 hot nodes). edges/s, speedup vs budget=0:

Workload budget=0 budget=8 budget=32 auto
[uid] @reverse dominant 1.00× 1.35× 1.29× 1.15×
50% reverse / 50% scalar 1.00× 1.42× 1.82× 1.39×
scalar @index dominant (control) 1.00× 1.23× 1.44× 1.19×

The @reverse case was 1.00× before change #3. The still-serial ProcessReverse caps it below the scalar ceiling, as expected. Numbers are a lower bound (8 cores); BenchmarkReverseDominant / BenchmarkReverseFiftyFifty are included so this can be re-swept on production-class hardware to tune fraction.

Notes

  • All benchmark numbers are from an 8-core box. On a high-core box the forward-write parallelism has more room, and the optimal fraction likely differs — auto currently trails the best fixed budget on a small box. Recommend a sweep on real hardware before defaulting to auto.
  • Happy to split this into separate commits (budget / auto / ProcessList) if you'd prefer to land them independently.

🤖 Generated with Claude Code

…sList forward write

Builds on the per-predicate mutation pipeline (dgraph-io#9467). Three additive,
flag-gated changes that make a single hot/dominant predicate's apply
actually parallelize:

1. Proportional goroutine budget across predicates (allocateWorkers):
   mutations-pipeline-goroutines=N distributes N workers by edge count
   (largest-remainder), so a dominant predicate gets >1 worker.
2. Auto mode (=-1): budget = min(GOMAXPROCS*fraction, edges/minEdgesPerWorker),
   derived at runtime. Tunables mutations-pipeline-goroutines-fraction (1.0)
   and mutations-pipeline-min-edges-per-worker (256).
3. Parallel forward data-write in ProcessList (the merge-light <pred,uid>
   pass), matching the existing ProcessSingle split, so [uid]/@reverse
   predicates' forward write parallelizes. ProcessReverse stays SERIAL.

Default is off (0) => byte-identical to the current one-goroutine-per-
predicate path. Byte-identical equivalence + conflict-key-set tests pass
under -race for scalar, [uid] @reverse, and AUTO==fixed. Benchmarks
(8-core dev box, ~20k-edge batch) included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rahst12 rahst12 requested a review from a team as a code owner June 23, 2026 06:18
@matthewmcneely matthewmcneely self-requested a review June 23, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant