perf(pipeline): proportional + auto goroutine budget, parallel ProcessList forward write#9762
Open
rahst12 wants to merge 1 commit into
Conversation
…sList forward write Builds on the per-predicate mutation pipeline (dgraph-io#9467). Three additive, flag-gated changes that make a single hot/dominant predicate's apply actually parallelize: 1. Proportional goroutine budget across predicates (allocateWorkers): mutations-pipeline-goroutines=N distributes N workers by edge count (largest-remainder), so a dominant predicate gets >1 worker. 2. Auto mode (=-1): budget = min(GOMAXPROCS*fraction, edges/minEdgesPerWorker), derived at runtime. Tunables mutations-pipeline-goroutines-fraction (1.0) and mutations-pipeline-min-edges-per-worker (256). 3. Parallel forward data-write in ProcessList (the merge-light <pred,uid> pass), matching the existing ProcessSingle split, so [uid]/@reverse predicates' forward write parallelizes. ProcessReverse stays SERIAL. Default is off (0) => byte-identical to the current one-goroutine-per- predicate path. Byte-identical equivalence + conflict-key-set tests pass under -race for scalar, [uid] @reverse, and AUTO==fixed. Benchmarks (8-core dev box, ~20k-edge batch) included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Three small, additive, flag-gated changes on top of this PR's per-predicate mutation pipeline, so a single hot/dominant predicate's apply actually parallelizes (today the pipeline fans one goroutine per predicate, so a single dominant predicate gets ~no intra-predicate speedup).
mutations-pipeline-goroutines=Ndistributes N workers across the predicates of a batch by edge count (largest-remainder), so a dominant predicate gets >1 worker.-1) — budget derived at runtime:min(GOMAXPROCS * fraction, totalEdges / minEdgesPerWorker). Tunables:mutations-pipeline-goroutines-fraction(default 1.0) andmutations-pipeline-min-edges-per-worker(default 256).ProcessList— the merge-light one-to-one<pred,uid>pass, mirroring the existingProcessSinglesplit, so[uid]/@reversepredicates' forward write parallelizes.ProcessReversestays serial (many-to-one, hot-target-prone — parallelizing it is a proven net loss).Why
Profiling a reverse-heavy ingest workload (~50–70%
@reverse [uid]) showed the per-predicate pipeline gives a dominant scalar predicate a good speedup but leaves@reverse [uid]predicates at 1.00× — they route toProcessList, whose forward write was serial. Change #3 closes that gap; #1/#2 make the worker count adaptive instead of one-per-predicate.How to configure
All keys live under the existing
--feature-flagssuperflag (semicolon-separatedkey=valuepairs). Two keys work together — the new intra-predicate parallelism only kicks in when both are set:mutations-pipeline-threshold1>0andlen(edges) >= threshold.0= legacy path entirely (the budget below is then irrelevant).mutations-pipeline-goroutines300or1= one goroutine per predicate (no intra-predicate split).N>1= fixed budget.-1= auto (derive from the two keys below).mutations-pipeline-goroutines-fraction1.0round(GOMAXPROCS * fraction), capped by the next key.mutations-pipeline-min-edges-per-worker256totalEdges / thisso small batches don't over-spawn.Example: alpha with a fixed 64-goroutine budget
dgraph alpha \ --my=localhost:7080 \ --zero=localhost:5080 \ --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=64"Example: alpha with the runtime auto budget (recommended on high-core hardware)
dgraph alpha \ --my=localhost:7080 \ --zero=localhost:5080 \ --feature-flags "mutations-pipeline-threshold=1; mutations-pipeline-goroutines=-1; mutations-pipeline-goroutines-fraction=1.0; mutations-pipeline-min-edges-per-worker=256"Auto sizes the budget to the host at runtime: for a host with
Hcores and a ~20k-edge batch it grantsmin(round(H * fraction), 20000/256)workers, distributed across the batch's predicates by edge count.Disable (legacy one-goroutine-per-predicate)
dgraph alpha --my=localhost:7080 --zero=localhost:5080 \ --feature-flags "mutations-pipeline-goroutines=0"Correctness
mutations-pipeline-goroutines=0(or any value<2) ⇒ byte-identical to the current one-goroutine-per-predicate path.-race) for: scalar predicates,[uid] @reversepredicates, and AUTO-budget == fixed-budget.ProcessReverse,allocateWorkers, and the auto derivation are pure/serial where it matters; per-worker scratch key buffers; lock-free store only for provably-disjoint<pred,uid>keys; all workers join before reverse/index/count passes (MVCC order preserved).TestAllocateWorkers,TestAutoBudget,TestPipelineBudgetByteIdentical,TestProcessListBudgetByteIdentical,TestPipelineBudgetAutoByteIdentical.Benchmarks (8-core dev box — a floor; the win scales with cores)
~20k-edge batch, 5 hot reverse targets (modeling 2–5 hot nodes).
edges/s, speedup vsbudget=0:[uid] @reversedominant@indexdominant (control)The
@reversecase was 1.00× before change #3. The still-serialProcessReversecaps it below the scalar ceiling, as expected. Numbers are a lower bound (8 cores);BenchmarkReverseDominant/BenchmarkReverseFiftyFiftyare included so this can be re-swept on production-class hardware to tunefraction.Notes
fractionlikely differs —autocurrently trails the best fixed budget on a small box. Recommend a sweep on real hardware before defaulting toauto.🤖 Generated with Claude Code