Optional cross-worker lock for shared adaptor repo#1416
Conversation
|
ok @stuartc having looked a bit more closely - this implementation MUST move into the engine The engine already has an What is the sentinel stuff all about? |
| const locksDir = path.join(repoDir, '.locks'); | ||
| const sentinelsDir = path.join(repoDir, '.sentinels'); | ||
|
|
||
| const ensureDirs = (async () => { |
There was a problem hiding this comment.
This is weird. Why not remove the wrapper and just await the mkdir calls?
| logger.debug(`acquired install lock for ${specifier}`); | ||
|
|
||
| try { | ||
| const [hasSentinel, hasPkg] = await Promise.all([ |
There was a problem hiding this comment.
As we're going to move this to the engine, the engine already has an isInstalled function, which determines whether an adaptor is installed or not.
Rather than implement that test twice - as we're doing here - I'd like tot ry and re-use that logic if possible.
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.
Migrate the cross-worker adaptor install lock out of ws-worker and into the engine's autoinstall flow. The lock now lives next to the code that actually performs the install. - New `withInstallLock` primitive in engine-multi's util - `autoinstall.lockRepo` option (default on) wraps the install call - Strengthen `isInstalled` to also stat node_modules/<alias>/package.json, catching half-installed adaptors left by a crashed worker - Drop the sentinel cache; npm's per-package extraction is atomic so the node_modules entry is itself proof of completion - ws-worker CLI flag renamed to `--no-repo-lock` (WORKER_NO_REPO_LOCK) - Rework tests to use tmpdirs + esmock'd runtime install, dropping the handleInstall/handleIsInstalled injection hooks entirely
The pending map was left behind by a 2023 restructure that replaced per-adaptor promise dedup with the module-level queue + busy flag. Nothing writes to it; the delete on error is a no-op.
Defence-in-depth: refuse aliases containing `..` or absolute paths before composing the .locks/<alias>.lock path. Upstream whitelist filtering already restricts specifiers, so this guards against a future bug there becoming a path-traversal write.
Try the lock with retries: 0 first. On ELOCKED, log one info line so k8s operators can see why a worker is sat waiting on cold start, then retry with the full budget. Uncontended acquisitions stay silent.
Yargs auto-negates --no-foo into { foo: false }, so declaring the option
as 'no-repo-lock' meant --no-repo-lock produced { repoLock: false } and
left args.noRepoLock undefined. The downstream lockRepo check thus
stayed true and the flag did nothing.
Rename the option to its positive form 'repo-lock' (default true). yargs'
auto-negation now correctly sets repoLock: false when --no-repo-lock is
passed. WORKER_NO_REPO_LOCK env var still disables the lock.
Also add defence-in-depth path-traversal guard on the lock alias.
e968b73 to
5509997
Compare
ensureLockTarget already creates the parent directory recursively, so the explicit mkdir of locksDir was dead work.
The previous substring check would also reject innocuous names like 'foo..bar'. Split on path separators so only '..' as a real path component is rejected.
Hoist processQueue and doAutoinstall to module scope and pass context/repoDir/logger/useLock through the queue entry instead of relying on the first caller's closure. Removes a latent foot-gun if anyone later adds per-run autoinstall/repoDir overrides or proxies AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents.
Pure whitespace; line-wrap long template strings and object literals so `pnpm test:format` (CI) passes.
|
Thanks @josephjclark - all fair points, addressed. Lock lives in the engine now. New file Sentinels are gone. Those were the No duplicated
Two files as you suggested: A few smaller things came out of the review:
Changeset updated to declare both |
josephjclark
left a comment
There was a problem hiding this comment.
Posting as far as I've got to but not done looking through this.
I like that things feel cleaner and less impactful here. You've got a lot of the logic and testing contained to repo-lock.ts
Worried that the test framework looks very heavy and intense. I suppose it has to be for something like this, which is very process-y. I'll take a closer look and have a think about it it.
I'd love to see some kind of integration test for this in integration-tests/worker. I don't know if we could build something reliably - but we do already have a bunch of autoinstalll tests. But maybe we need the process tests because actual autoinstall tests are always going to be flaky?
| '@openfn/ws-worker': minor | ||
| --- | ||
|
|
||
| Add a filesystem-based lock so multiple workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC). The lock lives inside engine-multi's autoinstall and serialises installs across processes via a per-adaptor lockfile. It is enabled by default and can be disabled with `WORKER_NO_REPO_LOCK=true` (or `--no-repo-lock`) on ws-worker. The engine's `isInstalled` check has also been tightened to verify the adaptor's `node_modules` entry exists on disk, not just the repo `package.json` dependency entry — this catches half-installed adaptors left behind by a crashed worker. |
There was a problem hiding this comment.
I prefer the release notes to be way more terse. No explanation, no justification - just the headline. Maybe a migration guide if breaking but obviously that doesn't apply here
| ), | ||
| repoDir: setArg(args.repoDir, WORKER_REPO_DIR), | ||
| repoLock: | ||
| args.repoLock ?? |
There was a problem hiding this comment.
This doesn't work with setArg?
I expect setArg(args.repoLock, WORKER_NO_REPO_LOCK, true) to work
I suppose we might need a bit more logic to track the boolean but I doubt it, and we don't need to be strict or fussy
| const lockRepo = args.repoLock !== false; | ||
| if (lockRepo && !args.repoDir) { | ||
| logger.warn( | ||
| 'WARNING: repo lock is enabled but --repo-dir is not set; lock will be a no-op' |
There was a problem hiding this comment.
repodir has a default value so I don't think this is true?
| }; | ||
|
|
||
| const pending: Record<string, Promise<void>> = {}; | ||
| // Per-entry options are pinned at enqueue time so the worker that drains the |
There was a problem hiding this comment.
Is this comment valid? Looks random?
| if (useLock) { | ||
| // Re-check inside the lock so we skip work a peer worker | ||
| // completed while we were waiting. | ||
| await withInstallLock(repoDir, getAliasedName(a), logger, async () => { |
There was a problem hiding this comment.
hmm. Why is the last argument a callback? We can't just let the promise return?
| fn: () => Promise<void> | ||
| ): Promise<void> => { | ||
| // Defence-in-depth: refuse aliases that could escape the .locks directory. | ||
| // Upstream whitelist filtering should make this unreachable. |
There was a problem hiding this comment.
Should it? Where?
If this is true I expect to see some shared logic which sanitises aliases to be safe
| @@ -0,0 +1,43 @@ | |||
| /** | |||
There was a problem hiding this comment.
This is weird. Just stick the test in repo-lock.test.ts so that a) it can be seen and b) there isn't this weird "othering" of unit tests
josephjclark
left a comment
There was a problem hiding this comment.
I've been looking through a bit more closely at the code and tests. I really don't love those tests: there's a lot of business logic in that worker and I'm a bit nervous that we're testing test code rather than actual logic.
I'm taking a break for now but I think there are a few things here we should talk about before I look deeper
| } | ||
|
|
||
| const locksDir = path.join(repoDir, '.locks'); | ||
| const target = path.join(locksDir, `${alias}.lock`); |
There was a problem hiding this comment.
since locksDir isn't used anywhere else you can simplify to
const target = path.join(repoDir, '.locks', `${alias}.lock`);
| const locksDir = path.join(repoDir, '.locks'); | ||
| const target = path.join(locksDir, `${alias}.lock`); | ||
|
|
||
| await ensureLockTarget(target); |
There was a problem hiding this comment.
Hold up, this function creates a lock file at the target path. But that's exactly what lockfile.lock does for us right?
So how does that work? Doesn't lockfile just assume the file is always locked because we've just created the lock?
| ); | ||
|
|
||
| try { | ||
| release = await lockfile.lock(target, { |
There was a problem hiding this comment.
So we call lockfile.lock twice here. Once I guess to see if we can aquire the lock - and if we can't, we print this log line and then try again with a number of retries?
First observation is that we're then in effect retrying the lock LOCK_RETRY_OPTIONS + 1 times
Second observation is - is this worth it? Are we better calling lockfile.check() first, and then doing the log-and-retry call if a lock already exists?
| const LOCK_INTERVAL_MS = 2_000; | ||
| const UPDATE_MS = 5_000; | ||
|
|
||
| const LOCK_RETRY_OPTIONS = { |
There was a problem hiding this comment.
Surprises me to see this declared as a constant. I get the convention but an object isn't really a constant. Just feels wierd.
Also none of this is configurable? The lock uses hard-coded values and that's it? Not a blocker for me but we might want to open an issue to make this configurable later.
Not because I want to configure it today - but because it's weird that I am unable to.
Short Description
Adds a filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same
npm install.Fixes #1414
Implementation Details
When the repo directory is shared between worker pods, two workers can otherwise hit
installfor the same adaptor at the same time and corrupt each other'snode_modules. The lock lives inengine-multi's autoinstall flow — that's where the install actually happens, so that's where the coordination belongs.How it works:
withInstallLockhelper inpackages/engine-multi/src/util/repo-lock.tsacquires a per-adaptor lockfile under<repoDir>/.locks/<alias>.lockviaproper-lockfile. Different adaptors don't block each other.installcall with the lock. Inside the lock it re-checksisInstalledand skips the install if another worker has just finished it.isInstalledis now stricter: it ANDs the existing repopackage.jsondependency check with a stat ofnode_modules/<alias>/package.json. npm extracts each package atomically, so the presence of that file is itself proof of completion. This also fixes a latent bug where a worker killed mid-install could leave the repopackage.jsonsaying "installed" whilenode_modulesis half-written, and the next worker would happily import broken code.isInstalledreturns true without touching the lock directory.installthrows, the lock is released and the next worker will retry.proper-lockfile's mtime heartbeat updates every 5s and another worker considers a lock stale after 5 min. The retry ceiling (6 min) sits above that, so when a pod dies mid-install its lock expires before waiting workers give up.On by default. Disable with
--no-repo-lock/WORKER_NO_REPO_LOCK=true. Single-worker deployments don't need to do anything to opt out — they'll just never hit contention and so never log anything new.New dep on
proper-lockfile@4.1.2(withgraceful-fs,retry,signal-exittransitives) inengine-multi. All checked for CVEs — clean.QA Notes
Interesting cases to exercise (with
--repo-dirset):--repo-dirstarting a run that needs the same adaptor at the same time — only onenpm installshould actually run; the other should see the install completed during its wait and skip.--no-repo-lock, multi-worker behaviour should be unchanged from before this PR.--repo-lockshould be indistinguishable from before — no contention, no extra logs.There's a multi-process test suite using
child_process.forkthat exercises the cross-worker cases deterministically —pnpm test test/util/repo-lock.test.tsfrompackages/engine-multi.K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.
AI Usage
You can read more details in our
Responsible AI Policy