Optional cross-worker lock for shared adaptor repo by stuartc · Pull Request #1416 · OpenFn/kit

stuartc · 2026-05-19T13:38:26Z

Short Description

Adds a filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same npm install.

Fixes #1414

Implementation Details

When the repo directory is shared between worker pods, two workers can otherwise hit install for the same adaptor at the same time and corrupt each other's node_modules. The lock lives in engine-multi's autoinstall flow — that's where the install actually happens, so that's where the coordination belongs.

How it works:

A new withInstallLock helper in packages/engine-multi/src/util/repo-lock.ts acquires a per-adaptor lockfile under <repoDir>/.locks/<alias>.lock via proper-lockfile. Different adaptors don't block each other.
Autoinstall wraps its existing install call with the lock. Inside the lock it re-checks isInstalled and skips the install if another worker has just finished it.
The engine's isInstalled is now stricter: it ANDs the existing repo package.json dependency check with a stat of node_modules/<alias>/package.json. npm extracts each package atomically, so the presence of that file is itself proof of completion. This also fixes a latent bug where a worker killed mid-install could leave the repo package.json saying "installed" while node_modules is half-written, and the next worker would happily import broken code.
The cache-hit path stays lock-free — isInstalled returns true without touching the lock directory.
If install throws, the lock is released and the next worker will retry.
Stale-lock recovery: proper-lockfile's mtime heartbeat updates every 5s and another worker considers a lock stale after 5 min. The retry ceiling (6 min) sits above that, so when a pod dies mid-install its lock expires before waiting workers give up.
When a worker has to wait for the lock, one info-level log line fires so the cause of a slow cold-start is visible without enabling debug logging.

On by default. Disable with --no-repo-lock / WORKER_NO_REPO_LOCK=true. Single-worker deployments don't need to do anything to opt out — they'll just never hit contention and so never log anything new.

New dep on proper-lockfile@4.1.2 (with graceful-fs, retry, signal-exit transitives) in engine-multi. All checked for CVEs — clean.

QA Notes

Interesting cases to exercise (with --repo-dir set):

Two workers pointing at the same --repo-dir starting a run that needs the same adaptor at the same time — only one npm install should actually run; the other should see the install completed during its wait and skip.
A worker killed mid-install (SIGKILL) — the next worker should recover after the stale window (5 min) rather than getting stuck forever.
Different adaptors should install in parallel, not serialise.
With --no-repo-lock, multi-worker behaviour should be unchanged from before this PR.
Single-worker behaviour with the default --repo-lock should be indistinguishable from before — no contention, no extra logs.

There's a multi-process test suite using child_process.fork that exercises the cross-worker cases deterministically — pnpm test test/util/repo-lock.test.ts from packages/engine-multi.

K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.

AI Usage

I have used Claude Code
I have used another model
I have not used AI

You can read more details in our
Responsible AI Policy

josephjclark · 2026-05-20T10:29:13Z

ok @stuartc having looked a bit more closely - this implementation MUST move into the engine

The engine already has an autoinstall.ts. You should just extend that to use the lockfile stuff before calling the runtime's install. I think you probably want the lock and install logic in two different files, but that's up to you.

What is the sentinel stuff all about?

josephjclark · 2026-05-20T10:07:37Z

+  const locksDir = path.join(repoDir, '.locks');
+  const sentinelsDir = path.join(repoDir, '.sentinels');
+
+  const ensureDirs = (async () => {


This is weird. Why not remove the wrapper and just await the mkdir calls?

josephjclark · 2026-05-20T10:30:32Z

+    logger.debug(`acquired install lock for ${specifier}`);
+
+    try {
+      const [hasSentinel, hasPkg] = await Promise.all([


As we're going to move this to the engine, the engine already has an isInstalled function, which determines whether an adaptor is installed or not.

Rather than implement that test twice - as we're doing here - I'd like tot ry and re-use that logic if possible.

Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.

Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.

Migrate the cross-worker adaptor install lock out of ws-worker and into the engine's autoinstall flow. The lock now lives next to the code that actually performs the install. - New `withInstallLock` primitive in engine-multi's util - `autoinstall.lockRepo` option (default on) wraps the install call - Strengthen `isInstalled` to also stat node_modules/<alias>/package.json, catching half-installed adaptors left by a crashed worker - Drop the sentinel cache; npm's per-package extraction is atomic so the node_modules entry is itself proof of completion - ws-worker CLI flag renamed to `--no-repo-lock` (WORKER_NO_REPO_LOCK) - Rework tests to use tmpdirs + esmock'd runtime install, dropping the handleInstall/handleIsInstalled injection hooks entirely

The pending map was left behind by a 2023 restructure that replaced per-adaptor promise dedup with the module-level queue + busy flag. Nothing writes to it; the delete on error is a no-op.

Defence-in-depth: refuse aliases containing `..` or absolute paths before composing the .locks/<alias>.lock path. Upstream whitelist filtering already restricts specifiers, so this guards against a future bug there becoming a path-traversal write.

Try the lock with retries: 0 first. On ELOCKED, log one info line so k8s operators can see why a worker is sat waiting on cold start, then retry with the full budget. Uncontended acquisitions stay silent.

Yargs auto-negates --no-foo into { foo: false }, so declaring the option as 'no-repo-lock' meant --no-repo-lock produced { repoLock: false } and left args.noRepoLock undefined. The downstream lockRepo check thus stayed true and the flag did nothing. Rename the option to its positive form 'repo-lock' (default true). yargs' auto-negation now correctly sets repoLock: false when --no-repo-lock is passed. WORKER_NO_REPO_LOCK env var still disables the lock. Also add defence-in-depth path-traversal guard on the lock alias.

ensureLockTarget already creates the parent directory recursively, so the explicit mkdir of locksDir was dead work.

The previous substring check would also reject innocuous names like 'foo..bar'. Split on path separators so only '..' as a real path component is rejected.

Hoist processQueue and doAutoinstall to module scope and pass context/repoDir/logger/useLock through the queue entry instead of relying on the first caller's closure. Removes a latent foot-gun if anyone later adds per-run autoinstall/repoDir overrides or proxies AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents.

Pure whitespace; line-wrap long template strings and object literals so `pnpm test:format` (CI) passes.

stuartc · 2026-05-22T13:49:14Z

Thanks @josephjclark - all fair points, addressed.

Lock lives in the engine now. New file packages/engine-multi/src/util/repo-lock.ts exports a single withInstallLock(repoDir, alias, logger, fn) helper - pure lock primitive, nothing else. Autoinstall wraps its existing install call with it. The worker doesn't touch lock files at all any more; it just flips autoinstall.lockRepo via a --no-repo-lock flag. Lock is on by default.

Sentinels are gone. Those were the .sentinels/<alias>.done marker files - written atomically once npm install returned successfully, so the install handler could check them as a "this finished cleanly" signal. They're redundant: npm extracts each package atomically, so the presence of node_modules/<alias>/package.json is enough. I've tightened the engine's existing isInstalled to AND its package.json dependency check with a stat of that file - so it's now a check of disk state rather than just npm's book-keeping. That catches manual node_modules cleanup, NFS visibility lag on shared volumes, and probably some mid-install crash cases too (haven't dug into npm's exact write ordering to be sure). Benefits every deployment, not just locked ones.

No duplicated isInstalled. Engine's isInstalled is the single source of truth now - initially this ws-worker version had its own sentinel + node_modules check inline, which duplicated the engine's pkg-deps check. The in-lock double-check (re-run isInstalled after acquiring, so a worker that queued behind another one's install sees the install has completed during the wait and skips its own attempt) is preserved - just calling the engine's check rather than a parallel implementation.

ensureDirs IIFE is gone - replaced with a small ensureLockTarget(target) helper that mkdirs the parent and touches the lock file. No IIFE-on-first-call dance.

Two files as you suggested: util/repo-lock.ts (lock primitive) and api/autoinstall.ts (orchestration + install). Runtime is untouched - engine still calls the runtime's install exactly as before, just inside a lock.

A few smaller things came out of the review:

Security/alias guard in withInstallLock - refuses aliases where .. appears as a path segment, or that are absolute, on top of the upstream whitelist.
Contention is logged at info level (one line, only when actually contended) so ops can see why a worker is sat waiting on cold start without needing debug logs on.
Dropped a dead pending map in autoinstall.ts left behind by your 2023 single-queue refactor - declared but never written to, just a stray delete pending[a] in the error path.
Caught a yargs auto-negation bug while wiring the flag up: --no-repo-lock was being parsed as { repoLock: false } rather than { noRepoLock: true }, so the original wiring was a no-op. Reversed the option and let yargs do the negation properly. Tests now cover all four combinations of CLI flag + env var.
Pinned per-call options (context/repoDir/logger/useLock) on the autoinstall queue entry. They were captured in the first caller's closure under the original code, which is fine today (engine is a singleton with stable options) but would silently misbehave the moment anyone wires per-run autoinstall overrides or proxies AUTOINSTALL_COMPLETE/ERROR through createWorkflowEvents. Cheap to fix now, expensive to debug later.

Changeset updated to declare both @openfn/engine-multi and @openfn/ws-worker as minors.

josephjclark

Posting as far as I've got to but not done looking through this.

I like that things feel cleaner and less impactful here. You've got a lot of the logic and testing contained to repo-lock.ts

Worried that the test framework looks very heavy and intense. I suppose it has to be for something like this, which is very process-y. I'll take a closer look and have a think about it it.

I'd love to see some kind of integration test for this in integration-tests/worker. I don't know if we could build something reliably - but we do already have a bunch of autoinstalll tests. But maybe we need the process tests because actual autoinstall tests are always going to be flaky?

josephjclark · 2026-05-26T17:23:45Z

+'@openfn/ws-worker': minor
+---
+
+Add a filesystem-based lock so multiple workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC). The lock lives inside engine-multi's autoinstall and serialises installs across processes via a per-adaptor lockfile. It is enabled by default and can be disabled with `WORKER_NO_REPO_LOCK=true` (or `--no-repo-lock`) on ws-worker. The engine's `isInstalled` check has also been tightened to verify the adaptor's `node_modules` entry exists on disk, not just the repo `package.json` dependency entry — this catches half-installed adaptors left behind by a crashed worker.


I prefer the release notes to be way more terse. No explanation, no justification - just the headline. Maybe a migration guide if breaking but obviously that doesn't apply here

josephjclark · 2026-05-26T17:26:49Z

    ),
    repoDir: setArg(args.repoDir, WORKER_REPO_DIR),
+    repoLock:
+      args.repoLock ??


This doesn't work with setArg?

I expect setArg(args.repoLock, WORKER_NO_REPO_LOCK, true) to work

I suppose we might need a bit more logic to track the boolean but I doubt it, and we don't need to be strict or fussy

josephjclark · 2026-05-26T17:27:21Z

+  const lockRepo = args.repoLock !== false;
+  if (lockRepo && !args.repoDir) {
+    logger.warn(
+      'WARNING: repo lock is enabled but --repo-dir is not set; lock will be a no-op'


repodir has a default value so I don't think this is true?

josephjclark · 2026-05-26T17:29:26Z

 };

-const pending: Record<string, Promise<void>> = {};
+// Per-entry options are pinned at enqueue time so the worker that drains the


Is this comment valid? Looks random?

josephjclark · 2026-05-26T17:31:14Z

+      if (useLock) {
+        // Re-check inside the lock so we skip work a peer worker
+        // completed while we were waiting.
+        await withInstallLock(repoDir, getAliasedName(a), logger, async () => {


hmm. Why is the last argument a callback? We can't just let the promise return?

josephjclark · 2026-05-26T17:35:44Z

+  fn: () => Promise<void>
+): Promise<void> => {
+  // Defence-in-depth: refuse aliases that could escape the .locks directory.
+  // Upstream whitelist filtering should make this unreachable.


Should it? Where?

If this is true I expect to see some shared logic which sanitises aliases to be safe

josephjclark · 2026-05-26T17:40:53Z

@@ -0,0 +1,43 @@
+/**


This is weird. Just stick the test in repo-lock.test.ts so that a) it can be seen and b) there isn't this weird "othering" of unit tests

josephjclark

I've been looking through a bit more closely at the code and tests. I really don't love those tests: there's a lot of business logic in that worker and I'm a bit nervous that we're testing test code rather than actual logic.

I'm taking a break for now but I think there are a few things here we should talk about before I look deeper

josephjclark · 2026-05-27T13:51:08Z

+  }
+
+  const locksDir = path.join(repoDir, '.locks');
+  const target = path.join(locksDir, `${alias}.lock`);


since locksDir isn't used anywhere else you can simplify to

const target = path.join(repoDir, '.locks', `${alias}.lock`);

josephjclark · 2026-05-27T13:59:17Z

+  const locksDir = path.join(repoDir, '.locks');
+  const target = path.join(locksDir, `${alias}.lock`);
+
+  await ensureLockTarget(target);


Hold up, this function creates a lock file at the target path. But that's exactly what lockfile.lock does for us right?

So how does that work? Doesn't lockfile just assume the file is always locked because we've just created the lock?

josephjclark · 2026-05-27T14:02:48Z

+    );
+
+    try {
+      release = await lockfile.lock(target, {


So we call lockfile.lock twice here. Once I guess to see if we can aquire the lock - and if we can't, we print this log line and then try again with a number of retries?

First observation is that we're then in effect retrying the lock LOCK_RETRY_OPTIONS + 1 times

Second observation is - is this worth it? Are we better calling lockfile.check() first, and then doing the log-and-retry call if a lock already exists?

josephjclark · 2026-05-27T14:05:49Z

+const LOCK_INTERVAL_MS = 2_000;
+const UPDATE_MS = 5_000;
+
+const LOCK_RETRY_OPTIONS = {


Surprises me to see this declared as a constant. I get the convention but an object isn't really a constant. Just feels wierd.

Also none of this is configurable? The lock uses hard-coded values and that's it? Not a blocker for me but we might want to open an issue to make this configurable later.

Not because I want to configure it today - but because it's weird that I am unable to.

taylordowns2000 added this to Core May 19, 2026

github-project-automation Bot moved this to New Issues in Core May 19, 2026

josephjclark reviewed May 19, 2026

View reviewed changes

Comment thread packages/ws-worker/src/start.ts Outdated

josephjclark reviewed May 19, 2026

View reviewed changes

Comment thread packages/ws-worker/src/util/repo-lock.ts Outdated

josephjclark reviewed May 20, 2026

View reviewed changes

stuartc added 7 commits May 22, 2026 10:48

refactor: remove dead pending map from autoinstall

e52257e

The pending map was left behind by a 2023 restructure that replaced per-adaptor promise dedup with the module-level queue + busy flag. Nothing writes to it; the delete on error is a no-op.

feat: log install-lock contention at info level

09503ae

Try the lock with retries: 0 first. On ELOCKED, log one info line so k8s operators can see why a worker is sat waiting on cold start, then retry with the full budget. Uncontended acquisitions stay silent.

stuartc force-pushed the 1414-distributed-adaptor-install-lock branch from e968b73 to 5509997 Compare May 22, 2026 11:37

stuartc added 4 commits May 22, 2026 14:59

refactor: drop redundant mkdir in withInstallLock

81eefab

ensureLockTarget already creates the parent directory recursively, so the explicit mkdir of locksDir was dead work.

refactor: only reject '..' as a path component in install-lock alias

f5c4abf

The previous substring check would also reject innocuous names like 'foo..bar'. Split on path separators so only '..' as a real path component is rejected.

style: apply prettier to repo-lock and autoinstall

dc8f04f

Pure whitespace; line-wrap long template strings and object literals so `pnpm test:format` (CI) passes.

stuartc requested a review from josephjclark May 22, 2026 13:48

stuartc marked this pull request as ready for review May 23, 2026 11:31

josephjclark reviewed May 26, 2026

View reviewed changes

josephjclark reviewed May 27, 2026

View reviewed changes

Conversation

stuartc commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short Description

Implementation Details

QA Notes

AI Usage

Uh oh!

Uh oh!

Uh oh!

josephjclark commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartc commented May 22, 2026

Uh oh!

josephjclark left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josephjclark left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stuartc commented May 19, 2026 •

edited

Loading