Skip to content

feat: Support remote filesystem seeds#765

Open
mikeknep wants to merge 10 commits into
mainfrom
remote-seeds/mknepper
Open

feat: Support remote filesystem seeds#765
mikeknep wants to merge 10 commits into
mainfrom
remote-seeds/mknepper

Conversation

@mikeknep

@mikeknep mikeknep commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📋 Summary

Adds support for injecting fsspec filesystems into DirectorySeedReader and FileContentsSeedReader so that they can be used in non-local contexts

🔗 Related Issue

Implements this plan

🔄 Changes

  • Introduces FileSystemProvider(Protocol) and a default implementation LocalFileSystemProvider, adding a seam where previously the local filesystem was effectively hardcoded
  • Updates FileSystemSeedSource and its subclasses to not validate dir/file existence upon config object creation, instead deferring that check to validation/read times
  • Refactors FileSystemSeedSource and its subclasses

🧪 Testing

  • make test passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable) -- none added, but existing tests pass
  • Added an implementation for the Nemo Platform Data Designer plugin that uses the Files service's fsspec filesystem and confirmed Directory- and FileContents seeds work. See feat: Support more Data Designer seed sources nemo-platform#413

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@mikeknep mikeknep requested a review from a team as a code owner June 23, 2026 18:20
@github-actions

Copy link
Copy Markdown
Contributor

Fern preview: https://nvidia-preview-pr-765.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces FileSystemProvider as an injectable protocol over the filesystem, allowing DirectorySeedReader and FileContentsSeedReader to operate against non-local backends (e.g., an fsspec-backed remote store). The companion change defers all path resolution and existence validation from config-construction time to read time.

  • Adds FileSystemProvider protocol and LocalFileSystemProvider default; threads the provider through FileSystemSeedReader with backward-compatible initializer handling for existing plugin subclasses that skip super().__init__().
  • Removes early-validation from FileSystemSeedSource (and subclasses), replacing the load-time is_dir() check with a LocalFileSystemProvider.ensure_root_exists call at read time, and adds a SeedReaderConfigErrorInvalidConfigError re-raise in the compiler so errors still surface cleanly.

Confidence Score: 5/5

Safe to merge; the seam is well-isolated, backward compatibility is explicitly handled, and the deferred validation path is fully covered by new tests.

All changed paths have corresponding tests, the LocalFileSystemProvider faithfully reproduces the previous existence-check behavior at read time, and the compiler wraps the new error type correctly. The backward-compat guard for plugins that skip super().init() is a thoughtful addition that avoids breaking existing readers.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Core change: adds FileSystemProvider protocol, LocalFileSystemProvider default, and threads the provider through FileSystemSeedReader with backward-compat init guard; PurePath widening addressed with a type-narrowing guard in AgentRolloutSeedReader.
packages/data-designer-config/src/data_designer/config/seed_source.py Removes load-time path resolution/existence validation from FileSystemSeedSource; runtime_path now returns the raw path string; AgentRolloutSeedSource correctly falls back to format defaults; all field descriptions updated.
packages/data-designer-engine/src/data_designer/engine/compiler.py Adds SeedReaderConfigError → InvalidConfigError wrapping around get_column_names() so deferred directory-missing errors surface as config errors.
packages/data-designer-config/tests/config/test_seed_source.py Tests updated to reflect deferred validation; new tests for raw path preservation, runtime_path fallback, and plugin subclass inheritance of runtime_path.
packages/data-designer-engine/tests/engine/resources/test_seed_reader.py New tests for injected provider usage, provider call ordering, LocalFileSystemProvider behavior, and read-time vs construction-time path resolution.
packages/data-designer-engine/tests/engine/test_compiler.py Adds test verifying SeedReaderConfigError is re-raised as InvalidConfigError from the compiler.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant FileSystemSeedReader
    participant FileSystemProvider
    participant SeedReaderFileSystemContext

    Note over Caller,FileSystemSeedReader: Construction time (no filesystem access)
    Caller->>FileSystemSeedReader: __init__(fs_provider?)
    FileSystemSeedReader-->>Caller: reader (provider stored)

    Note over Caller,FileSystemSeedReader: Attach time
    Caller->>FileSystemSeedReader: attach(source, resolver)
    FileSystemSeedReader->>FileSystemSeedReader: _reset_attachment_state()
    Note right of FileSystemSeedReader: sets _fs_provider if missing (compat)

    Note over Caller,SeedReaderFileSystemContext: Read / validate time
    Caller->>FileSystemSeedReader: get_column_names()
    FileSystemSeedReader->>FileSystemSeedReader: _get_filesystem_context()
    FileSystemSeedReader->>FileSystemSeedReader: create_filesystem_context(source.runtime_path)
    FileSystemSeedReader->>FileSystemProvider: ensure_root_exists(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: (raises SeedReaderConfigError if missing)
    FileSystemSeedReader->>FileSystemProvider: create_context(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: SeedReaderFileSystemContext
    FileSystemSeedReader->>FileSystemSeedReader: build_manifest(context)
    FileSystemSeedReader-->>Caller: column names
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant FileSystemSeedReader
    participant FileSystemProvider
    participant SeedReaderFileSystemContext

    Note over Caller,FileSystemSeedReader: Construction time (no filesystem access)
    Caller->>FileSystemSeedReader: __init__(fs_provider?)
    FileSystemSeedReader-->>Caller: reader (provider stored)

    Note over Caller,FileSystemSeedReader: Attach time
    Caller->>FileSystemSeedReader: attach(source, resolver)
    FileSystemSeedReader->>FileSystemSeedReader: _reset_attachment_state()
    Note right of FileSystemSeedReader: sets _fs_provider if missing (compat)

    Note over Caller,SeedReaderFileSystemContext: Read / validate time
    Caller->>FileSystemSeedReader: get_column_names()
    FileSystemSeedReader->>FileSystemSeedReader: _get_filesystem_context()
    FileSystemSeedReader->>FileSystemSeedReader: create_filesystem_context(source.runtime_path)
    FileSystemSeedReader->>FileSystemProvider: ensure_root_exists(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: (raises SeedReaderConfigError if missing)
    FileSystemSeedReader->>FileSystemProvider: create_context(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: SeedReaderFileSystemContext
    FileSystemSeedReader->>FileSystemSeedReader: build_manifest(context)
    FileSystemSeedReader-->>Caller: column names
Loading

Reviews (4): Last reviewed commit: "Simplify agent rollout default path" | Re-trigger Greptile

@nabinchha

Copy link
Copy Markdown
Contributor

Thanks for putting this together, @mikeknep — really clean seam for plugging in non-local filesystems.

Summary

Introduces a FileSystemProvider protocol (with a default LocalFileSystemProvider) so FileSystemSeedReader subclasses can be backed by remote fsspec filesystems. Existence/resolution checks move out of FileSystemSeedSource config construction and into the provider at validate/read time, with SeedReaderConfigError getting normalized to InvalidConfigError at compile time so the user-facing error story stays the same. Implementation matches the stated intent in the PR description.

Findings

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py:530-537ensure_root_exists runs unconditionally with the default LocalFileSystemProvider, which can break legacy plugins that override create_filesystem_context

  • What: _get_filesystem_context calls self._get_fs_provider().ensure_root_exists(...) before delegating to self.create_filesystem_context(...). The current Advanced Hooks docs (fern/versions/latest/pages/plugins/filesystem_seed_reader.mdx:166) tell plugin authors to override create_filesystem_context(...) for "custom rooted filesystem behavior" — so a plugin that historically used that hook to swap in, say, an S3-rooted DirFileSystem and didn't also inject an fs_provider will now hit LocalFileSystemProvider.ensure_root_exists, which does Path(runtime_path).expanduser().resolve().is_dir() against a non-local URI and falsely raises SeedReaderConfigError.
  • Why: This silently regresses any existing plugin that used the documented hook for non-local backends, and the failure mode is a confusing "directory does not exist" on a path that does exist (just not locally). Real-world blast radius is probably small (the framework was largely local-only before this PR), but the docs still actively point users at the now-fragile path.
  • Suggestion: Either (a) update the Advanced Hooks section in filesystem_seed_reader.mdx to recommend FileSystemProvider injection as the modern way to swap in custom backends and note that create_filesystem_context overrides alone keep the default local existence check active, or (b) move the existence check into LocalFileSystemProvider.create_context so it only fires for the default local provider and not when a plugin has swapped the context out.

packages/data-designer-engine/tests/engine/resources/test_seed_reader.py — No unit coverage for the new FileSystemProvider injection seam

  • What: The headline behavior of this PR — passing a custom FileSystemProvider into FileSystemSeedReader(fs_provider=...) and having both create_context and ensure_root_exists route through it — isn't exercised by any test. The end-to-end validation via the NeMo Platform PR is great, but the public seam itself isn't pinned down.
  • Why: Without a unit test, a future refactor could quietly drop the __init__ argument, stop threading the provider into _get_filesystem_context, or skip ensure_root_exists, and nothing in this repo would catch it. The contract is small but central to the PR's value.
  • Suggestion: Add a couple of focused tests using a stub FileSystemProvider (a tiny class or Mock(spec=FileSystemProvider)) that asserts: (1) DirectorySeedReader(fs_provider=stub) calls stub.ensure_root_exists and stub.create_context with the source's runtime_path, and (2) the returned SeedReaderFileSystemContext is what gets handed to build_manifest. A complementary LocalFileSystemProvider direct test (happy path + missing-dir raises SeedReaderConfigError) would document the default behavior too.

packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py:270-274 — Base SeedReader.create_filesystem_context is dead-but-public code that bypasses the new injection seam

  • What: SeedReader still carries a hardcoded create_filesystem_context that builds a local DirFileSystem directly. FileSystemSeedReader now always overrides it to delegate to the injected FileSystemProvider, so the base implementation is unreachable through the normal flow — but it's still part of the public API. (Pre-existing leftover, but the new override is what makes it dead and misleading.)
  • Why: Anyone subclassing SeedReader directly (not via FileSystemSeedReader) and calling create_filesystem_context will silently get hardcoded local behavior, completely bypassing the new FileSystemProvider seam this PR is built around. That's a real footgun for plugin authors trying to support non-local filesystems through any code path other than FileSystemSeedReader.
  • Suggestion: Drop the method from the SeedReader base class as part of this cleanup so the only create_filesystem_context on the public surface is the provider-aware one on FileSystemSeedReader. If non-filesystem readers ever need a filesystem context in the future, they should go through a provider too.

packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py:539-544_get_fs_provider lazy fallback hides a load-bearing contract with plugin subclasses

  • What: The getattr(self, "_fs_provider", None) fallback covers two cases at once: (a) defensive recovery if _fs_provider were ever cleared and (b) plugin subclasses that override __init__ without calling super().__init__() (e.g. OnAttachDirectorySeedReader at test_seed_reader.py:79-83). Case (b) is currently the only reason the fallback ever fires.
  • Why: From the code alone it looks like defensive paranoia against an attribute the __init__ literally just set, so a future contributor "tidying up" by replacing getattr(...) with self._fs_provider would silently break every plugin reader that defines its own __init__ without calling super — a class of bug that won't surface until a plugin tries to attach. The contract that plugin authors can skip super().__init__() is invisible in the code.
  • Suggestion: Pick one and make it explicit: (a) add a comment on _get_fs_provider noting the plugin-subclass case is load-bearing, (b) move the _fs_provider assignment into _reset_attachment_state (already called from attach(), so subclasses can keep skipping super().__init__() and the lazy fallback can go away), or (c) require super().__init__() in subclasses, update OnAttachDirectorySeedReader, and document the requirement in filesystem_seed_reader.mdx.

Suggestions — Take it or leave it

packages/data-designer-config/src/data_designer/config/seed_source.py:200-205_validate_filesystem_seed_source_path helper is now a trivial wrapper around a two-line check

  • What: The helper used to do real work (Path(value).expanduser().resolve() + is_dir() check + raise). After this PR it's reduced to if value is None: return None + if not value.strip(): raise + return value, and it still has exactly one caller (FileSystemSeedSource.validate_path).
  • Why: STYLEGUIDE's KISS guidance ("two similar lines is better than a premature helper") was already on the edge for this helper, and the simplification tips it over — validate_path now indirects through a module-level function just to do a strip-and-raise.
  • Suggestion: Inline the body into the @field_validator classmethod and drop the helper. (The sibling _validate_filesystem_seed_source_file_pattern has the same shape, but it's pre-existing and out of scope for this PR.)

packages/data-designer-config/src/data_designer/config/seed_source.py:284-286 — Defensive raise in AgentRolloutSeedSource.runtime_path is now unreachable

  • What: The model_validator(mode="after") at line 271 already guarantees self.path is not None or default_path is not None, so the if default_path is None: raise ValueError(...) inside the runtime_path property can't fire when self.path is None.
  • Why: Harmless, but it adds a code path that looks meaningful and isn't.
  • Suggestion: Either drop the conditional and just return default_path, or leave a # pragma: no cover / short comment explaining it's belt-and-suspenders against a future validator change.

What Looks Good

  • The FileSystemProvider Protocol is exactly the kind of minimal seam this codebase favors — two methods, easy to satisfy, easy to mock. Nice match for the SOLID guidance in STYLEGUIDE.
  • Wrapping SeedReaderConfigError -> InvalidConfigError in the compiler keeps the user-visible error story at compile time even though the actual check now runs at read time. That's the right boundary to do the translation at and the new compiler test (test_seed_reader_config_errors_are_invalid_config_errors) is a tight contract assertion, including the __cause__ chain.
  • The isinstance(root_path, Path) guard in AgentRolloutSeedReader._get_parse_context is a thoughtful belt-and-suspenders defense after the PurePath widening — clear error message, lazy enough that local readers pay nothing for it, and the comment explains exactly why it's there.
  • Test updates are crisp: the rename from *_cache_runtime_path_* to *_preserve_raw_runtime_path_* and from *_load_time_runtime_path_* to *_read_time_runtime_path_* makes the new semantics obvious, and test_filesystem_seed_source_subclass_inherits_runtime_path correctly pins the contract that plugin authors rely on.

Verdict

Needs changes — the ensure_root_exists interaction with create_filesystem_context overrides is worth either fixing or documenting before merge, and a couple of unit tests around the new FileSystemProvider seam would lock in the headline behavior. Everything else is take-or-leave-it.


This review was generated by an AI assistant.

@mikeknep mikeknep force-pushed the remote-seeds/mknepper branch from e6daeff to 590abf0 Compare June 26, 2026 16:24
@mikeknep

Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @nabinchha. I've addressed them in the most recent commits, let me know if you have any followups!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants