Skip to content

Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367

Open
epugh wants to merge 2 commits intoapache:mainfrom
epugh:fix/filestore-getmetadata-delete-race
Open

Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367
epugh wants to merge 2 commits intoapache:mainfrom
epugh:fix/filestore-getmetadata-delete-race

Conversation

@epugh
Copy link
Copy Markdown
Contributor

@epugh epugh commented Apr 26, 2026

I disocvered this issue while working on #4249 (SOLR-16341: fix blank file zip handling).

ClusterFileStore.getMetadata could return a 500 when a file was deleted between fileStore.list(...) returning the entry and convertToResponse calling FileDetails.size()/getTimeStamp() on it. The underlying Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException, which DistribFileStore wrapped as SolrException(SERVER_ERROR).

Reproducible via beasting TestDistribFileStore.testFileStoreManagement (seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager).

Fix:

  • DistribFileStore.FileInfo: catch NoSuchFileException in size() (returns -1) and getTimeStamp() (returns null) so callers can distinguish a concurrent-delete race from a real I/O failure.
  • ClusterFileStore.convertToResponse: returns null when size() < 0, signalling the entry vanished mid-call.
  • ClusterFileStore.getMetadata: filters null entries from the DIRECTORY listing; the FILE/METADATA branch already maps an empty result to null, which now also covers the race case so the response matches NOFILE semantics (singletonMap(path, null)).

ClusterFileStore.getMetadata could return a 500 when a file was deleted
between fileStore.list(...) returning the entry and convertToResponse
calling FileDetails.size()/getTimeStamp() on it. The underlying
Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException,
which DistribFileStore wrapped as SolrException(SERVER_ERROR).

Reproducible via beasting TestDistribFileStore.testFileStoreManagement
(seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager).

Fix:
- DistribFileStore.FileInfo: catch NoSuchFileException in size()
  (returns -1) and getTimeStamp() (returns null) so callers can
  distinguish a concurrent-delete race from a real I/O failure.
- ClusterFileStore.convertToResponse: returns null when size() < 0,
  signalling the entry vanished mid-call.
- ClusterFileStore.getMetadata: filters null entries from the
  DIRECTORY listing; the FILE/METADATA branch already maps an empty
  result to null, which now also covers the race case so the response
  matches NOFILE semantics (singletonMap(path, null)).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses a filestore concurrency race where GET /cluster/filestore/metadata... could intermittently return HTTP 500 if a file is deleted between directory listing and attribute reads, aiming to make the response match “not found” semantics instead of erroring.

Changes:

  • Catch NoSuchFileException in DistribFileStore.FileInfo attribute accessors (size(), getTimeStamp()), returning sentinel values for concurrent-delete races.
  • Update ClusterFileStore.getMetadata/convertToResponse to treat vanished entries as absent (including filtering nulls from directory listings).
  • Add an unreleased changelog entry describing the behavioral fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java Converts concurrent-delete NoSuchFileException into sentinel return values for size/timestamp reads.
solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java Treats sentinel size as “vanished” and filters null entries from directory listings.
changelog/unreleased/PR#4367-filestore-getmetadata-delete-race.yml Documents the fix for release notes.
Comments suppressed due to low confidence (2)

solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java:268

  • convertToResponse calls details.getMetaData() twice. Besides doing duplicate I/O, this can still surface the same concurrent-delete race as an unhandled RuntimeException (e.g., DistribFileStore wraps metadata read failures in RuntimeException), causing the HTTP 500 this PR is trying to eliminate. Fetch the metadata once into a local variable and handle the case where metadata can’t be read because the entry disappeared (treat it as null/vanished).
    entryMetadata.timestamp = details.getTimeStamp();
    if (details.getMetaData() != null) {
      details.getMetaData().toMap(entryMetadata.unknownProperties());
    }

solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java:316

  • FileDetails.size() and getTimeStamp() now use sentinel values (-1 / null) to indicate a concurrent-delete race. Since FileStore.FileDetails doesn’t document these return values, it’s easy for future callers to misinterpret them as valid data. Consider documenting this contract (ideally on FileStore.FileDetails) or at least adding a clarifying comment here indicating the sentinel values are part of an intentional API convention.
        @Override
        public Date getTimeStamp() {
          try {
            return new Date(Files.getLastModifiedTime(realPath()).toMillis());
          } catch (NoSuchFileException e) {
            // File was deleted concurrently between listing and reading its attributes.
            return null;
          } catch (IOException e) {
            throw new SolrException(
                SERVER_ERROR, "Failed to retrieve the last modified time for: " + realPath(), e);
          }
        }

        @Override
        public boolean isDir() {
          return type == FileType.DIRECTORY;
        }

        @Override
        public long size() {
          try {
            return Files.size(realPath());
          } catch (NoSuchFileException e) {
            // File was deleted concurrently between listing and reading its attributes.
            return -1;
          } catch (IOException e) {
            throw new SolrException(
                SERVER_ERROR, "Failed to retrieve the file size for: " + realPath(), e);
          }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

type: fixed # added, changed, fixed, deprecated, removed, dependency_update, security, other
authors:
- name: Eric Pugh
links: []
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entries in this directory typically include a link (PR and/or JIRA) under links: rather than an empty list. Please add an appropriate link entry so the change is traceable from the release notes.

Suggested change
links: []
links:
- https://github.com/apache/solr/pull/4367

Copilot uses AI. Check for mistakes.
Comment on lines +264 to 265
entryMetadata.size = size;
entryMetadata.timestamp = details.getTimeStamp();
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertToResponse treats size < 0 as a concurrent-delete race, but getTimeStamp() can also now return null on the same race if the file is deleted after size() succeeds. To keep the response consistent with NOFILE semantics, consider treating a null timestamp as a vanished entry as well (return null from convertToResponse).

Suggested change
entryMetadata.size = size;
entryMetadata.timestamp = details.getTimeStamp();
final var timestamp = details.getTimeStamp();
if (timestamp == null) {
// File was deleted concurrently between reading its size and timestamp.
return null;
}
entryMetadata.size = size;
entryMetadata.timestamp = timestamp;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants