Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367
Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367epugh wants to merge 2 commits intoapache:mainfrom
Conversation
ClusterFileStore.getMetadata could return a 500 when a file was deleted between fileStore.list(...) returning the entry and convertToResponse calling FileDetails.size()/getTimeStamp() on it. The underlying Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException, which DistribFileStore wrapped as SolrException(SERVER_ERROR). Reproducible via beasting TestDistribFileStore.testFileStoreManagement (seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager). Fix: - DistribFileStore.FileInfo: catch NoSuchFileException in size() (returns -1) and getTimeStamp() (returns null) so callers can distinguish a concurrent-delete race from a real I/O failure. - ClusterFileStore.convertToResponse: returns null when size() < 0, signalling the entry vanished mid-call. - ClusterFileStore.getMetadata: filters null entries from the DIRECTORY listing; the FILE/METADATA branch already maps an empty result to null, which now also covers the race case so the response matches NOFILE semantics (singletonMap(path, null)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Addresses a filestore concurrency race where GET /cluster/filestore/metadata... could intermittently return HTTP 500 if a file is deleted between directory listing and attribute reads, aiming to make the response match “not found” semantics instead of erroring.
Changes:
- Catch
NoSuchFileExceptioninDistribFileStore.FileInfoattribute accessors (size(),getTimeStamp()), returning sentinel values for concurrent-delete races. - Update
ClusterFileStore.getMetadata/convertToResponseto treat vanished entries as absent (including filtering nulls from directory listings). - Add an unreleased changelog entry describing the behavioral fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java | Converts concurrent-delete NoSuchFileException into sentinel return values for size/timestamp reads. |
| solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java | Treats sentinel size as “vanished” and filters null entries from directory listings. |
| changelog/unreleased/PR#4367-filestore-getmetadata-delete-race.yml | Documents the fix for release notes. |
Comments suppressed due to low confidence (2)
solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java:268
convertToResponsecallsdetails.getMetaData()twice. Besides doing duplicate I/O, this can still surface the same concurrent-delete race as an unhandledRuntimeException(e.g.,DistribFileStorewraps metadata read failures inRuntimeException), causing the HTTP 500 this PR is trying to eliminate. Fetch the metadata once into a local variable and handle the case where metadata can’t be read because the entry disappeared (treat it asnull/vanished).
entryMetadata.timestamp = details.getTimeStamp();
if (details.getMetaData() != null) {
details.getMetaData().toMap(entryMetadata.unknownProperties());
}
solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java:316
FileDetails.size()andgetTimeStamp()now use sentinel values (-1/null) to indicate a concurrent-delete race. SinceFileStore.FileDetailsdoesn’t document these return values, it’s easy for future callers to misinterpret them as valid data. Consider documenting this contract (ideally onFileStore.FileDetails) or at least adding a clarifying comment here indicating the sentinel values are part of an intentional API convention.
@Override
public Date getTimeStamp() {
try {
return new Date(Files.getLastModifiedTime(realPath()).toMillis());
} catch (NoSuchFileException e) {
// File was deleted concurrently between listing and reading its attributes.
return null;
} catch (IOException e) {
throw new SolrException(
SERVER_ERROR, "Failed to retrieve the last modified time for: " + realPath(), e);
}
}
@Override
public boolean isDir() {
return type == FileType.DIRECTORY;
}
@Override
public long size() {
try {
return Files.size(realPath());
} catch (NoSuchFileException e) {
// File was deleted concurrently between listing and reading its attributes.
return -1;
} catch (IOException e) {
throw new SolrException(
SERVER_ERROR, "Failed to retrieve the file size for: " + realPath(), e);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type: fixed # added, changed, fixed, deprecated, removed, dependency_update, security, other | ||
| authors: | ||
| - name: Eric Pugh | ||
| links: [] |
There was a problem hiding this comment.
Changelog entries in this directory typically include a link (PR and/or JIRA) under links: rather than an empty list. Please add an appropriate link entry so the change is traceable from the release notes.
| links: [] | |
| links: | |
| - https://github.com/apache/solr/pull/4367 |
| entryMetadata.size = size; | ||
| entryMetadata.timestamp = details.getTimeStamp(); |
There was a problem hiding this comment.
convertToResponse treats size < 0 as a concurrent-delete race, but getTimeStamp() can also now return null on the same race if the file is deleted after size() succeeds. To keep the response consistent with NOFILE semantics, consider treating a null timestamp as a vanished entry as well (return null from convertToResponse).
| entryMetadata.size = size; | |
| entryMetadata.timestamp = details.getTimeStamp(); | |
| final var timestamp = details.getTimeStamp(); | |
| if (timestamp == null) { | |
| // File was deleted concurrently between reading its size and timestamp. | |
| return null; | |
| } | |
| entryMetadata.size = size; | |
| entryMetadata.timestamp = timestamp; |
I disocvered this issue while working on #4249 (SOLR-16341: fix blank file zip handling).
ClusterFileStore.getMetadata could return a 500 when a file was deleted between fileStore.list(...) returning the entry and convertToResponse calling FileDetails.size()/getTimeStamp() on it. The underlying Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException, which DistribFileStore wrapped as SolrException(SERVER_ERROR).
Reproducible via beasting TestDistribFileStore.testFileStoreManagement (seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager).
Fix: