Skip to content

fix(worker): reindex repos with missing zoekt shards#1350

Open
RitwijParmar wants to merge 3 commits into
sourcebot-dev:mainfrom
RitwijParmar:codex/sourcebot-missing-shards-reindex
Open

fix(worker): reindex repos with missing zoekt shards#1350
RitwijParmar wants to merge 3 commits into
sourcebot-dev:mainfrom
RitwijParmar:codex/sourcebot-missing-shards-reindex

Conversation

@RitwijParmar

@RitwijParmar RitwijParmar commented Jun 18, 2026

Copy link
Copy Markdown

Fixes #1210

Summary

  • detect indexed repos whose committed zoekt shard files are missing on worker startup
  • mark those repos stale and queue reindex jobs, while skipping repos that already have pending or in-progress index work
  • ignore temporary shard files so failed partial indexes do not count as searchable shards

Verification

  • yarn workspace @sourcebot/backend test repoIndexManager.test.ts
  • yarn workspace @sourcebot/backend test
  • yarn workspace @sourcebot/backend build

Summary by CodeRabbit

  • Bug Fixes
    • Fixed cases where repositories appeared indexed in the database but were missing their on-disk search shard files, which previously prevented automatic re-indexing.
    • On worker startup, the system now detects these mismatches, marks affected repositories for re-indexing, and schedules the required reindex jobs.
  • Tests
    • Added coverage to validate the new startup reconciliation behavior and job scheduling for stale repositories.

@RitwijParmar RitwijParmar marked this pull request as ready for review June 18, 2026 19:33
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4819eed0-eafb-4da9-94b0-8f5e74f5677d

📥 Commits

Reviewing files that changed from the base of the PR and between 5f268f1 and b24f26f.

📒 Files selected for processing (2)
  • packages/backend/src/repoIndexManager.test.ts
  • packages/backend/src/repoIndexManager.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/backend/src/repoIndexManager.ts
  • packages/backend/src/repoIndexManager.test.ts

Walkthrough

Adds a startup reconciliation step to RepoIndexManager that scans INDEX_CACHE_DIR for .zoekt shard files, identifies DB-indexed repos with no corresponding shard on disk, resets their indexed state in batches via updateMany, and enqueues new INDEX jobs. A new test suite and changelog entry accompany the change.

Changes

Shard-Missing Startup Reconciliation

Layer / File(s) Summary
Core reconciliation method and startup wiring
packages/backend/src/repoIndexManager.ts, CHANGELOG.md
Adds STALE_REPO_UPDATE_BATCH_SIZE constant, wires a new reconciliation call into startScheduler() after orphaned disk cleanup, and implements the private method that reads .zoekt shard filenames, maps them to repo IDs via getRepoIdFromShardFileName, queries Prisma for indexed repos without active INDEX jobs in the timeout window, identifies repos with no shard file, clears indexedAt/indexedCommitHash in batched updateMany calls, logs warnings, and enqueues INDEX jobs via createJobs. Changelog entry added under Unreleased → Fixed.
Test mocks and startup reconciliation test
packages/backend/src/repoIndexManager.test.ts
Extends @sourcebot/shared, ./zoekt.js, and ./utils.js mocks with getRepoIdFromPath, REPOS_CACHE_DIR, cleanupTempShards, and getRepoIdFromShardFileName; adds updateMany to the Prisma repo mock; and adds a Startup Reconciliation test that mocks stale vs. healthy repo state, runs manager.startScheduler(), and asserts that repo.updateMany clears indexed markers for stale repos and that repo-index-job messages are enqueued with the correct payloads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • sourcebot-dev/sourcebot#805: Both PRs involve Zoekt .tmp shard file handling in packages/backend/src/repoIndexManager.ts; #805 adds best-effort cleanup of .tmp shards on indexing failure, while the main PR's startup reconciliation explicitly excludes .tmp files when deciding which repos need reindexing.
  • sourcebot-dev/sourcebot#973: Both PRs modify RepoIndexManager.startScheduler() to add/await startup filesystem scans and shard/repo ID parsing helpers; the main PR extends the same startup flow by additionally scheduling reindex jobs for indexed repos missing .zoekt shards.

Suggested reviewers

  • msukkari
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(worker): reindex repos with missing zoekt shards' directly describes the main change—automatic detection and recovery of indexed repositories with missing shard files.
Linked Issues check ✅ Passed The PR fully addresses the objectives from issue #1210: it detects repos marked indexed in DB but missing shard files, marks them stale, queues reindex jobs, skips active jobs, and ignores temp files.
Out of Scope Changes check ✅ Passed All changes directly implement the missing shard recovery mechanism specified in issue #1210. CHANGELOG and test additions support the core feature with no extraneous modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/backend/src/repoIndexManager.ts`:
- Around line 769-786: The startup reconciliation query in the findMany call on
this.db.repo is excluding repos with PENDING or IN_PROGRESS INDEX jobs without
checking if those jobs are actually stale. Modify the NOT.jobs.some condition to
additionally check if the job's createdAt or updatedAt timestamp is older than
repoIndexTimeoutMs by adding a time-based filter (e.g., createdAt greater than
current time minus repoIndexTimeoutMs). This ensures that only active, non-stale
jobs prevent the repo from being included in the startup reconciliation,
allowing stale jobs to be properly recovered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2bd2948c-857a-438a-8c6f-f20a93b745aa

📥 Commits

Reviewing files that changed from the base of the PR and between 9320065 and 5f268f1.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • packages/backend/src/repoIndexManager.test.ts
  • packages/backend/src/repoIndexManager.ts

Comment thread packages/backend/src/repoIndexManager.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug/rfe] Rebuild or mark repos stale when zoekt shard files are missing but DB marks repos indexed

1 participant