Skip to content

fix: prevent processing 3rd party maintainers [CM-1097]#4035

Open
mbani01 wants to merge 7 commits intomainfrom
fix/maintainers_from_thrid_party
Open

fix: prevent processing 3rd party maintainers [CM-1097]#4035
mbani01 wants to merge 7 commits intomainfrom
fix/maintainers_from_thrid_party

Conversation

@mbani01
Copy link
Copy Markdown
Contributor

@mbani01 mbani01 commented Apr 17, 2026

This pull request enhances the accuracy and reliability of maintainer extraction by introducing robust detection and exclusion of third-party and vendored files. The changes add new logic and rules to identify and skip files that are likely to be bundled dependencies or unrelated to project governance, both in the backend logic and the extraction prompt. This helps prevent false maintainer detections from third-party code and clarifies the extraction process.

Third-party and vendor file detection and exclusion:

  • Added a comprehensive set of directory names (THIRD_PARTY_DIR_EXACT) and a regular expression (_VERSION_DIR_RE) to identify third-party, vendored, or versioned directories in file paths. Also introduced a maximum path depth (MAX_NON_GOVERNANCE_DEPTH) for files without governance keywords to further filter out likely third-party files.
  • Implemented the _is_third_party_path class method to encapsulate the logic for detecting third-party or vendor file paths using the above rules.
  • Integrated the _is_third_party_path check into the analyze_and_build_result method, so that files matching any exclusion rule are skipped and logged, raising a NO_MAINTAINER_FOUND error.

Extraction prompt improvements:

  • Updated the maintainer extraction prompt in get_extraction_prompt to explicitly instruct the model to check for third-party/vendor file paths first, with detailed rules and examples for exclusion. This ensures the extraction process aligns with the backend logic and avoids extracting maintainers from irrelevant files.

Governance and scoring keyword updates:

  • Added contributing.md to the list of recognized governance filenames and contributing to the set of governance stems, improving coverage of legitimate maintainer files. [1] [2]

Dependency update:

  • Imported the re module to support regular expression matching for versioned directories.

Note

Medium Risk
Changes the maintainer detection/extraction flow to skip more files and alters the LLM prompt contract, which could reduce false positives but may also inadvertently exclude legitimate governance files in edge-case repos.

Overview
Prevents maintainer extraction from vendored/third-party content by adding a path-based exclusion check (known vendor dir names, versioned directories, and a hard max path depth) and applying it before running AI analysis.

Threads repo_url through maintainer extraction so the LLM prompt can explicitly reject likely submodules/dependencies using repo/path heuristics, and tightens Bedrock response parsing to more reliably strip trailing markdown fences and require raw JSON output.

Reviewed by Cursor Bugbot for commit 30a466d. Bugbot is set up for automated code reviews on this repo. Configure here.

mbani01 added 4 commits April 17, 2026 17:42
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 requested a review from joanagmaia April 17, 2026 16:48
@mbani01 mbani01 self-assigned this Apr 17, 2026
Copilot AI review requested due to automatic review settings April 17, 2026 16:48
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce false maintainer extraction by adding heuristics to detect and skip third-party/vendored files (both in backend filtering and in the LLM extraction prompt), and expands governance filename/stem matching to include CONTRIBUTING files.

Changes:

  • Added _is_third_party_path() with directory-name, versioned-directory, and “deep path without governance keywords” rejection rules, and applied it in analyze_and_build_result().
  • Expanded governance matching by adding contributing.md to known paths and contributing to governance stems.
  • Updated the LLM extraction prompt to require an upfront third-party path check with rules and examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 70 to 74
".github/codeowners",
"security-insights.md",
"readme.md",
"contributing.md",
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contributing.md/contributing are added to the governance match sets, but _ripgrep_search() later excludes basenames in EXCLUDED_FILENAMES (which includes contributing.md and contributing). As a result, CONTRIBUTING files still won’t be discovered as candidates, so this change won’t improve coverage as intended. Consider either removing CONTRIBUTING from EXCLUDED_FILENAMES or not adding it to KNOWN_PATHS/GOVERNANCE_STEMS (and update the PR description accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +190
GOVERNANCE_PATH_KEYWORDS = {
"maintainer",
"codeowner",
"owner",
"contributor",
"governance",
"committer",
"reviewer",
"approver",
"emeritus",
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOVERNANCE_PATH_KEYWORDS is used to exempt deep paths from the third-party filter, but it doesn’t include contributing even though this PR adds contributing to governance stems/known paths. This can cause deep CONTRIBUTING-related governance paths (e.g., docs/contributing/...) to be treated as third-party under Rule 3. Consider adding contributing (and any other governance stems you expect in paths) to GOVERNANCE_PATH_KEYWORDS to keep the heuristics consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +392
- **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these three rules match:

**Rule 1 — Vendor/dependency directory**: reject if any directory in the path is one of:
`vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`.

**Rule 2 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages.

**Rule 3 — Deep path without governance keyword**: reject if the path has more than 3 segments (e.g. `a/b/c/file`) AND does not contain any governance keyword (maintainer, codeowner, owner, contributor, governance, committer, reviewer, approver, emeritus). Deep files without governance keywords are typically unrelated to project governance.

**Examples of paths that MUST be rejected:**
- `vendor/google.golang.org/grpc/MAINTAINERS.md` (Rule 1: vendor)
- `node_modules/tunnel/README.md` (Rule 1: node_modules)
- `bundled/taskflow-3.10.0/README.md` (Rule 1: bundled + Rule 2: version)
- `gui-editors/gui-editor-apex/src/main/webapp/dist/js/jquery-ui-1.12.1/AUTHORS.txt` (Rule 2: version)
- `src/java.desktop/share/native/libsplashscreen/libpng/README` (Rule 3: deep, no governance keyword)
- `web/static-dist/libs/bootstrap/README.md` (Rule 3: deep, no governance keyword)
- `css/bootstrap/README.md` (Rule 3: depth=3 is fine, but this is actually a third-party asset — use your judgment)

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt’s vendor/third-party rules are not aligned with the backend _is_third_party_path() implementation:

  • Rule 1’s directory list is missing entries that the backend rejects (e.g. externallibs, bower_components_external).
  • The prompt includes Pods/Godeps with capital letters, while the backend lowercases paths before matching.
  • The css/bootstrap/README.md example says to “use your judgment” even though it doesn’t match any of the three mandatory reject rules, which conflicts with “MANDATORY — evaluate FIRST” + “MUST return … immediately”.
    To avoid model/backend divergence, make the prompt list/casing match the code exactly and remove (or rework) the “use your judgment” example so it doesn’t contradict the stated rules.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@joanagmaia joanagmaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mbani01 the copilot comments seem to make sense. Can you double check its suggestions and make sure we are not introducing inconsistencies with the new implementation? Thanks!

mbani01 and others added 2 commits April 20, 2026 18:18
Copilot AI review requested due to automatic review settings April 20, 2026 17:19
@mbani01
Copy link
Copy Markdown
Contributor Author

mbani01 commented Apr 20, 2026

Hey @mbani01 the copilot comments seem to make sense. Can you double check its suggestions and make sure we are not introducing inconsistencies with the new implementation? Thanks!

@joanagmaia indeed, that was relevant, it's fixed alongside the other missing cases

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 127 to 132
EXCLUDED_FILENAMES = {
"contributing.md",
"contributing",
"code_of_conduct.md",
"code-of-conduct.md",
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contributing.md/contributing are being added to EXCLUDED_FILENAMES, which prevents them from ever being considered as candidate governance/maintainer files (see _ripgrep_search). This conflicts with the PR description claiming contributing.md was added as a recognized governance filename/stem; either remove these from EXCLUDED_FILENAMES (and add to KNOWN_PATHS/GOVERNANCE_STEMS if desired) or update the PR description/intent to match the implemented behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +175
# Hard max depth (number of path segments). Files deeper than this are rejected
# regardless of content — legitimate governance files live at depth 1-3.
MAX_PATH_DEPTH = 3
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions a depth cap specifically for non-governance paths (e.g. MAX_NON_GOVERNANCE_DEPTH), but the implementation introduces an unconditional hard cap (MAX_PATH_DEPTH = 3) that rejects all paths deeper than 3 segments. If the intent is conditional depth filtering, this needs to be implemented (e.g., only apply the depth cap when no governance keywords/stems appear in the path) or the PR description/constants should be updated to reflect the unconditional behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 698 to +702
self.logger.info(f"Analyzing maintainer file: {filename}")

if self._is_third_party_path(filename):
self.logger.warning(f"Skipping third-party/vendor file: '{filename}'")
raise MaintanerAnalysisError(error_code=ErrorCode.NO_MAINTAINER_FOUND)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_third_party_path is only applied inside analyze_and_build_result, which means candidate discovery (find_candidate_files) can still end up reading and scoring large vendored files (and incurring I/O + decode cost) before being rejected here. If the goal is to reduce third-party impact and cost, consider applying the third-party/path-depth filter earlier (e.g., skip paths in _ripgrep_search/find_candidate_files before _read_text_file) to avoid unnecessary file reads and processing.

Copilot uses AI. Check for mistakes.
Comment on lines +361 to +382
- **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** and the **repository URL** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match:

**Rule 1 — Repo-name check (step by step)**:
1. Extract the repo name and org name from the repository URL (e.g. URL `https://github.com/numworks/epsilon` → repo=`epsilon`, org=`numworks`).
2. For each directory in the file path, check: is this directory name a common structural directory (like `src`, `docs`, `doc`, `.github`, `lib`, `pkg`, `test`, `community`, `content`, `tools`, `web`, `app`, `config`, `deploy`, `charts`, etc.)? If yes, skip it — it's fine.
3. For any directory that is NOT a common structural directory AND is NOT a governance keyword (maintainer, owner, contributor, etc.), check: does it appear as a substring of the repo name or org name, or vice versa? If NOT → this directory is a submodule or bundled library name that does not belong to this repo. Return `{{"error": "not_found"}}`.
Example: file `mylib/README.md` in repo `orgname/myproject` → `mylib` is not structural, not a governance keyword, and `mylib` does not appear in `myproject` or `orgname` → reject. But file `myproject/README.md` in the same repo → `myproject` matches the repo name → allow.

**Rule 2 — Vendor/dependency directory**: reject if any directory in the path is one of:
`vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`.

**Rule 3 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages.

**Rule 4 — Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segments → reject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions.

**Examples of paths that MUST be rejected:**
- `src/somelibrary/AUTHORS` in a repo that is NOT somelibrary (Rule 1)
- `subcomponent/README.md` in a repo with a different project name (Rule 1)
- `vendor/some-package/MAINTAINERS.md` (Rule 2: vendor)
- `node_modules/some-pkg/README.md` (Rule 2: node_modules)
- `bundled/pkg-1.2.0/README.md` (Rule 2 + Rule 3: version)
- `a/b/c/d/AUTHORS.txt` (Rule 4: more than 3 segments)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt Rule 1 (repo-name/org-name substring heuristic) is enforced only by the LLM instruction; the backend _is_third_party_path logic doesn’t implement this rule. That makes third-party rejection behavior non-deterministic and hard to debug (a file can be rejected by the model without any corresponding backend log/guardrail). Consider either implementing the same repo-name heuristic server-side (so skips are explainable and consistent) or removing/softening this rule in the prompt to match backend behavior.

Suggested change
- **Third-Party Check (MANDATORYevaluate FIRST)**: Examine the **full file path** and the **repository URL** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match:
**Rule 1Repo-name check (step by step)**:
1. Extract the repo name and org name from the repository URL (e.g. URL `https://github.com/numworks/epsilon`repo=`epsilon`, org=`numworks`).
2. For each directory in the file path, check: is this directory name a common structural directory (like `src`, `docs`, `doc`, `.github`, `lib`, `pkg`, `test`, `community`, `content`, `tools`, `web`, `app`, `config`, `deploy`, `charts`, etc.)? If yes, skip itit's fine.
3. For any directory that is NOT a common structural directory AND is NOT a governance keyword (maintainer, owner, contributor, etc.), check: does it appear as a substring of the repo name or org name, or vice versa? If NOTthis directory is a submodule or bundled library name that does not belong to this repo. Return `{{"error": "not_found"}}`.
Example: file `mylib/README.md` in repo `orgname/myproject``mylib` is not structural, not a governance keyword, and `mylib` does not appear in `myproject` or `orgname`reject. But file `myproject/README.md` in the same repo`myproject` matches the repo nameallow.
**Rule 2Vendor/dependency directory**: reject if any directory in the path is one of:
`vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`.
**Rule 3Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages.
**Rule 4Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segmentsreject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions.
**Examples of paths that MUST be rejected:**
- `src/somelibrary/AUTHORS` in a repo that is NOT somelibrary (Rule 1)
- `subcomponent/README.md` in a repo with a different project name (Rule 1)
- `vendor/some-package/MAINTAINERS.md` (Rule 2: vendor)
- `node_modules/some-pkg/README.md` (Rule 2: node_modules)
- `bundled/pkg-1.2.0/README.md` (Rule 2 + Rule 3: version)
- `a/b/c/d/AUTHORS.txt` (Rule 4: more than 3 segments)
- **Third-Party Check (MANDATORYevaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match:
**Rule 1Vendor/dependency directory**: reject if any directory in the path is one of:
`vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`.
**Rule 2Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages.
**Rule 3Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segmentsreject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions.
**Examples of paths that MUST be rejected:**
- `vendor/some-package/MAINTAINERS.md` (Rule 1: vendor)
- `node_modules/some-pkg/README.md` (Rule 1: node_modules)
- `bundled/pkg-1.2.0/README.md` (Rule 1 + Rule 2: version)
- `a/b/c/d/AUTHORS.txt` (Rule 3: more than 3 segments)

Copilot uses AI. Check for mistakes.
return True

if len(parts) > cls.MAX_PATH_DEPTH:
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depth limit unconditionally rejects all deep governance files

Medium Severity

The PR description explicitly states the depth limit (MAX_NON_GOVERNANCE_DEPTH) applies only to "files without governance keywords," but the implementation uses MAX_PATH_DEPTH = 3 and applies it unconditionally to all files in _is_third_party_path. This means legitimate governance files at depth 4+ (e.g., community/governance/team/MAINTAINERS.md) are rejected as third-party, even though they have governance filenames. Previously tracked saved_maintainer_file entries at those depths would also start failing, causing regressions.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 639b271. Configure here.

# Versioned directory pattern — directories containing semver-like numbers
# (e.g. "jquery-ui-1.12.1", "zlib-1.2.8", "ffmpeg-7.1.1") are almost always
# bundled third-party packages. Real project directories don't have versions.
_VERSION_DIR_RE = re.compile(r"\d+\.\d+")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version regex matches non-version directory names

Low Severity

_VERSION_DIR_RE = re.compile(r"\d+\.\d+") used with .search() matches any directory containing a digit-dot-digit pattern anywhere in its name. This causes false positives on legitimate directory names like python3.9, go1.21, gcc12.3, or net5.0, incorrectly flagging them as versioned third-party packages. Adding word-boundary anchors or requiring a leading separator would reduce false positives.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 639b271. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 30a466d. Configure here.


if self._is_third_party_path(filename):
self.logger.warning(f"Skipping third-party/vendor file: '{filename}'")
raise MaintanerAnalysisError(error_code=ErrorCode.NO_MAINTAINER_FOUND)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third-party filtering missing from candidate discovery stage

Medium Severity

_is_third_party_path is only checked inside analyze_and_build_result, but find_candidate_files does not pre-filter third-party paths. Since only the single top-scoring subdir candidate is ever tried (line 848), a high-scoring third-party file (e.g. vendor/lib/MAINTAINERS.md) can shadow legitimate lower-scoring subdir candidates (e.g. docs/MAINTAINERS.md). The rejected third-party file causes an immediate fallthrough to expensive AI file detection, skipping all other valid subdir candidates entirely.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 30a466d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants