fix: prevent processing 3rd party maintainers [CM-1097]#4035
fix: prevent processing 3rd party maintainers [CM-1097]#4035
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce false maintainer extraction by adding heuristics to detect and skip third-party/vendored files (both in backend filtering and in the LLM extraction prompt), and expands governance filename/stem matching to include CONTRIBUTING files.
Changes:
- Added
_is_third_party_path()with directory-name, versioned-directory, and “deep path without governance keywords” rejection rules, and applied it inanalyze_and_build_result(). - Expanded governance matching by adding
contributing.mdto known paths andcontributingto governance stems. - Updated the LLM extraction prompt to require an upfront third-party path check with rules and examples.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ".github/codeowners", | ||
| "security-insights.md", | ||
| "readme.md", | ||
| "contributing.md", | ||
| } |
There was a problem hiding this comment.
contributing.md/contributing are added to the governance match sets, but _ripgrep_search() later excludes basenames in EXCLUDED_FILENAMES (which includes contributing.md and contributing). As a result, CONTRIBUTING files still won’t be discovered as candidates, so this change won’t improve coverage as intended. Consider either removing CONTRIBUTING from EXCLUDED_FILENAMES or not adding it to KNOWN_PATHS/GOVERNANCE_STEMS (and update the PR description accordingly).
| GOVERNANCE_PATH_KEYWORDS = { | ||
| "maintainer", | ||
| "codeowner", | ||
| "owner", | ||
| "contributor", | ||
| "governance", | ||
| "committer", | ||
| "reviewer", | ||
| "approver", | ||
| "emeritus", | ||
| } |
There was a problem hiding this comment.
GOVERNANCE_PATH_KEYWORDS is used to exempt deep paths from the third-party filter, but it doesn’t include contributing even though this PR adds contributing to governance stems/known paths. This can cause deep CONTRIBUTING-related governance paths (e.g., docs/contributing/...) to be treated as third-party under Rule 3. Consider adding contributing (and any other governance stems you expect in paths) to GOVERNANCE_PATH_KEYWORDS to keep the heuristics consistent.
| - **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these three rules match: | ||
|
|
||
| **Rule 1 — Vendor/dependency directory**: reject if any directory in the path is one of: | ||
| `vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`. | ||
|
|
||
| **Rule 2 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages. | ||
|
|
||
| **Rule 3 — Deep path without governance keyword**: reject if the path has more than 3 segments (e.g. `a/b/c/file`) AND does not contain any governance keyword (maintainer, codeowner, owner, contributor, governance, committer, reviewer, approver, emeritus). Deep files without governance keywords are typically unrelated to project governance. | ||
|
|
||
| **Examples of paths that MUST be rejected:** | ||
| - `vendor/google.golang.org/grpc/MAINTAINERS.md` (Rule 1: vendor) | ||
| - `node_modules/tunnel/README.md` (Rule 1: node_modules) | ||
| - `bundled/taskflow-3.10.0/README.md` (Rule 1: bundled + Rule 2: version) | ||
| - `gui-editors/gui-editor-apex/src/main/webapp/dist/js/jquery-ui-1.12.1/AUTHORS.txt` (Rule 2: version) | ||
| - `src/java.desktop/share/native/libsplashscreen/libpng/README` (Rule 3: deep, no governance keyword) | ||
| - `web/static-dist/libs/bootstrap/README.md` (Rule 3: deep, no governance keyword) | ||
| - `css/bootstrap/README.md` (Rule 3: depth=3 is fine, but this is actually a third-party asset — use your judgment) | ||
|
|
There was a problem hiding this comment.
The prompt’s vendor/third-party rules are not aligned with the backend _is_third_party_path() implementation:
- Rule 1’s directory list is missing entries that the backend rejects (e.g.
externallibs,bower_components_external). - The prompt includes
Pods/Godepswith capital letters, while the backend lowercases paths before matching. - The
css/bootstrap/README.mdexample says to “use your judgment” even though it doesn’t match any of the three mandatory reject rules, which conflicts with “MANDATORY — evaluate FIRST” + “MUST return … immediately”.
To avoid model/backend divergence, make the prompt list/casing match the code exactly and remove (or rework) the “use your judgment” example so it doesn’t contradict the stated rules.
joanagmaia
left a comment
There was a problem hiding this comment.
Hey @mbani01 the copilot comments seem to make sense. Can you double check its suggestions and make sure we are not introducing inconsistencies with the new implementation? Thanks!
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@joanagmaia indeed, that was relevant, it's fixed alongside the other missing cases |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| EXCLUDED_FILENAMES = { | ||
| "contributing.md", | ||
| "contributing", | ||
| "code_of_conduct.md", | ||
| "code-of-conduct.md", | ||
| } |
There was a problem hiding this comment.
contributing.md/contributing are being added to EXCLUDED_FILENAMES, which prevents them from ever being considered as candidate governance/maintainer files (see _ripgrep_search). This conflicts with the PR description claiming contributing.md was added as a recognized governance filename/stem; either remove these from EXCLUDED_FILENAMES (and add to KNOWN_PATHS/GOVERNANCE_STEMS if desired) or update the PR description/intent to match the implemented behavior.
| # Hard max depth (number of path segments). Files deeper than this are rejected | ||
| # regardless of content — legitimate governance files live at depth 1-3. | ||
| MAX_PATH_DEPTH = 3 |
There was a problem hiding this comment.
The PR description mentions a depth cap specifically for non-governance paths (e.g. MAX_NON_GOVERNANCE_DEPTH), but the implementation introduces an unconditional hard cap (MAX_PATH_DEPTH = 3) that rejects all paths deeper than 3 segments. If the intent is conditional depth filtering, this needs to be implemented (e.g., only apply the depth cap when no governance keywords/stems appear in the path) or the PR description/constants should be updated to reflect the unconditional behavior.
| self.logger.info(f"Analyzing maintainer file: {filename}") | ||
|
|
||
| if self._is_third_party_path(filename): | ||
| self.logger.warning(f"Skipping third-party/vendor file: '{filename}'") | ||
| raise MaintanerAnalysisError(error_code=ErrorCode.NO_MAINTAINER_FOUND) |
There was a problem hiding this comment.
_is_third_party_path is only applied inside analyze_and_build_result, which means candidate discovery (find_candidate_files) can still end up reading and scoring large vendored files (and incurring I/O + decode cost) before being rejected here. If the goal is to reduce third-party impact and cost, consider applying the third-party/path-depth filter earlier (e.g., skip paths in _ripgrep_search/find_candidate_files before _read_text_file) to avoid unnecessary file reads and processing.
| - **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** and the **repository URL** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match: | ||
|
|
||
| **Rule 1 — Repo-name check (step by step)**: | ||
| 1. Extract the repo name and org name from the repository URL (e.g. URL `https://github.com/numworks/epsilon` → repo=`epsilon`, org=`numworks`). | ||
| 2. For each directory in the file path, check: is this directory name a common structural directory (like `src`, `docs`, `doc`, `.github`, `lib`, `pkg`, `test`, `community`, `content`, `tools`, `web`, `app`, `config`, `deploy`, `charts`, etc.)? If yes, skip it — it's fine. | ||
| 3. For any directory that is NOT a common structural directory AND is NOT a governance keyword (maintainer, owner, contributor, etc.), check: does it appear as a substring of the repo name or org name, or vice versa? If NOT → this directory is a submodule or bundled library name that does not belong to this repo. Return `{{"error": "not_found"}}`. | ||
| Example: file `mylib/README.md` in repo `orgname/myproject` → `mylib` is not structural, not a governance keyword, and `mylib` does not appear in `myproject` or `orgname` → reject. But file `myproject/README.md` in the same repo → `myproject` matches the repo name → allow. | ||
|
|
||
| **Rule 2 — Vendor/dependency directory**: reject if any directory in the path is one of: | ||
| `vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`. | ||
|
|
||
| **Rule 3 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages. | ||
|
|
||
| **Rule 4 — Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segments → reject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions. | ||
|
|
||
| **Examples of paths that MUST be rejected:** | ||
| - `src/somelibrary/AUTHORS` in a repo that is NOT somelibrary (Rule 1) | ||
| - `subcomponent/README.md` in a repo with a different project name (Rule 1) | ||
| - `vendor/some-package/MAINTAINERS.md` (Rule 2: vendor) | ||
| - `node_modules/some-pkg/README.md` (Rule 2: node_modules) | ||
| - `bundled/pkg-1.2.0/README.md` (Rule 2 + Rule 3: version) | ||
| - `a/b/c/d/AUTHORS.txt` (Rule 4: more than 3 segments) |
There was a problem hiding this comment.
Prompt Rule 1 (repo-name/org-name substring heuristic) is enforced only by the LLM instruction; the backend _is_third_party_path logic doesn’t implement this rule. That makes third-party rejection behavior non-deterministic and hard to debug (a file can be rejected by the model without any corresponding backend log/guardrail). Consider either implementing the same repo-name heuristic server-side (so skips are explainable and consistent) or removing/softening this rule in the prompt to match backend behavior.
| - **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** and the **repository URL** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match: | |
| **Rule 1 — Repo-name check (step by step)**: | |
| 1. Extract the repo name and org name from the repository URL (e.g. URL `https://github.com/numworks/epsilon` → repo=`epsilon`, org=`numworks`). | |
| 2. For each directory in the file path, check: is this directory name a common structural directory (like `src`, `docs`, `doc`, `.github`, `lib`, `pkg`, `test`, `community`, `content`, `tools`, `web`, `app`, `config`, `deploy`, `charts`, etc.)? If yes, skip it — it's fine. | |
| 3. For any directory that is NOT a common structural directory AND is NOT a governance keyword (maintainer, owner, contributor, etc.), check: does it appear as a substring of the repo name or org name, or vice versa? If NOT → this directory is a submodule or bundled library name that does not belong to this repo. Return `{{"error": "not_found"}}`. | |
| Example: file `mylib/README.md` in repo `orgname/myproject` → `mylib` is not structural, not a governance keyword, and `mylib` does not appear in `myproject` or `orgname` → reject. But file `myproject/README.md` in the same repo → `myproject` matches the repo name → allow. | |
| **Rule 2 — Vendor/dependency directory**: reject if any directory in the path is one of: | |
| `vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`. | |
| **Rule 3 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages. | |
| **Rule 4 — Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segments → reject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions. | |
| **Examples of paths that MUST be rejected:** | |
| - `src/somelibrary/AUTHORS` in a repo that is NOT somelibrary (Rule 1) | |
| - `subcomponent/README.md` in a repo with a different project name (Rule 1) | |
| - `vendor/some-package/MAINTAINERS.md` (Rule 2: vendor) | |
| - `node_modules/some-pkg/README.md` (Rule 2: node_modules) | |
| - `bundled/pkg-1.2.0/README.md` (Rule 2 + Rule 3: version) | |
| - `a/b/c/d/AUTHORS.txt` (Rule 4: more than 3 segments) | |
| - **Third-Party Check (MANDATORY — evaluate FIRST)**: Examine the **full file path** below. You MUST return `{{"error": "not_found"}}` immediately if ANY of these rules match: | |
| **Rule 1 — Vendor/dependency directory**: reject if any directory in the path is one of: | |
| `vendor`, `node_modules`, `3rdparty`, `3rd_party`, `third_party`, `thirdparty`, `third-party`, `external`, `external_packages`, `extern`, `ext`, `deps`, `deps_src`, `dependencies`, `depend`, `bundled`, `bundled_deps`, `Pods`, `Godeps`, `bower_components`, `gems`, `submodules`, `internal-complibs`, `runtime-library`, `lib-src`, `lib-python`, `contrib`, `vendored`, or ends with `.dist-info`. | |
| **Rule 2 — Versioned directory**: reject if any directory in the path contains a version number pattern like `X.Y` or `X.Y.Z` (e.g. `jquery-ui-1.12.1`, `zlib-1.2.8`, `ffmpeg-7.1.1`, `mesa-24.0.2`). Versioned directories are almost always bundled third-party packages. | |
| **Rule 3 — Hard depth limit**: reject if the path has more than 3 segments (e.g. `a/b/c/file` is 4 segments → reject). Legitimate governance files live at the root or 1-2 directories deep. No exceptions. | |
| **Examples of paths that MUST be rejected:** | |
| - `vendor/some-package/MAINTAINERS.md` (Rule 1: vendor) | |
| - `node_modules/some-pkg/README.md` (Rule 1: node_modules) | |
| - `bundled/pkg-1.2.0/README.md` (Rule 1 + Rule 2: version) | |
| - `a/b/c/d/AUTHORS.txt` (Rule 3: more than 3 segments) |
| return True | ||
|
|
||
| if len(parts) > cls.MAX_PATH_DEPTH: | ||
| return True |
There was a problem hiding this comment.
Depth limit unconditionally rejects all deep governance files
Medium Severity
The PR description explicitly states the depth limit (MAX_NON_GOVERNANCE_DEPTH) applies only to "files without governance keywords," but the implementation uses MAX_PATH_DEPTH = 3 and applies it unconditionally to all files in _is_third_party_path. This means legitimate governance files at depth 4+ (e.g., community/governance/team/MAINTAINERS.md) are rejected as third-party, even though they have governance filenames. Previously tracked saved_maintainer_file entries at those depths would also start failing, causing regressions.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 639b271. Configure here.
| # Versioned directory pattern — directories containing semver-like numbers | ||
| # (e.g. "jquery-ui-1.12.1", "zlib-1.2.8", "ffmpeg-7.1.1") are almost always | ||
| # bundled third-party packages. Real project directories don't have versions. | ||
| _VERSION_DIR_RE = re.compile(r"\d+\.\d+") |
There was a problem hiding this comment.
Version regex matches non-version directory names
Low Severity
_VERSION_DIR_RE = re.compile(r"\d+\.\d+") used with .search() matches any directory containing a digit-dot-digit pattern anywhere in its name. This causes false positives on legitimate directory names like python3.9, go1.21, gcc12.3, or net5.0, incorrectly flagging them as versioned third-party packages. Adding word-boundary anchors or requiring a leading separator would reduce false positives.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 639b271. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 30a466d. Configure here.
|
|
||
| if self._is_third_party_path(filename): | ||
| self.logger.warning(f"Skipping third-party/vendor file: '{filename}'") | ||
| raise MaintanerAnalysisError(error_code=ErrorCode.NO_MAINTAINER_FOUND) |
There was a problem hiding this comment.
Third-party filtering missing from candidate discovery stage
Medium Severity
_is_third_party_path is only checked inside analyze_and_build_result, but find_candidate_files does not pre-filter third-party paths. Since only the single top-scoring subdir candidate is ever tried (line 848), a high-scoring third-party file (e.g. vendor/lib/MAINTAINERS.md) can shadow legitimate lower-scoring subdir candidates (e.g. docs/MAINTAINERS.md). The rejected third-party file causes an immediate fallthrough to expensive AI file detection, skipping all other valid subdir candidates entirely.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 30a466d. Configure here.


This pull request enhances the accuracy and reliability of maintainer extraction by introducing robust detection and exclusion of third-party and vendored files. The changes add new logic and rules to identify and skip files that are likely to be bundled dependencies or unrelated to project governance, both in the backend logic and the extraction prompt. This helps prevent false maintainer detections from third-party code and clarifies the extraction process.
Third-party and vendor file detection and exclusion:
THIRD_PARTY_DIR_EXACT) and a regular expression (_VERSION_DIR_RE) to identify third-party, vendored, or versioned directories in file paths. Also introduced a maximum path depth (MAX_NON_GOVERNANCE_DEPTH) for files without governance keywords to further filter out likely third-party files._is_third_party_pathclass method to encapsulate the logic for detecting third-party or vendor file paths using the above rules._is_third_party_pathcheck into theanalyze_and_build_resultmethod, so that files matching any exclusion rule are skipped and logged, raising aNO_MAINTAINER_FOUNDerror.Extraction prompt improvements:
get_extraction_promptto explicitly instruct the model to check for third-party/vendor file paths first, with detailed rules and examples for exclusion. This ensures the extraction process aligns with the backend logic and avoids extracting maintainers from irrelevant files.Governance and scoring keyword updates:
contributing.mdto the list of recognized governance filenames andcontributingto the set of governance stems, improving coverage of legitimate maintainer files. [1] [2]Dependency update:
remodule to support regular expression matching for versioned directories.Note
Medium Risk
Changes the maintainer detection/extraction flow to skip more files and alters the LLM prompt contract, which could reduce false positives but may also inadvertently exclude legitimate governance files in edge-case repos.
Overview
Prevents maintainer extraction from vendored/third-party content by adding a path-based exclusion check (known vendor dir names, versioned directories, and a hard max path depth) and applying it before running AI analysis.
Threads
repo_urlthrough maintainer extraction so the LLM prompt can explicitly reject likely submodules/dependencies using repo/path heuristics, and tightens Bedrock response parsing to more reliably strip trailing markdown fences and require raw JSON output.Reviewed by Cursor Bugbot for commit 30a466d. Bugbot is set up for automated code reviews on this repo. Configure here.