Skip to content

AI Sensitive Data Scanner (Batch)#475

Open
LukasHirt wants to merge 29 commits into
mainfrom
ext/2026-06-22-ai-sensitive-data-scanner
Open

AI Sensitive Data Scanner (Batch)#475
LukasHirt wants to merge 29 commits into
mainfrom
ext/2026-06-22-ai-sensitive-data-scanner

Conversation

@LukasHirt

Copy link
Copy Markdown
Collaborator

AI-generated · OSPO-51 · Gate: ✅ 1.00

Problem

Teams routinely share folders containing files with accidentally embedded PII,
credentials, or confidential text. Manual inspection before sharing is
impossible at scale.

Solution

Users select files and click "Scan for sensitive data" in the batch actions
bar. The extension fetches text from supported files (txt, md, pdf) and sends
each to the LLM; a report modal lists per-file findings with redacted excerpts.
With structured-output models, findings are categorized (PII / credentials /
confidential); with basic text models, a plain per-file narrative is returned.
Without a configured LLM, the action opens an informational modal about the
missing setup.

Extension points

global.files.batch-actions

Why ship this now

Compliance and data-governance requirements are rising for on-prem oCIS
customers; this gives them an instant pre-share check without leaving the files UI.

What was built

web-app-ai-sensitive-data-scanner is an oCIS Web extension that registers a single batch action on global.files.batch-actions. When users select one or more files and trigger "Scan for sensitive data," the extension fetches the text content of each supported file (CSV, Markdown, PDF, plain text), sends it to the configured LLM endpoint sequentially, and displays per-file findings in a results modal. PDF content is extracted via pdfjs-dist's fake-worker pattern, capped at 12,000 characters, consistent with the approach used in other AI extensions in this repo.

The entry point (src/index.ts) registers the action via defineWebApplication, delegating file-type gating to src/utils/file-support.ts (isSupportedFile, defaulting to csv, md, pdf, txt). Scanning logic lives in src/composables/useScanner.ts: it builds a FileScanResult per resource with progressive state transitions (pending → scanning → done | error | skipped), validates the LLM endpoint origin against window.location.origin before attaching the Bearer token, and processes files one at a time with await rather than Promise.all to avoid rate-limit collisions. ScanResultsModal.vue drives both the unconfigured-LLM path (shows a setup prompt and suppresses the scan) and the live path, rendering structured findings with category icons (pii, credentials, confidential) or a plain pre-wrap narrative block when the LLM returns non-JSON text.

Two deliberate degradation tiers are supported: when the model returns valid JSON, findings are surfaced as categorized entries with redacted excerpts; when it returns prose, the raw response is stored as a narrative field and rendered verbatim. The same-origin check is a hard gate — cross-origin endpoints produce a per-file error without sending credentials. The batch action registers exclusively on global.files.batch-actions; dual-registration with global.files.context-actions was explicitly rejected during planning.

Unit tests cover all rendering states of ScanResultsModal.vue (unconfigured, global in-progress, per-file pending/scanning/skipped/error, narrative fallback, structured findings, and re-scan button visibility). An E2E scaffold (acceptance.spec.ts, ScannerPage.ts, playwright.config.ts, global-setup.ts) is committed but the acceptance tests themselves are out of scope for this PR — they require a live oCIS instance with an LLM sidecar and are not exercised in CI.

Gate

Check Result
Hygiene ✅ ok
Build ✅ ok
Lint ✅ ok
Unit tests ✅ ok
E2E tests ✅ ok
Score 1.00

Effort: M · 🤖 Generated by extctl

LukasHirt added 29 commits June 23, 2026 13:48
…e `packages/web-app-ai-sensitive-data-scanner/` with `package.json`, `vite.config.ts`, `tsconfig.json`, `src/index.ts` stub, `l10n/translations.json`, and `l10n/.tx/config`

Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Fix two cascading e2e failures caused by oCIS state pollution:

1. oc-modal-background blocks afterEach cleanup: dispatchModal creates a
   full-screen backdrop with pointer-events that intercepts every click,
   preventing deleteAllFromPersonal() from reaching the app-switcher button.
   Set pointer-events: none on the backdrop in ScanResultsModal.onMounted so
   the modal stays visible while clicks pass through to the nav.

2. Leftover test-document.txt from prior gate runs: when cleanup fails after
   test 3, the file lingers in oCIS, causing uploadFile() to hang on the
   "File already exists" conflict dialog in the next run (tests 1 and 2).
   Add a Playwright globalSetup that deletes the known test fixture files via
   WebDAV (/remote.php/dav/files/admin/) before the suite runs.

Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
…`src/composables/useLlm.ts` (copied from `web-app-ai-doc-summary`) and `src/utils/file-support.ts`

Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
…seScan.ts`: text/PDF file fetching, sequential LLM calls with structured-output + plain-text fallback, same-origin endpoint validation, and per-file result state

Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
Signed-off-by: Lukas Hirt <info@hirt.cz>
…nt: complete `src/index.ts` to register the `ActionExtension` on `global.files.batch-actions` with `isVisible` guard and `dispatchModal` handler

Signed-off-by: Lukas Hirt <info@hirt.cz>
…sultsModal.vue`: scanning progress, per-file findings tables (structured) and narrative fallback, unconfigured-LLM state, using ODS components

Signed-off-by: Lukas Hirt <info@hirt.cz>
…nit/components/ScanResultModal.spec.ts` and add the E2E scaffold in `tests/e2e/`

Signed-off-by: Lukas Hirt <info@hirt.cz>
….md if present) for the extension

Signed-off-by: Lukas Hirt <info@hirt.cz>
… CI matrix, and oCIS apps config

Signed-off-by: Lukas Hirt <info@hirt.cz>
@kw-security

kw-security commented Jun 23, 2026

Copy link
Copy Markdown

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants