[telemetry] Detect Python package manager(s) at project setup#1918
[telemetry] Detect Python package manager(s) at project setup#1918rugpanov wants to merge 1 commit into
Conversation
Why: We need first-party data on which Python package manager(s) our users' projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow investment, replacing public-survey estimates. Measurement only -- no setup behavior changes. What: - Add packageManagerDetection.ts: a pure, signal-based classifier that reports all applicable managers plus a best-guess primary (uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Treats bare uv/poetry on PATH as weak signals. - Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in telemetry/constants.ts (reuses existing Telemetry client; opt-out honored; categorical data only, no paths/package/cluster names). - Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the Telemetry class via the commandExtensions declare-module pattern (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the Telemetry client. - Add PackageManagerTelemetry.ts: the collection half -- a best-effort, non-blocking collector (disk + already-resolved interpreter metadata) that gathers signals, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); failures degrade to unknown and are swallowed. - Wire emission into three touchpoints: project-open env check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug). - Add unit tests for the detector and pure helpers, and a dashboard-owner handoff note. Detection correctness: - interpreterSource is derived from the active interpreter alone, never from project files: a uv.lock project on a conda/venv/system interpreter reports that interpreter's real source, keeping the setup-flow gap visible. A genuinely uv-provisioned venv is identified by the `uv =` marker in pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock. - conda is attributed only when the active interpreter resides under CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check), not on the bare env var, which is session-global in the extension host (launching from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects. - pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header scan (pyprojectHasToolSection) instead of substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers (e.g. [tool.uv.sources], [[tool.poetry.source]]) that the substring check missed. - No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal); detection now only reads disk and already-resolved interpreter metadata. Verification: - yarn run build (typecheck) passes. - eslint clean; prettier formatted. - yarn run test:unit: 228 passing, 0 failing (includes detector + helper tests). Co-authored-by: Isaac
c236f29 to
be9f174
Compare
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
|
Integration tests: https://github.com/databricks-eng/eng-dev-ecosystem/actions/runs/27942293534 |
| /** A Python package/environment manager detected for a project. */ | ||
| export type PackageManagerName = "uv" | "poetry" | "pip" | "conda"; | ||
| /** Best-guess primary manager, or "unknown" when no signal fires. */ | ||
| export type PrimaryManagerName = PackageManagerName | "unknown"; | ||
| /** How the active interpreter was provisioned. */ | ||
| export type InterpreterSource = "uv" | "conda" | "system" | "venv" | "unknown"; |
There was a problem hiding this comment.
It looks to be duplicated from packageManagerDetection.ts, can we re-use same types?
| * free-form content (see {@link detectPackageManagers}). Telemetry opt-out is | ||
| * honoured by the underlying {@link Telemetry} client. | ||
| */ | ||
| export class PackageManagerTelemetry { |
There was a problem hiding this comment.
No tests for PackageManagerTelemetry?
| prefix.startsWith(base + "/") || | ||
| prefix.startsWith(base + "\\") |
There was a problem hiding this comment.
This assumes filesystem is case sensitive, on Windows case C:\Some and C:\some returning false, but they denote same folder
| try { | ||
| return fs | ||
| .readdirSync(projectRoot) | ||
| .some((name) => /^requirements.*\.txt$/.test(name)); |
There was a problem hiding this comment.
This regex matches "requirementswhatever.txt" as well, was it intentional?
Changes
Measurement-only telemetry to learn which Python package manager(s) our users' projects actually use (pip / conda / uv / poetry), so the VPEX setup-flow investment can be prioritized from first-party data instead of public-survey estimates. No setup behavior changes — this is detection only.
The work splits cleanly into three layers so each is independently testable and the dependency direction stays correct (high-level → low-level):
packageManagerDetection.ts): given a set of already-collected signals, reports every applicable manager, a best-guess primary (priorityuv > poetry > conda > pip), the firing signals,hasLockfile, and interpreter source. Side-effect free and total.telemetry/packageManagerExtensions.ts): addsrecordPackageManagerDetectionto the existingTelemetryclass via the samedeclare modulepattern ascommandExtensions.ts. Keeps disk/Python-extension dependencies out of the telemetry client.PackageManagerTelemetry.ts): a best-effort, non-blocking collector that reads disk and already-resolved interpreter metadata, runs the pure classifier, and calls the emit method. Deduplicated per session on(trigger, projectRoot); any failure degrades tounknownand is swallowed so it never disrupts setup.Emission is wired into three setup touchpoints: project-open environment check (
auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug).A new
Events.PYTHON_ENV_SETUP_DETECTEDevent carries a typed, documented schema (reuses the existing telemetry transport; opt-out honored; categorical data only — no paths, package names, or cluster names). A handoff note for the analytics/dashboard owner is included atsrc/telemetry/PACKAGE_MANAGER_DETECTION.md.Detection correctness (the parts most worth reviewing):
interpreterSourceis derived from the active interpreter alone, never from project files. Auv.lockproject running a conda/venv/system interpreter reports that interpreter's real source, keeping the "uv project, interpreter not uv-managed yet" setup-flow gap visible. A genuinely uv-provisioned venv is identified by theuv =marker inpyvenv.cfg, not byuv.lock.CONDA_PREFIX(path-boundary checked), not on the bare env var — which is session-global in the extension host (launching VS Code from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects.pyproject[tool.uv]/[tool.poetry]detection uses a bounded table-header scan, not substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g.tool.uvicorn), and matches subtable and array-of-table headers ([tool.uv.sources],[[tool.poetry.source]]).uvfor a weak, non-attributing signal). Detection reads only disk and already-resolved interpreter metadata.Scope / privacy: measurement only — no changes to setup behavior (the VPEX flows are a separate effort). Only enum/categorical data and a closed set of signal identifiers are emitted; the existing telemetry opt-out (
telemetry.telemetryLevel) is respected by the transport.Tests
yarn run test:unit: 202 passing, 0 failing — includes the pure classifier (each manager, interpreter sources, overlaps like uv+pip / conda+pip / poetry+uv, weak signals, none) and pure helpers (pyprojectHasToolSection,pyvenvCfgMarksUv,interpreterUnderCondaPrefix), covering the conda-prefix boundary and shell-global false-positive cases.yarn run build(typecheck) passes.eslintclean;prettierformatted.Reviewer can validate with: