feat (umap, clustering): Added UMAP projection and clustering benchmark pipeline#14
Conversation
|
@Harsh16gupta Hi Harsh, I created this GitHub issue so we can have more discussion regarding the clustering algorithms, considering other ML approaches, and possible improvements. This is the new GitHub issue that I created: #15 I think it's better to report your experiments and findings, our discussions, etc. there than here. |
|
Hey Bill, thank you for creating an issue to tackle this. |
…bed logs allowed to show all the clusters formed by the all the algo
There was a problem hiding this comment.
Pull request overview
Adds an end-to-end clustering benchmark pipeline for note-embedding vectors, optionally applying UMAP dimensionality reduction before running multiple clustering strategies and scoring them via silhouette coefficient. This supports Issue #15 by enabling side-by-side comparison of grouping approaches for note categorization.
Changes:
- Introduces deterministic benchmarking utilities: Mulberry32 PRNG, UMAP projector wrapper, and distance/quality metrics (incl. silhouette score).
- Adds clustering implementations/wrappers (K-Means, K-Medoids, HDBSCAN) plus a benchmark runner that logs a comparison table.
- Integrates the benchmark run into the existing “Test Embedding” command and adds new dependencies (
umap-js,hdbscan-ts).
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tsconfig.json | Enables skipLibCheck for faster/less strict library type checking. |
| package.json | Adds umap-js and hdbscan-ts runtime dependencies. |
| package-lock.json | Locks newly added dependencies and their transitive tree. |
| src/utils/prng.ts | Adds Mulberry32 deterministic PRNG utility. |
| src/types/projector.ts | Defines UMAP projector option types. |
| src/types/cluster.ts | Defines clustering config/strategy/result types for benchmarking. |
| src/pipeline/UmapProjector.ts | Wraps umap-js with deterministic seeding and metric selection. |
| src/pipeline/clustering/metrics.ts | Adds cosine/euclidean distance and silhouette scoring. |
| src/pipeline/clustering/kmeans.ts | Implements K-Means with k-means++ initialization. |
| src/pipeline/clustering/kmedoids.ts | Implements K-Medoids (simplified PAM). |
| src/pipeline/clustering/hdbscan.ts | Wraps hdbscan-ts with cosine support via normalization. |
| src/pipeline/clustering/benchmark.ts | Runs strategies, computes scores, and logs a comparison table. |
| src/commands/testEmbed.ts | Runs the clustering benchmark after embedding and logs cluster assignments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
HahaBill
left a comment
There was a problem hiding this comment.
All looks good for now. Thank you for this PR :)
This PR introduces a clustering benchmark pipeline to evaluate different grouping strategies for note categorization. High-dimensional 384D note embeddings are optionally reduced using UMAP before running the clustering algorithms side-by-side.
Fixes: #15
Implements:
test
