Skip to content

feat (umap, clustering): Added UMAP projection and clustering benchmark pipeline#14

Merged
HahaBill merged 6 commits into
masterfrom
feat/umap-clustering-benchmark
Jun 15, 2026
Merged

feat (umap, clustering): Added UMAP projection and clustering benchmark pipeline#14
HahaBill merged 6 commits into
masterfrom
feat/umap-clustering-benchmark

Conversation

@Harsh16gupta

@Harsh16gupta Harsh16gupta commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

This PR introduces a clustering benchmark pipeline to evaluate different grouping strategies for note categorization. High-dimensional 384D note embeddings are optionally reduced using UMAP before running the clustering algorithms side-by-side.

Fixes: #15

Implements:

  • UMAP projection (384D to 10D) using umap-js.
  • Custom K-Means and K-Medoids implementations.
  • Wrapped HDBSCAN with support for cosine distance and noise tuning.
  • Silhouette coefficient scoring to evaluate cluster quality.
  • Shared Mulberry32 utility for deterministic, reproducible results.
  • Benchmark integration to log a comparison table of strategies in the console.

test
image

@HahaBill

HahaBill commented Jun 7, 2026

Copy link
Copy Markdown
Member

@Harsh16gupta Hi Harsh, I created this GitHub issue so we can have more discussion regarding the clustering algorithms, considering other ML approaches, and possible improvements. This is the new GitHub issue that I created: #15

I think it's better to report your experiments and findings, our discussions, etc. there than here.

@Harsh16gupta

Copy link
Copy Markdown
Collaborator Author

Hey Bill, thank you for creating an issue to tackle this.
I have deleted that long testing results and updated it in the isssue. Will continue sharing my findings there.

@Harsh16gupta Harsh16gupta marked this pull request as ready for review June 12, 2026 09:21
@Harsh16gupta Harsh16gupta changed the title feat: Implement UMAP with various clustering algorithms feat (umap, clustering): Added UMAP projection and clustering benchmark pipeline Jun 12, 2026
@Harsh16gupta Harsh16gupta requested a review from HahaBill June 12, 2026 09:37
@HahaBill HahaBill requested a review from Copilot June 13, 2026 15:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end clustering benchmark pipeline for note-embedding vectors, optionally applying UMAP dimensionality reduction before running multiple clustering strategies and scoring them via silhouette coefficient. This supports Issue #15 by enabling side-by-side comparison of grouping approaches for note categorization.

Changes:

  • Introduces deterministic benchmarking utilities: Mulberry32 PRNG, UMAP projector wrapper, and distance/quality metrics (incl. silhouette score).
  • Adds clustering implementations/wrappers (K-Means, K-Medoids, HDBSCAN) plus a benchmark runner that logs a comparison table.
  • Integrates the benchmark run into the existing “Test Embedding” command and adds new dependencies (umap-js, hdbscan-ts).

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tsconfig.json Enables skipLibCheck for faster/less strict library type checking.
package.json Adds umap-js and hdbscan-ts runtime dependencies.
package-lock.json Locks newly added dependencies and their transitive tree.
src/utils/prng.ts Adds Mulberry32 deterministic PRNG utility.
src/types/projector.ts Defines UMAP projector option types.
src/types/cluster.ts Defines clustering config/strategy/result types for benchmarking.
src/pipeline/UmapProjector.ts Wraps umap-js with deterministic seeding and metric selection.
src/pipeline/clustering/metrics.ts Adds cosine/euclidean distance and silhouette scoring.
src/pipeline/clustering/kmeans.ts Implements K-Means with k-means++ initialization.
src/pipeline/clustering/kmedoids.ts Implements K-Medoids (simplified PAM).
src/pipeline/clustering/hdbscan.ts Wraps hdbscan-ts with cosine support via normalization.
src/pipeline/clustering/benchmark.ts Runs strategies, computes scores, and logs a comparison table.
src/commands/testEmbed.ts Runs the clustering benchmark after embedding and logs cluster assignments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/commands/testEmbed.ts
Comment thread src/commands/testEmbed.ts
Comment thread src/pipeline/clustering/benchmark.ts Outdated
Comment thread src/pipeline/clustering/kmedoids.ts
Comment thread src/pipeline/clustering/kmeans.ts

@HahaBill HahaBill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good for now. Thank you for this PR :)

@HahaBill HahaBill merged commit 311e137 into master Jun 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] UMAP with various clustering algorithms

3 participants