Skip to content

feat: direct subsampling and applying of OSW during scoring#205

Merged
singjc merged 8 commits intoPyProphet:masterfrom
singjc:master
Apr 26, 2026
Merged

feat: direct subsampling and applying of OSW during scoring#205
singjc merged 8 commits intoPyProphet:masterfrom
singjc:master

Conversation

@singjc
Copy link
Copy Markdown
Contributor

@singjc singjc commented Apr 26, 2026

This pull request introduces improved and more flexible subsampling support for OSW files, aligning the behavior with other supported file types (such as parquet) and making it easier to perform semi-supervised learning on subsets of the data. The implementation ensures that subsampling is handled efficiently within DuckDB views, and that all downstream feature queries respect the sampled subset. Additionally, there are minor improvements to file type validation and weight saving logic.

Enhanced OSW subsampling and feature querying:

  • Added a new _init_duckdb_views method to BaseOSWReader to create a temporary table of sampled precursor IDs when subsampling is enabled, allowing efficient filtering for all feature queries. This method is now called in the OSW reader before creating feature views. [1] [2] [3]
  • Updated all DuckDB feature view creation methods in pyprophet/io/scoring/osw.py (_fetch_ms2_features_duckdb, _fetch_ms1_features_duckdb, _fetch_transition_features_duckdb, _fetch_alignment_features_duckdb) to optionally filter by the sampled precursor IDs, ensuring that only the subsampled data is processed when requested. [1] [2] [3] [4] [5] [6] [7] [8] [9]

User experience and compatibility improvements:

  • Extended file type validation in the scoring CLI so that OSW files, in addition to parquet formats, now support subsampling directly. The warning message has been updated to reflect this, and users of unsupported formats are advised to manually subsample their data.
  • In the OSW writer, updated the logic so that SVM classifier weights are saved in the same table as LDA weights, and ensured that the database commit is performed after saving. [1] [2]

Minor/maintenance:

  • Added import math to pyprophet/io/_base.py for use in subsample size calculation.
  • Updated Cython-generated file references to reflect newer numpy versions (no functional impact). [1] [2] [3] [4] [5] [6] [7] [8] [9]

singjc and others added 4 commits April 17, 2026 09:57
…scores arrays in lookup_values_from_error_table
- Enhanced `PyProphetRunner` to return output file for OSW file type.
- Improved error message for missing PYPROPHET_WEIGHTS table to include classifier type.
- Introduced new test outputs for OSW subsampling and weight application.
- Updated `OSWTestStrategy` to handle subsampling and weight application workflows.
- Added tests for OSW subsampling and applying weights to the full dataset.

Co-authored-by: Copilot <copilot@github.com>
Copilot AI review requested due to automatic review settings April 26, 2026 17:09
@singjc singjc enabled auto-merge April 26, 2026 17:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class OSW subsampling support during scoring by pushing the sampled subset into DuckDB views, so downstream feature queries operate only on the sampled precursors (similar to parquet workflows). It also adjusts scoring/apply-weights behavior for OSW and includes a few compatibility fixes (NumPy/Pandas).

Changes:

  • Add OSW DuckDB initialization to materialize a sampled precursor-id set and apply it across OSW feature views.
  • Extend scoring CLI subsampling validation to include OSW (alongside parquet variants).
  • Improve weight persistence/application behavior (SVM weights in OSW weight table, commit), plus NumPy/Pandas compatibility fixes and new regression tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pyprophet/io/_base.py Adds BaseOSWReader._init_duckdb_views() to create a sampled precursor-id temp table when subsampling.
pyprophet/io/scoring/osw.py Calls _init_duckdb_views() and threads subsample filtering into DuckDB feature-view creation; updates OSW weight saving logic.
pyprophet/cli/score.py Treats OSW as a supported format for --subsample_ratio workflows.
pyprophet/scoring/runner.py Returns OSW path from scoring runner; improves apply-weights error message.
pyprophet/scoring/classifiers.py Forces feature matrices/parameters to NumPy arrays (dtype float32) for better Pandas/NumPy interop and clearer failure mode.
pyprophet/stats.py Copies arrays before calling optimized matching to avoid read-only buffer issues in newer NumPy.
tests/test_pyprophet_score.py Adds OSW subsample and OSW apply-weights regression tests; plumbs --subsample_ratio into OSW strategy execution.
tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample.out Adds golden output for OSW subsampling test.
tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample_apply_weights.out Adds golden output for OSW apply-weights test.
pyprophet/scoring/_optimized.c Updates cython-generated references (no intended functional change).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyprophet/io/scoring/osw.py
Comment thread pyprophet/io/scoring/osw.py
Comment thread pyprophet/io/scoring/osw.py
Comment thread tests/test_pyprophet_score.py
Comment thread tests/test_pyprophet_score.py
singjc and others added 4 commits April 26, 2026 13:58
…adjust test commands for subsampling ratio

Co-authored-by: Copilot <copilot@github.com>
… commands

- Modified expected output values in test_pyprophet_score.test_osw_1.out and test_pyprophet_score.test_tsv_1.out to reflect changes in scoring results.
- Increased subsampling ratio from 0.5 to 1.0 in pyprophet scoring commands for both metabolomics and regular OSW workflows in test_pyprophet_score.py.

Co-authored-by: Copilot <copilot@github.com>
@singjc singjc merged commit 301e13b into PyProphet:master Apr 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants