fix: osw to parquet export with global peptide/protein scores#206
Merged
singjc merged 5 commits intoPyProphet:masterfrom Apr 27, 2026
Merged
fix: osw to parquet export with global peptide/protein scores#206singjc merged 5 commits intoPyProphet:masterfrom
singjc merged 5 commits intoPyProphet:masterfrom
Conversation
- Introduced a new test file `test_osw_export_score_views.py` to validate the export of score views from OSW files. - Implemented helper functions to create test OSW databases and read joined scores using DuckDB. - Added tests to ensure global and experiment-wide scores are correctly handled when run IDs are null. - Enhanced `test_pyprophet_export.py` by adding a sorting function for exported parquet frames to ensure deterministic snapshots. - Updated existing tests to utilize the new sorting function for parquet exports.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes OSW→Parquet export behavior for peptide/protein score tables by improving how global vs run-scoped contexts are queried/merged and by making joins tolerant to RUN_ID being NULL, ensuring global scores are retained during export.
Changes:
- Refactors
_get_peptide_protein_score_tableto separately build/merge non-global (keyed by(ID, RUN_ID)) and global (keyed byID) score projections. - Updates score-view join conditions to allow matching on
(FEATURE.RUN_ID = view.RUN_ID OR view.RUN_ID IS NULL)so global-score rows withoutRUN_IDstill join. - Stabilizes parquet export regression snapshots by sorting exported parquet frames prior to printing.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
pyprophet/io/export/osw.py |
Refactors peptide/protein score view generation and adjusts join logic to handle global scores where RUN_ID may be NULL. |
tests/test_osw_export_score_views.py |
Adds focused tests validating that global peptide/protein scores are preserved when RUN_ID is NULL. |
tests/test_pyprophet_export.py |
Adds deterministic sorting before regtest snapshot output for parquet export tests. |
tests/_regtest_outputs/test_pyprophet_export.test_parquet_export_scored_osw.out |
Updates expected regtest snapshot after introducing deterministic sorting. |
tests/_regtest_outputs/test_pyprophet_export.test_parquet_export_no_transition_data.out |
Updates expected regtest snapshot after introducing deterministic sorting. |
Comments suppressed due to low confidence (1)
pyprophet/io/export/osw.py:2579
- Using DuckDB ANY_VALUE() to collapse potentially multiple rows per (context, ID, RUN_ID) can yield nondeterministic results if duplicates exist (it may pick any row). Prefer a deterministic aggregate (e.g., MIN/MAX) or enforce uniqueness (e.g., assert/count duplicates) so exports don’t silently vary across runs/files.
[
f"ANY_VALUE(CASE WHEN context = '{context}' THEN SCORE END) as {score_table}_{safe_context}_SCORE",
f"ANY_VALUE(CASE WHEN context = '{context}' THEN PVALUE END) as {score_table}_{safe_context}_PVALUE",
f"ANY_VALUE(CASE WHEN context = '{context}' THEN QVALUE END) as {score_table}_{safe_context}_QVALUE",
f"ANY_VALUE(CASE WHEN context = '{context}' THEN PEP END) as {score_table}_{safe_context}_PEP",
]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ted before validation
- Introduced _stabilize_regtest_float function to ensure deterministic float rendering across platforms. - Updated _normalize_regtest_frame to utilize the new stabilization function for better consistency in test outputs. - Adjusted _normalize_peakgroup_regtest_frame to call the generalized normalization function. - Improved handling of tiny floating-point values and ensured zero values are consistently represented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request improves the handling of score tables and joins in the
pyprophet/io/export/osw.pymodule, specifically addressing how global and non-global contexts are managed in SQL queries and how joins are constructed whenRUN_IDmay beNULL. The main focus is to ensure correct merging and selection of scores for both global and run-specific contexts, especially in cases where some data may lack aRUN_ID.Score table querying and merging improvements:
_get_peptide_protein_score_tableto separately track non-global and global context columns, and to handle cases where either or both types of context exist. This includes building the merged query with appropriateFULL OUTER JOINlogic and ensuring correct column selection and grouping. [1] [2]Join logic enhancements for handling NULL
RUN_ID:_build_score_column_selection_and_joinsto allow joining score views whereRUN_IDis either matching orNULL, improving robustness when global scores (without aRUN_ID) are present. This applies to both peptide and protein score joins. [1] [2]