Skip to content

fix: osw to parquet export with global peptide/protein scores#206

Merged
singjc merged 5 commits intoPyProphet:masterfrom
singjc:master
Apr 27, 2026
Merged

fix: osw to parquet export with global peptide/protein scores#206
singjc merged 5 commits intoPyProphet:masterfrom
singjc:master

Conversation

@singjc
Copy link
Copy Markdown
Contributor

@singjc singjc commented Apr 26, 2026

This pull request improves the handling of score tables and joins in the pyprophet/io/export/osw.py module, specifically addressing how global and non-global contexts are managed in SQL queries and how joins are constructed when RUN_ID may be NULL. The main focus is to ensure correct merging and selection of scores for both global and run-specific contexts, especially in cases where some data may lack a RUN_ID.

Score table querying and merging improvements:

  • Refactored the construction of pivot columns and queries in _get_peptide_protein_score_table to separately track non-global and global context columns, and to handle cases where either or both types of context exist. This includes building the merged query with appropriate FULL OUTER JOIN logic and ensuring correct column selection and grouping. [1] [2]

Join logic enhancements for handling NULL RUN_ID:

  • Updated the join conditions in _build_score_column_selection_and_joins to allow joining score views where RUN_ID is either matching or NULL, improving robustness when global scores (without a RUN_ID) are present. This applies to both peptide and protein score joins. [1] [2]

singjc and others added 2 commits April 26, 2026 19:52
- Introduced a new test file `test_osw_export_score_views.py` to validate the export of score views from OSW files.
- Implemented helper functions to create test OSW databases and read joined scores using DuckDB.
- Added tests to ensure global and experiment-wide scores are correctly handled when run IDs are null.
- Enhanced `test_pyprophet_export.py` by adding a sorting function for exported parquet frames to ensure deterministic snapshots.
- Updated existing tests to utilize the new sorting function for parquet exports.
Copilot AI review requested due to automatic review settings April 26, 2026 23:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes OSW→Parquet export behavior for peptide/protein score tables by improving how global vs run-scoped contexts are queried/merged and by making joins tolerant to RUN_ID being NULL, ensuring global scores are retained during export.

Changes:

  • Refactors _get_peptide_protein_score_table to separately build/merge non-global (keyed by (ID, RUN_ID)) and global (keyed by ID) score projections.
  • Updates score-view join conditions to allow matching on (FEATURE.RUN_ID = view.RUN_ID OR view.RUN_ID IS NULL) so global-score rows without RUN_ID still join.
  • Stabilizes parquet export regression snapshots by sorting exported parquet frames prior to printing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyprophet/io/export/osw.py Refactors peptide/protein score view generation and adjusts join logic to handle global scores where RUN_ID may be NULL.
tests/test_osw_export_score_views.py Adds focused tests validating that global peptide/protein scores are preserved when RUN_ID is NULL.
tests/test_pyprophet_export.py Adds deterministic sorting before regtest snapshot output for parquet export tests.
tests/_regtest_outputs/test_pyprophet_export.test_parquet_export_scored_osw.out Updates expected regtest snapshot after introducing deterministic sorting.
tests/_regtest_outputs/test_pyprophet_export.test_parquet_export_no_transition_data.out Updates expected regtest snapshot after introducing deterministic sorting.
Comments suppressed due to low confidence (1)

pyprophet/io/export/osw.py:2579

  • Using DuckDB ANY_VALUE() to collapse potentially multiple rows per (context, ID, RUN_ID) can yield nondeterministic results if duplicates exist (it may pick any row). Prefer a deterministic aggregate (e.g., MIN/MAX) or enforce uniqueness (e.g., assert/count duplicates) so exports don’t silently vary across runs/files.
                    [
                        f"ANY_VALUE(CASE WHEN context = '{context}' THEN SCORE END) as {score_table}_{safe_context}_SCORE",
                        f"ANY_VALUE(CASE WHEN context = '{context}' THEN PVALUE END) as {score_table}_{safe_context}_PVALUE",
                        f"ANY_VALUE(CASE WHEN context = '{context}' THEN QVALUE END) as {score_table}_{safe_context}_QVALUE",
                        f"ANY_VALUE(CASE WHEN context = '{context}' THEN PEP END) as {score_table}_{safe_context}_PEP",
                    ]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyprophet/io/export/osw.py
singjc added 3 commits April 26, 2026 21:15
- Introduced _stabilize_regtest_float function to ensure deterministic float rendering across platforms.
- Updated _normalize_regtest_frame to utilize the new stabilization function for better consistency in test outputs.
- Adjusted _normalize_peakgroup_regtest_frame to call the generalized normalization function.
- Improved handling of tiny floating-point values and ensured zero values are consistently represented.
@singjc singjc enabled auto-merge April 27, 2026 04:38
@singjc singjc merged commit 5d4406d into PyProphet:master Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants