v2 preprint: MolOR updates, percept ablations, baselines#6
Merged
jdthamores merged 6 commits intomicrosoft:mainfrom May 7, 2026
Merged
v2 preprint: MolOR updates, percept ablations, baselines#6jdthamores merged 6 commits intomicrosoft:mainfrom
jdthamores merged 6 commits intomicrosoft:mainfrom
Conversation
- Add .gitignore for training output dirs, ESM/DGL caches, plot outputs, deprecated/, CLAUDE.md, and the local-only fig4 stat-tests notebook - Stop tracking notebooks/fig4_stat_tests.ipynb (kept locally, gitignored) - Add MolOR MPNN-encoder config (data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json) - Add PerceiverCPI/Goldman M2OR train/val/test splits and naive_broad_ORs_30%.fasta - Add notebooks/test_OR_logits_shuffle.ipynb (shuffle/null control) - Update analyze_blast_results.ipynb - README updated for v2: MPNN encoder option, ablation scripts under scripts/, Goldman/PerceiverCPI baselines, Zenodo placeholder Co-Authored-By: Seyone <seyonec@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- data/datasets/HORDE/{functional,pseudogene}_indices.json: load-bearing for
the pseudogene-control variant of the GS-LF percept model
(classification_OR_feat_ESM.py:112-113)
- data/datasets/M2OR/{seqs,seqs_with_annotations}.csv: OR sequences and
family annotations used by utils.py for OR-subfamily analyses
Co-Authored-By: Seyone <seyonec@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the codebase to support the v2 preprint experiments: adding a MolOR variant with an MPNN molecular encoder, enabling OR subfamily holdout splits, expanding percept ablation tooling (including pseudogene/functional HORDE controls), and improving repo hygiene around large artifacts.
Changes:
- Add MolOR_MPNN support (MPNN encoder path in
MolORPredictor, model loading, and inference wiring). - Add an
or_subfamily_holdoutdataset split and loss-weighting normalization option for M2OR pairs. - Add/revise scripts and small dataset artifacts to reproduce baseline splits/ablations and manage large outputs via
.gitignore.
Reviewed changes
Copilot reviewed 20 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
utils.py |
Adds MolOR_MPNN to featurizer handling, introduces OR subfamily holdout split, and updates OR-feature prediction path to pass edge features. |
classification_ESM.py |
Adds MolOR_MPNN / OR-subfamily holdout CLI options and new weighted-loss normalization flag; updates seeding behavior. |
gcn_or_predictor.py |
Adds MPNNGNN-based encoder option to MolORPredictor and threads optional edge_feats through forward(). |
data/m2or.py |
Adds normalize_loss_by_class_imbalance option; refactors sample-weight computation and path joining. |
classification_OR_feat_ESM.py |
Adds HORDE pseudogene/functional index selection and applies index filtering when loading logits. |
scripts/run_OR_percept_ablations_HORDE.sh |
Updates ablation runner to call classification_OR_feat_ESM.py. |
scripts/prepare_enzpred_data.py |
New utility to export PerceiverCPI M2OR splits into enz-pred CSV + PresetSplitter indices. |
scripts/merge_blast_annotations.py |
New script to merge BLAST-derived UniProt/gene IDs into sequence CSVs. |
scripts/get_gene_uniprot_IDs_blast.py |
New script to BLAST sequences and annotate UniProt/gene IDs. |
scripts/get_HORDE_metadata.ipynb |
New notebook to derive HORDE pseudogene/functional indices. |
receptor_binding/analyze_blast_results.ipynb |
Updates notebook paths/kernel metadata for BLAST result analysis. |
notebooks/OR_subfamily_analysis.ipynb |
New notebook analyzing OR subfamily distributions and weights. |
notebooks/cross_task_stats.ipynb |
Minor notebook edit (adds an empty code cell). |
README.md |
Updates repo overview and adds more structured usage/environment notes for v2. |
.gitignore |
Adds ignores for large artifacts, caches, and training output directories. |
data/datasets/HORDE/pseudogene_indices.json |
Adds precomputed pseudogene indices for HORDE. |
data/datasets/HORDE/functional_indices.json |
Adds precomputed functional indices for HORDE. |
data/datasets/naive_broad_ORs_30%.fasta |
Adds a small FASTA artifact referenced as “load-bearing” data. |
data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json |
Adds config for MolOR_MPNN on M2OR_Pairs. |
data/configures/GS_LF/MPNN_canonical.json |
Updates MPNN config hyperparameters for GS_LF. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
285
to
288
| else: | ||
| device = torch.device('cpu') | ||
| torch.cuda.set_device(device) | ||
| args['device'] = device |
Comment on lines
291
to
+296
| torch.manual_seed(seed) | ||
| np.random.seed(seed) | ||
| torch.cuda.manual_seed(seed) | ||
| torch.cuda.manual_seed_all(seed) | ||
| torch.backends.cudnn.deterministic = True | ||
| torch.backends.cudnn.benchmark = False |
| @@ -626,10 +630,9 @@ def forward(self, bg, feats, add_feats = None, seq_mask = None, node_mask = None | |||
| """ | |||
| #print(bg) | |||
| #print(feats) | |||
Comment on lines
+539
to
+542
| print('normalizing loss using log1p ofclass imbalance alongside canonical weighing scheme') | ||
| df['sample_weight'] = np.log1p(df['weight_class']) * df['weight_pair_imbalance'] * df['weight_quality'] | ||
| else: | ||
| print('normalizing loss using cannonical weighing scheme') |
Comment on lines
+105
to
+113
| print(f'Using OR subfamily holdout with subfamily: {args["or_subfamily_holdout"]}') | ||
| from dgl.data.utils import Subset | ||
| import pandas as pd | ||
| import numpy as np | ||
|
|
||
| # Validate input | ||
| target_subfamily = args['or_subfamily_holdout'] | ||
| if target_subfamily is None: | ||
| raise ValueError("--or_subfamily_holdout must be specified when using 'or_subfamily_holdout' split") |
Comment on lines
+13
to
+16
| DATA_DIR = os.path.join(os.path.dirname(__file__), "..", "data", "datasets") | ||
| OUT_CSV = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_binary.csv") | ||
| OUT_PICKLE = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_split_indices.p") | ||
|
|
Comment on lines
+73
to
+77
| def main(): | ||
| blast_file = '/home/seyonec/olfaction/receptor_binding/blast_results.txt' | ||
| seqs_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs.csv' | ||
| output_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs_with_annotations.csv' | ||
|
|
Comment on lines
255
to
+262
| elif args['OR_database'] == 'HORDE': | ||
| if args['prev_model_loss'] == 'unweighted_loss': | ||
| print("Loading logits from model trained on unweighed loss") | ||
| full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/olfactory_subgenome_OR_logits.pt') | ||
| if args['num_OR_logits'] < 1237: | ||
| full_OR_logits = full_OR_logits[:, :args['num_OR_logits']] | ||
| else: | ||
| print("Loading logits from model trained on weighed loss") | ||
| full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/weighted_loss_olfactory_subgenome_OR_logits.pt') | ||
| if args['num_OR_logits'] < 1237: | ||
| full_OR_logits = full_OR_logits[:, :args['num_OR_logits']] | ||
| # Apply class selection first (if provided), then enforce num_OR_logits |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.gitignorefor training output dirs, ESM/DGL caches, large OR-logit tensors; ships small load-bearing files (HORDE filter indices, M2OR sequence annotations, naive_broad_ORs FASTA).Test plan
python classification_ESM.py --model MolOR -d M2OR_Pairs -f canonical -esm 650m -cross_att -w -s randomtrains end-to-end on M2OR (regenerates ESM-2 cache on first run).bash scripts/run_OR_percept_ablations_HORDE.shreproduces GS-LF percept ablation given upstream MolOR checkpoint + HORDE OR-logit tensor (forthcoming Zenodo bundle).🤖 Generated with Claude Code