v2 preprint: MolOR updates, percept ablations, baselines by seyonechithrananda · Pull Request #6 · microsoft/olfaction

seyonechithrananda · 2026-05-06T06:26:12Z

Summary

v2 preprint update: cross-attention MolOR over per-residue ESM-2 (650M / 3B) embeddings, with optional MPNN molecular encoder alongside the existing GCN.
Percept ablations sweeping OR-feature counts across HORDE (5–845) and M2OR (5–1237) sets, weighted/unweighted upstream MolOR variants, plus a pseudogene-only control showing percept gains come from functional ORs.
Reviewer revisions: Benjamini–Hochberg FDR correction and Jonckheere–Terpstra trend test for Table S1; Goldman et al. (FFN+ESM) baseline reproduction; PerceiverCPI/Goldman M2OR splits added.
Repo hygiene: .gitignore for training output dirs, ESM/DGL caches, large OR-logit tensors; ships small load-bearing files (HORDE filter indices, M2OR sequence annotations, naive_broad_ORs FASTA).

Test plan

python classification_ESM.py --model MolOR -d M2OR_Pairs -f canonical -esm 650m -cross_att -w -s random trains end-to-end on M2OR (regenerates ESM-2 cache on first run).
bash scripts/run_OR_percept_ablations_HORDE.sh reproduces GS-LF percept ablation given upstream MolOR checkpoint + HORDE OR-logit tensor (forthcoming Zenodo bundle).
Pseudogene control: `classification_OR_feat_ESM.py --OR_database HORDE --OR_gene_class pseudogene` matches the no-OR baseline AUROC (~87.04) per Table 2.
Per-seed `*eval.txt` files in `weighted_loss_HORDE_2/gs_lf{0,5,…,845}_OR_logits_layernorm_roc_auc_score/` reproduce Table S1 numbers.

🤖 Generated with Claude Code

…odel_backbone

- Add .gitignore for training output dirs, ESM/DGL caches, plot outputs, deprecated/, CLAUDE.md, and the local-only fig4 stat-tests notebook - Stop tracking notebooks/fig4_stat_tests.ipynb (kept locally, gitignored) - Add MolOR MPNN-encoder config (data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json) - Add PerceiverCPI/Goldman M2OR train/val/test splits and naive_broad_ORs_30%.fasta - Add notebooks/test_OR_logits_shuffle.ipynb (shuffle/null control) - Update analyze_blast_results.ipynb - README updated for v2: MPNN encoder option, ablation scripts under scripts/, Goldman/PerceiverCPI baselines, Zenodo placeholder Co-Authored-By: Seyone <seyonec@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- data/datasets/HORDE/{functional,pseudogene}_indices.json: load-bearing for the pseudogene-control variant of the GS-LF percept model (classification_OR_feat_ESM.py:112-113) - data/datasets/M2OR/{seqs,seqs_with_annotations}.csv: OR sequences and family annotations used by utils.py for OR-subfamily analyses Co-Authored-By: Seyone <seyonec@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the codebase to support the v2 preprint experiments: adding a MolOR variant with an MPNN molecular encoder, enabling OR subfamily holdout splits, expanding percept ablation tooling (including pseudogene/functional HORDE controls), and improving repo hygiene around large artifacts.

Changes:

Add MolOR_MPNN support (MPNN encoder path in MolORPredictor, model loading, and inference wiring).
Add an or_subfamily_holdout dataset split and loss-weighting normalization option for M2OR pairs.
Add/revise scripts and small dataset artifacts to reproduce baseline splits/ablations and manage large outputs via .gitignore.

Reviewed changes

Copilot reviewed 20 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`utils.py`	Adds MolOR_MPNN to featurizer handling, introduces OR subfamily holdout split, and updates OR-feature prediction path to pass edge features.
`classification_ESM.py`	Adds MolOR_MPNN / OR-subfamily holdout CLI options and new weighted-loss normalization flag; updates seeding behavior.
`gcn_or_predictor.py`	Adds MPNNGNN-based encoder option to `MolORPredictor` and threads optional `edge_feats` through `forward()`.
`data/m2or.py`	Adds `normalize_loss_by_class_imbalance` option; refactors sample-weight computation and path joining.
`classification_OR_feat_ESM.py`	Adds HORDE pseudogene/functional index selection and applies index filtering when loading logits.
`scripts/run_OR_percept_ablations_HORDE.sh`	Updates ablation runner to call `classification_OR_feat_ESM.py`.
`scripts/prepare_enzpred_data.py`	New utility to export PerceiverCPI M2OR splits into enz-pred CSV + PresetSplitter indices.
`scripts/merge_blast_annotations.py`	New script to merge BLAST-derived UniProt/gene IDs into sequence CSVs.
`scripts/get_gene_uniprot_IDs_blast.py`	New script to BLAST sequences and annotate UniProt/gene IDs.
`scripts/get_HORDE_metadata.ipynb`	New notebook to derive HORDE pseudogene/functional indices.
`receptor_binding/analyze_blast_results.ipynb`	Updates notebook paths/kernel metadata for BLAST result analysis.
`notebooks/OR_subfamily_analysis.ipynb`	New notebook analyzing OR subfamily distributions and weights.
`notebooks/cross_task_stats.ipynb`	Minor notebook edit (adds an empty code cell).
`README.md`	Updates repo overview and adds more structured usage/environment notes for v2.
`.gitignore`	Adds ignores for large artifacts, caches, and training output directories.
`data/datasets/HORDE/pseudogene_indices.json`	Adds precomputed pseudogene indices for HORDE.
`data/datasets/HORDE/functional_indices.json`	Adds precomputed functional indices for HORDE.
`data/datasets/naive_broad_ORs_30%.fasta`	Adds a small FASTA artifact referenced as “load-bearing” data.
`data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json`	Adds config for MolOR_MPNN on M2OR_Pairs.
`data/configures/GS_LF/MPNN_canonical.json`	Updates MPNN config hyperparameters for GS_LF.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    else:
        device = torch.device('cpu')
        torch.cuda.set_device(device)
        args['device'] = device


    torch.manual_seed(seed)
    np.random.seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False


@@ -626,10 +630,9 @@ def forward(self, bg, feats, add_feats = None, seq_mask = None, node_mask = None
        """
        #print(bg)
        #print(feats)


+        print('normalizing loss using log1p ofclass imbalance alongside canonical weighing scheme')
+        df['sample_weight'] = np.log1p(df['weight_class']) * df['weight_pair_imbalance'] * df['weight_quality']
+    else:
+        print('normalizing loss using cannonical weighing scheme')


+        print(f'Using OR subfamily holdout with subfamily: {args["or_subfamily_holdout"]}')
+        from dgl.data.utils import Subset
+        import pandas as pd
+        import numpy as np
+
+        # Validate input
+        target_subfamily = args['or_subfamily_holdout']
+        if target_subfamily is None:
+            raise ValueError("--or_subfamily_holdout must be specified when using 'or_subfamily_holdout' split")


+DATA_DIR = os.path.join(os.path.dirname(__file__), "..", "data", "datasets")
+OUT_CSV = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_binary.csv")
+OUT_PICKLE = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_split_indices.p")
+


+def main():
+    blast_file = '/home/seyonec/olfaction/receptor_binding/blast_results.txt'
+    seqs_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs.csv'
+    output_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs_with_annotations.csv'
+


    elif args['OR_database'] == 'HORDE':
        if args['prev_model_loss'] == 'unweighted_loss':
            print("Loading logits from model trained on unweighed loss")
            full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/olfactory_subgenome_OR_logits.pt')
-            if args['num_OR_logits'] < 1237:
-                full_OR_logits = full_OR_logits[:, :args['num_OR_logits']]
        else:
            print("Loading logits from model trained on weighed loss")
            full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/weighted_loss_olfactory_subgenome_OR_logits.pt')
-            if args['num_OR_logits'] < 1237:
-                full_OR_logits = full_OR_logits[:, :args['num_OR_logits']]
+        # Apply class selection first (if provided), then enforce num_OR_logits


seyonechithrananda and others added 6 commits August 11, 2025 00:20

cleanup, baseline + functional/pseudogene test

8ce9d02

Merge branch 'main' of github.com:seyonechithrananda/olfaction into m…

5345924

…odel_backbone

statistical tests

2a72ee0

v2 preprint scripts, notebooks and misc changes

a4acae9

seyonechithrananda requested review from Copilot and yangkky May 6, 2026 06:27

Copilot started reviewing on behalf of seyonechithrananda May 6, 2026 06:28 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

jdthamores merged commit 840e57b into microsoft:main May 7, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 preprint: MolOR updates, percept ablations, baselines#6

v2 preprint: MolOR updates, percept ablations, baselines#6
jdthamores merged 6 commits intomicrosoft:mainfrom
seyonechithrananda:model_backbone

seyonechithrananda commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

seyonechithrananda commented May 6, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants