Skip to content

v2 preprint: MolOR updates, percept ablations, baselines#6

Merged
jdthamores merged 6 commits intomicrosoft:mainfrom
seyonechithrananda:model_backbone
May 7, 2026
Merged

v2 preprint: MolOR updates, percept ablations, baselines#6
jdthamores merged 6 commits intomicrosoft:mainfrom
seyonechithrananda:model_backbone

Conversation

@seyonechithrananda
Copy link
Copy Markdown
Collaborator

Summary

  • v2 preprint update: cross-attention MolOR over per-residue ESM-2 (650M / 3B) embeddings, with optional MPNN molecular encoder alongside the existing GCN.
  • Percept ablations sweeping OR-feature counts across HORDE (5–845) and M2OR (5–1237) sets, weighted/unweighted upstream MolOR variants, plus a pseudogene-only control showing percept gains come from functional ORs.
  • Reviewer revisions: Benjamini–Hochberg FDR correction and Jonckheere–Terpstra trend test for Table S1; Goldman et al. (FFN+ESM) baseline reproduction; PerceiverCPI/Goldman M2OR splits added.
  • Repo hygiene: .gitignore for training output dirs, ESM/DGL caches, large OR-logit tensors; ships small load-bearing files (HORDE filter indices, M2OR sequence annotations, naive_broad_ORs FASTA).

Test plan

  • python classification_ESM.py --model MolOR -d M2OR_Pairs -f canonical -esm 650m -cross_att -w -s random trains end-to-end on M2OR (regenerates ESM-2 cache on first run).
  • bash scripts/run_OR_percept_ablations_HORDE.sh reproduces GS-LF percept ablation given upstream MolOR checkpoint + HORDE OR-logit tensor (forthcoming Zenodo bundle).
  • Pseudogene control: `classification_OR_feat_ESM.py --OR_database HORDE --OR_gene_class pseudogene` matches the no-OR baseline AUROC (~87.04) per Table 2.
  • Per-seed `*eval.txt` files in `weighted_loss_HORDE_2/gs_lf{0,5,…,845}_OR_logits_layernorm_roc_auc_score/` reproduce Table S1 numbers.

🤖 Generated with Claude Code

seyonechithrananda and others added 6 commits August 11, 2025 00:20
- Add .gitignore for training output dirs, ESM/DGL caches, plot outputs,
  deprecated/, CLAUDE.md, and the local-only fig4 stat-tests notebook
- Stop tracking notebooks/fig4_stat_tests.ipynb (kept locally, gitignored)
- Add MolOR MPNN-encoder config (data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json)
- Add PerceiverCPI/Goldman M2OR train/val/test splits and naive_broad_ORs_30%.fasta
- Add notebooks/test_OR_logits_shuffle.ipynb (shuffle/null control)
- Update analyze_blast_results.ipynb
- README updated for v2: MPNN encoder option, ablation scripts under scripts/,
  Goldman/PerceiverCPI baselines, Zenodo placeholder

Co-Authored-By: Seyone <seyonec@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- data/datasets/HORDE/{functional,pseudogene}_indices.json: load-bearing for
  the pseudogene-control variant of the GS-LF percept model
  (classification_OR_feat_ESM.py:112-113)
- data/datasets/M2OR/{seqs,seqs_with_annotations}.csv: OR sequences and
  family annotations used by utils.py for OR-subfamily analyses

Co-Authored-By: Seyone <seyonec@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the codebase to support the v2 preprint experiments: adding a MolOR variant with an MPNN molecular encoder, enabling OR subfamily holdout splits, expanding percept ablation tooling (including pseudogene/functional HORDE controls), and improving repo hygiene around large artifacts.

Changes:

  • Add MolOR_MPNN support (MPNN encoder path in MolORPredictor, model loading, and inference wiring).
  • Add an or_subfamily_holdout dataset split and loss-weighting normalization option for M2OR pairs.
  • Add/revise scripts and small dataset artifacts to reproduce baseline splits/ablations and manage large outputs via .gitignore.

Reviewed changes

Copilot reviewed 20 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
utils.py Adds MolOR_MPNN to featurizer handling, introduces OR subfamily holdout split, and updates OR-feature prediction path to pass edge features.
classification_ESM.py Adds MolOR_MPNN / OR-subfamily holdout CLI options and new weighted-loss normalization flag; updates seeding behavior.
gcn_or_predictor.py Adds MPNNGNN-based encoder option to MolORPredictor and threads optional edge_feats through forward().
data/m2or.py Adds normalize_loss_by_class_imbalance option; refactors sample-weight computation and path joining.
classification_OR_feat_ESM.py Adds HORDE pseudogene/functional index selection and applies index filtering when loading logits.
scripts/run_OR_percept_ablations_HORDE.sh Updates ablation runner to call classification_OR_feat_ESM.py.
scripts/prepare_enzpred_data.py New utility to export PerceiverCPI M2OR splits into enz-pred CSV + PresetSplitter indices.
scripts/merge_blast_annotations.py New script to merge BLAST-derived UniProt/gene IDs into sequence CSVs.
scripts/get_gene_uniprot_IDs_blast.py New script to BLAST sequences and annotate UniProt/gene IDs.
scripts/get_HORDE_metadata.ipynb New notebook to derive HORDE pseudogene/functional indices.
receptor_binding/analyze_blast_results.ipynb Updates notebook paths/kernel metadata for BLAST result analysis.
notebooks/OR_subfamily_analysis.ipynb New notebook analyzing OR subfamily distributions and weights.
notebooks/cross_task_stats.ipynb Minor notebook edit (adds an empty code cell).
README.md Updates repo overview and adds more structured usage/environment notes for v2.
.gitignore Adds ignores for large artifacts, caches, and training output directories.
data/datasets/HORDE/pseudogene_indices.json Adds precomputed pseudogene indices for HORDE.
data/datasets/HORDE/functional_indices.json Adds precomputed functional indices for HORDE.
data/datasets/naive_broad_ORs_30%.fasta Adds a small FASTA artifact referenced as “load-bearing” data.
data/configures/M2OR_Pairs/MolOR_MPNN_canonical.json Adds config for MolOR_MPNN on M2OR_Pairs.
data/configures/GS_LF/MPNN_canonical.json Updates MPNN config hyperparameters for GS_LF.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread classification_ESM.py
Comment on lines 285 to 288
else:
device = torch.device('cpu')
torch.cuda.set_device(device)
args['device'] = device
Comment thread classification_ESM.py
Comment on lines 291 to +296
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Comment thread gcn_or_predictor.py
@@ -626,10 +630,9 @@ def forward(self, bg, feats, add_feats = None, seq_mask = None, node_mask = None
"""
#print(bg)
#print(feats)
Comment thread data/m2or.py
Comment on lines +539 to +542
print('normalizing loss using log1p ofclass imbalance alongside canonical weighing scheme')
df['sample_weight'] = np.log1p(df['weight_class']) * df['weight_pair_imbalance'] * df['weight_quality']
else:
print('normalizing loss using cannonical weighing scheme')
Comment thread utils.py
Comment on lines +105 to +113
print(f'Using OR subfamily holdout with subfamily: {args["or_subfamily_holdout"]}')
from dgl.data.utils import Subset
import pandas as pd
import numpy as np

# Validate input
target_subfamily = args['or_subfamily_holdout']
if target_subfamily is None:
raise ValueError("--or_subfamily_holdout must be specified when using 'or_subfamily_holdout' split")
Comment on lines +13 to +16
DATA_DIR = os.path.join(os.path.dirname(__file__), "..", "data", "datasets")
OUT_CSV = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_binary.csv")
OUT_PICKLE = os.path.join(os.path.expanduser("~"), "enz-pred", "data", "processed", "m2or_split_indices.p")

Comment on lines +73 to +77
def main():
blast_file = '/home/seyonec/olfaction/receptor_binding/blast_results.txt'
seqs_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs.csv'
output_file = '/home/seyonec/olfaction/data/datasets/M2OR/seqs_with_annotations.csv'

Comment on lines 255 to +262
elif args['OR_database'] == 'HORDE':
if args['prev_model_loss'] == 'unweighted_loss':
print("Loading logits from model trained on unweighed loss")
full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/olfactory_subgenome_OR_logits.pt')
if args['num_OR_logits'] < 1237:
full_OR_logits = full_OR_logits[:, :args['num_OR_logits']]
else:
print("Loading logits from model trained on weighed loss")
full_OR_logits = torch.load('/home/seyonec/olfaction/data/datasets/weighted_loss_olfactory_subgenome_OR_logits.pt')
if args['num_OR_logits'] < 1237:
full_OR_logits = full_OR_logits[:, :args['num_OR_logits']]
# Apply class selection first (if provided), then enforce num_OR_logits
@jdthamores jdthamores merged commit 840e57b into microsoft:main May 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants