This is the official code repository for the paper "Protein Circuit Tracing via Cross-layer Transcoders", by Darin Tsui, Kunal Talreja, Daniel Saeedi, and Amirali Aghazadeh, accepted into ICML 2026. A link to the paper can be found here.
Additionally, one can explore protein circuits through our web-based visualizer!
The easiest way to get started with ProtoMech is through our interactive Google Colab notebook. No local installation is required.
- Models: ProtoMech currently supports running on ESM2-8M and ESM2-35M!
- Circuit Discovery (optional): Train a probe on your custom dataset (Binary classification or Regression) to identify circuits.
- Interactive Visualization: Generate files required for our website and visualize circuits!
If you skip step 2, you can obtain circuit files in two ways:
- Use Our Pre-discovered Library: If you want to explore circuits from our paper, we provide a curated list of circuits here you can access through our notebook.
- Auto-generate Your Own: Even without a custom dataset, you can still generate a circuit! Just leave the
circuitoption blank.
Create the conda environment by running:
conda env create -f clt.yml
conda activate cltProtoMech/
├── training/ # CLT training code
├── training_block/ # Windowed CLT training code
├── training_transcoder/ # PLT training code
├── circuit_utils/ # Core circuit discovery utilities
├── family_circuit/ # Protein family-based circuit discovery
├── function_circuit/ # DMS function-based circuit discovery
├── steering/ # Probe and DMS steering experiments
├── esm_steering/ # CAA (Contrastive Activation Addition) steering
├── visualization/ # Circuit analysis and PyMOL visualization
├── data/ # Training data generation
└── plots_and_tables/ # Plots and tables
Location: training/clt_model.py
Replaces ESM2 MLP blocks with a sparse transcoder using information from all preceding layers:
- Top-K Activation: Only top-k latents are active (enforces sparsity)
- Cross-Layer Decoding: Layer l reconstructs using latents from layers 0 to l
- AuxK Loss: Encourages rarely-used latents to activate
Location: training_block/clt_model.py
Variant of CLTs that restricts cross-layer connectictivity to localized windows. A fair tradeoff between capturing cross-layer dependencies and compute time.
- Block size: Sets the window size for cross-layer connectivity.
Location: training_transcoder/plt_model.py
Baseline where each layer has independent encoder/decoder pairs. Layer l only uses its own latents.
# Train CLT
cd training && sh main.sh
# Train Windowed CLT
cd training_block && sh main.sh
# Train PLT
cd training_transcoder && sh main_plt.shIf you would like to train your own model, download training_sequences_5m.a2m from https://huggingface.co/datasets/ktalreja/ProtoMechData and put it in the data folder.
Identifies minimal subsets of latents that recover a target property (family classification or DMS fitness).
| File | Description |
|---|---|
clt_circuit.py |
CLT circuit discovery |
plt_circuit.py |
PLT circuit discovery |
esm_activation.py |
ESM-2 activation extraction |
Discovers circuits distinguishing protein families (InterPro domains).
cd family_circuit
sh main.sh # Full run for all families
sh main.sh --target IPR000724 # Specific familyYou can download our Swiss-Prot data used for our family circuits, swissprot_seqid30_75k_all_info_with_3di.parquet, from https://huggingface.co/datasets/ktalreja/ProtoMechData and put it in the data folder.
Discovers circuits using DMS fitness data.
cd function_circuit && sh main.shModifies sequence generation by amplifying or ablating circuit nodes.
| File | Description |
|---|---|
full_replacement_models.py |
FullCLTReplacementModel, FullPLTReplacementModel |
local_replacement_models.py |
Local replacement model for CLT |
run_probe_steering.py |
Probe-based steering |
Contrastive Activation Addition steering using steering vectors from contrastive pairs.
cd esm_steering && sh main_caa_steering.shLocation: visualization/
| File | Description |
|---|---|
circuit_analysis.py |
Family-level circuit analysis |
circuit_analysis_function.py |
Function/DMS-level analysis |
generate_pymol_view.py |
PyMOL visualization scripts |
compute_activations.py |
Computes top-10 sequences per act |
If you want to use compute_activations.py instead of using the pre-saved top activation results found in top10_activations.pt (which can be found here), download swissprot_full.parquet from https://huggingface.co/datasets/ktalreja/ProtoMechData and put it in the data folder.
You can find the models at https://huggingface.co/ktalreja/ProtoMechModels and the data used in this paper at https://huggingface.co/datasets/ktalreja/ProtoMechData.
If you use ProtoMech and enjoy it, please consider citing our paper!
@inproceedings{tsui2026protomech,
title={Protein Circuit Tracing via Cross-layer Transcoders},
author={Tsui, Darin and Talreja, Kunal and Saeedi, Daniel and Aghazadeh, Amirali},
booktitle={Proceedings of the 43rd International Conference on Machine Learning},
year={2026}
}