Oncodrive3D ships with a plotting pipeline that turns the clustering results into visualizations and annotated tables. They help you interpret why a gene or residue scored a significant 3D-clustering signal, judge whether that signal looks biologically plausible or artifactual, and produce downstream assets (summary panels, per-gene tracks, association charts, annotated CSVs) for follow-up.
Annotations are built once per dataset via oncodrive3d build-annotations, then reused by oncodrive3d plot for any cohort.
Building the annotation bundle (oncodrive3d build-annotations) is the demanding step and needs all of the following. Plotting (oncodrive3d plot) needs only items 1 and 2, plus the annotation bundle and the outputs of an oncodrive3d run (listed under Generating Plots).
- Datasets –
oncodrive3d build-datasets(orbuild-datasets --mane_only) must have been run already. Both steps read from this folder (build-annotationsusespdb_structuresandseq_for_mut_prob.tsv;plotusesconfidence.tsvandseq_for_mut_prob.tsv). - Python environment – use the same environment (e.g.,
uvvirtualenv) that you rely on for the CLI. - PDB_Tool binary –
oncodrive3d build-annotationsinvokes the PDB_Tool executable namedPDB_Toolon$PATHto compute per-residue solvent accessibility and secondary structure. See Installing PDB_Tool below for a recipe. - Internet access – required to download Pfam annotations, UniProt features, and (unless
--ddg_diris set) RaSP ΔΔG predictions. - Disk space – annotation folders contain many files; keep several GB free.
PDB_Tool compiles from source. If your conda env has no C/C++ toolchain, install one first:
conda install -c conda-forge gxx makeThen build and put it on $PATH:
git clone https://github.com/realbigws/PDB_Tool.git
cd PDB_Tool
make -C source_code
# The Makefile drops the binary at the repo root. Symlink into the active conda env so it resolves via `which PDB_Tool`:
ln -s "$(pwd)/PDB_Tool" "$CONDA_PREFIX/bin/PDB_Tool"Run once per dataset (or whenever you update AlphaFold structures):
# Default (Homo sapiens, public RaSP ΔΔG):
oncodrive3d build-annotations -d <build_folder> -o <annot_folder>
# With a MANE-built dataset:
oncodrive3d build-annotations -d <mane_build_folder> -o <annot_folder>
# Mouse with custom ΔΔG predictions (omit --ddg_dir to skip the ΔΔG step):
oncodrive3d build-annotations -d <build_folder> -o <annot_folder> -s mouse --ddg_dir <ddg_path>See oncodrive3d build-annotations --help for all options.
Worth knowing:
- ΔΔG predictions default to the public RaSP bundle (computed against the canonical AlphaFold v4 human proteome, not the MANE bundle). For datasets built with a different AF version, residue-level mismatches are filtered during validation (see
--ddg_mismatch_thresholdbelow). --ddg_diroverrides the default download with a folder of RaSP-style CSVs (columnsvariantandscore_ml; UniProt accession auto-detected anywhere in the filename, any separator). For non-human organisms the public bundle doesn't apply, so without--ddg_dirthe ΔΔG step is skipped with a warning; other annotations build normally. To generate predictions yourself on CPUs, see bbglab/rasp_cpu.--ddg_mismatch_threshold(default0.1) drops a protein if its wild-type residues disagree with the canonical UniProt sequence above this fraction. Set to1.0to disable the WT-mismatch check (positions outside the canonical sequence still drop the protein).- If
--output_direxists and isn't empty, you're prompted before its contents are cleaned (excludinglog/); pass--yesto auto-confirm.
The command assembles three annotation tracks:
- Stability change (ΔΔG) – RaSP predictions are downloaded (human) or read from
--ddg_dir, then validated against the canonical sequence fromseq_for_mut_prob.tsv; proteins failing validation are dropped with a warning. Positions with no prediction are kept asNaN(not0.0) so plots show gaps and the logistic regression uses only real measurements. - PDB features – AlphaFold structures are run through
PDB_Toolintopdb_tool_df.tsv, with residue-level secondary structure (SSE) and relative accessibility (pACC). - Domains and sites – Pfam coordinates (Ensembl BioMart) go to
pfam.tsv; UniProt DOMAIN/PTM/SITE/MOTIF/MEMBRANE features (EMBL-EBI Proteins API) go touniprot_feat.tsv.
After a successful run the annotation folder contains:
annotations/
├── pdb_tool_df.tsv
├── pfam.tsv
├── uniprot_feat.tsv
├── stability_change/ # optional; absent for mouse builds without --ddg_dir
│ └── <UNIPROT>_ddg.json
└── log/
Keep this directory around: oncodrive3d plot reads the tables above and merges in ΔΔG values when stability_change/ is present, otherwise the ΔΔG track is omitted from per-gene plots and association analyses.
Once you have:
- Gene-level results (
<cohort>.3d_clustering_genes.csv), - Residue-level results (
<cohort>.3d_clustering_pos.csv), - Processed mutations (
<cohort>.mutations.processed.tsv), - Missense probability dictionary (
<cohort>.miss_prob.processed.json), - Processed sequence dataframe (
<cohort>.seq_df.processed.tsv), - Built datasets (
datasets/) and annotations (annotations/),
call:
oncodrive3d plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--maf_path output/COHORT/COHORT.mutations.processed.tsv \
--miss_prob_path output/COHORT/COHORT.miss_prob.processed.json \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--datasets_dir /path/to/datasets \
--annotations_dir /path/to/annotations \
--output_dir plots/COHORT \
--cohort COHORTSee oncodrive3d plot --help for all options.
Worth knowing:
--maf_pathis the processed missense-only TSV (<cohort>.mutations.processed.tsv) fromoncodrive3d run.--maf_for_nonmiss_pathis optional and takes the original MAF (before processing); supply it to enable the non-missense track. All other input files (gene/pos results,--miss_prob_path,--seq_df_path,--datasets_dir,--annotations_dir) must come from that sameoncodrive3d runinvocation; mismatch yields empty plots or missing-track errors.--lst_summary_tracks/--lst_gene_tracksaccept comma-separated track names; pair them with--lst_*_hratiosto redistribute vertical space.
The command produces:
- A summary plot of per-gene mutation counts, cluster residues, and score distributions.
- Per-gene plots overlaying the requested tracks (observed vs expected mutation counts, missense probabilities, clustering scores, PAE/pLDDT, ΔΔG, Pfam/UniProt domains, PTMs, membrane regions, motifs). Tracks not available for a gene are dropped automatically.
- Annotated tables (
<cohort>.3d_clustering_pos.annotated.csvand<cohort>.uniprot_feat.tsv), merging the positional results with disorder (pLDDT), PDB features, transcript metadata, and UniProt domains. Written only when--output_csvis passed. - Association plots (optional); see the "Association Analyses" section below.
The optional association module quantifies how strongly specific annotations track with significant clusters:
- Input preparation – residues with non-zero missense probability inherit standardized predictors: structural metrics (pLDDT, ΔΔG, surface exposure), categorical features (Pfam/UniProt/PTM/motif dummies), and the expected missense probability itself.
- Univariate logistic regressions – for each gene and each predictor, the pipeline fits
logit(C ~ feature)whereCis the binary cluster label. This yields log-odds, standard errors, and raw p-values that are stored in<cohort>.logreg_result.tsv. - Visualization – the statistics above feed three plot types under
<cohort>.associations_plots/:- A cohort-wide volcano plot highlighting annotations with the most extreme log-odds and p-values.
- Per-gene mini volcano plots to inspect feature enrichments gene by gene.
- Log-odds strip charts with 95% confidence intervals to visualize effect sizes.
Only raw p-values are provided; apply your preferred multiple-testing correction (e.g., BH-FDR) before drawing conclusions about specific features.
For interactive-ready 3D views, the separate oncodrive3d chimerax-plot command renders PNG snapshots (plus .defattr attribute files) under <output_dir>/<cohort>.chimerax/. It reuses the gene/position CSVs from oncodrive3d run, the datasets directory (for AlphaFold structures), and the processed sequence dataframe. Each snapshot colours the AlphaFold model by a mutation or clustering metric (mutations in residue, mutations in volume, clustering score, log clustering score), highlighting the mutated or cluster residues as spheres. The Nextflow pipeline exposes the same functionality through the chimerax_plot flag.
Note
ChimeraX must be installed separately. The framework was tested with ChimeraX 1.6.1. Download ucsf-chimerax_1.6.1ubuntu20.04_amd64.deb from UCSF older releases (newer releases should also work) and install it with sudo apt install ./ucsf-chimerax_1.6.1ubuntu20.04_amd64.deb; the executable lands at /usr/bin/chimerax, which is where the command looks by default (pass --chimerax_bin if yours is elsewhere).
If instead you run Oncodrive3D through the provided chimerax or full Docker image, ChimeraX is already included, so no separate install is needed.
Example:
oncodrive3d chimerax-plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--datasets_dir /path/to/datasets \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--output_dir plots/COHORT \
--cohort COHORT \
--chimerax_bin /opt/ChimeraX/bin/ChimeraX \
--max_n_genes 20 \
--pixel_size 0.1 \
--cluster_extSee oncodrive3d chimerax-plot --help for all options.
Worth knowing:
--pixel_sizecontrols resolution: smaller values produce larger images (default0.08).--cluster_extdisplays extended clusters (mutations that contribute to but don't directly form significant clusters).--af_versionis auto-detected from the structures in the datasets directory, so you normally don't set it. It's used only as a tiebreaker when the dataset contains more than one AlphaFold version, or as a fallback if none is detected (default6).--spheres/--no-spheres(default on) highlights residues as spheres: mutated residues on the base plots, cluster residues on the*_clustersplots. With--no-spheresthe base plots are cartoon-only while the*_clustersplots still mark the clusters.--cluster_markers(default off) adds translucent volume bubbles on the cluster residues in the*_clustersplots.--non_mutated_color(defaultgray) and--text_color(defaultblack) set the colour of the non-mutated cartoon and of the title / color-bar label (any ChimeraX colour name or hex).--transparent_bg/--no-transparent_bg(default on) saves images with a transparent background; pass--no-transparent_bgfor a white background.
- PDB_Tool missing – install the binary or adjust
$PATH. The build step logs the exact command being executed, making it easier to diagnose permission issues. - Annotation mismatches – plots rely on UniProt IDs matching between the run outputs and the annotation tables. Make sure you pass the same datasets directory used during
build-datasets. - Association plots without data – if a gene lacks both clustered and non-clustered residues after filtering, it is skipped from the logistic regression. The log file will note “There aren’t any relationship to plot”.