Skip to content

Latest commit

 

History

History
300 lines (212 loc) · 11.1 KB

File metadata and controls

300 lines (212 loc) · 11.1 KB

------------ Under Internal Testing ------------

CopyKAT-Python

CopyKAT-Python is a Python reimplementation of the CopyKAT workflow for inferring large-scale copy number alterations (CNAs) from single-cell RNA-seq data. It reproduces the core CopyKAT strategy while improving scalability, usability, and integration with modern AnnData/Scanpy pipelines.

Why CopyKAT-Python?

The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. Recurring practical limitations include:

  • Long runtimes, with reports of >1 hour for ~8,000 cells
  • Inability to handle very large datasets (hundreds of thousands to millions of cells) due to hierarchical clustering limits

Highlights:

  • Identical core parameters as CopyKAT-R with convenient Python improvements
  • Handles datasets from thousands to hundreds of thousands of cells with significantly faster speed
  • fixed some known bugs of copykat-R to improve aneuploid prediction accuracy.
  • Annotated CNA heatmaps with per-cell metadata sidebars (cell type, cluster labels, etc.)
  • Pre-built Singularity container for reproducible deployment
  • Validated across 11 human cancer samples and a 170k-cell Xenium whole-transcript dataset

Installation

From source:

Installs copykat-py into your current environment

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
pip install -e .

From environment.yml with conda:

Creates a fresh conda environment for copykat-py with all required packages

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
conda env create -f environment.yml
conda activate copykit_py

## After activation, confirm the commands are available
copykat_matrix --help
copykat_anndata --help

Singularity container (recommended for HPC environments):

wget https://github.com/navinlabcode/Copykat_python/releases/download/v1.0.0/copykat_py.sif
singularity exec copykat_py.sif copykat-py --help

How to Run

CopyKAT-Python supports two main entry points:

Entry point When to use
copykat_matrix / copykat-py Input is a .csv, .tsv, or .mtx matrix file on disk in linux
copykat_anndata() Input is an already-loaded AnnData object in python
image

Terminal — copykat_matrix / copykat-py

CSV or TSV matrix:

copykat_matrix \
    --input sample_counts.csv \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

10X matrix market input:

copykat_matrix \
    --input filtered_feature_bc_matrix/matrix.mtx.gz \
    --genes filtered_feature_bc_matrix/features.tsv.gz \
    --barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

Pass --meta (and optionally --row-split) to produce an annotated heatmap alongside the standard output. See Annotated Heatmap with Metadata.

Python API — copykat()

import pandas as pd
from copykat_py import copykat

counts = pd.read_csv("sample_counts.csv", index_col=0)

result = copykat(
    rawmat=counts,
    id_type="S",
    sam_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
)

print(result["prediction"].head())

rawmat can also be a dict with keys matrix, genes, and barcodes for sparse matrices.

Python API — copykat_anndata()

import anndata as ad
from copykat_py import copykat_anndata

adata = ad.read_h5ad("sample.h5ad")

result = copykat_anndata(
    adata=adata,
    selecting_meta=["CellType", "copykat_pred", "seurat_clusters"],
    row_split="CellType",
    sample_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
    output_dir="results/sample1_anndata",
)

print(result["prediction"]["copykat.pred"].value_counts())

Useful options: layer (use adata.layers[...]), use_raw (use adata.raw), selecting_meta (export obs columns for annotated heatmaps), row_split (column defining row groups).

Output files

All entry points produce the same outputs as copykat-R:

  • *_copykat_CNA_results.txt
  • *_copykat_prediction.txt
  • *_copykat_heatmap.png
  • copykat_run.log

When metadata is supplied, an additional annotated heatmap PNG is produced. AnnData workflows also write *_selected_obs_meta.csv.


Annotated Heatmap with Metadata

Produce a CNA heatmap with per-cell metadata annotations (cell type, cluster labels, etc.). Rows are split into labelled groups by a chosen metadata column and ordered by hierarchical or K-means clustering within each group.

CLI — standalone re-plot (copykat-py-plot)

Re-plot from an existing CNA results file without re-running the full analysis:

copykat-py-plot \
    --cna  sample_copykat_CNA_results.txt \
    --meta xenium_ft_full_meta_celltype_leiden.csv \
    --row-split inferred_CellType \
    --sample-name xenium_all_cells \
    --n-cores 40 \
    --output xenium_annotated_heatmap.png
Flag Default Description
--cna / -c (required) *_copykat_CNA_results.txt from a copykat-py run
--meta / -m (required) Annotation CSV — first column = cell name, rest = metadata
--row-split second column Column used to split and label row groups

Meta CSV format — header row is auto-detected:

cell_name,leiden_cluster,inferred_CellType
aaaajgij-1,5,lumhr
aaaandia-1,5,lumhr
...

Cells present in the CNA results but absent from the CSV are labelled "unknown" and shown in grey. All remaining metadata columns are drawn as coloured annotation sidebars.

Python API — plot_heatmap_annotated

import pandas as pd
from copykat_py.plotting import plot_heatmap_annotated

cna = pd.read_csv("sample_copykat_CNA_results.txt", sep="\t")
plot_heatmap_annotated(
    mat           = cna.iloc[:, 3:].values.astype("float32"),
    cell_names    = cna.columns[3:].tolist(),
    chrom_info    = cna.iloc[:, 0].values,
    meta_csv      = "xenium_ft_full_meta_celltype_leiden.csv",
    row_split_col = "inferred_CellType",
    sample_name   = "xenium_all_cells",
    n_cores       = 40,
    output_path   = "xenium_annotated_heatmap.png",
)

Benchmarking and Validation

Validation for 11 datasets from Cancer Cell Atlas (3CA)

Both CopyKAT-R and CopyKAT-Python were tested on raw datasets (no QC filtering) using 24 cores. A total of 11 datasets with cell-type composition, aneuploid annotation, and UMAP embeddings from the metadata, and prepare per-sample count matrices in standard 10X MTX format for downstream CopyKAT-R vs CopyKAT-Py comparison.

3CA Benchmark Datasets
# Dataset Cancer Type Sample n_cells Tumor % (meta) Ref
1 Gao2021_Breast Breast cancer DCIS1 1,480 74.4% Gao et al. 2021
2 Chen2020_Head-and-Neck Nasopharyngeal carcinoma P11 6,890 26.3% Chen et al. 2020
3 Laughney2020_Lung Lung adenocarcinoma RU681 993 77.0% Laughney et al. 2020
4 Bi2021_Kidney Renal cell carcinoma (RCC) P90 8,426 39.4% Bi et al. 2021
5 Dong2020_Prostate Prostate cancer patient #5 8,690 19.3% Dong et al. 2020
6 Jerby-Arnon2021_Sarcoma Synovial sarcoma SyS14 2,522 94.4% Jerby-Arnon et al. 2021
7 Choudhury2022_Brain Meningioma MSC6-BTI 13,171 62.2% Choudhury et al. 2022
8 Lin2020_Pancreas PDAC P08 1,139 74.1% Lin et al. 2020
9 Lee2020_Colorectal Colorectal cancer (CRC) SMC09 2,272 77.9% Lee et al. 2020
10 Geistlinger2020_Ovarian HGSOC T59 12,659 25.1% Geistlinger et al. 2020
11 Ji2020_Skin Cutaneous SCC P4 7,956 53.0% Ji et al. 2020

Key Metrics Comparison image

image

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

image image

Large-Scale Testing: Xenium Atera Dataset

The full FFPE Human Breast Cancer Xenium (Atera) dataset was subsetted to 50k, 100k, and full (~170k cells) to evaluate scalability.

Runtime Comparison image

CNV Heatmap with Annotation

image

CNV comparison image

Why Results May Differ from CopyKAT-R

From the above comparison of the final prediction, the Seurat cluster 4 was called diploid by CopyKAT-R but aneuploid by CopyKAT-Py. The copykat-py call was confirmed correct through the corresponding H&E cell morphology in this case.

The key difference is in the final prediction step (step 8), where both implementations perform hierarchical clustering on the adjusted CNA matrix and cut the tree at k=2. R's copykat explicitly uses method = "ward.D" in hclust(), while CopyKAT-Python uses scipy/fastcluster's "ward", which implements the mathematically correct ward.D2 criterion. For cells cluster (like Seurat cluster 4 here,) with subtle CNV profiles that sit near the boundary of the diploid/aneuploid split, the two linkage variants produce different dendrogram topologies, causing the binary label assignment to flip.

CopyKAT-Python results may not be identical to CopyKAT-R due to differences in:

High-confidence results typically show:

  • Clear chromosome-arm or whole-chromosome CNV patterns
  • Consistent CNV profiles within clusters
  • Strong separation between inferred diploid and aneuploid cells

Lower-confidence results may occur in samples with:

  • Weak CNV signal or low sequencing depth
  • Few normal reference cells
  • Strong batch effects
  • Near-diploid tumor genomes

Disclaimer: CopyKAT-Python is an independent reimplementation focused on scalability and usability, while faithfully reproducing the core CopyKAT analytical strategy.

  • Gene annotation versions
  • Filtering and preprocessing steps
  • Numerical implementation details
  • Smoothing and segmentation algorithms
  • Clustering behavior (parDist + hcluster vs. PCA + fastcluster)