------------ Under Internal Testing ------------

CopyKAT-Python

CopyKAT-Python is a Python reimplementation of the CopyKAT workflow for inferring large-scale copy number alterations (CNAs) from single-cell RNA-seq data. It reproduces the core CopyKAT strategy while improving scalability, usability, and integration with modern AnnData/Scanpy pipelines.

Why CopyKAT-Python?

The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. Recurring practical limitations include:

Long runtimes, with reports of >1 hour for ~8,000 cells
Inability to handle very large datasets (hundreds of thousands to millions of cells) due to hierarchical clustering limits

Highlights:

Identical core parameters as CopyKAT-R with convenient Python improvements
Handles datasets from thousands to hundreds of thousands of cells with significantly faster speed
fixed some known bugs of copykat-R to improve aneuploid prediction accuracy.
Annotated CNA heatmaps with per-cell metadata sidebars (cell type, cluster labels, etc.)
Pre-built Singularity container for reproducible deployment
Validated across 11 human cancer samples and a 170k-cell Xenium whole-transcript dataset

Installation

From source:

Installs copykat-py into your current environment

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
pip install -e .

From environment.yml with conda:

Creates a fresh conda environment for copykat-py with all required packages

git clone https://github.com/navinlabcode/Copykat_python.git
cd Copykat_python
conda env create -f environment.yml
conda activate copykit_py

## After activation, confirm the commands are available
copykat_matrix --help
copykat_anndata --help

Singularity container (recommended for HPC environments):

wget https://github.com/navinlabcode/Copykat_python/releases/download/v1.0.0/copykat_py.sif
singularity exec copykat_py.sif copykat-py --help

How to Run

CopyKAT-Python supports two main entry points:

Entry point	When to use
`copykat_matrix` / `copykat-py`	Input is a `.csv`, `.tsv`, or `.mtx` matrix file on disk in linux
`copykat_anndata()`	Input is an already-loaded `AnnData` object in python

Terminal — `copykat_matrix` / `copykat-py`

CSV or TSV matrix:

copykat_matrix \
    --input sample_counts.csv \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

10X matrix market input:

copykat_matrix \
    --input filtered_feature_bc_matrix/matrix.mtx.gz \
    --genes filtered_feature_bc_matrix/features.tsv.gz \
    --barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
    --sample-name sample1 \
    --genome hg20 \
    --n-cores 24 \
    --output-dir results/sample1

Pass --meta (and optionally --row-split) to produce an annotated heatmap alongside the standard output. See Annotated Heatmap with Metadata.

Python API — `copykat()`

import pandas as pd
from copykat_py import copykat

counts = pd.read_csv("sample_counts.csv", index_col=0)

result = copykat(
    rawmat=counts,
    id_type="S",
    sam_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
)

print(result["prediction"].head())

rawmat can also be a dict with keys matrix, genes, and barcodes for sparse matrices.

Python API — `copykat_anndata()`

import anndata as ad
from copykat_py import copykat_anndata

adata = ad.read_h5ad("sample.h5ad")

result = copykat_anndata(
    adata=adata,
    selecting_meta=["CellType", "copykat_pred", "seurat_clusters"],
    row_split="CellType",
    sample_name="sample1",
    genome="hg20",
    distance="euclidean",
    n_cores=24,
    output_dir="results/sample1_anndata",
)

print(result["prediction"]["copykat.pred"].value_counts())

Useful options: layer (use adata.layers[...]), use_raw (use adata.raw), selecting_meta (export obs columns for annotated heatmaps), row_split (column defining row groups).

Output files

All entry points produce the same outputs as copykat-R:

*_copykat_CNA_results.txt
*_copykat_prediction.txt
*_copykat_heatmap.png
copykat_run.log

When metadata is supplied, an additional annotated heatmap PNG is produced. AnnData workflows also write *_selected_obs_meta.csv.

Annotated Heatmap with Metadata

Produce a CNA heatmap with per-cell metadata annotations (cell type, cluster labels, etc.). Rows are split into labelled groups by a chosen metadata column and ordered by hierarchical or K-means clustering within each group.

CLI — standalone re-plot (`copykat-py-plot`)

Re-plot from an existing CNA results file without re-running the full analysis:

copykat-py-plot \
    --cna  sample_copykat_CNA_results.txt \
    --meta xenium_ft_full_meta_celltype_leiden.csv \
    --row-split inferred_CellType \
    --sample-name xenium_all_cells \
    --n-cores 40 \
    --output xenium_annotated_heatmap.png

Flag	Default	Description
`--cna` / `-c`	(required)	`*_copykat_CNA_results.txt` from a copykat-py run
`--meta` / `-m`	(required)	Annotation CSV — first column = cell name, rest = metadata
`--row-split`	second column	Column used to split and label row groups

Meta CSV format — header row is auto-detected:

cell_name,leiden_cluster,inferred_CellType
aaaajgij-1,5,lumhr
aaaandia-1,5,lumhr
...

Cells present in the CNA results but absent from the CSV are labelled "unknown" and shown in grey. All remaining metadata columns are drawn as coloured annotation sidebars.

Python API — `plot_heatmap_annotated`

import pandas as pd
from copykat_py.plotting import plot_heatmap_annotated

cna = pd.read_csv("sample_copykat_CNA_results.txt", sep="\t")
plot_heatmap_annotated(
    mat           = cna.iloc[:, 3:].values.astype("float32"),
    cell_names    = cna.columns[3:].tolist(),
    chrom_info    = cna.iloc[:, 0].values,
    meta_csv      = "xenium_ft_full_meta_celltype_leiden.csv",
    row_split_col = "inferred_CellType",
    sample_name   = "xenium_all_cells",
    n_cores       = 40,
    output_path   = "xenium_annotated_heatmap.png",
)

Benchmarking and Validation

Validation for 11 datasets from Cancer Cell Atlas (3CA)

Both CopyKAT-R and CopyKAT-Python were tested on raw datasets (no QC filtering) using 24 cores. A total of 11 datasets with cell-type composition, aneuploid annotation, and UMAP embeddings from the metadata, and prepare per-sample count matrices in standard 10X MTX format for downstream CopyKAT-R vs CopyKAT-Py comparison.

3CA Benchmark Datasets

#	Dataset	Cancer Type	Sample	n_cells	Tumor % (meta)	Ref
1	Gao2021_Breast	Breast cancer	DCIS1	1,480	74.4%	Gao et al. 2021
2	Chen2020_Head-and-Neck	Nasopharyngeal carcinoma	P11	6,890	26.3%	Chen et al. 2020
3	Laughney2020_Lung	Lung adenocarcinoma	RU681	993	77.0%	Laughney et al. 2020
4	Bi2021_Kidney	Renal cell carcinoma (RCC)	P90	8,426	39.4%	Bi et al. 2021
5	Dong2020_Prostate	Prostate cancer	patient #5	8,690	19.3%	Dong et al. 2020
6	Jerby-Arnon2021_Sarcoma	Synovial sarcoma	SyS14	2,522	94.4%	Jerby-Arnon et al. 2021
7	Choudhury2022_Brain	Meningioma	MSC6-BTI	13,171	62.2%	Choudhury et al. 2022
8	Lin2020_Pancreas	PDAC	P08	1,139	74.1%	Lin et al. 2020
9	Lee2020_Colorectal	Colorectal cancer (CRC)	SMC09	2,272	77.9%	Lee et al. 2020
10	Geistlinger2020_Ovarian	HGSOC	T59	12,659	25.1%	Geistlinger et al. 2020
11	Ji2020_Skin	Cutaneous SCC	P4	7,956	53.0%	Ji et al. 2020

Key Metrics Comparison

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

Large-Scale Testing: Xenium Atera Dataset

The full FFPE Human Breast Cancer Xenium (Atera) dataset was subsetted to 50k, 100k, and full (~170k cells) to evaluate scalability.

Runtime Comparison

CNV Heatmap with Annotation

CNV comparison

Why Results May Differ from CopyKAT-R

From the above comparison of the final prediction, the Seurat cluster 4 was called diploid by CopyKAT-R but aneuploid by CopyKAT-Py. The copykat-py call was confirmed correct through the corresponding H&E cell morphology in this case.

The key difference is in the final prediction step (step 8), where both implementations perform hierarchical clustering on the adjusted CNA matrix and cut the tree at k=2. R's copykat explicitly uses method = "ward.D" in hclust(), while CopyKAT-Python uses scipy/fastcluster's "ward", which implements the mathematically correct ward.D2 criterion. For cells cluster (like Seurat cluster 4 here,) with subtle CNV profiles that sit near the boundary of the diploid/aneuploid split, the two linkage variants produce different dendrogram topologies, causing the binary label assignment to flip.

CopyKAT-Python results may not be identical to CopyKAT-R due to differences in:

High-confidence results typically show:

Clear chromosome-arm or whole-chromosome CNV patterns
Consistent CNV profiles within clusters
Strong separation between inferred diploid and aneuploid cells

Lower-confidence results may occur in samples with:

Weak CNV signal or low sequencing depth
Few normal reference cells
Strong batch effects
Near-diploid tumor genomes

Disclaimer: CopyKAT-Python is an independent reimplementation focused on scalability and usability, while faithfully reproducing the core CopyKAT analytical strategy.

Gene annotation versions
Filtering and preprocessing steps
Numerical implementation details
Smoothing and segmentation algorithms
Clustering behavior (parDist + hcluster vs. PCA + fastcluster)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

------------ Under Internal Testing ------------

CopyKAT-Python

Why CopyKAT-Python?

Installation

How to Run

Terminal — `copykat_matrix` / `copykat-py`

Python API — `copykat()`

Python API — `copykat_anndata()`

Output files

Annotated Heatmap with Metadata

CLI — standalone re-plot (`copykat-py-plot`)

Python API — `plot_heatmap_annotated`

Benchmarking and Validation

Validation for 11 datasets from Cancer Cell Atlas (3CA)

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

Large-Scale Testing: Xenium Atera Dataset

Why Results May Differ from CopyKAT-R

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

------------ Under Internal Testing ------------

CopyKAT-Python

Why CopyKAT-Python?

Installation

How to Run

Terminal — copykat_matrix / copykat-py

Python API — copykat()

Python API — copykat_anndata()

Output files

Annotated Heatmap with Metadata

CLI — standalone re-plot (copykat-py-plot)

Python API — plot_heatmap_annotated

Benchmarking and Validation

Validation for 11 datasets from Cancer Cell Atlas (3CA)

Side-by-Side Comparison: CopyKAT-R vs CopyKAT-Python

Large-Scale Testing: Xenium Atera Dataset

Why Results May Differ from CopyKAT-R

Terminal — `copykat_matrix` / `copykat-py`

Python API — `copykat()`

Python API — `copykat_anndata()`

CLI — standalone re-plot (`copykat-py-plot`)

Python API — `plot_heatmap_annotated`