Data preparation capsule that converts BARseq gene-expression data from MATLAB format to R SingleCellExperiment (SCE) objects for downstream analysis, as part of:
Su, Kosillo, Jung, Chen et al. (2026). Topographic structure and function of locus coeruleus norepinephrine neurons. bioRxiv 2026.04.10.717727
This capsule does not produce manuscript figures. Its outputs are saved as two per-subject derived data assets which are consumed by the downstream analysis capsule LC-NE_BARseq_MAPseq_analyses (Code Ocean), which uses them to generate Figure S5 of the manuscript.
GitHub: https://github.com/AllenNeuralDynamics/LC-NE_BARseq_MAT-RDS_conversion
Code Ocean: https://codeocean.allenneuraldynamics.org/capsule/3953531/tree
Full collection: https://codeocean.allenneuraldynamics.org/collections/9cf044ce-93c7-4c7e-bfa1-5d8c37aa42ec
| File | Description |
|---|---|
00_env_library_loading.R |
Loads the r4-base conda environment and core libraries (hdf5r, Matrix, SingleCellExperiment). Provided as a reference for interactive use; not called directly by the run script. |
00_conversion_lib.R |
Shared library holding the conversion logic. Defines convert_v7_filtneurons() (reads a v7.3 BARseq .mat into a SingleCellExperiment) and convert_subject() (end-to-end per-subject pipeline: read input, save initial SCE, clean, save cleaned SCE, Dbh-filter, save filtered SCE). Sourced by the per-subject scripts. |
01_BarSeq_RDSconvert_brain3_v2.R |
Per-subject driver for specimen 780345 (brain 3). Sources 00_conversion_lib.R and calls convert_subject(). |
01_BarSeq_RDSconvert_brain4_v2.R |
Per-subject driver for specimen 780346 (brain 4). Sources 00_conversion_lib.R and calls convert_subject(). |
02_update_metadata.py |
Generates AIND-compliant data_description.json and processing.json for each output folder, and copies peer metadata (acquisition.json, procedures.json, subject.json) from the input asset. Uses aind-data-schema Pydantic models for validation. |
run |
Bash entry point for Reproducible Run. Renders each conversion script to an HTML report via knitr::spin, then runs the metadata-generation script. |
convert_subject() performs the following steps for each subject:
- Opens the BARseq MATLAB file (
.mat, HDF5 v7.3 format) usinghdf5r::H5Fileand reads thefilt_neuronsgroup. - Reconstructs the sparse gene-by-cell count matrix from stored CSC components using
Matrix::sparseMatrix. - Extracts per-cell metadata: slice, position, FOV coordinates, angle, depth, barcode status, batch number, CCF coordinates, and CCF annotation.
- Constructs a unique cell identifier (
uid) from batch, slice, and cell ID. - Assembles into a
SingleCellExperimentobject and saves ascombined_neurons_clust_CCFv2.rds. - Validates uid uniqueness, renames columns by uid, removes placeholder genes (
unused-*) and duplicate hybridization-cycle genes. - Saves the cleaned SCE as
combined_neurons_clust_CCFv2_uid.rds— this is the file consumed by the downstream analysis capsule. - Filters to putative LC-NE neurons (Dbh expression > 2) and saves as
DBHfilteredneurons_clust_CCFv2_uid.rds.
| Asset | is_public | Description |
|---|---|---|
barseq_780345_2025-02-24_12-00-00 |
true | BARseq data for specimen 780345 (brain 3). Contains BARseq/combined_neurons_clust_CCFv2.mat. Bucket: aind-open-data. |
barseq_780346_2025-06-13_12-00-00 |
true | BARseq data for specimen 780346 (brain 4). Contains BARseq/combined_neurons_clust_CCFv2.mat. Bucket: aind-open-data. |
Each conversion script writes to a per-subject output folder under /results/, named:
/results/<input_asset_name>_processed_MAT2RDS_<timestamp>/
For example, a run on May 6, 2026 might produce:
/results/barseq_780345_2025-02-24_12-00-00_processed_MAT2RDS_2026-05-06_17-30-00//results/barseq_780346_2025-06-13_12-00-00_processed_MAT2RDS_2026-05-06_17-30-00/
Each output folder contains three .rds files:
| File | Description |
|---|---|
combined_neurons_clust_CCFv2.rds |
Initial SingleCellExperiment object before duplicate-gene cleanup |
combined_neurons_clust_CCFv2_uid.rds |
Cleaned SCE with unique cell IDs and unused-* / duplicate hybridization-cycle genes removed |
DBHfilteredneurons_clust_CCFv2_uid.rds |
Same as above but filtered to putative LC-NE neurons (Dbh expression > 2) |
After a reproducible run from the released capsule, these two output folders are saved as separate AIND-metadata-tagged data assets in aind-open-data, with processing JSON pointing back to this capsule. Those published assets are what the downstream analysis capsule mounts.
Click Reproducible Run in Code Ocean. The run script processes both brains sequentially. Runtime is approximately 10 minutes on a large instance.
Before launching the run, attach the Code Ocean API Credentials Secret to the capsule (Capsule Settings → Credentials). The metadata-generation step queries the Code Ocean API at runtime to record the capsule's release version in each output folder's processing.json. Without the Secret the conversion still runs end-to-end and produces the RDS files plus data_description.json / subject.json / acquisition.json / procedures.json; only processing.json is skipped, with a warning. For producing the canonical published derived assets, the Secret should be attached so provenance is recorded.
R 4.2.3 in a conda environment (r4-base) with hdf5r, Matrix, and SingleCellExperiment as core dependencies. The full environment is defined in environment/r4-base.yml.
This project is licensed under the MIT License. See LICENSE for details.