Anndata Requirements for Upload

This page contains the requirements for uploading AnnData files to CAP. We recommend using the CAP validator tool to make sure your dataset meets the requirements prior to uploading. See the GitHub repository for installation and usage instructions: cap-validator GitHub repo. The CAP validator Python package can be downloaded from: cap-validator PyPI package page.

Cell Metadata (`obs`)

The file must contain the following fields in obs with the column names exactly as listed below.
These fields may not contain NA values.

Field name	Type	Accepted values
assay	string	This must be the specific assay, not a generic term such as scRNA-seq or 10x sequencing. Accepted value examples: `'10x 3' v2'` or `'Smart-seq2'`
disease	string	This must be the specific disease term. Accepted value examples: `'glioblastoma'` or `'Alzheimer disease'` For healthy samples use `'normal'` or `'healthy'`
organism	string	This must be the Latin (Genus species) name. Accepted value examples: `'Homo sapiens'` or `'Mus musculus'`
tissue	string	The most accurate anatomical term for where the sample was collected from. Accepted value examples: `'retina'` or `'heart left ventricle'`

Optional `obs` field: clustering

NOTE: CAP has specific requirements to name the output of clustering within the AnnData obs field. This was adopted so that any saved clustering output could be unambiguously retrieved across all datasets.

Users may save multiple resolutions of clustering (with any algorithmic approach they choose) in the file, as long as it adheres to the following rules:

Cluster fields are not required but if they are in the dataset they must be denoted with cluster, leiden or louvain.
If you wish to add a descriptive suffix, the prefix cluster must be used e.g. cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]
Examples of acceptable cluster column names are: louvain, cluster_leiden, cluster_louvain_precise3, cluster_leiden_broad etc.

Please refer to the Cell Annotation schema for more details.

Embedding (`obsm`)

At least one embedding, tSNE, UMAP or PCA, is required, and more than one may be included.
The embeddings(s) must be saved with the prefix X_, for example: X_tsne, X_pca, X_umap.
The embeddings(s) must be stored as an array and be in [n_cells x 2] shape.

Gene Metadata (`var`)

CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var fields following the AnnData standard.
var is a pandas.DataFrame object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index.
Note: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens and Mus musculus. If there are other species you wish to upload to CAP, please contact support@celltype.info and we will work to accommodate your request.

Count Matrix (`X`)

The file must contain a raw count matrix saved as .X or raw.X.
If the file contains a count matrix in .X the .raw layer must be empty.
The matrix must be a dense matrix or if sparse, in Compressed Sparse Row (CSR) format.

Dataset-wide Metadata (`uns`)

There are no requirements for fields in uns for uploading to CAP.

Creating Drafts and Uploading Datasets

Entering Cell Annotation Metadata

Anndata Requirements for Upload

Cell Metadata (obs)

Optional obs field: clustering

Embedding (obsm)

Gene Metadata (var)