Anndata Requirements for Upload

This page contains the requirements for uploading AnnData files to CAP.

Cell Metadata (`obs`)

The file must contain the following fields in obs with the column names exactly as listed below.
These fields may not contain NA values.

Field name	Type	Accepted values
assay	string	This must be the specific assay, not a generic term such as scRNA-seq or 10x sequencing - Accepted value examples: `10x 3' v2`, `10x 3' v3`, `Smart-seq2`
disease	string	This must be the specific disease term, e.g. glioblastoma or Alzheimer disease - For healthy samples use `normal` or `healthy`
organism	string	- Must be the Latin (Genus species) name, e.g. `Homo sapiens`, `Mus musculus`
tissue	string	The most accurate anatomical term for where the sample was collected from - Accepted value examples: `retina`, `heart left ventricle`

Optional `obs` field: clustering

NOTE: CAP has specific requirements to name the output of clustering within the AnnData obs field. This was adopted so that any saved clustering output could be unambiguously retrieved across all datasets.

Users may save multiple resolutions of clustering (with any algorithmic approach they choose) in the file, as long as it adheres to the following rules:

Cluster fields are not required but if they are in the dataset they must be denoted with cluster, leiden or louvain.
If you wish to add a descriptive suffix, the prefix cluster must be used e.g. cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]
Examples of acceptable cluster column names are: louvain, cluster_leiden, cluster_louvain_precise3, cluster_leiden_broad etc.

Please refer to the Cell Annotation schema for more details.

Embedding (`obsm`)

At least one embedding, tSNE, UMAP or PCA, is required, and more than one may be included.
The embeddings(s) must be saved with the prefix X_, for example: X_tsne, X_pca, X_umap.

Gene Metadata (`var`)

CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var fields following the AnnData standard.
var is a pandas.DataFrame object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index.
Note: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens and Mus musculus. If there are other species you wish to upload to CAP, please contact support@celltype.info and we will work to accommodate your request.

Count Matrix (`X`)

The file must contain a raw count matrix saved as .X or raw.X.
If the file contains a count matrix in .X the .raw layer must be empty.

Dataset-wide Metadata (`uns`)

There are no requirements for fields in uns for uploading to CAP.

Cell Metadata (obs)

Optional obs field: clustering

Embedding (obsm)

Gene Metadata (var)