Logo
Cell Annotation Platform
Sign In / Join

Anndata Requirements for Upload

This page contains the requirements for uploading AnnData files to CAP.

Cell Metadata (obs)

  • The file must contain the following fields in obs with the column names exactly as listed below.
  • These fields may not contain NA values.
FieldĀ name Type Accepted values
assay string

This must be the specific assay, not a generic term such as scRNA-seq or 10x sequencing. Accepted value examples: '10x 3' v2' or 'Smart-seq2'

disease string

This must be the specific disease term. Accepted value examples: 'glioblastoma' or 'Alzheimer disease' For healthy samples use 'normal' or 'healthy'

organism string

This must be the Latin (Genus species) name. Accepted value examples: 'Homo sapiens' or 'Mus musculus'

tissue string

The most accurate anatomical term for where the sample was collected from. Accepted value examples: 'retina' or 'heart left ventricle'

Optional obs field: clustering

  • NOTE: CAP has specific requirements to name the output of clustering within the AnnData obs field. This was adopted so that any saved clustering output could be unambiguously retrieved across all datasets.

Users may save multiple resolutions of clustering (with any algorithmic approach they choose) in the file, as long as it adheres to the following rules:

  • Cluster fields are not required but if they are in the dataset they must be denoted with cluster, leiden or louvain.
  • If you wish to add a descriptive suffix, the prefix cluster must be used e.g. cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]
  • Examples of acceptable cluster column names are: louvain, cluster_leiden, cluster_louvain_precise3, cluster_leiden_broad etc.

Please refer to the Cell Annotation schema for more details.

Embedding (obsm)

  • At least one embedding, tSNE, UMAP or PCA, is required, and more than one may be included.
  • The embeddings(s) must be saved with the prefix X_, for example: X_tsne, X_pca, X_umap.
  • The embeddings(s) must be stored as an array and be in [n_cells x 2] shape.

Gene Metadata (var)

  • CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var fields following the AnnData standard.
  • var is a pandas.DataFrame object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index.
  • Note: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens and Mus musculus. If there are other species you wish to upload to CAP, please contact support@celltype.info and we will work to accommodate your request.

Count Matrix (X)

  • The file must contain a raw count matrix saved as .X or raw.X.
  • If the file contains a count matrix in .X the .raw layer must be empty.
  • The matrix must be a dense matrix or if sparse, in Compressed Sparse Row (CSR) format.

Dataset-wide Metadata (uns)

  • There are no requirements for fields in uns for uploading to CAP.