Anndata Requirements for Upload
This page contains the requirements for uploading AnnData files to CAP.
Cell Metadata (obs
)
- The file must contain the following fields in
obs
with the column names exactly as listed below. - These fields may not contain NA values.
FieldĀ name | Type | Accepted values |
---|---|---|
assay | string | This must be the specific assay, not a generic term such as scRNA-seq or 10x sequencing.
Accepted value examples: |
disease | string | This must be the specific disease term. Accepted value examples: |
organism | string | This must be the Latin (Genus species) name. Accepted value examples:
|
tissue | string | The most accurate anatomical term for where the sample was collected from. Accepted value
examples: |
Optional obs
field: clustering
- NOTE: CAP has specific requirements to name the output of clustering within the AnnData
obs
field. This was adopted so that any saved clustering output could be unambiguously retrieved across all datasets.
Users may save multiple resolutions of clustering (with any algorithmic approach they choose) in the file, as long as it adheres to the following rules:
- Cluster fields are not required but if they are in the dataset they must be denoted with
cluster
,leiden
orlouvain
. - If you wish to add a descriptive suffix, the prefix
cluster
must be used e.g.cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]
- Examples of acceptable cluster column names are:
louvain
,cluster_leiden
,cluster_louvain_precise3
,cluster_leiden_broad
etc.
Please refer to the Cell Annotation schema for more details.
Embedding (obsm
)
- At least one embedding, tSNE, UMAP or PCA, is required, and more than one may be included.
- The embeddings(s) must be saved with the prefix
X_
, for example:X_tsne
,X_pca
,X_umap
. - The embeddings(s) must be stored as an array and be in [n_cells x 2] shape.
Gene Metadata (var
)
- CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the
var
fields following the AnnData standard. var
is apandas.DataFrame
object. ENSEMBL terms MUST be used to index these rows, i.e.pandas.DataFrame.index
.- Note: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support
Homo sapiens
andMus musculus
. If there are other species you wish to upload to CAP, please contactsupport@celltype.info
and we will work to accommodate your request.
Count Matrix (X
)
- The file must contain a raw count matrix saved as
.X
orraw.X
. - If the file contains a count matrix in
.X
the.raw
layer must be empty. - The matrix must be a dense matrix or if sparse, in Compressed Sparse Row (CSR) format.
Dataset-wide Metadata (uns
)
- There are no requirements for fields in uns for uploading to CAP.