Cell Annotation Metadata Terms

Annotating cell types and states – identifying them as separable entities and naming them as either known entities or new ones – is a cornerstone of biological research. The annotation of a cell state is a critical level of abstraction that structures our understanding of biology.

Currently, the annotations of cell types and states within single-cell datasets is rather ad hoc, which normally amounts to a single string associated to cells within standard bioinformatic files like Seurat or AnnData. Such an approach is simply not standardized enough to create reliable large-scale cell atlases within the Human Cell Atlas. Researchers disagree in the definitions of the cell labels used, molecular definitions of the biological entities, and relative precision of terms across datasets. There are also questions of bioinformatic transparency, including where did these cell annotations come from and what they precisely mean.

Here, we would like to promote a standard so that cell annotations could become more transparent for downstream analysis. This is required for publishing cell annotations on CAP. Information on how this cell annotation metadata will be encoded within bioinformatic files can be found in the schema here: Cell Annotation Schema

Entering Cell Annotation Metadata

Cell annotation metadata may be entered on the 'Edit Dataset' page. See the Entering cell annotation metadata documentation for more information about how to enter and edit the cell annotation metadata.

Cell Label

The preferred text used by the author to annotate this cell type. This denotes any free-text term which the author uses to annotate cells, i.e. the preferred cell label name used by the author.

Abbreviations are permitted in this field; authors may annotate their cells using any label they wish for this field. For example, in the 'Cell Label' field , the terms 'LC' or 'luminal cell' would both be acceptable.

Reserved terms

There are special cases whereby we have reserved keywords when annotating cells. Users should use these terms if applicable.

  • Doublets: The term 'doublets' is reserved for encoding cells defined as doublets based on some computational analysis. By “doublets”, we refer to the sequencing artifact within droplet-based protocols whereby two or more cells are tagged with the same barcode.

  • Junk: The term 'junk' is reserved for encoding cells that failed sequencing (and QC filtering) for some reason, e.g. few genes detected, high fraction of mitochondrial read.

  • Unknown: The term 'unknown' is specifically reserved for cells which the author did not know how to annotate with a biological entity. It is a generic term meaning “I do not know”.

Cell Full name

Full length name for the term used in 'Cell Label'. Abbreviations not permitted. This must be the full-length name for the biological entity listed in "cell label" by the author.

In most of the cases the full name must be equal to the name of the ontology term in Ontology Lookup Service (See Ontology term section).

Synonyms

Comma-separated list of terms the author considers to be exactly or nearly the same as the value defined in the 'Cell Label' field. For example, for the term 'glial cell' a user could state the following synonyms 'neuroglia, neuroglial cell'.

Ontology term

The term from Ontology Lookup Service (OLS) which corresponds to given cell type annotation. In contrast with Cell Label this field must exactly match with biological entity given in OLS. When the cell type is not presented in OLS, this field converts to a suggestion for a new OLS entity. To support both existing and missing term, the ontology term is not a single field but a combination of few:

Ontology term exists

Boolean flag which shows rather the given cell type exists in the OLS.

Ontology term id

The id of the cell type from the OLS. When the term is not presented in OLS, this id must point to the closest possible parent existing in OLS.

Ontology term

The full name of the cell type from OLS. In case when OLS term is missing, this must be equal to name of the term which ontology term id points to (i.e. to the closest existing parent).

Category

The category term denotes a biological entity which the author associates as the nearest "class" or "broader term" (or "parent term") for the value/term in the field. Much like the field “Cell Term”, the category term may exist in the Cell Ontology or it may be a new term to be added to the Cell Ontology.

The corresponding parent term or category normally could be the term directly above the specified cell type in the cell ontology hierarchy, which can be found in the Ontology Lookup Service tree view. For example, for the term 'glycinergic amacrine cell' the parent term would be 'amacrine cell'.

Evidence for Annotations

Marker Gene Evidence

The list of gene names which are explicitly used as evidence supporting the assignment of this cell annotation. Given this is derived from the data itself, this must be recorded using a comma-separated list of the gene names existing within the bioinformatic file.

Canonical Marker Genes

The list of gene names of “legacy markers” or “known markers” for this entity, i.e. gene names widely recognized as defining this cell type (or cell state) using transcriptomics. This must be recorded using a comma-separated list of the gene names. The meaning of this field differs from “Marker Gene Evidence”, as the former is explicitly referring to some variety of data analysis.

For example, researchers could list “GNLY, NKG7” as canonical markers for “Natural killer (NK) cells”. “IL7R, S100A4” may be listed as canonical markers of “Memory CD4+ cells”.

Rationale

A free-text statement communicating the user's rationale for their cell annotation. Justification and/or evidence for this, including citations, is encouraged. Users should explain why they chose this cell annotation, how they derived these cell annotations, and what the cell annotation means (i.e. the identity and function of the cell type or cell state).

Given this is free-text, the explanations/rationales will primarily be read by other researchers. We encourage researchers to provide as much context as possible. Such context is critical for resolving differences between cell annotations across publications and research groups.

For example, a user could provide the following informative rationale:

These cells were annotated as “plasmacytoid dendritic cells (pDCs)” upon running differential expression with Seurat v5 using default parameters after standard pre-processing and Leiden clustering. The differentially expressed genes of this cluster lacked key markers used to identify B cells, T cells, NK cells, or monocytes. More relevantly, this cluster expressed ‘GZMB’, ‘IGJ’, ‘IGKC’, and ‘SERPINF1’, which corresponds to the subcluster used to identify pDCs in Villani et al (2017), doi: 10.1126/science.aah4573.

Rationale DOI

Comma-separated list of DOIs corresponding to the publications cited as justification for cell annotations in the 'rationale' field. For example, for 'chodl neurons' a user could list the following DOIs: 10.7554/eLife.59928, 10.7554/eLife.59928

Cell Ontology Assessment

Optional free-text field to express any suggestions for improving any aspect of the Cell Ontology concerning this specific cell annotation. Disagreements with any aspect of the Cell Ontology should be noted here for ontology curators to review.

For example, a user could add additional information such as:

The CL term 'amacrine cell' (CL:0000561) should have four child terms, glycinergic, GABAergic, GABAergic Glycinergic amacrine cells and non-GABAergic non-glycinergic amacrine cells. Currently, this distinction is not clear.

or

A synonym listed for this annotation is 'T cell of appendix' (CL:0009031). It’s unclear how this is functionally different from other classes of mature or immature T cells.

or

The definition provided by CL for the 'retinal melanocyte' (CL:0002485) does not clearly contrast the distinction cells of the 'retinal pigment epithelium' (RPE). The former should be distinguished by the uveal tract.