A longstanding goal in neuroscience is to classify the brain’s many cells into discrete categories according to their function.
Such categories can help researchers understand the complex neural circuits that ultimately give rise to behavior and disease.
However, there’s little consensus about what metrics should define a cell’s identity.
In a new study, a collaboration born in part from the Neural Systems & Behavior (NS&B) course at the Marine Biological Laboratory tests the notion that a cell’s identity can be described solely by the genes it expresses.
The study, published in Proceedings of the National Academy of Sciences, advocates a more “multimodal” approach to defining cell identity.
By using popular and powerful RNA sequencing techniques, researchers can take a snapshot of all the genes that are currently turned on inside a cell.
But it’s becoming increasingly clear that such strategies may be limited in their ability to give a complete picture of cell identity, or represent changes over time.
Along with their collaborators, NS&B instructors Hans Hofmann, David Schulz, and Eve Marder put two popular RNA-based methods to the test: single-cell RNA sequencing and quantitative RT-PCR.
They applied these techniques to two well-studied nerve clusters in the crab Cancer borealis – the stomatogastric and cardiac ganglia – which allowed them to compare the results from the RNA-based approaches to other known metrics of cell identity.
They found that the cell identities generated by the complete RNA profiles, or “transcriptomes,” did not match the existing cell identities they had compiled over years of observation. In fact, categorizing cells based on their entire transcriptome ultimately yielded “scrambled” identities.
However, as the researchers further refined their selection of key genes to input into their analysis, the RNA profiles began to more closely resemble the identities gleaned from other attributes, such as innervation patterns, morphology, and physiology.
Thus, this multimodal approach has the potential to reveal a more accurate portrayal of cell identity than RNA sequencing alone.

The stomatogastric ganglion in the Jonah crab (Cancer borealis), tagged with GFP. This nerve cluster helps the crab chew and filter food. Image is credited to Adam Northcutt.
According to Hofmann, most studies don’t bother to validate transcriptomic data with other metrics of cell identity like morphology and physiology.
“Classification and characterization of cell types is often performed within the context of specific studies, and not based on a systematic approach,” he says.
“We really have to collect a lot of additional data, even across species, to come up with a robust taxonomy of cell types.”
“RNA sequencing is tremendously promising and powerful, but this study provides a valuable and necessary check,” Schulz adds.
“Rather than relying entirely on analytics applied blindly to cell type, whenever possible it’s important to consider multiple modalities of information as well.”
The trick, Hofmann and Schulz agree, is knowing which data are indicative of cell identity, and which are simply noise that will interfere with accurate classification.
Researchers must also eventually agree on a definition of “cell identity.” Drawing firm boundaries between cell types is useful in many ways, but may ultimately be problematic.
“Soon,” Schulz says, “we’ll start to see the limitations of trying to impose very discrete categories on the spectrum of cell types within and across individuals.”
Genes act in concert to maintain a cell’s identity as a specific cell type, to respond to external signals, and to carry out complex cellular activities such as replication and metabolism. Coordinating the necessary genes for these functions is frequently achieved through transcriptional co-regulation, where genes are induced together as a gene expression program (GEP) in response to the appropriate internal or external signal (Eisen et al., 1998; Segal et al., 2003).
By enabling unbiased measurement of the whole transcriptome, profiling technologies such as RNA-Seq are paving the way for systematically discovering GEPs and shedding light on the biological mechanisms that they govern (Liberzon et al., 2015).
Single-cell RNA-Seq (scRNA-Seq) has greatly enhanced our potential to resolve GEPs by making it possible to observe variation in gene expression over many individual cells. Even so, inferring GEPs remains challenging as scRNA-Seq data is noisy and high-dimensional, requiring computational approaches to uncover the underlying patterns.
In addition, technical artifacts such as doublets (where two or more distinct cells are mistakenly collapsed into one) can confound analysis. Methodological advances in dimensionality reduction, clustering, lineage trajectory tracing, and differential expression analysis have helped overcome some of these issues (Amir et al., 2013; Kharchenko et al., 2014; Satija et al., 2015; Trapnell et al., 2014).
Here, we focus on a key challenge of inferring expression programs from scRNA-Seq data: the fact that individual cells may express multiple GEPs but we only detect cellular expression profiles that reflect their combination, rather than the GEPs themselves. A cell’s gene expression is shaped by many factors including its cell type, its state in time-dependent processes such as the cell cycle, and its response to varied environmental stimuli (Wagner et al., 2016).
We group these into two broad classes of expression programs that can be detectable in scRNA-Seq data: (1) GEPs that correspond to the identity of a specific cell type such as hepatocyte or melanocyte (identity programs) and (2) GEPs that are expressed independently of cell type, in any cell that is carrying out a specific activity such as cell division or immune cell activation (activity programs).
In this formulation, identity programs are expressed uniquely in cells of a specific cell type, while activity programs may vary dynamically in cells of one or multiple types and may be continuous or discrete.
Thus far, the vast majority of scRNA-Seq studies have focused on systematically identifying and characterizing the expression programs of cell types composing a given tissue, that is identity GEPs. Substantially less progress has been made in identifying activity GEPs, primarily through direct manipulation of cells in controlled experiments, for example comparing stimulated and unstimulated neurons (Hrvatin et al., 2018) or cells pre- and post-viral infection (Steuerman et al., 2018).
If a subset of cells profiled by scRNA-Seq expresses a given activity GEP, there is a potential to directly infer the program from the data without the need for controlled experiments. However, this can be significantly more challenging than ascertaining identity GEPs; while some cells may have expression profiles that are predominantly the output of an identity program, activity programs will always be expressed alongside the identity programs of one or frequently many cell types.
Thus, while finding the average expression of clusters of similar cells may often be sufficient for finding reasonably accurate identity GEPs, it will often fail for activity GEPs.
We hypothesized that we could infer activity GEPs directly from variation in single-cell expression profiles using matrix factorization. In this context, matrix factorization would model the gene expression data matrix as the product of two lower rank matrices, one encoding the relative contribution of each gene to each program, and a second specifying the proportions in which the programs are combined for each cell.
We refer to the second matrix as a ‘usage’ matrix as it specifies how much each GEP is ‘used’ by each cell in the dataset (Stein-O’Brien et al., 2018) (Figure 1A). Unlike hard clustering, which reduces all cells in a cluster to a single shared GEP, matrix factorization allows cells to express multiple GEPs.
Thus, this computational approach would allow cells to express one or more activity GEPs in addition to their expected cell-type GEP, and could correctly model doublets as a combination of the identity GEPs for the combined cell types. To the best of our knowledge, no previously reported studies have benchmarked the ability of matrix factorization methods to accurately learn identity and activity GEPs from scRNA-Seq profiles.
We see three primary motivations for jointly inferring identity and activity GEPs in scRNA-Seq data. First, systematic discovery of GEPs could reveal unexpected or novel activity programs reflecting important biological processes (e.g. immune activation or hypoxia) in the context of the native biological tissue. Second, it could enable characterization of the prevalence of each activity GEP across cell types in the tissue.
Finally, accounting for activity programs could improve inference of identity programs by avoiding spurious inclusion of activity program genes in the latter. GEPs corresponding to different phases of the cell cycle are examples of widespread activity programs and are well-known to confound identity (cell type) program inference in scRNA-Seq data (Chen and Zhou, 2017; Scialdone et al., 2015). However, cell-cycle is just one instance of the broader problem of confounding of identity and activity programs.
While matrix factorization is widely used as a preprocessing step in scRNA-Seq analysis, a priori it is unclear which, if any, factorization approaches would be most appropriate for inferring biologically meaningful GEPs.
In particular, Principal Component Analysis (PCA), Independent Component Analysis (ICA), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and Non-Negative Matrix Factorization (NMF)(Lee and Seung, 1999) have been used for dimensionality reduction of data prior to downstream analysis or as an approach to cell clustering.
However, while PCA (Shalek et al., 2014; Steuerman et al., 2018), NMF (Puram et al., 2017) and ICA (Saunders et al., 2018) components have been interpreted as activity programs, the dimensions inferred by these or other matrix factorization algorithms may not necessarily align with biologically meaningful gene expression programs and are frequently ignored in practice. This is because each method makes different simplifying assumptions that are potentially inappropriate for gene expression data.
For example, NMF and LDA are non-negative and so cannot directly model repression. ICA components are statistically independent, PCA components are mutually orthogonal, and both allow gene expression to be negative. Furthermore, none of these methods, except LDA, explicitly accounts for the count distribution of expression data in their error models.
In this study, we motivate, validate, and enhance the use of matrix factorization for GEP inference. Using simulations, we show that despite their simplifying assumptions, ICA, LDA, and NMF—but not PCA—can accurately discover both activity and identity GEPs.
However, due to inherent randomness in their algorithms, they give substantially varying results when repeated multiple times, which hinders their interpretability. We therefore implemented a meta-analysis approach (Figure 1B), which demonstrably increased robustness and accuracy. Overall, the meta-analysis of NMF, which we call Consensus NMF (cNMF), gave the best performance in these simulations.
Applied to three real datasets generated by three different scRNA-Seq platforms, cNMF inferred expected activity programs (cell-cycle programs in a brain organoid dataset and depolarization induced programs in visual cortex neurons), an unanticipated hypoxia program, and intriguing novel activity programs.
It also enhanced cell type characterization and enabled estimation of rates of activity across cell types. These findings on real datasets further validate our approach as a useful analysis tool to understand complex signals within scRNA-Seq data.
Source:
Marine Biological Laboratory
Media Contacts:
Gina Hebert – Marine Biological Laboratory
Image Source:
The image is credited to Adam Northcutt.
Original Research: Open access
“Molecular profiling of single neurons of known identity in two ganglia from the crab Cancer borealis”. Adam J. Northcutt, Daniel R. Kick, Adriane G. Otopalik, Benjamin M. Goetz, Rayna M. Harris, Joseph M. Santin, Hans A. Hofmann, Eve Marder, and David J. Schulz.
PNAS doi:10.1073/pnas.1911413116.