This vignette shows how to jointly define clusters using single-cell RNA-seq and single-nuclear ATAC-seq data. We will use the peripheral blood mononuclear cell (PBMC) scRNA-seq and snATAC-seq datasets provided by 10X. The workflow for jointly analyzing RNA and ATAC is quite similar to that for integrating multiple RNA datasets. The only differences are (1) we need to process the ATAC data into gene-level values comparable to gene expression and (2) we perform gene selection using the RNA data only.

Load Pre-Computed Count Matrices for RNA and ATAC Data

Because ATAC-seq measures chromatin accessibility, a different data modality than gene expression, we need to process the ATAC data to obtain gene-level features that we can use to integrate with the RNA data. Although one could imagine many strategies for calculating gene-level features from ATAC data, we found that the simplest possible strategy works well: counting the number of reads that overlap the gene body and promoter for each gene. We show how to do this below starting from the cellranger output, but for now we will simply read in the final gene x cell matrices for RNA and ATAC data. These matrices and all other inputs are provided here, so we can load them directly:

rna_clusts = readRDS("rna_cluster_assignments.RDS")
atac_clusts = readRDS("atac_cluster_assignments.RDS")
pbmc.atac <- readRDS('pbmc.atac.expression.mat.RDS')
pbmc.rna <- readRDS('pbmc.rna.expression.mat.RDS')

Calculate Gene-Level Features for ATAC Data

Here we show how to compute a gene x cell feature matrix starting from the fragments.tsv file output by the 10X cellranger count pipeline. Skip this section for now if you want to simply try out liger on the provided counts.

Although one could imagine many strategies for calculating gene-level features from ATAC data, we found that the simplest possible strategy works well: counting the number of reads that overlap the gene body and promoter for each gene. We found that summing the peak counts output by cellranger count for the peaks overlapping each gene can also work, but this strategy is less desirable because (1) information from reads not in peaks is lost and (2) the cellranger peak calling is performed on all cells, which leads to an overrepresentation of peaks from abundant cell populations and biases against rare cell populations.

We sort the fragments output by cellranger by barcode and use the BEDOPS bedmap tool to make a list of cell barcodes that overlap each gene and promoter. To perform this step, you need to install BEDOPS (see this page for installation instructions). The file atac_fragments.tsv is available from the 10X website at this link. Note that this file is very large and the step to sort the file is somewhat slow.

system("sort -k4,4 atac_v1_pbmc_10k_fragments.tsv > atac_fragments.sort.bed")
system("bedmap --delim \"\\t\" --echo --echo-map-id hg19_genes.bed atac_fragments.sort.bed > atac_genes_bc.bed")
system("bedmap --delim \"\\t\" --echo --echo-map-id hg19_promoters.bed atac_fragments.sort.bed > atac_promoters_bc.bed")

We then import the bedmap output into R. Note that we use the as.is parameter in read.table to supress the conversion of character columns like gene symbol to factors. Using Liger’s makeFeatureMatrix function, we calculate gene and promoter accessibility counts. Finally, we sort the rows of these matrices by gene symbol and add them together. Note that we could also calculate these combined counts in a single step by combining the gene and promoter BED files; we choose to calculate them separately in case this is useful for downstream analyses.

library(liger)

genes.bc <- read.table(file = "atac_genes_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)
promoters.bc <- read.table(file = "atac_promoters_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)

#The makeFeatureMatrix function requires a list of barcodes as a vector
barcodes = names(atac_clusts)
gene.counts <- makeFeatureMatrix(genes.bc, barcodes)
promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes)

gene.counts <- gene.counts[order(rownames(gene.counts)),]
promoter.counts <- promoter.counts[order(rownames(promoter.counts)),]
pbmc.atac <- gene.counts + promoter.counts

Read 10X RNA Output

Liger’s read10X function can be used to read cellranger count output files into R.

sample.dir = '.'
sample.name = 'pbmc.rna'
pbmc.rna <- read10X(sample.dirs = list(sample.dir), sample.names = list(sample.name))

Data Preprocessing

The algorithm takes a list of two or more count matrices as input. Genes should be in rows and cells in columns. We can make a list of the count matrices and pass this as parameter to the createLiger function to construct a Liger object.

library(liger)
ggplot2::theme_set(theme_cowplot())
pbmc.data = list(atac=pbmc.atac[,names(atac_clusts)], rna=pbmc.rna[,names(rna_clusts)])
int.pbmc <- createLiger(pbmc.data)

Before computing the factorization, we need to normalize the data to account for different capture efficiency and sequencing depth per cell with normalize, select variable genes with selectGenes, and scale the data with scaleNotCenter. Note that we do not mean-center the data because nonnegative matrix factorization requires nonnegative values.

The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union of the result. The variable genes are selected by comparing the variance of each gene’s expression to its mean expression. The selectGenes function was written primarily scRNA-seq in mind, and snATAC-seq data distribution is quite different. So instead of taking the union of variable genes from RNA and ATAC, we set datasets.use = 2 in the function to perform gene selection using only the RNA dataset.

int.pbmc <- normalize(int.pbmc)
int.pbmc <- selectGenes(int.pbmc, datasets.use = 2)
int.pbmc <- scaleNotCenter(int.pbmc)

Factorization and Quantile Normalization

Next we perform integrative non-negative matrix factorization in order to identify shared and distinct metagenes across the datasets and the corresponding metagene loadings for each cell. The most important parameters in the factorization are k (the number of factors) and lambda (the penalty parameter which limits the dataset-specific component of the factorization). The default value of lambda=5.0 usually provides reasonable results for most analyses. For this analysis, we simply use k = 20 and the default value of lambda.

int.pbmc <- optimizeALS(int.pbmc, k=20)

After the factorization, we still need to quantile normalize the factor loadings across the datasets. Notice that if we plot a t-SNE representation of the factor loadings before normalization, the cells still cluster mainly by modality.

int.pbmc <- runTSNE(int.pbmc, use.raw = T)
p1 <- plotByDatasetAndCluster(int.pbmc, return.plots = T)
print(p1[[1]])