This walkthrough steps through a basic analysis and alignment of two datasets of peripheral blood mononuclear cells (PBMCs), provided by 10X and Gierahn et al., 2017. We show how LIGER can be used to identify trends and clusters across both datasets, in addition to dataset-specific differences. We also demonstrate some of the package’s visualization capabilities while reproducing some of the figure panels from the manuscript (Fig. 2).

The data can be downloaded from here. First we set up the liger object using the 10X and Seqwell datasets provided in this directory.

library(liger)
library(cowplot)

pbmc.10x <- read.table('~/Downloads/pbmc_alignment/pbmc_10X.expressionMatrix.txt',
                       sep="\t",stringsAsFactors=F,header=T,row.names = 1)
pbmc.seqwell <- read.table('~/Downloads/pbmc_alignment/pbmc_SeqWell.expressionMatrix.txt',
                           sep="\t",stringsAsFactors=F,header=T,row.names = 1)
pbmc.data = list(tenx=pbmc.10x, seqwell=pbmc.seqwell)

# Create liger object
a.pbmc <- createLiger(pbmc.data)

Data Preprocessing

Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, select variable genes, and scale the data. Note that we do not center the data when scaling because nonnegative matrix factorization accepts only positive values.

The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union of the result. The most variable genes are selected by comparing the variance of each gene’s expression to its mean expression and keeping those with var/mean ratio above a certain threshold.

We can also simply pass pre-selected variable genes to the object if we’ve determined them from another source. In this case, we are using the variable genes determined by the Seurat package for their analysis of these two datasets (this was done in the manuscript for a cleaner comparison of the methods).

a.pbmc <- normalize(a.pbmc)
# Can pass different var.thresh values to each dataset if one seems to be contributing significantly
# more genes than the other
a.pbmc <- selectGenes(a.pbmc, var.thresh = c(0.3, 0.875), do.plot = F)

# In this case, we have a precomputed set of variable genes
s.var.genes <- readRDS('~/Downloads/pbmc_alignment/var_genes.RDS')
a.pbmc@var.genes <- s.var.genes
a.pbmc <- scaleNotCenter(a.pbmc)

Factorization

Next we perform integrative non-negative matrix factorization in order to identify shared and distinct metagenes across the datasets and the corresponding factor/metagene loadings for each cell. The most important parameters in the factorization are k (the number of factors) and lambda (the penalty parameter which limits the dataset-specific component of the factorization). The default value of lambda=5.0 usually provides reasonable results for most analyses, although the suggestLambda function can be used to determine a more appropriate lambda value for the desired level of dataset alignment.

To determine the appropriate number of factors to use, we can use the suggestK function which plots median K-L divergence from the uniform distribution in the factor loadings as a function of k. We want to look for the section of the plot where this metric stops increasing as sharply (the “elbow” of the plot). In general, we should expect a positive correlation between the number of subgroups we expect to find in the analysis and the appropriate number of factors to use. Since the suggestK function can take more than 10 minutes to run, it can sometimes be useful to run a quick preliminary analysis with k=20 to get an idea of whether a much higher number of factors is needed.

# running suggestK on multiple cores can greatly decrease the runtime
k.suggest <- suggestK(a.pbmc, num.cores = 5, gen.new = T, return.results = T, plot.log2 = F)

For this analysis, we select a k of 22, though you can try various values in that range for similar results. We select the default lambda=5.0. (You can also run suggestLambda to get a plot of expected alignment vs. lambda for a given number of factors.)

# Take the lowest objective of three factorizations with different initializations
# Multiple restarts are recommended for initial analyses since iNMF is non-deterministic
a.pbmc <- optimizeALS(a.pbmc, k=22, thresh = 5e-5, nrep = 3)

After the factorization, we still need to quantile align the factor loadings across the datasets. Notice that if we plot a t-SNE representation of the factor loadings, the data still cluster mainly by dataset.

a.pbmc <- runTSNE(a.pbmc, use.raw = T)
p1 <- plotByDatasetAndCluster(a.pbmc, return.plots = T)
# Plot by dataset
print(p1[[1]])