Comparing and contrasting heterogeneous single cell profiles using liger

Joshua D. Welch and Velina Kozareva

2019-03-12

Data preprocessing

The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns. Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.

dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is
recommended.
dge2 = readRDS("dge2.RDS")
ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
ligerex = normalize(ligerex)
ligerex = selectGenes(ligerex, var.thresh = 0.1)
ligerex = scaleNotCenter(ligerex)

Loading 10X Data

liger also has functions for reading data generated by 10X’s cellranger count pipeline (including from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and then use this as a single dataset in a new object for integration.

# 10X data to be combined for sample 1 
sample1_data1 = "/path/to/10X/output/dir1"
sample1_data2 = "/path/to/10X/output/dir2"
sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2), 
                      sample.names = c("s1_data1", "s1_data2"), min.umis = 500)
# 10X data for sample 2
sample2_data = "/path/to/10X/output/dir3"
sample2_dge = read10X(sample.dirs = list(sample2_data), 
                      sample.names = c("s2_data"), min.umis = 500)
liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge))
# continue with other preprocessing steps
                      

Performing the factorization

Next we perform the factorization using an alternating least squares algorithm. After performing the factorization, we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets. The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results.

ligerex = optimizeALS(ligerex, k = 20) 
ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment

Visualizing the results

We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended for larger datasets). Visualizations can be colored by dataset of origin or cluster assignment. plotWordClouds is a useful way to visualize the most highly loading genes (both shared and dataset specific) for each factor, in conjunction with the factor loadings across cells.

ligerex = runTSNE(ligerex)
# for larger datasets, may want to use UMAP instead
ligerex = runUMAP(ligerex)
plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot
pdf("word_clouds.pdf")
plotWordClouds(ligerex)
dev.off()

Exploring factors and clusters

Another way to examine factor loadings across cells, and to help visualize the alignment process is through plotFactors; this can be helpful for seeing significant scale differences between the datasets. We can also visualize the correspondence between clusters and factors in the data with plotClusterProportions and plotClusterFactors. These can be especially useful for identifying clusters associated with certain factors and corresponding marker genes.

plotFactors(ligerex)
plotClusterProportions(ligerex)
plotClusterFactors(ligerex, use.aligned = T)

Finding and visualizing marker genes

We can use the factorization to more explicitly identify shared and dataset-specific markers. The function getFactorMarkers returns a list, where the first element contains dataset-specific markers for dataset 1, the second element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2, and the last 2 elements indicate the number of factors in which each marker is found. This information allows the identification of ubiquitous vs. cell-type-specific dataset differences.

markers = getFactorMarkers(ligerex, num.genes = 10)
plotGene(ligerex, gene = "Malat1")
plotGeneViolin(ligerex, gene = "Malat1")

Comparing different cluster assignments

We can compare and visualize liger cluster assignments with existing clusterings (if cluster assignments are available for the individual datasets).

# published cluster assignments for all cells in dataset 1 and 2
clusters_published1 = readRDS("clusters1.RDS")
clusters_published2 = readRDS("clusters2.RDS")
# calculate adjusted Rand Index between liger cluster assignments and another assignment
calcARI(ligerex, c(clusters_published1, clusters_published2)) 
# calculate purity between liger cluster assignments and another assignment for just dataset 1 
calcPurity(ligerex, clusters_published1)
# visualize joint cluster assignment as related to individual dataset cluster assignments
makeRiverplot(ligerex, clusters_published1, clusters_published2)

Checking dataset alignment and individual dataset distortion

liger includes methods for estimating the level of integration (alignment) between datasets and the level of distortion of structure for each individual dataset after factorization and alignment (compared to before). In general, datasets which have been factorized and aligned in a meaningful way should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion (high agreement metric).

calcAlignment(ligerex)
calcAgreement(ligerex)
# see if certain clusters are more integrated than others
calcAlignmentPerCluster(ligerex)

Selecting k and lambda

The suggestK and suggestLambda functions can aid in selecting k and lambda. We want to find the smallest k for which the increase in entropy metric begins to level off (an “elbow” in the plot). Similarly, we want the smallest lambda for which the alignment metric stabilizes.

suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda

Updating the factorization

If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions below provide an efficient method of updating. This is much faster than the naive approach of simply re-running the optimizeALS algorithm.

ligerex = optimizeNewK(ligerex, k = 15) #Can also decrease K
#Add new batches from the same condition/technology/species/protocol
ligerex = optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new),
                          which.datasets = list(name1, name2), add.to.existing = T) 
#Add completely new datasets. Specify which existing datasets are most similar.
ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
                          which.datasets = list(name1, name2), add.to.existing = F) 
#cell.subset is a list of cells to retain from each dataset
ligerex = optimizeSubset(ligerex, cell.subset) 

Integration between liger and Seurat

We can easily create liger objects from Seurat (V2 or V3) objects (and vice versa), while keeping calculated features from the original objects.

# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable 
# genes from each object
ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000)
# Create liger object from single integrated Seurat V2 object, keeping CCA factorization, 
# splitting datasets by Seurat meta.var column "original"
ligerex = seuratToLiger(seurat_obj, combined.seurat = T, meta.var = 'original', cca.to.H = T)
# Create liger object from single integrated Seurat V3 object, splitting datasets by two
# available assays in Seurat 
ligerex = seuratToLiger(seurat_obj, combined.seurat = T, assays.use = c('RNA', 'ADT'))
# Create Seurat object from liger object, keeping liger highly variable genes
seurat_obj = ligerToSeurat(ligerex, use.liger.genes = T)