Comparing and contrasting heterogeneous single cell profiles using liger

Joshua D. Welch

2018-10-20

Data Preprocessing

The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns. Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that coresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.

dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is accepted.
dge2 = readRDS("dge2.RDS")
ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
ligerex = normalize(ligerex)
ligerex = selectGenes(ligerex, var.thresh = 0.1)
ligerex = scaleNotCenter(ligerex)

Performing the Factorization

Next we perform the factorization using an alternating least squares algorithm. After performing the factorization, we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets. The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results.

ligerex = optimizeALS(ligerex, k = 20) 
ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment

Visualizing the results

ligerex = runTSNE(ligerex)
plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot
pdf("word_clouds.pdf")
plotWordClouds(ligerex)
dev.off()

Finding marker genes

We can use the factorization to identify shared and dataset-specific markers. The function below returns a list, where the first element contains dataset-specific markers for dataset 1, the second element contains shared markers, the third element contains dataset-specific markers for dataset 2, and the last 2 elements indicate the number of factors in which each marker is found. This information allows the identification of ubiquitous vs. cell-type-specific dataset differences.

markers = getFactorMarkers(ligerex, num.genes = 10)
plotGene(ligerex, gene = "Malat1")
plotGeneViolin(ligerex, gene = "Malat1")

Selecting k and lambda

The suggestK and suggestLambda functions can aid in selecting k and lambda. We want to find the smallest k for which the increase in entropy metric begins to level off (an “elbow” in the plot). Similarly, we want the smallest lambda for which the alignment metric stabilizes.

suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda

Updating the Factorization

If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions below provide an efficient method of updating. This is much faster than the naive approach of simply re-running the optimizeALS algorithm.

ligerex = optimizeNewK(ligerex, k = 15) #Can also decrease K
#Add new batches from the same condition/technology/species/protocol
ligerex = optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new),
                          which.datasets = list(name1, name2), add.to.existing = T) 
#Add completely new datasets. Specify which existing datasets are most similar.
ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
                          which.datasets = list(name1, name2), add.to.existing = F) 
#cell.subset is a list of cells to retain from each dataset
ligerex = optimizeSubset(ligerex, cell.subset)