debbuging new
Author

Lexanomics

Published

March 28, 2024

Parallelization tips on Seurat

For the series on basic, but useful tips on single-cell analysis. We want to introduce parallelization using future for Seurat.

Frequently a bioinformatician/computational biologist needs to test multiple parameters in a single-cell project, e.g., evaluating clustering resolutions. Although a conventional task, this procedure could take a while ranging from minutes to hours… It can get even worse if we talk about differential expressions analysis with well-known FindAllMarkers.

Life is short, and I want my results. That being said, why not use parallelization? To our surprise, not many data analysts are familiar with the future package [1]. Furthermore, the future package is very handy for R package development - Yes, it is not limited to Seurat, you could incorporate it into your package/script.

As for Seurat, there are at least six (06) functions that were written to leverage future parallelization.

  • NormalizeData
  • ScaleData -JackStraw
  • FindMarkers*
  • FindIntegrationAnchors
  • FindClusters*

*For teaching purposes we will focus only on these two functions.

1. Loading requirements

library(Seurat)
library(dplyr)
library(future)

Please note that we need to establish what parallelization strategy (plan) will be used in our analysis. Also, we are running multiple processes in parallel which can dramatically increase the memory consumption. Be thoughtful about it.

# Setting memory limit for 8Gb
options(future.globals.maxSize = 8000 * 1024^2)

# Enabling parallelization
plan("multicore", workers = 4) # To the date, RStudio does not support parallelization. Run this script using Rscript command-line application.

Next, we will load the PBMC dataset from 10X Genomics. Available here

pbmc_seurat <- readRDS(file = "path/to/pbmc_seurat.RDS")
pbmc_seurat
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 3 layers present: counts, data, scale.data

2. Normalization and Dimensionality reduction

pbmc_seurat <- pbmc_seurat %>%
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
  ScaleData()
pbmc_seurat <- RunPCA(pbmc_seurat)

3. Clustering

As mentioned, often a bioinformatician will evaluate with multiple clustering resolutions. In this tutorial, we will do such a task considering eight different thresholds.

pbmc_seurat <- FindNeighbors(pbmc_seurat, dims = 1:10)
pbmc_seurat <- FindClusters(pbmc_seurat, 
                            resolution = c(0.1, 0.25, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0))

4. Differential Expression

Finally, we will execute the differential expression analysis.

pbmc_markers <- FindAllMarkers(pbmc_seurat, only.pos = TRUE)
pbmc_markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1)
# A tibble: 2,830 × 7
# Groups:   cluster [11]
      p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
      <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1 4.62e-91       1.08 0.954 0.609  6.33e-87 0       LDHB     
 2 1.94e-89       2.11 0.517 0.124  2.66e-85 0       CCR7     
 3 1.68e-52       1.76 0.396 0.117  2.30e-48 0       PRKCQ-AS1
 4 9.77e-52       1.83 0.388 0.114  1.34e-47 0       LEF1     
 5 1.77e-43       1.89 0.323 0.09   2.42e-39 0       MAL      
 6 2.61e-43       2.05 0.235 0.048  3.58e-39 0       FHIT     
 7 3.45e-42       1.37 0.488 0.196  4.74e-38 0       PIK3IP1  
 8 1.41e-40       1.05 0.677 0.373  1.93e-36 0       NOSIP    
 9 1.51e-37       1.20 0.448 0.184  2.08e-33 0       TCF7     
10 1.83e-35       1.57 0.244 0.063  2.51e-31 0       TRABD2A  
# ℹ 2,820 more rows

Another lazy tip: If you are interested in assessing a preliminary result, you could combine the parallelization strategy with the parameter max.cells.per.ident from FindAllMarkers. It will downsample each cluster/ident based on the selected threshold.

pbmc_markers <- FindAllMarkers(
  pbmc_seurat, 
  max.cells.per.ident = 100, # for real datasets you should considered 1000 cells
  only.pos = TRUE)

  • The figures are derived from 1, and 2