dreval

library(dreval)
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: multiple methods tables found for 'setequal'
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: replacing previous import 'S4Arrays::read_block' by
#> 'DelayedArray::read_block' when loading 'SummarizedExperiment'

Introduction

The primary aim of the dreval package is to evaluate and compare one or more reduced dimension representations, in terms of how well they retain the structure from a reference representation. Here, we illustrate its application to a single-cell RNA-seq data set of peripheral blood mononuclear cells (PBMCs), representing a subset of the pbmc3k data set from the TENxPBMCData package. The set of genes has been subset to only highly variable ones, and several dimension reduction methods (PCA, tSNE and UMAP) have been applied.

data(pbmc3ksub)
pbmc3ksub
#> class: SingleCellExperiment 
#> dim: 1804 2700 
#> metadata(1): log.exprs.offset
#> assays(2): counts logcounts
#> rownames(1804): LYZ S100A9 ... AGTRAP ATXN1L
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames(2700): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(8): PCA PCA_k2 ... UMAP_nn15 UMAP_nn30
#> mainExpName: NULL
#> altExpNames(0):

Evaluation

The dreval() function is the main function in the package, and calculates several metrics comparing each of the reduced dimension representations in the SingleCellExperiment object to the designed reference representation. Here, we use the underlying normalized and log-transformed data contained in the logcounts assay as the reference representation. For a list of the calculated metrics, see the help page for dreval().

dre <- dreval(
  sce = pbmc3ksub, refType = "assay", refAssay = "logcounts", 
  dimReds = NULL, nSamples = 500, kTM = c(10, 75), 
  verbose = FALSE, distNorm = "l2"
)

Result visualization

The output of dreval() is a list with two elements. The plots entry contains several diagnostic plot objects, while the scores entry contains the calculated scores for each of the evaluated reduced dimension representations. These can be summarized visually with the plotRankSummary() function.

dre$scores
#>             Method dimensionality SpearmanCorrDist PearsonCorrDist KSStatDist
#> D...1          PCA             25        0.8155730       0.8257160  0.3231743
#> D...2       PCA_k2              2        0.6755452       0.6558962  0.4693307
#> D...3  TSNE_perp30              2        0.6482856       0.6000383  0.3659880
#> D...4   TSNE_perp5              2        0.4111228       0.3683701  0.3791743
#> D...5  TSNE_perp15              2        0.5911459       0.5574556  0.3687695
#> D...6 TSNE_perp100              2        0.6306086       0.5824035  0.3852104
#> D...7    UMAP_nn15              2        0.6004903       0.5721709  0.4249379
#> D...8    UMAP_nn30              2        0.6083531       0.5567674  0.4599679
#>       EuclDistBetweenDists SammonStress Trustworthiness_k10 Continuity_k10
#> D...1            0.2519976   0.06707261           0.9161812      0.9297478
#> D...2            0.4891927   0.25436476           0.8173676      0.8644743
#> D...3            0.3952114   0.16459791           0.8513288      0.8608281
#> D...4            0.4105094   0.17360639           0.8278914      0.8092301
#> D...5            0.3928673   0.16198379           0.8438089      0.8585416
#> D...6            0.4488881   0.21421422           0.8571905      0.8642444
#> D...7            0.5214532   0.29170515           0.8387282      0.8560252
#> D...8            0.5649890   0.34189444           0.8402497      0.8552194
#>       Trustworthiness_k75 Continuity_k75 MeanJaccard_k10 MeanJaccard_k75
#> D...1           0.9307742      0.9432491       0.1911720       0.4512849
#> D...2           0.8631356      0.9031720       0.1126294       0.3307295
#> D...3           0.8667626      0.8900429       0.1355732       0.3536190
#> D...4           0.8035737      0.8163307       0.1322705       0.2981947
#> D...5           0.8583365      0.8794768       0.1355053       0.3477706
#> D...6           0.8716046      0.8985398       0.1401506       0.3702075
#> D...7           0.8624017      0.8961107       0.1312572       0.3493354
#> D...8           0.8642634      0.8988611       0.1340936       0.3555657
#>       coRankingQlocal coRankingQglobal KmaxLCMC
#> D...1       0.3999471        0.2486593      124
#> D...2       0.2904145        0.1925418      134
#> D...3       0.3019631        0.1894801      126
#> D...4       0.2359816        0.1263050      106
#> D...5       0.2892327        0.1756569      119
#> D...6       0.3141772        0.1902151      129
#> D...7       0.3003102        0.1736805      131
#> D...8       0.3073870        0.1844069      133
suppressPackageStartupMessages({
  library(ggplot2)
})
cowplot::plot_grid(plotlist = lapply(dre$plots$disthex, function(w) w + 
  theme(axis.title = element_text(size = 7))), ncol = 2)
#> Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(density)` instead.
#> ℹ The deprecated feature was likely used in the dreval package.
#>   Please report the issue at <https://github.com/csoneson/dreval/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

plotRankSummary(dre$scores) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Using a reduced dimension representation as the reference space

In the example above, we used the original feature space (the logcounts assay) as the reference space, to which all reduced dimension representations were compared. It is also possible to use one of the reduced dimension representations as the reference space, as we will illustrate here.

dre <- dreval(
  sce = pbmc3ksub, refType = "dimred", refDimRed = "PCA", 
  dimReds = NULL, nSamples = 500, kTM = c(10, 75), 
  verbose = FALSE, distNorm = "l2"
)
dre$scores
#>             Method dimensionality SpearmanCorrDist PearsonCorrDist KSStatDist
#> D...1       PCA_k2              2        0.9209284       0.8982171 0.27183968
#> D...2  TSNE_perp30              2        0.7649254       0.7251368 0.12925852
#> D...3   TSNE_perp5              2        0.5765053       0.5481474 0.09867735
#> D...4  TSNE_perp15              2        0.7282644       0.7053298 0.11158317
#> D...5 TSNE_perp100              2        0.7639271       0.7235801 0.20747094
#> D...6    UMAP_nn15              2        0.7957255       0.7695924 0.30283768
#> D...7    UMAP_nn30              2        0.8244421       0.7564214 0.35959920
#>       EuclDistBetweenDists SammonStress Trustworthiness_k10 Continuity_k10
#> D...1            0.2940195    0.1157218           0.9105399      0.9450027
#> D...2            0.3105750    0.1050059           0.9644227      0.9615129
#> D...3            0.3786560    0.1527330           0.9450436      0.9368867
#> D...4            0.3160655    0.1096450           0.9602679      0.9580091
#> D...5            0.3453218    0.1333275           0.9657441      0.9684471
#> D...6            0.3784983    0.1823796           0.9546336      0.9598832
#> D...7            0.4220470    0.2249209           0.9549292      0.9631876
#>       Trustworthiness_k75 Continuity_k75 MeanJaccard_k10 MeanJaccard_k75
#> D...1           0.9417897      0.9579986       0.1916863       0.5036352
#> D...2           0.9427882      0.9503873       0.3531773       0.5642417
#> D...3           0.8814643      0.9070903       0.2938348       0.4638678
#> D...4           0.9370618      0.9458715       0.3356492       0.5564015
#> D...5           0.9544140      0.9619534       0.3715695       0.5956543
#> D...6           0.9500420      0.9546654       0.3180622       0.5758168
#> D...7           0.9532505      0.9593630       0.3147608       0.5846820
#>       coRankingQlocal coRankingQglobal KmaxLCMC
#> D...1       0.4334229        0.2767674      119
#> D...2       0.5049718        0.2633026       57
#> D...3       0.4308531        0.1928194       61
#> D...4       0.4965520        0.2462500       64
#> D...5       0.5182479        0.2860603       48
#> D...6       0.4891039        0.2646368       62
#> D...7       0.4953732        0.2803684       62
cowplot::plot_grid(plotlist = lapply(dre$plots$disthex, function(w) w + 
  theme(axis.title = element_text(size = 7))), ncol = 2)

plotRankSummary(dre$scores) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Computational efficiency

dreval relies heavily on the calculation of pairwise distances among samples, and is consequently computationally demanding for large data sets. The dreval() function allows for subsampling a smaller number of samples, as well as selecting a subset of the variables, to speed up calculations and reduce the memory footprint. Here we illustrate that for the example data set, the calculated metrics are stable under subsampling of cells.

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(ggplot2)
})

set.seed(1)
nS <- c(100, 400, 1000, 1500, 2700)
v <- lapply(nS, function(n) {
  if (n == 2700) nR <- 1
  else nR <- 2
  lapply(seq_len(nR), function(i) {
    message(n, " samples, replicate ", i)
    dreval(sce = pbmc3ksub, refAssay = "logcounts", dimReds = NULL,
           nSamples = n, kTM = 50, verbose = FALSE)$scores %>%
      dplyr::mutate(nSamples = n, replicate = i)
  })
})
vv <- do.call(dplyr::bind_rows, v)

vv <- vv %>%
  tidyr::gather(
    key = "measure", value = "value", 
    -Method, -dimensionality, -nSamples, -replicate
  )

suppressWarnings(
  ggplot(vv, aes(x = nSamples, y = value, color = Method)) + 
    geom_point(size = 3) + facet_wrap(~ measure, scales = "free_y") + 
    geom_smooth(se = FALSE) + theme_bw()
)

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1  dreval_0.1.5   rmarkdown_2.29
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.37.0 gtable_0.3.6               
#>  [3] xfun_0.49                   bslib_0.8.0                
#>  [5] sparsesvd_0.2-2             Biobase_2.67.0             
#>  [7] lattice_0.22-6              vctrs_0.6.5                
#>  [9] tools_4.4.2                 generics_0.1.3             
#> [11] stats4_4.4.2                tibble_3.2.1               
#> [13] fansi_1.0.6                 cluster_2.1.6              
#> [15] pkgconfig_2.0.3             Matrix_1.7-1               
#> [17] S4Vectors_0.45.1            lifecycle_1.0.4            
#> [19] GenomeInfoDbData_1.2.13     farver_2.1.2               
#> [21] compiler_4.4.2              iotools_0.3-5              
#> [23] munsell_0.5.1               coRanking_0.2.5            
#> [25] GenomeInfoDb_1.43.0         htmltools_0.5.8.1          
#> [27] sys_3.4.3                   buildtools_1.0.0           
#> [29] sass_0.4.9                  yaml_2.3.10                
#> [31] hexbin_1.28.5               pillar_1.9.0               
#> [33] crayon_1.5.3                jquerylib_0.1.4            
#> [35] tidyr_1.3.1                 MASS_7.3-61                
#> [37] SingleCellExperiment_1.29.1 DelayedArray_0.33.1        
#> [39] cachem_1.1.0                abind_1.4-8                
#> [41] tidyselect_1.2.1            digest_0.6.37              
#> [43] dplyr_1.1.4                 purrr_1.0.2                
#> [45] labeling_0.4.3              maketools_1.3.1            
#> [47] cowplot_1.1.3               fastmap_1.2.0              
#> [49] grid_4.4.2                  colorspace_2.1-1           
#> [51] cli_3.6.3                   SparseArray_1.7.1          
#> [53] magrittr_2.0.3              S4Arrays_1.7.1             
#> [55] utf8_1.2.4                  withr_3.0.2                
#> [57] UCSC.utils_1.3.0            scales_1.3.0               
#> [59] XVector_0.47.0              httr_1.4.7                 
#> [61] matrixStats_1.4.1           evaluate_1.0.1             
#> [63] knitr_1.49                  GenomicRanges_1.59.0       
#> [65] IRanges_2.41.0              rlang_1.1.4                
#> [67] Rcpp_1.0.13-1               wordspace_0.2-8            
#> [69] glue_1.8.0                  BiocGenerics_0.53.2        
#> [71] jsonlite_1.8.9              R6_2.5.1                   
#> [73] MatrixGenerics_1.19.0       zlibbioc_1.52.0

References