library(dreval)
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: multiple methods tables found for 'setequal'
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
#> Warning: replacing previous import 'S4Arrays::read_block' by
#> 'DelayedArray::read_block' when loading 'SummarizedExperiment'
The primary aim of the dreval
package is to evaluate and
compare one or more reduced dimension representations, in terms of how
well they retain the structure from a reference representation. Here, we
illustrate its application to a single-cell RNA-seq data set of
peripheral blood mononuclear cells (PBMCs), representing a subset of the
pbmc3k
data set from the TENxPBMCData
package.
The set of genes has been subset to only highly variable ones, and
several dimension reduction methods (PCA, tSNE and UMAP) have been
applied.
data(pbmc3ksub)
pbmc3ksub
#> class: SingleCellExperiment
#> dim: 1804 2700
#> metadata(1): log.exprs.offset
#> assays(2): counts logcounts
#> rownames(1804): LYZ S100A9 ... AGTRAP ATXN1L
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames(2700): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(8): PCA PCA_k2 ... UMAP_nn15 UMAP_nn30
#> mainExpName: NULL
#> altExpNames(0):
The dreval()
function is the main function in the
package, and calculates several metrics comparing each of the reduced
dimension representations in the SingleCellExperiment
object to the designed reference representation. Here, we use the
underlying normalized and log-transformed data contained in the
logcounts
assay as the reference representation. For a list
of the calculated metrics, see the help page for
dreval()
.
The output of dreval()
is a list with two elements. The
plots
entry contains several diagnostic plot objects, while
the scores
entry contains the calculated scores for each of
the evaluated reduced dimension representations. These can be summarized
visually with the plotRankSummary()
function.
dre$scores
#> Method dimensionality SpearmanCorrDist PearsonCorrDist KSStatDist
#> D...1 PCA 25 0.8155730 0.8257160 0.3231743
#> D...2 PCA_k2 2 0.6755452 0.6558962 0.4693307
#> D...3 TSNE_perp30 2 0.6482856 0.6000383 0.3659880
#> D...4 TSNE_perp5 2 0.4111228 0.3683701 0.3791743
#> D...5 TSNE_perp15 2 0.5911459 0.5574556 0.3687695
#> D...6 TSNE_perp100 2 0.6306086 0.5824035 0.3852104
#> D...7 UMAP_nn15 2 0.6004903 0.5721709 0.4249379
#> D...8 UMAP_nn30 2 0.6083531 0.5567674 0.4599679
#> EuclDistBetweenDists SammonStress Trustworthiness_k10 Continuity_k10
#> D...1 0.2519976 0.06707261 0.9161812 0.9297478
#> D...2 0.4891927 0.25436476 0.8173676 0.8644743
#> D...3 0.3952114 0.16459791 0.8513288 0.8608281
#> D...4 0.4105094 0.17360639 0.8278914 0.8092301
#> D...5 0.3928673 0.16198379 0.8438089 0.8585416
#> D...6 0.4488881 0.21421422 0.8571905 0.8642444
#> D...7 0.5214532 0.29170515 0.8387282 0.8560252
#> D...8 0.5649890 0.34189444 0.8402497 0.8552194
#> Trustworthiness_k75 Continuity_k75 MeanJaccard_k10 MeanJaccard_k75
#> D...1 0.9307742 0.9432491 0.1911720 0.4512849
#> D...2 0.8631356 0.9031720 0.1126294 0.3307295
#> D...3 0.8667626 0.8900429 0.1355732 0.3536190
#> D...4 0.8035737 0.8163307 0.1322705 0.2981947
#> D...5 0.8583365 0.8794768 0.1355053 0.3477706
#> D...6 0.8716046 0.8985398 0.1401506 0.3702075
#> D...7 0.8624017 0.8961107 0.1312572 0.3493354
#> D...8 0.8642634 0.8988611 0.1340936 0.3555657
#> coRankingQlocal coRankingQglobal KmaxLCMC
#> D...1 0.3999471 0.2486593 124
#> D...2 0.2904145 0.1925418 134
#> D...3 0.3019631 0.1894801 126
#> D...4 0.2359816 0.1263050 106
#> D...5 0.2892327 0.1756569 119
#> D...6 0.3141772 0.1902151 129
#> D...7 0.3003102 0.1736805 131
#> D...8 0.3073870 0.1844069 133
suppressPackageStartupMessages({
library(ggplot2)
})
cowplot::plot_grid(plotlist = lapply(dre$plots$disthex, function(w) w +
theme(axis.title = element_text(size = 7))), ncol = 2)
#> Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(density)` instead.
#> ℹ The deprecated feature was likely used in the dreval package.
#> Please report the issue at <https://github.com/csoneson/dreval/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
In the example above, we used the original feature space (the
logcounts
assay) as the reference space, to which all
reduced dimension representations were compared. It is also possible to
use one of the reduced dimension representations as the reference space,
as we will illustrate here.
dre <- dreval(
sce = pbmc3ksub, refType = "dimred", refDimRed = "PCA",
dimReds = NULL, nSamples = 500, kTM = c(10, 75),
verbose = FALSE, distNorm = "l2"
)
dre$scores
#> Method dimensionality SpearmanCorrDist PearsonCorrDist KSStatDist
#> D...1 PCA_k2 2 0.9209284 0.8982171 0.27183968
#> D...2 TSNE_perp30 2 0.7649254 0.7251368 0.12925852
#> D...3 TSNE_perp5 2 0.5765053 0.5481474 0.09867735
#> D...4 TSNE_perp15 2 0.7282644 0.7053298 0.11158317
#> D...5 TSNE_perp100 2 0.7639271 0.7235801 0.20747094
#> D...6 UMAP_nn15 2 0.7957255 0.7695924 0.30283768
#> D...7 UMAP_nn30 2 0.8244421 0.7564214 0.35959920
#> EuclDistBetweenDists SammonStress Trustworthiness_k10 Continuity_k10
#> D...1 0.2940195 0.1157218 0.9105399 0.9450027
#> D...2 0.3105750 0.1050059 0.9644227 0.9615129
#> D...3 0.3786560 0.1527330 0.9450436 0.9368867
#> D...4 0.3160655 0.1096450 0.9602679 0.9580091
#> D...5 0.3453218 0.1333275 0.9657441 0.9684471
#> D...6 0.3784983 0.1823796 0.9546336 0.9598832
#> D...7 0.4220470 0.2249209 0.9549292 0.9631876
#> Trustworthiness_k75 Continuity_k75 MeanJaccard_k10 MeanJaccard_k75
#> D...1 0.9417897 0.9579986 0.1916863 0.5036352
#> D...2 0.9427882 0.9503873 0.3531773 0.5642417
#> D...3 0.8814643 0.9070903 0.2938348 0.4638678
#> D...4 0.9370618 0.9458715 0.3356492 0.5564015
#> D...5 0.9544140 0.9619534 0.3715695 0.5956543
#> D...6 0.9500420 0.9546654 0.3180622 0.5758168
#> D...7 0.9532505 0.9593630 0.3147608 0.5846820
#> coRankingQlocal coRankingQglobal KmaxLCMC
#> D...1 0.4334229 0.2767674 119
#> D...2 0.5049718 0.2633026 57
#> D...3 0.4308531 0.1928194 61
#> D...4 0.4965520 0.2462500 64
#> D...5 0.5182479 0.2860603 48
#> D...6 0.4891039 0.2646368 62
#> D...7 0.4953732 0.2803684 62
cowplot::plot_grid(plotlist = lapply(dre$plots$disthex, function(w) w +
theme(axis.title = element_text(size = 7))), ncol = 2)
dreval
relies heavily on the calculation of pairwise
distances among samples, and is consequently computationally demanding
for large data sets. The dreval()
function allows for
subsampling a smaller number of samples, as well as selecting a subset
of the variables, to speed up calculations and reduce the memory
footprint. Here we illustrate that for the example data set, the
calculated metrics are stable under subsampling of cells.
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(ggplot2)
})
set.seed(1)
nS <- c(100, 400, 1000, 1500, 2700)
v <- lapply(nS, function(n) {
if (n == 2700) nR <- 1
else nR <- 2
lapply(seq_len(nR), function(i) {
message(n, " samples, replicate ", i)
dreval(sce = pbmc3ksub, refAssay = "logcounts", dimReds = NULL,
nSamples = n, kTM = 50, verbose = FALSE)$scores %>%
dplyr::mutate(nSamples = n, replicate = i)
})
})
vv <- do.call(dplyr::bind_rows, v)
vv <- vv %>%
tidyr::gather(
key = "measure", value = "value",
-Method, -dimensionality, -nSamples, -replicate
)
suppressWarnings(
ggplot(vv, aes(x = nSamples, y = value, color = Method)) +
geom_point(size = 3) + facet_wrap(~ measure, scales = "free_y") +
geom_smooth(se = FALSE) + theme_bw()
)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dreval_0.1.5 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 gtable_0.3.6
#> [3] xfun_0.49 bslib_0.8.0
#> [5] sparsesvd_0.2-2 Biobase_2.67.0
#> [7] lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.4.2 generics_0.1.3
#> [11] stats4_4.4.2 tibble_3.2.1
#> [13] fansi_1.0.6 cluster_2.1.6
#> [15] pkgconfig_2.0.3 Matrix_1.7-1
#> [17] S4Vectors_0.45.1 lifecycle_1.0.4
#> [19] GenomeInfoDbData_1.2.13 farver_2.1.2
#> [21] compiler_4.4.2 iotools_0.3-5
#> [23] munsell_0.5.1 coRanking_0.2.5
#> [25] GenomeInfoDb_1.43.0 htmltools_0.5.8.1
#> [27] sys_3.4.3 buildtools_1.0.0
#> [29] sass_0.4.9 yaml_2.3.10
#> [31] hexbin_1.28.5 pillar_1.9.0
#> [33] crayon_1.5.3 jquerylib_0.1.4
#> [35] tidyr_1.3.1 MASS_7.3-61
#> [37] SingleCellExperiment_1.29.1 DelayedArray_0.33.1
#> [39] cachem_1.1.0 abind_1.4-8
#> [41] tidyselect_1.2.1 digest_0.6.37
#> [43] dplyr_1.1.4 purrr_1.0.2
#> [45] labeling_0.4.3 maketools_1.3.1
#> [47] cowplot_1.1.3 fastmap_1.2.0
#> [49] grid_4.4.2 colorspace_2.1-1
#> [51] cli_3.6.3 SparseArray_1.7.1
#> [53] magrittr_2.0.3 S4Arrays_1.7.1
#> [55] utf8_1.2.4 withr_3.0.2
#> [57] UCSC.utils_1.3.0 scales_1.3.0
#> [59] XVector_0.47.0 httr_1.4.7
#> [61] matrixStats_1.4.1 evaluate_1.0.1
#> [63] knitr_1.49 GenomicRanges_1.59.0
#> [65] IRanges_2.41.0 rlang_1.1.4
#> [67] Rcpp_1.0.13-1 wordspace_0.2-8
#> [69] glue_1.8.0 BiocGenerics_0.53.2
#> [71] jsonlite_1.8.9 R6_2.5.1
#> [73] MatrixGenerics_1.19.0 zlibbioc_1.52.0