Title: | Evaluate Reduced Dimension Representations |
---|---|
Description: | Evaluate and compare multiple reduced dimension representations, based on how well they retain structure from the original data set. |
Authors: | Charlotte Soneson [aut, cre] |
Maintainer: | Charlotte Soneson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.5 |
Built: | 2024-11-14 05:17:40 UTC |
Source: | https://github.com/csoneson/dreval |
Calculates a collection of metrics comparing one or more reduced dimension
representations to a reference representation. The function takes a
SingleCellExperiment
object as input. The reference representation can
be either one of the included assays or one of the reduced dimension
representations. If an assay is used, reference distances can be calculated
based on all or a subset of the features (rows). These distances are then
compared to distances calculated from the specified reduced dimension
representations, and several scores are returned. The execution time of the
function depends strongly on both the number of retained variables (which
affects the distance calculation in the reference space) and the number of
samples that are randomly selected to use as the basis for the comparison.
Since subsampling of the columns (via the nSamples
argument) is
random, setting the random seed is recommended to obtain reproducible
results.
dreval( sce, dimReds = NULL, refType = "assay", refAssay = "logcounts", refDimRed = NULL, features = NULL, nSamples = NULL, distNorm = "none", refDistMethod = "euclidean", kTM = c(10, 100), labelColumn = NULL, verbose = FALSE )
dreval( sce, dimReds = NULL, refType = "assay", refAssay = "logcounts", refDimRed = NULL, features = NULL, nSamples = NULL, distNorm = "none", refDistMethod = "euclidean", kTM = c(10, 100), labelColumn = NULL, verbose = FALSE )
sce |
A |
dimReds |
A character vector with the names of the reduced dimension
representations from |
refType |
A character scalar, either "assay" or "dimred", specifying
whether to use an assay or a reduced dimension representation of |
refAssay |
A character scalar giving the name of the assay from
|
refDimRed |
A character scalar specifying the reduced dimension
representation to use as the reference data representation if
|
features |
A character vector giving the IDs of the features to use for
distance calculations from the chosen assay. Will be matched to the row
names of |
nSamples |
A numeric scalar, giving the number of columns to subsample
(randomly) from |
distNorm |
A character scalar, indicating how the distance vectors in the reference and low-dimensional spaces should be normalized before they are compared. If set to "l2", the vectors are L2 normalized, if set to "median" they are divided by the median value times the square root of their length, and if set to any other value they are divided by the square root of their length, to avoid metrics scaling with the number of retained samples. |
refDistMethod |
A character scalar defining the distance measure to use in the reference space. Must be one of "euclidean", "manhattan", "maximum", "canberra" or "cosine". The distance in the low-dimensional representation will always be Euclidean. |
kTM |
An integer vector giving the number of neighbors to use for trustworthiness, continuity and Jaccard index calculations. |
labelColumn |
A character scalar defining a column of
|
verbose |
A logical scalar, indicating whether to print out progress messages. |
The following metrics are calculated:
SpearmanCorrDist - The Spearman correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.
PearsonCorrDist - The Pearson correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.
KSstatDist - The Kolmogorov-Smirnov statistic comparing the distribution of distances in the reference space and in the low-dimensional representation. Ranges from 0 to 1, lower values are better.
EuclDistBetweenDists - The Euclidean distance between the vector of
distances in the reference space and those in the low-dimensional
representation. Depending on the value of distNorm
, distances are
scaled before they are compared. Lower values are better.
SammonStress - The Sammon stress (Sammon 1969). Depending on the
value of distNorm
, distances are scaled before they are compared.
Lower values are better.
Trustworthiness_kNN - The trustworthiness score (Venna & Kaski 2001), using NN nearest neighbors. The trustworthiness indicates to which degree we can trust that the points placed closest to a given sample in the low-dimensional representation are really close to the sample also in the reference space. Ranges from 0 to 1, higher values are better.
Continuity_kNN - The continuity score (Venna & Kaski 2001), using NN nearest neighbors. The continuity indicates to which degree we can trust that the points closest to a given sample in the reference space are placed close to the sample also in the low-dimensional representation. Ranges from 0 to 1, higher values are better.
MeanJaccard_kNN - The mean Jaccard index (over all samples), comparing the set of NN nearest neighbors in the reference space and those in the low-dimensional representation. Ranges from 0 to 1, higher values are better.
MeanSilhouette_X - If a labelColumn
X is supplied, the mean
silhouette score (Rousseeuw 1987) across all samples, with the grouping
given by this column and the distances obtained from the low-dimensional
representation. Ranges from -1 to 1, higher values are better.
coRankingQlocal - Q_local, defined as the average LCMC over the values to the left of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of local distances, higher values are better.
coRankingQglobal - Q_global, defined as the average LCMC over the values to the right of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of global distances, higher values are better.
A list with two elements:
scores - A data.frame
with values of all evaluation metrics,
across the dimension reduction methods. In addition to the metrics, it
contains the dimensionality of the respective reduced dimension
representations, and the value of K giving the highest value of LCMC (used
for the calculations of Qlocal and Qglobal, see Kraemer et al 2018, Lee and
Verleysen 2009, Chen and Buja 2009).
plots - A list of ggplot objects, representing diagnostic plots.
Charlotte Soneson
Venna J., Kaski S. (2001). Neighborhood preservation in nonlinear projection methods: An experimental study. In Dorffner G., Bischof H., Hornik K., editors, Proceedings of ICANN 2001, pp 485–491. Springer, Berlin.
Lee J.A., Verleysen M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72 (7-9):1431-1443.
Chen L., Buja A. (2009). Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104:209-219.
Kraemer G., Reichstein M., Mahecha M.D. (2018). dimRed and coRanking - Unifying dimensionality reduction in R. The R Journal 10 (1):342-358.
Sammon J.W. Jr (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers C18(5):401-409.
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:53-65.
data(pbmc3ksub) dre <- dreval(sce = pbmc3ksub, nSamples = 150)
data(pbmc3ksub) dre <- dreval(sce = pbmc3ksub, nSamples = 150)
dreval evaluates and compares multiple reduced dimension representations, based on how well they retain structure from the original data set.
This data set contains expression profiles for 2,700 PBMCs. The original data set was obtained from the TENxPBMCData package (pbmc3k data set). This data set was subset to around 1,800 highly expressed genes, normalized using the scran package, and several dimensionality reduction methods were applied.
Charlotte Soneson
For each metric, rank the evaluated reduced dimension representations by performance, and plot a summary of the overall ranking. Metrics evaluating local and global structure preservations are colored in red and blue, respectively.
plotRankSummary( dreSummary, metrics = NULL, sortBars = "decreasing", scoreType = "rank", tiesMethod = "average" )
plotRankSummary( dreSummary, metrics = NULL, sortBars = "decreasing", scoreType = "rank", tiesMethod = "average" )
dreSummary |
A |
metrics |
A character vector with the metrics to include in the summary.
Must be a subset of the column names of |
sortBars |
A character scalar indicating whether/how to sort the bars in the output. Either "decreasing", "increasing" or "none" (in which case the input order will be used). |
scoreType |
A character scalar indicating what type of values to show in
the plot. Either "rank" or "rescale". If set to "rank", the representations
will be ranked for each metric (with the best one assigned the highest
rank). If set to "rescale", the scores for each metric will first, if
necessary, be inverted so that a high (positive) value corresponds to
better performance, and then be linearly rescaled, mapping the lowest score
to 1 and the highest to P, where P is the number of evaluated
representations. If the original scores are approximately equally spaced
between the highest and lowest observed values, this gives similar results
as setting |
tiesMethod |
A character scalar indicating how ties are handled if
|
Nothing is returned, but a plot is generated.
Charlotte Soneson
data(pbmc3ksub) dre <- dreval(sce = pbmc3ksub, nSamples = 150) plotRankSummary(dre$scores)
data(pbmc3ksub) dre <- dreval(sce = pbmc3ksub, nSamples = 150) plotRankSummary(dre$scores)