HotellingEllipse

Christian L. Goueguel

This package is specifically designed to help draw Hotelling’s T-squared ellipses on PCA or PLS score scatterplots, a crucial tool in chemometrics for multivariate data analysis and quality control. In the field of chemometrics, these ellipses serve as powerful visual aids for measurements assessment, outlier detection, and process monitoring. By superimposing Hotelling’s T-squared ellipses onto score plots, analysts can quickly identify abnormal samples and establish well-defined confidence regions for normal process operations.

The package’s functionality enables computing the Hotelling’s T-squared statistic, the semi-minor and semi-major axes of the Hotelling’s T-squared ellipse, and generates the coordinate points necessary for constructing a confidence ellipse. Specifically, the package provides two primary functions:

Data

library(HotellingEllipse)
data("specData", package = "HotellingEllipse")

Principal component analysis

In this example, we use FactoMineR::PCA() to perform the Principal Component Analysis (PCA) on a LIBS spectral dataset specData, and extract the PCA scores.

set.seed(002)
pca_mod <- specData %>%
  select(where(is.numeric)) %>%
  PCA(scale.unit = FALSE, graph = FALSE)
pca_scores <- pca_mod %>%
  pluck("ind", "coord") %>%
  as_tibble() %>%
  print()
#> # A tibble: 100 × 5
#>      Dim.1    Dim.2  Dim.3   Dim.4 Dim.5
#>      <dbl>    <dbl>  <dbl>   <dbl> <dbl>
#>  1 25306.  -10831.  -1851.   -83.4 -560.
#>  2   -67.3   1137.  -2946.  2495.  -568.
#>  3 -1822.     -22.0 -2305.  1640.  -409.
#>  4 -1238.    3734.   4039. -2428.   379.
#>  5  3299.    4727.   -888. -1089.   262.
#>  6  5006.     -49.5  2534.  1917.  -970.
#>  7 -8325.   -5607.    960. -3361.   103.
#>  8 -4955.   -1056.   2510.  -397.  -354.
#>  9 -1610.    1271.  -2556.  2268.  -760.
#> 10 19582.    2289.    886.  -843.  1483.
#> # ℹ 90 more rows

Hotelling’s T-Squared Statistic

The Hotelling’s T-squared statistic can be computed using the ellipseParam function in two distinct ways. (1) By specifying the parameter k: This represents the number of principal components to retain. (2) By setting the threshold parameter: This defines the cumulative explained variance to determine the number of components.

Calculate using the first k components

T2 <- ellipseParam(pca_scores, k = 3)$Tsquare$value
T2
#>   [1] 8.79727659 0.42574446 0.26720102 1.12758053 0.73815225 0.50178543
#>   [7] 1.48895076 0.52248052 0.35995391 3.43505326 2.20554170 0.14665580
#>  [13] 1.23019477 1.08187330 0.06404802 0.82350867 4.12091044 0.18901522
#>  [19] 0.74937040 1.81186352 1.01387374 0.36322618 0.18201057 1.19587478
#>  [25] 0.82387604 0.59000025 1.64384908 1.35023679 0.04035757 1.07207307
#>  [31] 1.35289851 0.89287224 0.58566878 0.51087876 0.33557545 0.35322970
#>  [37] 2.75957814 0.73510967 0.42097537 0.65499771 0.41084500 0.26551834
#>  [43] 0.48229565 0.36048584 0.25235451 0.76994454 2.93735077 0.07159629
#>  [49] 0.15337115 0.09562119 0.12416659 0.41598605 0.38472295 0.21898479
#>  [55] 0.20860524 1.26153407 0.26389302 1.35227317 0.05781487 0.65100278
#>  [61] 2.61898699 0.94048980 1.60881058 1.02169526 0.41282496 0.27235598
#>  [67] 0.14854682 0.66617260 2.82560083 0.12785702 1.25327046 0.66607730
#>  [73] 1.22850658 1.13312107 1.46864227 1.46171265 0.10436499 0.26006454
#>  [79] 0.97128044 1.92028425 1.95885654 0.36084958 4.84793496 0.24714317
#>  [85] 0.55723720 0.64080960 1.87718652 0.14398651 0.50051409 0.43680390
#>  [91] 0.76213175 0.58130268 1.48307214 0.55700045 0.17208837 0.11091915
#>  [97] 0.76406153 0.66296513 0.11773916 1.40404240
T2 <- ellipseParam(pca_scores, k = 5)$Tsquare$value
T2
#>   [1] 5.35198031 0.80483847 0.41280666 1.09396270 0.54357239 1.05866612
#>   [7] 1.54758552 0.38932531 0.85084455 3.33998923 1.33661709 2.67345992
#>  [13] 1.44052380 0.74554569 0.09716023 0.63692986 3.03905923 0.14527271
#>  [19] 1.65973096 2.03032292 0.68825673 0.70721780 1.05449654 0.95268942
#>  [25] 0.67437285 0.51832530 1.60329970 1.65265923 0.46804197 1.61273965
#>  [31] 0.90591680 0.83720307 0.57655751 1.27928674 0.40830752 0.25568007
#>  [37] 1.92746313 0.43666648 0.42052805 0.53674167 0.33326678 0.31518697
#>  [43] 0.53420621 0.28056547 0.39345409 1.31456541 2.34976600 0.21583160
#>  [49] 0.19463398 0.06499698 0.76421962 0.82346628 0.60492739 0.25441618
#>  [55] 0.44826694 0.75245708 0.84507014 0.97855109 0.25134633 0.68527571
#>  [61] 2.26652396 1.59851453 1.12427685 0.84476325 0.25321959 0.44806235
#>  [67] 0.74685450 0.45892445 2.81481934 0.49094953 0.80909145 0.57714113
#>  [73] 0.77330414 0.74629051 1.84932002 0.88840206 0.46801132 0.33977349
#>  [79] 0.58104753 1.35207024 1.73787700 0.51984216 2.91262151 0.20476197
#>  [85] 0.70355158 0.55268503 1.63351195 0.24424689 0.55678461 1.86446207
#>  [91] 0.45875232 0.39755343 0.98469700 0.55441164 0.20947523 0.63142698
#>  [97] 0.71018219 0.51824037 0.17165582 0.88278586

Calculate using the cumulative variance threshold

T2 <- ellipseParam(pca_scores, threshold = 0.80)$Tsquare$value
T2
#>   [1] 13.098377222  0.053558139  0.042763017  0.596881084  1.064949043
#>   [6]  0.322634358  2.193588461  0.362201444  0.100269823  5.152202188
#>  [11]  0.136476512  0.187288163  0.457018299  0.836596116  0.037948942
#>  [16]  0.499423175  6.201798003  0.283938583  1.053279861  2.176093882
#>  [21]  1.495054754  0.328273167  0.001077433  0.385735197  0.770015903
#>  [26]  0.855099441  1.918512180  1.898354875  0.059730859  1.544579259
#>  [31]  2.032241146  1.336688195  0.733955470  0.192465562  0.189870642
#>  [36]  0.228768334  2.513724266  0.227842165  0.596628858  0.794052585
#>  [41]  0.579569949  0.180495608  0.375303017  0.101758464  0.290449128
#>  [46]  1.151911637  0.616332156  0.016266200  0.230869862  0.026046210
#>  [51]  0.181580829  0.187191855  0.318096039  0.049303079  0.301571988
#>  [56]  1.849456533  0.093358887  0.338259448  0.054705074  0.865871257
#>  [61]  3.419405506  0.932424618  1.363364796  0.870186313  0.512986568
#>  [66]  0.075112028  0.016206207  0.195475035  3.986477406  0.142755791
#>  [71]  0.379374196  0.809324698  1.859972412  0.058403575  1.958324315
#>  [76]  0.060646515  0.131828892  0.155979037  0.731558582  2.704001013
#>  [81]  0.243720099  0.546490035  6.914605380  0.369950255  0.425961362
#>  [86]  0.376793369  0.040293609  0.146837113  0.524107524  0.652320554
#>  [91]  0.013362935  0.319746716  1.968145064  0.729551769  0.214084611
#>  [96]  0.093838067  0.917982030  0.158779544  0.089226394  2.122042242
T2 <- ellipseParam(pca_scores, threshold = 0.95)$Tsquare$value
T2
#>   [1] 6.53045498 0.77985914 0.39875666 1.27633910 0.63623773 0.64621852
#>   [7] 1.94706323 0.39955561 0.65027891 2.60270069 1.68777974 2.33832862
#>  [13] 1.73627591 0.89148735 0.09154971 0.61396625 3.74797974 0.14792815
#>  [19] 1.16098653 1.37203684 0.75262904 0.74133604 0.39917534 1.18375135
#>  [25] 0.84423117 0.49281117 2.02255319 1.09460844 0.09591725 0.96401329
#>  [31] 1.02350705 1.05620821 0.58298463 1.26807473 0.51557341 0.27067599
#>  [37] 2.41734914 0.54565033 0.47149967 0.56074231 0.36897968 0.39285391
#>  [43] 0.52187589 0.26763215 0.48000911 1.01010296 2.54789673 0.16441988
#>  [49] 0.11408734 0.07745065 0.89519603 1.01950514 0.59274496 0.25875285
#>  [55] 0.50533086 0.94128973 1.06662628 1.21217577 0.13332013 0.49597035
#>  [61] 2.49744440 1.05004728 1.20861993 0.82193478 0.31977530 0.36498547
#>  [67] 0.91019140 0.50545945 2.13281990 0.27229890 0.93957749 0.72809042
#>  [73] 0.94080967 0.89232795 1.12148274 1.12208355 0.58120095 0.24318126
#>  [79] 0.72706291 1.70413391 1.59117880 0.26850066 3.67343823 0.25443017
#>  [85] 0.75731684 0.48234231 1.94290842 0.11555869 0.37941945 1.20420583
#>  [91] 0.56964432 0.45545274 1.11992496 0.57298359 0.14954172 0.75681961
#>  [97] 0.89556581 0.49625512 0.14666752 1.05902198

Hotelling’s T-Squared Ellipse

Calculating the semi-axes

To visualize the confidence region for our multivariate data, we employ the ellipseParam() function to generate a confidence ellipse. Our objective is to calculate the lengths of the semi-axes for this ellipse, focusing on the bivariate relationship within the PC2-PC4 subspace of our principal component analysis. We maintain the default value for k (number of components) at 2. This ensures we’re working with a two-dimensional representation. We specify pcx = 2 and pcy = 4 as inputs. This directs the function to use the 2nd and 4th principal components for the x and y axes, respectively.

ellipse_axes <- ellipseParam(pca_scores, pcx = 2, pcy = 4)
str(ellipse_axes)
#> List of 5
#>  $ Tsquare     : tibble [100 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ value: num [1:100] 13.0984 0.0536 0.0428 0.5969 1.0649 ...
#>  $ cutoff.99pct: num 9.76
#>  $ cutoff.95pct: num 6.24
#>  $ nb.comp     : num 2
#>  $ Ellipse     : tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
#>   ..$ a.99pct: num 10800
#>   ..$ b.99pct: num 5634
#>   ..$ a.95pct: num 8639
#>   ..$ b.95pct: num 4506

We can extract parameters for further use:

a1 <- ellipse_axes %>% pluck("Ellipse", "a.99pct")
b1 <- ellipse_axes %>% pluck("Ellipse", "b.99pct")
a2 <- ellipse_axes %>% pluck("Ellipse", "a.95pct")
b2 <- ellipse_axes %>% pluck("Ellipse", "b.95pct")
Tsq <- ellipse_axes %>% pluck("Tsquare", "value")
pca_scores %>%
  ggplot(aes(x = Dim.2, y = Dim.4)) +
  geom_ellipse(aes(x0 = 0, y0 = 0, a = a1, b = b1, angle = 0), linewidth = .5, linetype = "solid", fill = "white") + 
  geom_ellipse(aes(x0 = 0, y0 = 0, a = a2, b = b2, angle = 0), linewidth = .5, linetype = "dashed", fill = "white") +
  geom_point(aes(fill = Tsq), shape = 21, size = 3, color = "black") +
  scale_fill_viridis_c(option = "viridis") +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
  geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
  labs(
    title = "Scatterplot of PCA scores", 
    subtitle = "PC1 vs. PC3", 
    x = "PC2", 
    y = "PC4", 
    fill = "T2", 
    caption = "Figure 1"
    ) +
  theme_grey()
plot of chunk unnamed-chunk-21

plot of chunk unnamed-chunk-21

Calculating the x-y coordinates

An alternative method for visualizing the Hotelling’s T-squared confidence region is to utilize the ellipseCoord() function. This function generates the x and y coordinates for plotting the confidence ellipse, offering greater flexibility in visualization and further analysis. It allows users to specify a custom confidence level via the confi.limit parameter. The default confidence level is set at 95%, which is commonly used in statistical analyses. The function returns a set of coordinates that define the ellipse’s boundary in the chosen subspace, thereby complementing the semi-axes information provided by the ellipseParam() function.

In the example below, we focus on the subspace spanned by the 1st and 5th components.

xy_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 5, conf.limit = 0.975, pts = 500)
str(xy_coord)
#> tibble [500 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ x: num [1:500] 17253 17252 17248 17241 17231 ...
#>  $ y: num [1:500] -5.97e-13 2.01e+01 4.02e+01 6.04e+01 8.05e+01 ...
ggplot() +
  geom_polygon(data = xy_coord, aes(x, y), color = "black", fill = "white") +
  geom_point(data = pca_scores, aes(x = Dim.1, y = Dim.5), shape = 21, size = 5, fill = "blue", color = "black", alpha = 1/2) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
  geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
  labs(
    title = "Scatterplot of PCA scores", 
    subtitle = "PC1 vs. PC5", 
    x = "PC1", 
    y = "PC5", 
    caption = "Figure 2"
    ) +
  theme_grey()
plot of chunk unnamed-chunk-24

plot of chunk unnamed-chunk-24

Hotelling’s T-Squared Ellipsoid

The ellipseCoord function offers an optional parameter pcz. When activated, this parameter extends the function’s capabilities from two-dimensional ellipses to three-dimensional ellipsoids, providing a more comprehensive visualization of multivariate data, particularly in situation where the first two principal components do not adequately capture the data’s variability.

Calculating the x-y-z coordinates

In the example below, the resulting 3D Hotelling’s T-squared ellipsoid serves as a volumetric confidence region in the subspace spanned by the 1st, 2nd, and 3rd components. It encapsulates a specified proportion of data points, determined by the confidence level, providing a more holistic view of the data’s distribution and outliers in three dimensions.

xyz_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2, pcz = 3, conf.limit = 0.95, pts = 100)
str(xyz_coord)
#> tibble [10,000 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ x: num [1:10000] -2.32e-13 -2.32e-13 -2.32e-13 -2.32e-13 -2.32e-13 ...
#>  $ y: num [1:10000] 6.93e-13 6.93e-13 6.93e-13 6.93e-13 6.93e-13 ...
#>  $ z: num [1:10000] 7745 7745 7745 7745 7745 ...
T2 <- ellipseParam(pca_scores, k = 3)$Tsquare$value
color_palette <- viridisLite::viridis(nrow(pca_scores))
scaled_T2 <- scales::rescale(T2, to = c(1, nrow(pca_scores)))
point_colors <- color_palette[round(scaled_T2)]
rgl::setupKnitr(autoprint = TRUE)
rgl::plot3d(
  x = xyz_coord$x, 
  y = xyz_coord$y, 
  z = xyz_coord$z,
  xlab = "PC1", 
  ylab = "PC2", 
  zlab = "PC3",
  type = "l", 
  lwd = 0.5,
  col = "lightgray",
  alpha = 0.5)
rgl::points3d(
  x = pca_scores$Dim.1, 
  y = pca_scores$Dim.2, 
  z = pca_scores$Dim.3, 
  col = point_colors,
  size = 5,
  add = TRUE)
rgl::bgplot3d({
    par(mar = c(0,0,0,0))
    plot.new()
    color_legend <- as.raster(matrix(rev(color_palette), ncol = 1))
    rasterImage(color_legend, 0.85, 0.1, 0.9, 0.9)
    text(
      x = 0.92, 
      y = seq(0.1, 0.9, length.out = 5), 
      labels = round(seq(min(T2), max(T2), length.out = 5), 2),
      cex = 0.7)
    text(x = 0.92, y = 0.95, labels = "T2", cex = 0.8)})
rgl::view3d(theta = 30, phi = 25, zoom = .8)