Multivariate Dimension Reduction Workshop

Author

Alex Shaw

About this workflow

The title of this project is R Workshop - Multivariate Statistical Analysis I: Dimension Reduction. The slides for the presentation accompanying this workflow can be found here.

In this workflow we focus on practical data analysis by presenting statistical workflows applicable in any software for dimension reduction. The R code used to create output is also included.

In order to re-run this .qmd file, either open it with RStudio and press “render” or, if you want both the .pdf and the .html file to be produced, run the following code with this .qmd file being present in your working directory: quarto::quarto_render("multivariate_dimension_reduction_workflow.qmd", output_format = "all")

Annotations and explanations of this software workflow are still being moved from Powerpoint format. In the meantime the (https://sydney-informatics-hub.github.io/stats-resources/specialist_statistics.html#multivariate-statistical-analysis-1-dimension-reduction)[additional material] will help aid your interpretation and understanding of examples 2 and 3.

Acknowledgements

If you use this workflow, please don’t forget to acknowledge us. You can use the following sentence

The authors acknowledge the Statistical Workshops and Workflows provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney

We acknowledge Professor François Husson (https://husson.github.io/) for providing the datasets and elegant explanations of dimension reduction methods that formed the basis of our workshop examples. We also acknowledge Murray’s R Resources (https://www.flutterbys.com.au/stats/) for the NMDS example.

R setup

Before we delve into the content, it is necessary to load certain libraries. If you want to use them, please install them beforehand with install.packages("library_name").

Code
suppressPackageStartupMessages({
  library("tibble")
  library("magrittr")
  library("ggplot2")
  library("dplyr")
  library("FactoMineR")
  library("factoextra")
  library("tourr")
  library("tidyr")
  library("GGally")
  library("stringr")
  library("ggmosaic")
  library("vcd")
  library(readr)
  library(vegan)
  library(shape)
  })

# Optional to produce tourr animations
suppressPackageStartupMessages({
  library("gganimate")
  library("gifski")
  })

# GGplot theme
theme_set(theme_bw())

# No warnings
options(warn = -1)

# Load the data
data(decathlon)

Dimension Reduction Theory

Here is some code to run the tourrs shown in the workshop. These are all using the decathlon example that is discussed below.

Code
# Take the decathlon data, centre and scale it
d <- scale(decathlon[,2:11],center=T,scale=F)

# Create animate gifs for each of the tours
save_gif(animate_xy(d,tour_path=little_tour()),"little_tour.gif",delay=0.01,loop=F)
save_gif(animate_xy(d,tour_path=grand_tour()),"grand_tour.gif",delay=0.01,loop=F)
save_gif(animate_xy(d,tour_path=guided_tour(cmass())),"guided_tour.gif",delay=0.01,loop=F)

Example 1: PCA

In the first example we use PCA to examine the decathlon dataset. The decathlon data is a built-in dataset with the FactoMineR package. So once we have loaded FactoMineR, we load the dataset using the below function, and it gets stored in the decathlon variable.

Code
data(decathlon)

The data consists of 41 performances across 10 variables that make up the 10 events in the decathlon.

What are the 10 events of the decathlon?

names(decathlon)

100m, Long.jump, Shot.put, High.jump, 400m, 110m.hurdle, Discus, Pole.vault, Javeline, 1500m, Rank, Points, Competition

Performance is measured in these different variables in two different ways:

  • Track events: seconds to complete distance
  • Field events: distance travelled in metres

Before we consider dimension reduction analysis, we should do our usual one-by-one EDA

Step 0: Identify your variable types and perform appropriate Exploratory Data Analysis

All of the [performance] variables are continuous measurements. Lets look at what is in the decathlon data by looking at the first 10 rows:

Code
head(decathlon)
         100m Long.jump Shot.put High.jump  400m 110m.hurdle Discus Pole.vault
SEBRLE  11.04      7.58    14.83      2.07 49.81       14.69  43.75       5.02
CLAY    10.76      7.40    14.26      1.86 49.37       14.05  50.72       4.92
KARPOV  11.02      7.30    14.77      2.04 48.37       14.09  48.95       4.92
BERNARD 11.02      7.23    14.25      1.92 48.93       14.99  40.87       5.32
YURKOV  11.34      7.09    15.19      2.10 50.42       15.31  46.26       4.72
WARNERS 11.11      7.60    14.31      1.98 48.68       14.23  41.10       4.92
        Javeline 1500m Rank Points Competition
SEBRLE     63.19 291.7    1   8217    Decastar
CLAY       60.15 301.5    2   8122    Decastar
KARPOV     50.31 300.2    3   8099    Decastar
BERNARD    62.77 280.1    4   8067    Decastar
YURKOV     63.44 276.4    5   8036    Decastar
WARNERS    51.77 278.1    6   8030    Decastar

In addition to our performance variables we have additional information:

  • The rows are named for the athlete - note that row names are a special variable in R that are not technically part of the data frame but merely a label for each row
  • We have the Rank, the position they came in each race
  • We have the Points, they were awarded in each Competition
  • We have the Competition, they competed in

The row names being stored as a label is a bit annoying. For now let’s make it into a column in its own right:

Code
decathlon <- decathlon %>% rownames_to_column("Athlete") 

This dataset covers two competitions:

Code
decathlon %>% distinct(Competition)
  Competition
1    Decastar
2    OlympicG

Let’s make a new version of the dataset that just contains the performances. We will need to keep the name of the athlete for labeling of our PCA output later. We can turn this back into a row label when we need to input only the numeric variables in the PCA.

Code
decathlon_performances <- decathlon %>% dplyr::select(-c(Rank,Points,Competition))

head(decathlon_performances)
  Athlete  100m Long.jump Shot.put High.jump  400m 110m.hurdle Discus
1  SEBRLE 11.04      7.58    14.83      2.07 49.81       14.69  43.75
2    CLAY 10.76      7.40    14.26      1.86 49.37       14.05  50.72
3  KARPOV 11.02      7.30    14.77      2.04 48.37       14.09  48.95
4 BERNARD 11.02      7.23    14.25      1.92 48.93       14.99  40.87
5  YURKOV 11.34      7.09    15.19      2.10 50.42       15.31  46.26
6 WARNERS 11.11      7.60    14.31      1.98 48.68       14.23  41.10
  Pole.vault Javeline 1500m
1       5.02    63.19 291.7
2       4.92    60.15 301.5
3       4.92    50.31 300.2
4       5.32    62.77 280.1
5       4.72    63.44 276.4
6       4.92    51.77 278.1

Let’s look at how they are distributed using boxplots;

Code
# Transform the data for plotting
decathlon_performances_long <- pivot_longer(decathlon_performances,-Athlete,names_to="Event",values_to="Performance")

ggplot(decathlon_performances_long,aes(x=Event,y=Performance)) + geom_boxplot()

Wow, the distribution of these performances are very different for the different events. 1500m has the highest values, which makes sense because this is a long distance event with performance measured in seconds taken to complete the distance.

When we perform the PCA the different variables must be mean centred, so that biases are not introduced. So we can start by doing that.

Code
decathlon_performances_centred <- scale(decathlon_performances %>% column_to_rownames('Athlete'),center=T,scale=F) 
decathlon_performances_centred_long <- pivot_longer(as.data.frame(decathlon_performances_centred),everything(),names_to="Event",values_to="Performance")
ggplot(decathlon_performances_centred_long,aes(x=Event,y=Performance)) + geom_boxplot()

Now that we have centred the values, we can see that there is a big difference in the amount of variation in each of the events. We are interested in patterns in overall performance, and in the decathlon, each of the events have an equal weight. So we will standardise each of the events, by dividing each value by the standard deviation for that event. This is a commonly used approach that gives each variable an equal influence over your PCA solution, but you may not always want to do this.

Code
decathlon_performances_centred_scaled <- scale(decathlon_performances %>% column_to_rownames('Athlete'),center=T,scale=T)

decathlon_performances_centred_scaled_long <- pivot_longer(as.data.frame(decathlon_performances_centred_scaled),everything(),names_to="Event",values_to="Performance")
ggplot(decathlon_performances_centred_scaled_long,aes(x=Event,y=Performance)) + geom_boxplot()

Let’s also take a look at a correlation and scatterplot matrix

Code
ggpairs(decathlon_performances %>% dplyr::select(-Athlete))

The starred correlation coefficients have evidence of a non-zero correlation. We see evidence of a positive correlation between the track events (e.g. 100m and 400m), which makes sense. However, we also see strong evidence for a negative correlation between some field events and track events (e.g. 100m and Long jump). This is simply an artifact of how the performances are measured.

  • A bigger number is a good thing in long jump (you jumped further)
  • A bigger number is a bad thing in 100m (you took longer)

We decide to multiply all track times by -1 so all events have the same ‘polarity’. This will make the output from the PCA more easy to interpret, because the correlations among the input will have a consistent direction of effect. We should make sure to label this change to the variables, by putting a minus sign in front of each of the track events.

Code
# A note about this code, to use the below tidyverse functions, mutate and rename_with we need to make sure that the centred scale is turned back into a data frame
decathlon_performances_track_reversed_centred_scaled <- as.data.frame(decathlon_performances_centred_scaled) %>% mutate(across(c("100m","400m","110m.hurdle","1500m"),~ .x*-1)) %>%  rename_with(~str_c("-", .), c("100m","400m","110m.hurdle","1500m")) 

Okay, we are now ready to run the PCA!

Step 1: Run dimension reduction analysis

Code
decathlon_PCA = PCA(decathlon_performances_track_reversed_centred_scaled,graph = F) 

Let’s examine the output:

Step 2: Examine the relationships between variables

Code
fviz_pca_var(decathlon_PCA)

Let’s throw a couple of supplementary variables in to aid our interpretation and rerun the PCA.

Code
# We will add these two supplementary variables as the first two columns, and then tell the PCA function that they have column index 1 and 2 by using quanti.sup=1:2
decathlon_PCA = PCA(cbind(decathlon[,c("Rank","Points")],decathlon_performances_track_reversed_centred_scaled),quanti.sup=1:2,graph=F)

fviz_pca_var(decathlon_PCA)

Interpreting the PCs as latent variables:

Clearly PC1 represents performance of the athlete. A weighted sum of all events – all variables are positively correlated. Those with higher PC1 scores, have higher points and lower rank. Despite explaining the most variability (33%), this is not particularly interesting as a latent variable.

PC2 appears to represent specialisation of athletes. It is a contrast between events that involve sprinting, and those that don’t, which have opposite signed correlations with PC2*. Sprint specialists achieve by running really fast. Non-sprint specialists achieve most of their points by being good at other field events. Despite explaining about half as much of the variability as PC1 (17%), PC2 has a more interesting interpretation as a latent variable. If specialisation is present in decathletes, this PC will appear reliably in multiple samples.

*A technical note: the sign of the PC itself is arbitrary! The sign of the correlation between the variable and the PC is not arbitrary

Step 3: Examine the relationships between subjects

Code
fviz_pca_ind(decathlon_PCA)

We can plot the individual observations plotted in PC space, often called the score plot. Their PC scores for two dimensions at a time

This is again a projection into PC space, but not of the variables, but instead the individual subjects

Observations close to each other have similar values for the depicted PCs (but not necessarily similar overall). Recall latent variable interpretation for each PC.

In some examples (not this one) there may be distinct clusters, and you can perform various types of clustering on observations in the PC space.

Step 4: Further summarisation and interpretation

Code
fviz_screeplot(decathlon_PCA, ncp=5)

The Scree plot helps you decide how many dimensions to consider. There are as many PCs as there were variables input (10), but by default the PCA function only keeps the first five. There are different approaches to choosing the number of dimensions, depending on the type of dataset you have and the domain of your problem.

A couple of other plots help you examine the PCs more closely. Firstly the quality of representation plot tells you how well each original variable’s information is preserved in the PCs. We can see that Pole vault is not well represented by the first two PCs.

Quality of representation plot:

Code
fviz_cos2(decathlon_PCA,choice="var",axes = 1:2)

Similarly the contribution of the vraiables to each of the PCs (PC1 is shown below). When you consider the loadings of each of the variables onto the PC, how much do they contribute relative to each other. The reference line is at 10%, because there were 10 variables input, so above this line is ‘overrepresented’ in PC1 compared to other variables.

Contribution of [original] variables plot

Code
fviz_contrib(decathlon_PCA,choice="var")

Example 2: Correspondence Analysis

Step 0: Perform appropriate EDA

The Nobel Prize data must be loaded from a csv.

Code
Nobel <- read.table("data_CA_NobelPrize_withMaths.csv", 
           header=TRUE, sep=";", row.names=1, check.names=FALSE)

This is the complete dataset, but in the example we only use the G8 countries (first 8 rows), and exclude the mathematics prize (7th column)

Code
Nobel_subset <- Nobel[1:8,1:6]

We can look at this contingency table as frequencies:

Code
Nobel_subset
        Chemistry Economic science Literature Medicine Peace Physics
Canada          4                3          2        4     1       4
France          8                3         11       12    10       9
Germany        24                1          8       18     5      24
Italy           1                1          6        5     1       5
Japan           6                0          2        3     1      11
Russia          4                3          5        2     3      10
UK             23                6          7       26    11      20
USA            51               43          8       70    19      66

Or proportions:

Code
prop.table(Nobel_subset) %>% round(2)
        Chemistry Economic science Literature Medicine Peace Physics
Canada       0.01             0.01       0.00     0.01  0.00    0.01
France       0.01             0.01       0.02     0.02  0.02    0.02
Germany      0.04             0.00       0.01     0.03  0.01    0.04
Italy        0.00             0.00       0.01     0.01  0.00    0.01
Japan        0.01             0.00       0.00     0.01  0.00    0.02
Russia       0.01             0.01       0.01     0.00  0.01    0.02
UK           0.04             0.01       0.01     0.05  0.02    0.04
USA          0.09             0.08       0.01     0.12  0.03    0.12

We can also visualise as a mosiac plot using the ggmosaic package

Code
Nobel_subset_tidy <- Nobel_subset %>% rownames_to_column("country") %>% pivot_longer(-country,names_to="prize")

ggplot(data=Nobel_subset_tidy) + geom_mosaic(aes(x=product(prize,country),weight=value,fill=prize)) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Before we proceed to the method, we can look overall at whether the prize and country categorical variables are independent.

The below plot is from the vcd package, which aids visualisation of multivariate data. We see the same mosaic plot as above, but now shaded by Pearson residuals (the distance of that combination from expected under independence). The tiny p-value shown on the bottom right is the p-value for a Chi-squared test for independence. On the basis of this test, there is a dependence between the two variables, which we can explore using correspondence analysis.

Code
vcd::mosaic(as.matrix((rev(Nobel_subset))),split_vertical=T,shade=T,labeling_args = list(rot_labels = c(top = 90, left = 0),abbreviate_labs=4))

Step 1: Run dimension reduction

Code
res.ca=CA(Nobel_subset)

Step 2 & 3: Examine the relationships between ‘variables’ and ‘subjects’

Code
plot(res.ca)

Step 4: Further summarisation and interpretation

Code
fviz_screeplot(res.ca, addlabels = TRUE, ylim = c(0, 60))

Example 3 nMDS

Let’s take an ecology example for nMDS:

Mac Nally bird abundance data

  • Ralph Mac Nally (1989)
  • Maximum abundance for 102 bird species
  • 37 sites, 5 different forest types (Gippsland manna gum, montane forest, woodland, box-ironbark and river redgum and mixed forest)
  • Research question: Do the bird assemblages differ between forest types?

Step 0: EDA

Code
macnally <- read_csv("macnally_full_site.csv")
Rows: 37 Columns: 104
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): Site, HABITAT
dbl (102): GST, EYR, GF, BTH, GWH, WTTR, WEHE, WNHE, SFW, WBSW, CR, LK, RWB,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
macnally_forlabels <- macnally %>% column_to_rownames("Site")
macnally_tidy <- macnally %>% pivot_longer(-c(Site,HABITAT),names_to="Bird",values_to="Abundance")
ggplot(macnally_tidy,aes(x=Site,y=Abundance,colour=Bird)) + geom_jitter(alpha=0.5) +scale_x_discrete(guide = guide_axis(angle = 90)) + theme_classic() + theme(legend.position="none")

Code
ggplot(macnally_tidy,aes(x=Site,y=Abundance,fill=Bird)) + geom_bar(position="stack",stat="identity") +scale_x_discrete(guide = guide_axis(angle = 90)) + theme_classic() + theme(legend.position = "none")

Code
# Zero counts
macnally_tidy %>% group_by(Site) %>% summarise(zerocounts=sum(Abundance==0)) %>% arrange(desc(zerocounts))
# A tibble: 37 × 2
   Site         zerocounts
   <chr>             <int>
 1 Pearcedale           81
 2 Cape Schanck         77
 3 Bittern              76
 4 Red Hill             76
 5 Devilbend            75
 6 Bailieston           74
 7 Balnarring           74
 8 Warneet              74
 9 Hawke                73
10 Lysterfield          73
# ℹ 27 more rows
Code
# Standardisation

macnally.std <- wisconsin(macnally[,c(-1,-2)]) %>% cbind(macnally[,1:2],.)
macnally.std_tidy <- macnally.std %>% pivot_longer(-c(Site,HABITAT),names_to="Bird",values_to="Abundance")
ggplot(macnally.std_tidy,aes(x=Site,y=Abundance,colour=Bird)) + geom_jitter(alpha=0.5) +scale_x_discrete(guide = guide_axis(angle = 90)) + theme_classic() + theme(legend.position="none")

Code
ggplot(macnally.std_tidy,aes(x=Site,y=Abundance,fill=Bird)) + geom_bar(position="stack",stat="identity") +scale_x_discrete(guide = guide_axis(angle = 90)) + theme(legend.position = "none")

Step 1: Run dimension reduction analysis

Code
macnally.full.mds <- metaMDS(macnally.std[,-(1:2)])
Run 0 stress 0.1243679 
Run 1 stress 0.1562021 
Run 2 stress 0.1925009 
Run 3 stress 0.1243679 
... New best solution
... Procrustes: rmse 6.342555e-06  max resid 2.08651e-05 
... Similar to previous best
Run 4 stress 0.1660106 
Run 5 stress 0.1361338 
Run 6 stress 0.1734126 
Run 7 stress 0.1243679 
... Procrustes: rmse 7.226363e-06  max resid 3.130177e-05 
... Similar to previous best
Run 8 stress 0.1243679 
... Procrustes: rmse 2.249612e-05  max resid 7.695735e-05 
... Similar to previous best
Run 9 stress 0.124368 
... Procrustes: rmse 1.46231e-05  max resid 6.821196e-05 
... Similar to previous best
Run 10 stress 0.1243679 
... Procrustes: rmse 1.137193e-05  max resid 3.948229e-05 
... Similar to previous best
Run 11 stress 0.1362469 
Run 12 stress 0.1243679 
... Procrustes: rmse 4.352368e-06  max resid 1.199799e-05 
... Similar to previous best
Run 13 stress 0.124368 
... Procrustes: rmse 4.231432e-05  max resid 0.0001354954 
... Similar to previous best
Run 14 stress 0.1456979 
Run 15 stress 0.1247501 
... Procrustes: rmse 0.01200537  max resid 0.06251301 
Run 16 stress 0.1243679 
... New best solution
... Procrustes: rmse 7.542984e-06  max resid 2.859018e-05 
... Similar to previous best
Run 17 stress 0.1362466 
Run 18 stress 0.1243679 
... Procrustes: rmse 1.898389e-05  max resid 8.035575e-05 
... Similar to previous best
Run 19 stress 0.1753109 
Run 20 stress 0.1247507 
... Procrustes: rmse 0.01214907  max resid 0.06324879 
*** Best solution repeated 2 times

Because the distance-based methods don’t have the same concept of variables and subjects, we will start by looking at the ‘subject’ plot, plotting the individual sites

Step 3: Examine the subject (site) plot

  • Environmental variables can be included in the plot in a similar way to supplementary variables for PCA (i.e. not part of the input, but correlation with each dimension calculated post-hoc)
  • Could include variables such as soil pH, soil composition, altitude, etc.
  • In this example we have just used each type of forest as a [binary] variable. The directions generally match the clusters of forest type
Code
ordiplot(macnally.full.mds,type="n",display="sites")
text(macnally.full.mds,lab=rownames(macnally_forlabels), col=as.numeric(as.factor(macnally$HABITAT)))
habitat <- model.matrix(~-1+macnally$HABITAT)
colnames(habitat) <-gsub("macnally\\$HABITAT","",colnames(habitat))
envfit <- envfit(macnally.full.mds, env=habitat)
plot(envfit, col="gray")

Step 2: Examine the variables (species) plot

  • Variable (species) scores are included by taking a weighted average of the site scores
  • Species that are closest to the sites in the ordination map are expected to have the highest abundances
Code
ordiplot(macnally.full.mds,type="n") |>  text("species", cex=0.5) |> points("sites", col=as.numeric(as.factor(macnally$HABITAT)))

Step 4: Further summarisation

  • We would like to (formally) test the hypothesis that the type of forest is associated with species composition, we can use PERMANOVA
Code
macnally.dist <- vegdist(macnally.std[,-(1:2)],"bray")
adonis2(macnally.dist~macnally$HABITAT)
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

adonis2(formula = macnally.dist ~ macnally$HABITAT)
         Df SumOfSqs      R2      F Pr(>F)    
Model     5   3.3873 0.43021 4.6812  0.001 ***
Residual 31   4.4863 0.56979                  
Total    36   7.8735 1.00000                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1