The title of this project is R Workshop - Multivariate Statistical Analysis I: Dimension Reduction. The slides for the presentation accompanying this workflow can be found here.
In this workflow we focus on practical data analysis by presenting statistical workflows applicable in any software for dimension reduction. The R code used to create output is also included.
In order to re-run this .qmd file, either open it with RStudio and press “render” or, if you want both the .pdf and the .html file to be produced, run the following code with this .qmd file being present in your working directory: quarto::quarto_render("multivariate_dimension_reduction_workflow.qmd", output_format = "all")
Annotations and explanations of this software workflow are still being moved from Powerpoint format. In the meantime the (https://sydney-informatics-hub.github.io/stats-resources/specialist_statistics.html#multivariate-statistical-analysis-1-dimension-reduction)[additional material] will help aid your interpretation and understanding of examples 2 and 3.
Acknowledgements
If you use this workflow, please don’t forget to acknowledge us. You can use the following sentence
The authors acknowledge the Statistical Workshops and Workflows provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney
We acknowledge Professor François Husson (https://husson.github.io/) for providing the datasets and elegant explanations of dimension reduction methods that formed the basis of our workshop examples. We also acknowledge Murray’s R Resources (https://www.flutterbys.com.au/stats/) for the NMDS example.
R setup
Before we delve into the content, it is necessary to load certain libraries. If you want to use them, please install them beforehand with install.packages("library_name").
Code
suppressPackageStartupMessages({library("tibble")library("magrittr")library("ggplot2")library("dplyr")library("FactoMineR")library("factoextra")library("tourr")library("tidyr")library("GGally")library("stringr")library("ggmosaic")library("vcd")library(readr)library(vegan)library(shape) })# Optional to produce tourr animationssuppressPackageStartupMessages({library("gganimate")library("gifski") })# GGplot themetheme_set(theme_bw())# No warningsoptions(warn =-1)# Load the datadata(decathlon)
Dimension Reduction Theory
Here is some code to run the tourrs shown in the workshop. These are all using the decathlon example that is discussed below.
Code
# Take the decathlon data, centre and scale itd <-scale(decathlon[,2:11],center=T,scale=F)# Create animate gifs for each of the tourssave_gif(animate_xy(d,tour_path=little_tour()),"little_tour.gif",delay=0.01,loop=F)save_gif(animate_xy(d,tour_path=grand_tour()),"grand_tour.gif",delay=0.01,loop=F)save_gif(animate_xy(d,tour_path=guided_tour(cmass())),"guided_tour.gif",delay=0.01,loop=F)
Example 1: PCA
In the first example we use PCA to examine the decathlon dataset. The decathlon data is a built-in dataset with the FactoMineR package. So once we have loaded FactoMineR, we load the dataset using the below function, and it gets stored in the decathlon variable.
Code
data(decathlon)
The data consists of 41 performances across 10 variables that make up the 10 events in the decathlon.
In addition to our performance variables we have additional information:
The rows are named for the athlete - note that row names are a special variable in R that are not technically part of the data frame but merely a label for each row
We have the Rank, the position they came in each race
We have the Points, they were awarded in each Competition
We have the Competition, they competed in
The row names being stored as a label is a bit annoying. For now let’s make it into a column in its own right:
Let’s make a new version of the dataset that just contains the performances. We will need to keep the name of the athlete for labeling of our PCA output later. We can turn this back into a row label when we need to input only the numeric variables in the PCA.
Let’s look at how they are distributed using boxplots;
Code
# Transform the data for plottingdecathlon_performances_long <-pivot_longer(decathlon_performances,-Athlete,names_to="Event",values_to="Performance")ggplot(decathlon_performances_long,aes(x=Event,y=Performance)) +geom_boxplot()
Wow, the distribution of these performances are very different for the different events. 1500m has the highest values, which makes sense because this is a long distance event with performance measured in seconds taken to complete the distance.
When we perform the PCA the different variables must be mean centred, so that biases are not introduced. So we can start by doing that.
Now that we have centred the values, we can see that there is a big difference in the amount of variation in each of the events. We are interested in patterns in overall performance, and in the decathlon, each of the events have an equal weight. So we will standardise each of the events, by dividing each value by the standard deviation for that event. This is a commonly used approach that gives each variable an equal influence over your PCA solution, but you may not always want to do this.
The starred correlation coefficients have evidence of a non-zero correlation. We see evidence of a positive correlation between the track events (e.g. 100m and 400m), which makes sense. However, we also see strong evidence for a negative correlation between some field events and track events (e.g. 100m and Long jump). This is simply an artifact of how the performances are measured.
A bigger number is a good thing in long jump (you jumped further)
A bigger number is a bad thing in 100m (you took longer)
We decide to multiply all track times by -1 so all events have the same ‘polarity’. This will make the output from the PCA more easy to interpret, because the correlations among the input will have a consistent direction of effect. We should make sure to label this change to the variables, by putting a minus sign in front of each of the track events.
Code
# A note about this code, to use the below tidyverse functions, mutate and rename_with we need to make sure that the centred scale is turned back into a data framedecathlon_performances_track_reversed_centred_scaled <-as.data.frame(decathlon_performances_centred_scaled) %>%mutate(across(c("100m","400m","110m.hurdle","1500m"),~ .x*-1)) %>%rename_with(~str_c("-", .), c("100m","400m","110m.hurdle","1500m"))
Step 2: Examine the relationships between variables
Code
fviz_pca_var(decathlon_PCA)
Let’s throw a couple of supplementary variables in to aid our interpretation and rerun the PCA.
Code
# We will add these two supplementary variables as the first two columns, and then tell the PCA function that they have column index 1 and 2 by using quanti.sup=1:2decathlon_PCA =PCA(cbind(decathlon[,c("Rank","Points")],decathlon_performances_track_reversed_centred_scaled),quanti.sup=1:2,graph=F)fviz_pca_var(decathlon_PCA)
Interpreting the PCs as latent variables:
Clearly PC1 represents performance of the athlete. A weighted sum of all events – all variables are positively correlated. Those with higher PC1 scores, have higher points and lower rank. Despite explaining the most variability (33%), this is not particularly interesting as a latent variable.
PC2 appears to represent specialisation of athletes. It is a contrast between events that involve sprinting, and those that don’t, which have opposite signed correlations with PC2*. Sprint specialists achieve by running really fast. Non-sprint specialists achieve most of their points by being good at other field events. Despite explaining about half as much of the variability as PC1 (17%), PC2 has a more interesting interpretation as a latent variable. If specialisation is present in decathletes, this PC will appear reliably in multiple samples.
*A technical note: the sign of the PC itself is arbitrary! The sign of the correlation between the variable and the PC is not arbitrary
Step 3: Examine the relationships between subjects
Code
fviz_pca_ind(decathlon_PCA)
We can plot the individual observations plotted in PC space, often called the score plot. Their PC scores for two dimensions at a time
This is again a projection into PC space, but not of the variables, but instead the individual subjects
Observations close to each other have similar values for the depicted PCs (but not necessarily similar overall). Recall latent variable interpretation for each PC.
In some examples (not this one) there may be distinct clusters, and you can perform various types of clustering on observations in the PC space.
Step 4: Further summarisation and interpretation
Code
fviz_screeplot(decathlon_PCA, ncp=5)
The Scree plot helps you decide how many dimensions to consider. There are as many PCs as there were variables input (10), but by default the PCA function only keeps the first five. There are different approaches to choosing the number of dimensions, depending on the type of dataset you have and the domain of your problem.
A couple of other plots help you examine the PCs more closely. Firstly the quality of representation plot tells you how well each original variable’s information is preserved in the PCs. We can see that Pole vault is not well represented by the first two PCs.
Quality of representation plot:
Code
fviz_cos2(decathlon_PCA,choice="var",axes =1:2)
Similarly the contribution of the vraiables to each of the PCs (PC1 is shown below). When you consider the loadings of each of the variables onto the PC, how much do they contribute relative to each other. The reference line is at 10%, because there were 10 variables input, so above this line is ‘overrepresented’ in PC1 compared to other variables.
Contribution of [original] variables plot
Code
fviz_contrib(decathlon_PCA,choice="var")
Example 2: Correspondence Analysis
Step 0: Perform appropriate EDA
The Nobel Prize data must be loaded from a csv.
Code
Nobel <-read.table("data_CA_NobelPrize_withMaths.csv", header=TRUE, sep=";", row.names=1, check.names=FALSE)
This is the complete dataset, but in the example we only use the G8 countries (first 8 rows), and exclude the mathematics prize (7th column)
Code
Nobel_subset <- Nobel[1:8,1:6]
We can look at this contingency table as frequencies:
Code
Nobel_subset
Chemistry Economic science Literature Medicine Peace Physics
Canada 4 3 2 4 1 4
France 8 3 11 12 10 9
Germany 24 1 8 18 5 24
Italy 1 1 6 5 1 5
Japan 6 0 2 3 1 11
Russia 4 3 5 2 3 10
UK 23 6 7 26 11 20
USA 51 43 8 70 19 66
Or proportions:
Code
prop.table(Nobel_subset) %>%round(2)
Chemistry Economic science Literature Medicine Peace Physics
Canada 0.01 0.01 0.00 0.01 0.00 0.01
France 0.01 0.01 0.02 0.02 0.02 0.02
Germany 0.04 0.00 0.01 0.03 0.01 0.04
Italy 0.00 0.00 0.01 0.01 0.00 0.01
Japan 0.01 0.00 0.00 0.01 0.00 0.02
Russia 0.01 0.01 0.01 0.00 0.01 0.02
UK 0.04 0.01 0.01 0.05 0.02 0.04
USA 0.09 0.08 0.01 0.12 0.03 0.12
We can also visualise as a mosiac plot using the ggmosaic package
Before we proceed to the method, we can look overall at whether the prize and country categorical variables are independent.
The below plot is from the vcd package, which aids visualisation of multivariate data. We see the same mosaic plot as above, but now shaded by Pearson residuals (the distance of that combination from expected under independence). The tiny p-value shown on the bottom right is the p-value for a Chi-squared test for independence. On the basis of this test, there is a dependence between the two variables, which we can explore using correspondence analysis.
Code
vcd::mosaic(as.matrix((rev(Nobel_subset))),split_vertical=T,shade=T,labeling_args =list(rot_labels =c(top =90, left =0),abbreviate_labs=4))
Step 1: Run dimension reduction
Code
res.ca=CA(Nobel_subset)
Step 2 & 3: Examine the relationships between ‘variables’ and ‘subjects’
37 sites, 5 different forest types (Gippsland manna gum, montane forest, woodland, box-ironbark and river redgum and mixed forest)
Research question: Do the bird assemblages differ between forest types?
Step 0: EDA
Code
macnally <-read_csv("macnally_full_site.csv")
Rows: 37 Columns: 104
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Site, HABITAT
dbl (102): GST, EYR, GF, BTH, GWH, WTTR, WEHE, WNHE, SFW, WBSW, CR, LK, RWB,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Run 0 stress 0.1243679
Run 1 stress 0.1562021
Run 2 stress 0.1925009
Run 3 stress 0.1243679
... New best solution
... Procrustes: rmse 6.342555e-06 max resid 2.08651e-05
... Similar to previous best
Run 4 stress 0.1660106
Run 5 stress 0.1361338
Run 6 stress 0.1734126
Run 7 stress 0.1243679
... Procrustes: rmse 7.226363e-06 max resid 3.130177e-05
... Similar to previous best
Run 8 stress 0.1243679
... Procrustes: rmse 2.249612e-05 max resid 7.695735e-05
... Similar to previous best
Run 9 stress 0.124368
... Procrustes: rmse 1.46231e-05 max resid 6.821196e-05
... Similar to previous best
Run 10 stress 0.1243679
... Procrustes: rmse 1.137193e-05 max resid 3.948229e-05
... Similar to previous best
Run 11 stress 0.1362469
Run 12 stress 0.1243679
... Procrustes: rmse 4.352368e-06 max resid 1.199799e-05
... Similar to previous best
Run 13 stress 0.124368
... Procrustes: rmse 4.231432e-05 max resid 0.0001354954
... Similar to previous best
Run 14 stress 0.1456979
Run 15 stress 0.1247501
... Procrustes: rmse 0.01200537 max resid 0.06251301
Run 16 stress 0.1243679
... New best solution
... Procrustes: rmse 7.542984e-06 max resid 2.859018e-05
... Similar to previous best
Run 17 stress 0.1362466
Run 18 stress 0.1243679
... Procrustes: rmse 1.898389e-05 max resid 8.035575e-05
... Similar to previous best
Run 19 stress 0.1753109
Run 20 stress 0.1247507
... Procrustes: rmse 0.01214907 max resid 0.06324879
*** Best solution repeated 2 times
Because the distance-based methods don’t have the same concept of variables and subjects, we will start by looking at the ‘subject’ plot, plotting the individual sites
Step 3: Examine the subject (site) plot
Environmental variables can be included in the plot in a similar way to supplementary variables for PCA (i.e. not part of the input, but correlation with each dimension calculated post-hoc)
Could include variables such as soil pH, soil composition, altitude, etc.
In this example we have just used each type of forest as a [binary] variable. The directions generally match the clusters of forest type