The dataset contains information about the results of standardized vocabulary test through grades 8-11 for 64 students. We have 5 numerical variables in total: Grade 8, Grade 9, Grade 10, Grade 11 and Mean, where the variable Mean is the mean of the four scores for each student.
## Grade 8 Grade 9 Grade 10 Grade 11
## Min. :-2.1900 Min. :-1.310 Min. :-0.660 Min. :-2.220
## 1st Qu.:-0.0525 1st Qu.: 1.113 1st Qu.: 1.448 1st Qu.: 2.330
## Median : 1.2300 Median : 2.455 Median : 2.715 Median : 3.270
## Mean : 1.1384 Mean : 2.542 Mean : 2.988 Mean : 3.472
## 3rd Qu.: 2.1825 3rd Qu.: 3.405 3rd Qu.: 4.160 3rd Qu.: 4.478
## Max. : 8.2600 Max. : 9.550 Max. :10.240 Max. :10.580
## Mean
## Min. :-1.380
## 1st Qu.: 1.100
## Median : 2.340
## Mean : 2.535
## 3rd Qu.: 3.638
## Max. : 9.660
Below, we can see the summary output of the PCA. As we can see, the first component explains roughly 88 % of the total variance and the first two explain about 93 % of the variance. The number is already quite high, therefore (and for interpretability reasons) we will work only with the first two principal components.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 4.1702 1.01915 0.92261 0.71527 0.003707
## Proportion of Variance 0.8787 0.05248 0.04301 0.02585 0.000000
## Cumulative Proportion 0.8787 0.93114 0.97415 1.00000 1.000000
It is useful to plot a biplot.
All arrows point right, hence all five variables are quite highly positively correlated. The correlation of Grade 10 and Grade 11 as well as Grade 8 and Mean appears to be very close to 1. That is quite surprising, as it suggest that the result in Grade 8 already “determines” how the students will do on average also in the following three years.
The data appears to be scattered quite regularly, so there do not appear to be any “bundles.” However, there are some outliers (too far right or too far left) representing students that did too well or too poorly on the tests in comparison with their classmates. Furthermore, it seems that slightly more points are on the left of the arrows, suggesting that the differences between the students that scored below average on the tests are smaller than the differences between above average students (which could be explained by a few exceptional scores).