Task 6 - Vocabulary data

The dataset contains information about the results of standardized vocabulary test through grades 8-11 for 64 students. We have 5 numerical variables in total: Grade 8, Grade 9, Grade 10, Grade 11 and Mean, where the variable Mean is the mean of the four scores for each student.

##     Grade 8           Grade 9          Grade 10         Grade 11     
##  Min.   :-2.1900   Min.   :-1.310   Min.   :-0.660   Min.   :-2.220  
##  1st Qu.:-0.0525   1st Qu.: 1.113   1st Qu.: 1.448   1st Qu.: 2.330  
##  Median : 1.2300   Median : 2.455   Median : 2.715   Median : 3.270  
##  Mean   : 1.1384   Mean   : 2.542   Mean   : 2.988   Mean   : 3.472  
##  3rd Qu.: 2.1825   3rd Qu.: 3.405   3rd Qu.: 4.160   3rd Qu.: 4.478  
##  Max.   : 8.2600   Max.   : 9.550   Max.   :10.240   Max.   :10.580  
##       Mean       
##  Min.   :-1.380  
##  1st Qu.: 1.100  
##  Median : 2.340  
##  Mean   : 2.535  
##  3rd Qu.: 3.638  
##  Max.   : 9.660

Below, we can see the summary output of the PCA. As we can see, the first component explains roughly 88 % of the total variance and the first two explain about 93 % of the variance. The number is already quite high, therefore (and for interpretability reasons) we will work only with the first two principal components.

## Importance of components:
##                           PC1     PC2     PC3     PC4      PC5
## Standard deviation     4.1702 1.01915 0.92261 0.71527 0.003707
## Proportion of Variance 0.8787 0.05248 0.04301 0.02585 0.000000
## Cumulative Proportion  0.8787 0.93114 0.97415 1.00000 1.000000

It is useful to plot a biplot.

All arrows point right, hence all five variables are quite highly positively correlated. The correlation of Grade 10 and Grade 11 as well as Grade 8 and Mean appears to be very close to 1. That is quite surprising, as it suggest that the result in Grade 8 already “determines” how the students will do on average also in the following three years.

The data appears to be scattered quite regularly, so there do not appear to be any “bundles.” However, there are some outliers (too far right or too far left) representing students that did too well or too poorly on the tests in comparison with their classmates. Furthermore, it seems that slightly more points are on the left of the arrows, suggesting that the differences between the students that scored below average on the tests are smaller than the differences between above average students (which could be explained by a few exceptional scores).