Task 7 - Vocabulary data

We use the same dataset as in the task 6 (PCA). The dataset contains information about the results of standardized vocabulary test through grades 8-11 for 64 students. We have in total 5 numerical variables: Grade 8, Grade 9, Grade 10, Grade 11 and Mean, where the variable Mean is the mean of the four scores for each student.

We will now focus on the dimension reduction using factor analysis.

Since our data is five-dimensional, we will consider factor analysis with one or two factors. Due to the very similar p-value and similarly high comulative variances (and principal analysis results) we will work with just 1 factor.

## 
## Call:
## factanal(x = vocabulary, factors = 1, rotation = "varimax")
## 
## Uniquenesses:
##  Grade 8  Grade 9 Grade 10 Grade 11     Mean 
##    0.126    0.179    0.117    0.185    0.005 
## 
## Loadings:
##          Factor1
## Grade 8  0.935  
## Grade 9  0.906  
## Grade 10 0.939  
## Grade 11 0.903  
## Mean     0.998  
## 
##                Factor1
## SS loadings      4.388
## Proportion Var   0.878
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 500.6 on 5 degrees of freedom.
## The p-value is 5.93e-106

As we can see, all 5 variables are highly correlated (which corresponds to the PCA results) and in a regression model we could use the factor instead and prevent that the issue of multicollinearity while quite well preserving the variance of the data.