Task 8 - Vocabulary data

We use the same dataset as in task 6 (PCA) and task 7 (factor analysis). The dataset contains information about the results of standardized vocabulary test through grades 8-11 for 64 students. We have in total 5 numerical variables: Grade 8, Grade 9, Grade 10, Grade 11 and Mean, where the variable Mean is the mean of the four scores for each student.

1. K-means algorithm

Our first task is to apply the K-means algorithm using some reasonable choice for the number of clusters. We will divide our data into 3 clusters: low-performing students, average-performing students and high-performing students. Below we can see the division into clusters in a scatterplot for performance in Grade 8 and Grade 11 tests. Since all 5 covariates are highly correlated, plots for other pairs of covariates would look quite similar.

2. Hierarchical aglomerative clustering algorithm

Now the task is to use some hierarchical aglomerative clustering algorithm using also 3 clusters and compare the results.

As we can see, the clusters are not the same, which is not surprising, as we are dealing with different algorithms. The left cluster corresponds to high-performing students. There are now only 2 students. On the scatterplot for K-means algorithm we can see that those are the highest ranking students. The middle cluster corresponds to the least-performing and the right cluster to the average-performing students.