Data choosing

I have chosen a data set called “ushealth” the same as for principal components analysis. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division. We will at first perform principal analysis and then connect it with results from factor analysis.

Hierarchical clustering

At first we take just data including information about number of deaths and make distance matrix out of them. Then we compute clusters and plot dendrogram where we distinguish between different regions:

D <- dist(ushealth[,3:9])
HC1 <- hclust(D)
local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[n], lab.font= n%%4))
    }
    n
  }
  mycols <- as.numeric(ushealth$reg)
  i <- 0
})

x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)

From the plot we can see that observation from the West region are mostly in the left and a lot of them near each other, but other observations are quite mixed in the remaining parts.

Non-hierarchical Clustering

Next we perform K means algorithm where we choose number of groups as 4. Then we plot few plots trying to distinguish between groups nicely.

set.seed(12345)
Kmeans <- kmeans(ushealth[3:9], centers = 4)
colVector <- as.numeric(Kmeans$cluster)

plot(ushealth[,3] ~ ushealth[,5], bg = colVector, xlab = "Deaths(cancer)", ylab = "Deaths(accident)", pch = 21, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,3], col = 1:4, pch = 8, cex = 2)

cols <- as.numeric(ushealth$reg)
cols[cols==1]=5
cols[cols==2]=1
cols[cols==5]=2

plot(ushealth[,4] ~ ushealth[,5], bg = colVector, xlab = "Deaths(cancer)", ylab = "Deaths(cardio)", pch = 21, col = "black")
points(Kmeans$centers[,2] ~ Kmeans$centers[,3], col = 1:4, pch = 8, cex = 2)
text((ushealth[,4] + 12) ~ ushealth[,5],labels=rownames(ushealth), col=cols, cex = 0.5)

In the plots we can see different clusters while comparing number of deaths from different reasons(accident vs. cancer, cardio vs. cancer) with their means. We have added also labels colored by regions into the second plot and we again observe that the clusters made out of K means algorithm match the regions quite well especially in extreme values and again mainly in the West.