Data choosing

I have chosen a data set called “ushealth” the same as for principal components analysis. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division. We will at first perform principal analysis and then connect it with results from factor analysis.

Principal components analysis

In previous task we have performed principal analysis just on data including information of number of deaths for different reasons. Now we will include also information about number of doctors and hospitals and again distinguish between different regions. I have dicide not to include land area and population, because they (especially land area) are on different scale and it could be problem for factor analysis. At first we perform principal analysis analogous to the task from previous week.

pc <- prcomp(ushealth[,3:11], scale=TRUE)
sum = summary(pc)
pca_importance <- function(x) {
  vars <- x$sdev^2
  vars <- vars/sum(vars)
  rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars, 
      `Cumulative Proportion` = cumsum(vars))
}
pc_sum <- pca_importance(sum)
colnames(pc_sum) <-c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9")
knitr::kable(pc_sum)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 1.8979914 1.4183153 1.0841553 1.0171339 0.7004291 0.6087444 0.4589580 0.2501822 0.2040883
Proportion of Variance 0.4002635 0.2235132 0.1305992 0.1149513 0.0545112 0.0411744 0.0234047 0.0069546 0.0046280
Cumulative Proportion 0.4002635 0.6237766 0.7543758 0.8693271 0.9238383 0.9650127 0.9884174 0.9953720 1.0000000

From the table we can see that first principal component obtains 40 percent of variance and first three components with standard deviation above 1 together contains 86,9 percent of total variance so we would choose 4 components and reduce the dimension from 9 do 4 (maybe it could be also possible to choose just 3 principal components having together 75,4 percent of the “information”).

We plot some describing plots of principal components:

library(ggfortify)
pc <- prcomp(ushealth[,3:11], scale=TRUE)
autoplot(pc,x=1,y=2, data = ushealth, colour = 'reg',
         loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3)

From the plot we observe that states in the West (violet points in the right) have more deaths caused by accidents then the others and states in Northeast have more deaths cause by diseases so they make “cloud” on the left side of plot. States from Midwest and South are somewhere in the middle. South being located more to the right and Midwest to the left. We can also see, that accident have different direction against all others, number of doctors and hospitals have the same direction which makes sense. The reasons of death also points into the same direction except death caused by liver problems. From that we could think about 4 different clusters which we will try to compare with result from factor analysis.

Factor analysis

fa1 <- factanal(ushealth[,3:11], factors = 4, rotation="varimax")
print(fa1, digits=2, cutoff=.5, sort=TRUE)
## 
## Call:
## factanal(x = ushealth[, 3:11], factors = 4, rotation = "varimax")
## 
## Uniquenesses:
##  acc card canc  pul pneu diab  liv  doc shop 
## 0.53 0.04 0.04 0.51 0.39 0.28 0.00 0.00 0.20 
## 
## Loadings:
##      Factor1 Factor2 Factor3 Factor4
## acc  -0.62                          
## card  0.75            0.62          
## canc  0.84                          
## diab  0.84                          
## doc           0.95                  
## shop          0.89                  
## pul                   0.55          
## pneu                  0.76          
## liv                           0.95  
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       2.42    1.96    1.44    1.18
## Proportion Var    0.27    0.22    0.16    0.13
## Cumulative Var    0.27    0.49    0.65    0.78
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 11.74 on 6 degrees of freedom.
## The p-value is 0.0679

I have consider more settings but finally choose to present this one with 4 factors because of good interpretation and also the fact that we do not reject the hypothesis that 4 factors are sufficient.

From the output we observe that the first factor consist of the accident with negative loading and the “well known or usual” diseases. The second factor consist of number of doctors and number of hospitals which makes sense. The factor has higher loadings for not that common diseases (less deaths) and the last factor has higher loading just for liver which could be also seen from the principal components analysis, that it has much different direction than other diseases.

We will also provide simple plot of loading for each factor:

rbPal <- colorRampPalette(c('red','blue'))

loadings <- data.frame(fa1$loadings[,1:4])
loadings$colF1 <- rbPal(20)[as.numeric(cut(loadings[,1],breaks = seq(-1,1,length = 20)))]
loadings$nam1 <- paste(row.names(loadings), round(loadings[,1], digits = 3), sep = "\n")
loadings$colF2 <- rbPal(20)[as.numeric(cut(loadings[,2],breaks = seq(-1,1,length = 20)))]
loadings$nam2 <- paste(row.names(loadings), round(loadings[,2], digits = 3), sep = "\n")
loadings$colF3 <- rbPal(20)[as.numeric(cut(loadings[,3],breaks = seq(-1,1,length = 20)))]
loadings$nam3 <- paste(row.names(loadings), round(loadings[,3], digits = 3), sep = "\n")
loadings$colF4 <- rbPal(20)[as.numeric(cut(loadings[,4],breaks = seq(-1,1,length = 20)))]
loadings$nam4 <- paste(row.names(loadings), round(loadings[,4], digits = 3), sep = "\n")

par(mfrow = c(4,1))
barplot(abs(loadings[,1]), col = loadings$colF1, ylim = c(0,1), ylab = "Factor 1", names.arg = loadings$nam1, cex.names = 0.8)
barplot(abs(loadings[,2]), col = loadings$colF2, ylim = c(0,1), ylab = "Factor 2", names.arg = loadings$nam2, cex.names = 0.8)
barplot(abs(loadings[,3]), col = loadings$colF3, ylim = c(0,1), ylab = "Factor 3", names.arg = loadings$nam3, cex.names = 0.8)
barplot(abs(loadings[,4]), col = loadings$colF4, ylim = c(0,1), ylab = "Factor 4", names.arg = loadings$nam4, cex.names = 0.8)

par(mfrow = c(1,1))

From the plot we can see the same result graphically. Red color means negative loading and blue means positive. We can see again the main “effects” in each factor:

  1. accident, cardiovascular disease, cancer and diabetes
  2. number of doctors and hospitals
  3. cardiovascular disease, cancer, pulmonar and pneumonia flu
  4. liver

The results correspond with our expectation and also with results from principal components.