Data choosing

I have chosen a data set called “ushealth”. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division.

Correlation and principal components analysis

We choose just variables including information of number of deaths because of different reasons (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and try to compare this with respect to different regions in the US. At first we will explore the correlation structure of the data set.

library(corrplot)
library("qgraph")
data_death <- ushealth[3:9]
corrplot(cor(data_death), mar=c(0,0,6,0), tl.offset = 1)
mtext("Corrplot between reasons of death", at=4, line=-0.5, cex=1.5)

qgraph(cor(data_death), title="Correlation between reasons of death")

From the corrplots above we can se that number of deaths caused by accident is negatively correlated with other reasons of death (diseases) which makes sense.

Now we use principal components (scaled version) and print the result into a table:

pc <- prcomp(data_death, scale=TRUE)
sum = summary(pc)
pca_importance <- function(x) {
  vars <- x$sdev^2
  vars <- vars/sum(vars)
  rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars, 
      `Cumulative Proportion` = cumsum(vars))
}
pc_sum <- pca_importance(sum)
colnames(pc_sum) <-c("PC1","PC2","PC3","PC4","PC5","PC6","PC7")
knitr::kable(pc_sum)
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8442043 1.1087184 1.0302897 0.7816730 0.6585562 0.4651182 0.2170561
Proportion of Variance 0.4858699 0.1756081 0.1516424 0.0872875 0.0619566 0.0309050 0.0067305
Cumulative Proportion 0.4858699 0.6614780 0.8131204 0.9004079 0.9623645 0.9932695 1.0000000

From the table we can see that first principal component obtains 48,6 percent of variance and first three components with standard deviation above 1 together contains 81,3 percent of total variance so we would choose 3 components and reduce the dimension from 7 do 3 (maybe it could be also possible to choose just 2 principal components having together 66,1 percent of the “information”).

We plot some describing plots of principal components:

pc <- prcomp(data_death, scale=TRUE)
par(mfrow = c(1,2))
plot(pc, type = "l", col = "darkblue", main = "",lwd=2)
abline(h=1,col="red")
plot(pc, col = "darkblue", main = "")

par(mfrow = c(1,1))

From the plots we can see that first three principal components have the variance above 1 which corresponds with the results in previous table.

Now we plot graph using first and second or first and third principal components and distinguish the data observations from 4 different regions to see whether they are different.

library(ggfortify)
pc <- prcomp(data_death, scale=TRUE)

autoplot(pc,x=1,y=2, data = ushealth, colour = 'reg',
         loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3)

autoplot(pc,x=1,y=3, data = ushealth, colour = 'reg',
         loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3)

From the plot we observe that states in the West have more deaths caused by accidents then the others and states in Northeast have more deaths cause by diseases so they make “cloud” on the left side of plot. States from Midwest and South are somewhere in the middle but merge with other observations not making separate group.