Multivariate Analysis

Correlation and principal components analysis

We choose just variables including information of number of deaths because of different reasons (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and try to compare this with respect to different regions in the US. At first we will explore the correlation structure of the data set.

library(corrplot)
library("qgraph")
data_death <- ushealth[3:9]
corrplot(cor(data_death), mar=c(0,0,6,0), tl.offset = 1)
mtext("Corrplot between reasons of death", at=4, line=-0.5, cex=1.5)

qgraph(cor(data_death), title="Correlation between reasons of death")

From the corrplots above we can se that number of deaths caused by accident is negatively correlated with other reasons of death (diseases) which makes sense.

Now we use principal components (scaled version) and print the result into a table:

pc <- prcomp(data_death, scale=TRUE)
sum = summary(pc)
pca_importance <- function(x) {
  vars <- x$sdev^2
  vars <- vars/sum(vars)
  rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars, 
      `Cumulative Proportion` = cumsum(vars))
}
pc_sum <- pca_importance(sum)
colnames(pc_sum) <-c("PC1","PC2","PC3","PC4","PC5","PC6","PC7")
knitr::kable(pc_sum)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7
Standard deviation	1.8442043	1.1087184	1.0302897	0.7816730	0.6585562	0.4651182	0.2170561
Proportion of Variance	0.4858699	0.1756081	0.1516424	0.0872875	0.0619566	0.0309050	0.0067305
Cumulative Proportion	0.4858699	0.6614780	0.8131204	0.9004079	0.9623645	0.9932695	1.0000000

From the table we can see that first principal component obtains 48,6 percent of variance and first three components with standard deviation above 1 together contains 81,3 percent of total variance so we would choose 3 components and reduce the dimension from 7 do 3 (maybe it could be also possible to choose just 2 principal components having together 66,1 percent of the “information”).

We plot some describing plots of principal components:

pc <- prcomp(data_death, scale=TRUE)
par(mfrow = c(1,2))
plot(pc, type = "l", col = "darkblue", main = "",lwd=2)
abline(h=1,col="red")
plot(pc, col = "darkblue", main = "")

par(mfrow = c(1,1))

From the plots we can see that first three principal components have the variance above 1 which corresponds with the results in previous table.

Now we plot graph using first and second or first and third principal components and distinguish the data observations from 4 different regions to see whether they are different.

library(ggfortify)
pc <- prcomp(data_death, scale=TRUE)

autoplot(pc,x=1,y=2, data = ushealth, colour = 'reg',
         loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3)

autoplot(pc,x=1,y=3, data = ushealth, colour = 'reg',
         loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3)

From the plot we observe that states in the West have more deaths caused by accidents then the others and states in Northeast have more deaths cause by diseases so they make “cloud” on the left side of plot. States from Midwest and South are somewhere in the middle but merge with other observations not making separate group.

Multivariate Analysis

Veronika Roubínová

31.3.2022

Data choosing

Correlation and principal components analysis