Multivariate Analyses

Plots

ushealth <- transform(ushealth, deaths = acc + card + canc + pul + pneu + diab + liv)
x <- ushealth[order(ushealth$deaths),] # sort by deaths
x$color[x$reg=="Northeast"] <- "red"
x$color[x$reg=="Midwest"] <- "blue"
x$color[x$reg=="South"] <- "darkgreen"
x$color[x$reg=="West"] <- "gold2" 
x$point<-as.numeric(x$div)
dotchart(x$deaths,labels=row.names(x),cex=0.75,groups= x$reg,
    main="Deaths per region",
   xlab="Number of deaths", gcolor="black", color=x$color, pch = x$point)

cNE <- mean(ushealth$deaths[ushealth$reg=="Northeast"])
cMW <- mean(ushealth$deaths[ushealth$reg=="Midwest"])
cS <- mean(ushealth$deaths[ushealth$reg=="South"])
cW <- mean(ushealth$deaths[ushealth$reg=="West"])

lines(c(cNE, cNE), c(48, 56), col = "red", lwd = 2)
lines(c(cMW, cMW), c(34, 45), col = "blue", lwd = 2)
lines(c(cS, cS), c(16, 31), col = "darkgreen", lwd = 2)
lines(c(cW, cW), c(1, 13), col = "gold2", lwd = 2)

The first plot describes number of deaths (created using sum of number of deaths of all given reasons) in different US regions and divisions. The division into US regions is represented by colors and more precise into subregions by the type of dots. The numbers of deaths in specific states (represented by abbreviation) are ordered from least to most. There are also sample means shown using lines in the same color. We can see that states in west have lower number of deaths than others. The highest values has Northeast. In the Northeast, there can be seen also difference between subregions - Mid Atlantic represented by triangle has higher values than New England represented by circle.

library(ggplot2)
library("reshape2") 
library(lattice)
library(wesanderson)
data_long_2 <- melt(ushealth[c(3:9,12)], id = "reg")
ggplot(data_long_2, aes(x = variable, y = value, color = reg)) +  # ggplot function
  geom_boxplot()+labs(y= "number of deaths", x = "reason of death", title="                           Number of deaths due to reason and region")

data_long_2 <- melt(ushealth[c(3,6:9,12)], id = "reg")
ggplot(data_long_2, aes(x = variable, y = value, color = reg)) +  # ggplot function
  geom_boxplot()+labs(y= "number of deaths", x = "reason of death", title="                           Number of deaths due to reason and region")

Next two plots do not neglect information about the reason of death. The plots show boxplots of number of deaths for each given reason with respect to different region. In the first plot we observe than cardiovascular disease and cancer have much higher number of deaths and we can see the same pattern in the differences between regions as in the first plot above. We have added also boxplots for other reasons than cardiovascular disease and cancer to see the differences better. We observe that the number of deaths differs a little in different regions, but there is not obvious same pattern for all reasons.

library(psych)
pairs.panels(ushealth[,c("popu.1985","land.area","doc", "shop", "deaths")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE, # show correlation ellipses
             main = "Dependence of some numeric variables", 
             )

Last plot shows the dependence of some numeric variables in our dataset - number of doctors, hospitals and deaths, land area and population. We can see positive dependence between number of doctors, hospitals and population in the left side and also high correlation on the right between these. Between other variables we can not observe obvious dependence.

Multivariate Analyses

Veronika Roubínová

16.2.2022

Dataset description

Plots