Dataset description

I have chosen a data set called “ushealth”. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetis, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division. For this task I have chosen states from Northeast and Midwest and I will compare their reported number of deaths (vectors with 5 elements - we will omitt cancer and cardiovascular disease for different range).

Motivation

At first we load the data and extract observation from South and West to compare their vectors of reported deaths. To motivate the statistical test we plot boxplots showing differences between observations of reported deaths in the South and West.

Data_S_W <-subset(ushealth, reg == "West"|reg == "South")

data_long_2 <- melt(Data_S_W[c(3,6:9,12)], id = "reg")
ggplot(data_long_2, aes(x = variable, y = value, color = reg)) +  # ggplot function
  geom_boxplot()+labs(y= "number of deaths", x = "reason of death", title="                           Number of deaths due to reason and region")

From the plots we can see that there are differences between numbers of deaths in the South and West.

We will perform two sample Hotelling test to compare two population means. We assume a multivariate sample \(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n} \sim N_{p}(\boldsymbol{\mu}_{1}, \Sigma)\) and some another sample \(\boldsymbol{Y}_{1}, \dots, \boldsymbol{Y}_{m} \sim N_{p}(\boldsymbol{\mu}_{2}, \Sigma)\), with some generally different mean parameter \(\boldsymbol{\mu}_{1} \neq \boldsymbol{\mu}_{2}\). We are interested in testing the null hypothesis \[ H_{0}: \boldsymbol{\mu}_1 = \boldsymbol{\mu}_{2} \] against the alternative hypothesis, that the null does not hold.

In our case the assumption of multinormality and same variance matrices might be violated, but we will perform the test anyway.

From the exercise class we have that the rejection region is defined as \[ \frac{nm(n + m - p - 1)}{p(n + m)}(\overline{\boldsymbol{X}}_{n} - \overline{\boldsymbol{Y}}_{m})^{\top} \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_{n} - \overline{\boldsymbol{Y}}_{m}) \geq F_{p, n + m - p - 1}(1 - \alpha), \]

where \(\mathcal{S} = (n + m)^{-1} \cdot (n \mathcal{S}_1 + m \mathcal{S}_2)\) and \(n \mathcal{S}_{1} = \mathbb{X}^\top \mathcal{H}\mathbb{X} \sim W_p(\Sigma, n - 1)\) and \(m \mathcal{S}_{2} = \mathbb{Y}^\top \mathcal{H}\mathbb{Y} \sim W_p(\Sigma, m - 1)\) respectively are the (rescaled) empirical estimates of the variance-covariance matrix \(\Sigma\).

In our case \(n= 16\), \(m = 13\) are ranges of the samples and \(p = 5\) is the length of vectors \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_{2}\).

When we perform the test we get:

Hotelling \(T^2\) test

Test statistic: 18.732

Degrees of freedom: 5, 23

P-value: 0.025

From the output we can see that we reject the null hypothesis \(H_{0}: \boldsymbol{\mu}_1 = \boldsymbol{\mu}_{2}\) so the difference between the means is statistically significant.

We will also compute the confidence intervals for all elements of the difference between means and present them into a table.

We will use the fact that the corresponding simultaneous confidence intervals for all possible linear combinations \(\boldsymbol{a}^\top \boldsymbol{\mu}\) of the mean vector \(\boldsymbol{\mu} \in \mathbb{R}^{p}\) are given as \[ P\Big(\forall \boldsymbol{a} \in \mathbb{R}^{p};~ \boldsymbol{a}^\top \boldsymbol{\mu} \in \big( \boldsymbol{a}^\top\overline{\boldsymbol{X}} - \sqrt{K_{\alpha} \boldsymbol{a}^\top \mathcal{S} \boldsymbol{a}}, \boldsymbol{a}^\top\overline{\boldsymbol{X}} + \sqrt{K_{\alpha} \boldsymbol{a}^\top \mathcal{S} \boldsymbol{a}} \big) \big)\Big) = 1 - \alpha, \] where \(K_{\alpha}\) is the corresponding quantile obtained from the Fisher \(F\) distribution in the form \(K_{\alpha} = \frac{p(n + m)}{nm(n + m - p - 1)} F_{p, n + m - p - 1}(1 - \alpha)\) and \(\mathcal{S}\) is the sample variance-covariance matrix from above.

We use this fact with \(\boldsymbol{a}\) as \(i\)th unit vector for \(i=1,\dots,5\).

Confidence intervals are written in following table and we can see that the greatest difference is between the expected values of number of deaths caused by accident and there is also the widest confidence interval, so this variable might be the cause of rejecting the null hypothesis.

Confidence intervals
Lower Boundary Upper Boundary Mean Estimate
accident -20.2285157 12.180439 -4.0240385
pulmonar -11.3566711 7.843210 -1.7567308
pneumonia flu -4.8480453 4.940353 0.0461538
diabetes -0.8385584 8.718366 3.9399038
liver -5.0982763 2.457892 -1.3201923