For the data analysis I chose to work with the dataset \(\mathsf{vocabulary}\) from the library \(\mathsf{SMSdata}\):
library(SMSdata)
data(vocabulary)
I will work only with the first two covariates, \(\mathsf{Grade 8}\) and \(\mathsf{Grade 9}\):
X = vocabulary[,c(1,2)]
head(X)
## Grade 8 Grade 9
## 1 1.75 2.60
## 2 0.90 2.47
## 3 0.80 0.93
## 4 2.42 4.15
## 5 -1.31 -1.31
## 6 -1.56 1.67
summary(X)
## Grade 8 Grade 9
## Min. :-2.1900 Min. :-1.310
## 1st Qu.:-0.0525 1st Qu.: 1.113
## Median : 1.2300 Median : 2.455
## Mean : 1.1384 Mean : 2.542
## 3rd Qu.: 2.1825 3rd Qu.: 3.405
## Max. : 8.2600 Max. : 9.550
For checking normality of the data, we can take a look at the following 2D plot:
library(mixtools)
plot(X, pch = 21, bg = "lightblue", xlab = "Grade 8", ylab = "Grade 9", xlim = c(-5,8), ylim = c(-4,10))
ellipse(mu = colMeans(X), sigma = cov(X), col = "red")
It seems quite alright, almost all points are located inside the ellipse. We can formulate the problem:
Consider iid random vector \(\mathbf{X}_1, \ldots, \mathbf{X}_n \sim \mathcal{N}_2(\mu, \Sigma)\), where \(\mathbf{X}_i = (X_{i1}, X_{i2})^T\). We have \[\begin{align*} (n-1)(\boldsymbol{\overline{X} - \mu})^T S^{-1} (\boldsymbol{\overline{X} - \mu}) \sim T^2(n-1,2), \end{align*}\] \(S\) stands for the sample variance matrix of \(X\).
Or it can be equivalently expressed as \[\begin{align*} \frac{n-2}{2} (\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \sim F_{2,n-2} \end{align*}\]
Based on the sample means from the summary table, we can formulate the null and alternative hypothesis as follows:
\[H_0: \boldsymbol{\mu} = \boldsymbol{\mu_0} = \begin{pmatrix} 1 \\ 2.5 \end{pmatrix}\] \[H_1: \boldsymbol{\mu} \neq \mu_0\] We will base our test on the following test statistics: \[(n-1)(\boldsymbol{\overline{X} - \mu_0})^T S^{-1}(\boldsymbol{\overline{X} - \mu_0}) \overset{H_0}{\sim} T^2(n-1,2).\] If we run this test in R, we obtain the following results:
library(DescTools)
HotellingsT2Test(X, y = NULL, mu = c(1,2.5), test = "f")
##
## Hotelling's one sample T2-test
##
## data: X
## T.2 = 0.31096, df1 = 2, df2 = 62, p-value = 0.7339
## alternative hypothesis: true location is not equal to c(1,2.5)
We obtained p-value which is much higher than 0.05, thus we do not reject the null hypothesis. That high p-value suggests that the true mean really could be close to \((1,2.5)\).
Finally, we will provide a confidence region for the true mean \(\boldsymbol{\mu}\). As mentioned before, \[\frac{n-2}{2}(\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \sim F_{2,n-2}\]. Thus the confidence region is of a form \[\{\mu \in \mathbf{R}^2: (\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \leq \frac{2}{n-2}F_{2,n-2}(1-\alpha) \}\].
We can visualise this confidence region as well (with a density of 2dimensional normal disribution, with parameters given by the sample mean and variance matrix of \(X\), in the background)
library(mvtnorm)
x=seq(-5,6,0.01)
y=seq(-5,10,0.01)
m <- apply(X,2,mean)
S <- cov(X)
contour(x,y,outer(x,y, function(x,y){dmvnorm(cbind(x,y), sigma=S)}))
points(m[1],m[2],pch=8,col="red",cex=2)
S1=solve(S)
contour(x,y,outer(x,y,function(x,y){(64-2)*
apply(t(t(cbind(x,y))-m),1,function(x){t(x)%*%S1%*%x})<
2*qf(0.95,2,64-2)}),col="red",add=TRUE)