Individual assignment IV

For the data analysis I chose to work with the dataset \(\mathsf{vocabulary}\) from the library \(\mathsf{SMSdata}\):

library(SMSdata)
data(vocabulary)

I will work only with the first two covariates, \(\mathsf{Grade 8}\) and \(\mathsf{Grade 9}\):

X = vocabulary[,c(1,2)]
head(X)
##   Grade 8 Grade 9
## 1    1.75    2.60
## 2    0.90    2.47
## 3    0.80    0.93
## 4    2.42    4.15
## 5   -1.31   -1.31
## 6   -1.56    1.67
summary(X)
##     Grade 8           Grade 9      
##  Min.   :-2.1900   Min.   :-1.310  
##  1st Qu.:-0.0525   1st Qu.: 1.113  
##  Median : 1.2300   Median : 2.455  
##  Mean   : 1.1384   Mean   : 2.542  
##  3rd Qu.: 2.1825   3rd Qu.: 3.405  
##  Max.   : 8.2600   Max.   : 9.550

For checking normality of the data, we can take a look at the following 2D plot:

library(mixtools)
plot(X, pch = 21, bg = "lightblue", xlab = "Grade 8", ylab = "Grade 9", xlim = c(-5,8), ylim = c(-4,10))
ellipse(mu = colMeans(X), sigma = cov(X), col = "red")

It seems quite alright, almost all points are located inside the ellipse. We can formulate the problem:

Consider iid random vector \(\mathbf{X}_1, \ldots, \mathbf{X}_n \sim \mathcal{N}_2(\mu, \Sigma)\), where \(\mathbf{X}_i = (X_{i1}, X_{i2})^T\). We have \[\begin{align*} (n-1)(\boldsymbol{\overline{X} - \mu})^T S^{-1} (\boldsymbol{\overline{X} - \mu}) \sim T^2(n-1,2), \end{align*}\] \(S\) stands for the sample variance matrix of \(X\).

Or it can be equivalently expressed as \[\begin{align*} \frac{n-2}{2} (\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \sim F_{2,n-2} \end{align*}\]

Based on the sample means from the summary table, we can formulate the null and alternative hypothesis as follows:

\[H_0: \boldsymbol{\mu} = \boldsymbol{\mu_0} = \begin{pmatrix} 1 \\ 2.5 \end{pmatrix}\] \[H_1: \boldsymbol{\mu} \neq \mu_0\] We will base our test on the following test statistics: \[(n-1)(\boldsymbol{\overline{X} - \mu_0})^T S^{-1}(\boldsymbol{\overline{X} - \mu_0}) \overset{H_0}{\sim} T^2(n-1,2).\] If we run this test in R, we obtain the following results:

library(DescTools)
HotellingsT2Test(X, y = NULL, mu = c(1,2.5), test = "f")
## 
##  Hotelling's one sample T2-test
## 
## data:  X
## T.2 = 0.31096, df1 = 2, df2 = 62, p-value = 0.7339
## alternative hypothesis: true location is not equal to c(1,2.5)

We obtained p-value which is much higher than 0.05, thus we do not reject the null hypothesis. That high p-value suggests that the true mean really could be close to \((1,2.5)\).

Finally, we will provide a confidence region for the true mean \(\boldsymbol{\mu}\). As mentioned before, \[\frac{n-2}{2}(\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \sim F_{2,n-2}\]. Thus the confidence region is of a form \[\{\mu \in \mathbf{R}^2: (\boldsymbol{\overline{X} - \mu})^T S^{-1}(\boldsymbol{\overline{X} - \mu}) \leq \frac{2}{n-2}F_{2,n-2}(1-\alpha) \}\].

We can visualise this confidence region as well (with a density of 2dimensional normal disribution, with parameters given by the sample mean and variance matrix of \(X\), in the background)

library(mvtnorm)
x=seq(-5,6,0.01)
y=seq(-5,10,0.01)
m <- apply(X,2,mean)
S <- cov(X)

contour(x,y,outer(x,y, function(x,y){dmvnorm(cbind(x,y), sigma=S)}))
points(m[1],m[2],pch=8,col="red",cex=2)
S1=solve(S)
contour(x,y,outer(x,y,function(x,y){(64-2)*
                                      apply(t(t(cbind(x,y))-m),1,function(x){t(x)%*%S1%*%x})<
                                      2*qf(0.95,2,64-2)}),col="red",add=TRUE)