Multivariate Analyses

Theoretical Exercise

From the theoretical exercises I have chosen the third one:

Exercise 3

Consider a random vector \(\boldsymbol{X} \sim N_{2}(\boldsymbol{\mu}, \Sigma)\), for \(\boldsymbol{\mu} = (2,2)^\top\) and \(\Sigma = \mathbb{I}_2\) (a unit matrix). Consider also matrices \(\mathbb{A} = (1, 1)\) and \(\mathbb{B} = (1, -1)\). Show, that the random variables \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) are independent.

Solution

\(\boldsymbol{X}\) has multivariate so from definition of multivariate distribution we have that \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) have univariate normal distribution. Because \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) have univariate normal distribution, it is enough to show that they are not correlated to get independence. Let us compute:

\(\text{cov}(\mathbb{A}\boldsymbol{X},\mathbb{B}\boldsymbol{X})=\mathbb{A}\text{var}(\boldsymbol{X})\mathbb{B}^{\top}=\mathbb{A}\mathbb{B}^{\top}=(1, 1)(1, -1)^{\top}=1-1=0\)

From this we get that \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) are uncorrelated and normally distributed, so they are independent.

We should use the R simulations to empirically verify the results. For that we simulate \(\boldsymbol{X} \sim N_{2}(\boldsymbol{\mu}, \Sigma)\) then \(\mathbb{A}\boldsymbol{X}\) and \(\mathbb{B}\boldsymbol{X}\) and verify their independence with plot.

library("mvtnorm")
set.seed(12345)
n <- 10000
X <- rmvnorm(n, c(2, 2), matrix(c(1, 0, 0, 1),2,2))
A=c(1,1)
B=c(1,-1)
a=as.vector(A%*%t(X))
b=as.vector(B%*%t(X))
plot(a,b, pch=21, bg="skyblue", main = "Independence")
abline(lm(b ~ a), col = "deepskyblue4", lwd = 2)

c=cor(a,b)

We can see that the line given by linear model is almost constant which suggested that the variables are not independent.

We have also computed sample Pearson correlation coefficient which has value of approximately -0.01889. That corresponds with the fact that variables are not correlated.

Practical Exercise

In this part we propose a simple example (using a two dimensional random vector \((X,Y)^{\top}\)) showing that despite all normally distributed marginals the whole random vector fails to follow the joint normal distribution.

This is true for the random vector \((X,Y)^{\top}\) with the joint distribution:

\(f_{(X,Y)}(x,y)= \begin{cases} 2f_X(x)f_Y(y), & \text{if } x\geq 0,y\geq 0,\text{or } x < 0,y < 0,\\ 0, &\text{otherwise}, \end{cases}\)

where \(f_X(x)\) and \(f_Y(y)\) are densities of standard normal distribution. We can check the marginal densities.

\(f_X(x)=\int 2f_X(x)f_Y(y)(\mathbb{I}_{x\geq 0,y\geq 0}+\mathbb{I}_{x < 0,y < 0})dy=\int_{-\infty}^{0} 2f_X(x)f_Y(y)\mathbb{I}_{x< 0}dy+\int_{0}^{\infty} 2f_X(x)f_Y(y)\mathbb{I}_{x \geq 0}dy=f_X(x)\mathbb{I}_{x< 0}+f_X(x)\mathbb{I}_{x \geq 0}=f_X(x).\)

So \(X\) has marginal standard normal distribution and from the symmetry the same thing holds for \(Y\).

To generate random sample from this distribution, we will need to use the fact that:

\(f_{(X,Y)}(x,y)=f_{Y|X}(y|x)\cdot f_X(x)\)

For that we need to compute the conditional density, which is given by

\(f_{Y|X}(y|x)=\dfrac{f_{(X,Y)}(x,y)}{f_X(x)}=\dfrac{2f_X(x)f_Y(y)(\mathbb{I}_{x\geq 0,y\geq 0}+\mathbb{I}_{x < 0,y < 0})}{f_X(x)}=2f_Y(y)(\mathbb{I}_{x\geq 0,y\geq 0}+\mathbb{I}_{x < 0,y < 0})\)

After that we use the same algorithm like in previous assignment with uniform distribution and quantile function for the conditional distribution. For that we need to find distribution function and invert it.

\(F_{Y|X}=\int_{-\infty}^{y}2f_Y(u)(\mathbb{I}_{x\geq 0,u\geq 0}+\mathbb{I}_{x < 0,u < 0})du= \begin{cases} 2\Phi(y)-1 & \text{if } y\geq 0,x\geq 0,\\ 1 & \text{if } y\geq 0,x < 0,\\ 2\Phi(y) & \text{if } y < 0,x < 0,\\ 0 & \text{if } y< 0,x \geq 0. \end{cases}\)

From the formula for conditional distribution we can see that it is not normally distributed, therefore our random vector cannot have multivariate normal distribution. Corresponding quantile function is

\(F_{Y|X}^{-1}(z)= \begin{cases} \Phi^{-1}\left(\dfrac{z+1}{2}\right) & \text{if } z \in (0,1),x\geq 0,\\ \Phi^{-1}\left(\dfrac{z}{2}\right) & \text{if } z \in (0,1),x < 0. \end{cases}\)

Using this fact we generate sample of random vectors \((X_i,Y_i)^{\top}, i=1,\dots,n\) where \(n=50000\) and then we plot scatterplot with approximation of density to see that it does not have multivariate normal distribution. We also plot histograms with empirical densities for both X and Y, which confirm the marginal normal distribution.

library("mvtnorm")
library("ggplot2")
n <- 50000
X <- rnorm(n, mean=0, sd=1)
Z <- runif(n)
Y <-as.numeric(X >= 0)*qnorm((Z+1)/2)+as.numeric(X < 0)*qnorm((Z)/2)
data=data.frame(X,Y)

plot1 <- ggplot(data, aes(x=X, y=Y) )

plot1 +
  geom_point(shape = 21, colour = "black", fill = "skyblue") +
  stat_density2d(aes(colour = ..level..))

df1 <- data.frame(X)
df2 <- data.frame(Y)

# Histogram with kernel density
ggplot(df1, aes(x = X)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white", bins = 50) +
  geom_density(lwd = 1, colour = 4,
               fill = 4, alpha = 0.25) + ggtitle("Marginal distribution of X")

ggplot(df2, aes(x = Y)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white", bins = 50) +
  geom_density(lwd = 1, colour = 2,
               fill = 2, alpha = 0.25) + ggtitle("Marginal distribution of Y")

Multivariate Analyses

Veronika Roubínová

2.3.2022

Theoretical Exercise

Exercise 3

Solution

Practical Exercise