Data choosing

I have chosen a data set called “ushealth” the same as for principal components and factor analysis. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetes, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division.

Discriminant analysis

In the following, let us consider the ushealth data set. We use the subset of the continuous covariates specially variables including information about number of deaths caused by different reasons only and we search for a linear classification rule (a specific partition of the sample space \(\mathbb{R}^7\)) to classify the given observations by the location. We group the data with respect to location (data from West versus data from other states, it was chosen like this with respect to previous findings). We visualize the linear discrimination analysis here:

library(MASS)

ushealth = transform(ushealth, freg = factor(as.numeric(ushealth$reg=="West"),levels = 1:0, labels = c("West","Other")))
lda1 <- lda(freg ~ acc + card + canc + pul + pneu + diab + liv, data = ushealth)

pred <- predict(lda1)
ldahist(data = pred$x[,1], g=ushealth$freg, col = "darkblue")

From the plot it looks like the “critical” value distinguishing between these two groups is somewhere -1. We can also visualize our result using partition plot:

library("klaR")
partimat(as.factor(freg) ~ acc + card + canc + pul + pneu + diab + liv, data = ushealth, method = "lda",nplots.vert=7,nplots.hor=3, image.colors=c("olivedrab1","lightblue1"))

In the figure above, the partition of the sample space is provided with respect to all possible pairs of two-dimensional (xy) scatterplots. We can see that in some cases the partition is quite well performed and the states from West are almost all separated, but there are also cases where the partition fails. The empirical error rate is between 0.02 and 0.26.

For a better graphical insight into the LDA results we can consider a simplification where we only use one continuous covariate as a predictor. Thus, the linear discriminant rule is given in terms of one scaling factor only. I have chosen variables including information about number of deaths caused by cancer.

lda2 <- lda(freg ~ canc, data = ushealth)
lda2cv <- lda(freg ~ canc, data = ushealth, CV = T)

col=c("darkblue","olivedrab3")

lda_scores <- as.numeric(lda2$scaling) * ushealth$canc
par(mfrow = c(1,2))
plot(lda_scores ~ ushealth$canc, pch = 21, bg = col[ushealth$freg], ylab ="LDA Scores", xlab = "Deaths(cancer)")

lda2cv <- cbind(lda2cv$posterior, ushealth$canc, ushealth$freg)
lda2cv <- lda2cv[order(ushealth$canc), ]


plot(lda2cv[,1] ~ lda2cv[,3], pch = 21, bg = col[lda2cv[,4]], ylab = "Posterior Probability", xlab = "Deaths(cancer)")
lines(lda2cv[,1] ~ lda2cv[,3], col = "blue")
points(lda2cv[,2] ~ lda2cv[,3], pch = 22, bg = col[lda2cv[,4]])
lines(lda2cv[,2] ~ lda2cv[,3], col = "olivedrab1")

for (i in 1:nrow(lda2cv)){
  lines(rep(lda2cv[i,3], 2), c(0,1), col = "gray", lty = 3)
}

We have plotted the values of LDA scores with respect to number of death caused by cancer and then also posterior probabilities for being in the West group or others. The colors are determine by the algorithm grouping.

Now we will use other approaches: generalized linear model; linear discriminant analysis; quadratic discriminant analysis; linear regression model. We will plot number of deaths caused by accident and cardiovascular diseases and try to fit a line/curve between the colored groups to divide them.

mowers.gr = expand.grid(seq(20,90,len=1000),seq(100,600,len=1000))
names(mowers.gr) = c("acc","card")

par(mfrow=c(2,2))
cols=c("darkblue","olivedrab3")
mowe.ii = unique(mowers.gr$acc)
mowe.ll = unique(mowers.gr$card)

ushealth = transform(ushealth, numreg=as.numeric(freg))

plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Generalized Linear Model")
mow.nn = glm(factor(numreg)~acc+card,family=binomial,data=ushealth)

contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0.5),1000,1000),
        nlev=1,drawlabels=FALSE,add=T, lwd = 2)

plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Linear Discriminant Analysis")
mow.nn = lda(factor(numreg)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T, lwd = 2)

plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Quadratic Discriminant Analysis")
mow.nn = qda(factor(numreg)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T, lwd = 2)

plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Linear Regression Model")
text(ushealth$acc,ushealth$card+9,labels=rownames(ushealth), col=cols[ushealth$numreg])
mow.nn = lm(I(2*(as.numeric(numreg)-1)-1)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0),1000,1000),
        nlev=1,drawlabels=FALSE,add=T, lwd = 2)

We can see that dividing lines are different, but the results(when we look into which points are in which group). In use of all approaches we can see, that there are 4 states from West mixed with other states and also there is one state inside “the cloud” of West states. When we take a closer look on the states from West situated “closer” to the rest, they are the states directly on the West shore(Oregon, Washington, California).

We can also try to use neural networks. Unfortunately, the results heavily depends on an initial configuration: compare, for instance, the following four outputs (the same neural network is applied for the same dataset for all four cases):

set.seed(12345)
par(mfrow=c(2,2))
library(nnet)

for (k in 1:4) {
  mow.nn = nnet(factor(numreg)~acc+card,size=50,data=ushealth)
  plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Generalized Linear Model")
  contour(mowe.ii,mowe.ll,
          matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="class")),1000,1000),
          nlev=1,drawlabels=FALSE,add=T)
}
## # weights:  201
## initial  value 27.945943 
## iter  10 value 16.809677
## iter  20 value 10.362816
## iter  30 value 9.541977
## iter  40 value 8.742259
## iter  50 value 8.300610
## iter  60 value 5.061847
## iter  70 value 1.938868
## iter  80 value 1.559524
## iter  90 value 1.435861
## iter 100 value 1.401719
## final  value 1.401719 
## stopped after 100 iterations
## # weights:  201
## initial  value 40.018829 
## iter  10 value 17.236669
## iter  20 value 14.645893
## iter  30 value 9.538097
## iter  40 value 8.711909
## iter  50 value 8.698753
## final  value 8.698728 
## converged
## # weights:  201
## initial  value 93.691099 
## iter  10 value 14.104722
## iter  20 value 11.761781
## iter  30 value 8.717074
## iter  40 value 6.450915
## iter  50 value 1.855225
## iter  60 value 0.005783
## final  value 0.000072 
## converged
## # weights:  201
## initial  value 30.210371 
## iter  10 value 15.737749
## iter  20 value 11.803170
## iter  30 value 9.050575
## iter  40 value 8.895819
## iter  50 value 8.842978
## iter  60 value 8.817397
## iter  70 value 8.804843
## iter  80 value 8.798630
## iter  90 value 8.795778
## iter 100 value 8.793648
## final  value 8.793648 
## stopped after 100 iterations