I have chosen a data set called “ushealth” the same as for principal components and factor analysis. This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) the reported number of deaths in the 50 states of the U.S. classified according to 7 categories (accident, cardiovascular, cancer, pulmonar, pneumonia, diabetes, liver) and some other variables describing number of doctors and hospitals, land area, population and specification of US region ad division.
In the following, let us consider the ushealth data set. We use the subset of the continuous covariates specially variables including information about number of deaths caused by different reasons only and we search for a linear classification rule (a specific partition of the sample space \(\mathbb{R}^7\)) to classify the given observations by the location. We group the data with respect to location (data from West versus data from other states, it was chosen like this with respect to previous findings). We visualize the linear discrimination analysis here:
library(MASS)
ushealth = transform(ushealth, freg = factor(as.numeric(ushealth$reg=="West"),levels = 1:0, labels = c("West","Other")))
lda1 <- lda(freg ~ acc + card + canc + pul + pneu + diab + liv, data = ushealth)
pred <- predict(lda1)
ldahist(data = pred$x[,1], g=ushealth$freg, col = "darkblue")
From the plot it looks like the “critical” value distinguishing between these two groups is somewhere -1. We can also visualize our result using partition plot:
library("klaR")
partimat(as.factor(freg) ~ acc + card + canc + pul + pneu + diab + liv, data = ushealth, method = "lda",nplots.vert=7,nplots.hor=3, image.colors=c("olivedrab1","lightblue1"))
In the figure above, the partition of the sample space is provided with respect to all possible pairs of two-dimensional (xy) scatterplots. We can see that in some cases the partition is quite well performed and the states from West are almost all separated, but there are also cases where the partition fails. The empirical error rate is between 0.02 and 0.26.
For a better graphical insight into the LDA results we can consider a simplification where we only use one continuous covariate as a predictor. Thus, the linear discriminant rule is given in terms of one scaling factor only. I have chosen variables including information about number of deaths caused by cancer.
lda2 <- lda(freg ~ canc, data = ushealth)
lda2cv <- lda(freg ~ canc, data = ushealth, CV = T)
col=c("darkblue","olivedrab3")
lda_scores <- as.numeric(lda2$scaling) * ushealth$canc
par(mfrow = c(1,2))
plot(lda_scores ~ ushealth$canc, pch = 21, bg = col[ushealth$freg], ylab ="LDA Scores", xlab = "Deaths(cancer)")
lda2cv <- cbind(lda2cv$posterior, ushealth$canc, ushealth$freg)
lda2cv <- lda2cv[order(ushealth$canc), ]
plot(lda2cv[,1] ~ lda2cv[,3], pch = 21, bg = col[lda2cv[,4]], ylab = "Posterior Probability", xlab = "Deaths(cancer)")
lines(lda2cv[,1] ~ lda2cv[,3], col = "blue")
points(lda2cv[,2] ~ lda2cv[,3], pch = 22, bg = col[lda2cv[,4]])
lines(lda2cv[,2] ~ lda2cv[,3], col = "olivedrab1")
for (i in 1:nrow(lda2cv)){
lines(rep(lda2cv[i,3], 2), c(0,1), col = "gray", lty = 3)
}
We have plotted the values of LDA scores with respect to number of death caused by cancer and then also posterior probabilities for being in the West group or others. The colors are determine by the algorithm grouping.
Now we will use other approaches: generalized linear model; linear discriminant analysis; quadratic discriminant analysis; linear regression model. We will plot number of deaths caused by accident and cardiovascular diseases and try to fit a line/curve between the colored groups to divide them.
mowers.gr = expand.grid(seq(20,90,len=1000),seq(100,600,len=1000))
names(mowers.gr) = c("acc","card")
par(mfrow=c(2,2))
cols=c("darkblue","olivedrab3")
mowe.ii = unique(mowers.gr$acc)
mowe.ll = unique(mowers.gr$card)
ushealth = transform(ushealth, numreg=as.numeric(freg))
plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Generalized Linear Model")
mow.nn = glm(factor(numreg)~acc+card,family=binomial,data=ushealth)
contour(mowe.ii,mowe.ll,
matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0.5),1000,1000),
nlev=1,drawlabels=FALSE,add=T, lwd = 2)
plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Linear Discriminant Analysis")
mow.nn = lda(factor(numreg)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
nlev=1,drawlabels=FALSE,add=T, lwd = 2)
plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Quadratic Discriminant Analysis")
mow.nn = qda(factor(numreg)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
nlev=1,drawlabels=FALSE,add=T, lwd = 2)
plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Linear Regression Model")
text(ushealth$acc,ushealth$card+9,labels=rownames(ushealth), col=cols[ushealth$numreg])
mow.nn = lm(I(2*(as.numeric(numreg)-1)-1)~acc+card,data=ushealth)
contour(mowe.ii,mowe.ll,
matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0),1000,1000),
nlev=1,drawlabels=FALSE,add=T, lwd = 2)
We can see that dividing lines are different, but the results(when we look into which points are in which group). In use of all approaches we can see, that there are 4 states from West mixed with other states and also there is one state inside “the cloud” of West states. When we take a closer look on the states from West situated “closer” to the rest, they are the states directly on the West shore(Oregon, Washington, California).
We can also try to use neural networks. Unfortunately, the results heavily depends on an initial configuration: compare, for instance, the following four outputs (the same neural network is applied for the same dataset for all four cases):
set.seed(12345)
par(mfrow=c(2,2))
library(nnet)
for (k in 1:4) {
mow.nn = nnet(factor(numreg)~acc+card,size=50,data=ushealth)
plot(ushealth$acc,ushealth$card,pch=21 + ushealth$numreg, bg = cols[ushealth$numreg], main = "Generalized Linear Model")
contour(mowe.ii,mowe.ll,
matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="class")),1000,1000),
nlev=1,drawlabels=FALSE,add=T)
}
## # weights: 201
## initial value 27.945943
## iter 10 value 16.809677
## iter 20 value 10.362816
## iter 30 value 9.541977
## iter 40 value 8.742259
## iter 50 value 8.300610
## iter 60 value 5.061847
## iter 70 value 1.938868
## iter 80 value 1.559524
## iter 90 value 1.435861
## iter 100 value 1.401719
## final value 1.401719
## stopped after 100 iterations
## # weights: 201
## initial value 40.018829
## iter 10 value 17.236669
## iter 20 value 14.645893
## iter 30 value 9.538097
## iter 40 value 8.711909
## iter 50 value 8.698753
## final value 8.698728
## converged
## # weights: 201
## initial value 93.691099
## iter 10 value 14.104722
## iter 20 value 11.761781
## iter 30 value 8.717074
## iter 40 value 6.450915
## iter 50 value 1.855225
## iter 60 value 0.005783
## final value 0.000072
## converged
## # weights: 201
## initial value 30.210371
## iter 10 value 15.737749
## iter 20 value 11.803170
## iter 30 value 9.050575
## iter 40 value 8.895819
## iter 50 value 8.842978
## iter 60 value 8.817397
## iter 70 value 8.804843
## iter 80 value 8.798630
## iter 90 value 8.795778
## iter 100 value 8.793648
## final value 8.793648
## stopped after 100 iterations