Chapter 7 Classification
7.1 Indirect Classification - The Bayes Classifier
In classification, the goal is to assign observations to a group. Similar to regression, where we have \(m(x) = E[Y | X = x]\), we want to assign class probabilities to the observations \[\pi_j (x) = P[Y = j | X = x] \;\;\; (j = 0,1, ..., J-1) \] Def: A classifier maps A multidimensional input vector to a class label. Or mathematically: \(C: \mathbb{R}^p \rightarrow \{0, ..., J-1\}\) The quality of a classifier is measured via the zero-one test-error. \[\mathbb{P}[C(X_{new}) \neq Y_{new}] \]
The optimal classifier with respect to the zero-one Error is the Bayes Classifier. It classifies an observation to the group for which the predicted probability was the highest. \[ C_{bayes}(x) = \arg\max_{0<j<J-1}\pi_j(x)\] Hence, the Bayes Classifier is a point-wise classifier. For the Bayes Classifier, the zero-one test error is known as the Bayes Risk. \[ \mathbb{P}[C_{Bayes}(X_{new}) \neq Y_{new}]\]
In practice, \(\pi_j(\cdot)\) is unknown (just as the MSE in regression is unknown) and hence, the the Bayes Classifier and Risk is unknown too. However, we can estimate \(\pi_j(\cdot)\) from the data and plug it in the Bayes Classifier. \[\hat{C}(X) = \arg\max_{0<j<J-1}\hat{\pi}_j(x)\] This is an indirect estimator, since we first estimate the class probabilities \(\pi_j(\cdot)\) for each observation \(x\) and then assign the class to it for which the probability was the highest. Question how is that more indirect than Discriminant analysis? Don’t we use the Bayes classifier in the end?
7.2 Direct Classification - The Discriminant View
7.2.1 LDA
One example for a direct classification is discriminant analysis. Using Bayes Theorem \[ \mathbb{P}[Y = j | X] = \frac{\mathbb{P}[X = x | y = j]}{\mathbb{P}[X = x]}*\mathbb{P}[Y = j] \] And assuming
\[ (X| Y) \sim N_p(\mu_j, \Sigma); \;\; \sum\limits_{k = 0}^{J-1}p_k = 1\]
We can write \[ \mathbb{P}[Y = j | X = x] = \frac{f_{ x | Y = j } * p_j}{\sum\limits_{k = 0}^{J-1} f_{x | Y = k} * p_k} \] Note that there is no distributional assumption on \(Y\) so far. You can estimate \[\mu_j = \sum\limits_{i = 1}^n{x_i*1_{Y_i = j}} / 1_{Y_i = j}\] and \[\Sigma = \frac{1}{n-j}\sum\limits_{j = 0}^{J-1}\sum\limits_{i = 1}^n(x_i - \mu_j)(x_i - \mu_j)'\;1_{Y_i = j} \] Note that the means of the response of the groups are different, but the covariance structure is the same for all of them. We now also need to estimate \(p_j\). A straight-forward way is \[ \hat{p}_j = n^{-1}\sum\limits_{i = 1}^n{1_{[Y_i = j]}} = \frac{n_j}{n} \]
From here, you can easily compute the classifier (as done in the exercise) by maximizing the log-likelihood. Then, you can derive the decision boundary by using \(\delta_j - \delta_k = 0\). In a two dimensional predictor space with two classes, the decision boundary is a line. Every combination of the two predictors on one side of the line will result in a prediction of class one, everything on the other side of the line of class two. Note that both the decision function (and hence the decision boundary) are linear in x.
7.2.2 QDA
Quadratic discriminant analysis loosens the assumption of shared covariance matrices, namely each group has their own covariance matrix. This leads to quadratic decisions functions \(\delta\) and hence to non-linear decision boundaries. QDA is more flexible but for high \(p\), the problem of over-fitting can occur, since the number of variables to estimate is \(J*p(p+1)\) variable for the covariance matrix only (instead of \(p*(p+1)\) for LDA.
7.3 Indirect Classification - The View of Logistic Regression
There are also ways to come up with indirect assignment of the class label, namely via the Bayes classifier \(\hat{C}(X) = \arg\max_{0<j<J-1}\hat{\pi}_j(x)\). The logistic model for \(\pi_j(x) = \mathbb{P}(Y = j | X = x)\) is \(\log(\frac{\pi_j(x)}{1-\pi_j(x)}) = g(\cdot)\) or \(\pi_j(x) = \frac{e^{g(\cdot)}}{1+ e^{g(\cdot)}}\) equivalently. That model maps the real value \(g(\cdot)\) can take to the interval \((0, 1)\), which gives a natural interpretation of the response as a probability. Note that we want the transformation to be monotone so the mapping is invertible. The response variable \(Y_1, ..., Y_n\) are distributed according to a Bernoulli distribution, i.e.. \(Y_1, ..., Y_n \sim \textrm{Bernulli}(\pi(x_i))\). The logistic model belongs to the class of generalized linear models. These models have three characteristics:
- A link function (in our case the logit function \(\log(\frac{\pi_j(x)}{1-\pi_j(x)}) = g(x)\)).
- A response distribution (i.e. Bernoulli)
- The concrete form of \(g(\cdot)\) (in logistic regression most often just a linear model \(g(x) = \sum\limits_{j = 1}^p \beta_j x_j\)).
There is no closed-form solution for the above problem, hence we need to rely on gradient descent to find the maximum likelihood solution. You can fit a logistic regression model in R as follows:
# fit the model
fit <- glm(response~predictors, data = my_data, family = "binomial")
# predict
prediction <- predict(fit, newdata = my_data, type = "response") > 0.5
# evaluate in-sample
mean(prediction == my_data$response)
7.4 Discriminant Analysis or Logistic Regression?
- The logistic regression assumes the log-odds to be linear in the predictors, i.e. \(\log\Big(\frac{\pi}{1-\pi}\Big) = \sum\limits_{i = 1}^p \beta_i x_i\).
- The discriminant analysis assumes \(X|Y \sim N(\mu_j, \Sigma\), which leads to linear model in the decision variables. Hence, the methods are quite similar.
- It is quite natural to use factors with logistic regression, while for discriminant analysis, it is not very natural (?even not reasonable?).
- Empirically, the two methods yield similar results even under a violation of the normality assumption.
TODO multinomial likelihood (see footnote.)
7.5 Multiclass case (J > 2)
With logistic regression, you can not model multiclass cases directly, but indirectly using one of the following methods:
- Encode a J-case problem as J binary problems (one class against the rest), that is
For each case you want to label, you will obtain \(J-1\) probability estimates, i.e. \(\pi_j(x) = \frac{\exp\big(\sum\beta^{(j)}_jx_j\big)}{1 + \exp\big(\sum\beta^{(j)}_jx_j\big)}\). They don’t sum up to one necessarily, but you can normalize to obtain normed probabilities. Then, use the Bayes classifier to choose the class label (\(\arg\max_{0 < j < J-1} \pi_j\))
- Similarly, you can also model all against the reference, \(\log(\frac{\pi_1}{\pi_0})\). This might be helpful when we want to compare different group memberships with a reference group. For example in clinical trials, we want to compare how many times more likely someone belongs to group ill with disease A than healthy (the reference).
- You can also look at pair-wise comparisons. Choose two groups you want to compare, drop all observations that don’t belong to one of the two groups and estimate the model. There are \(\binom{J}{2}\) possible models with all models potentially having different number of observations.
- In the special case of ordered groups, the correct model is often a proportional odds model that models \[\text{logit}(\mathbb{P}(Y < k |X) = \alpha_k + g(\cdot))\] with \(\alpha_1 < \alpha_2 < \text{...} < \alpha_{J-1}\). The log odds are proportional, which becomes obvious if we take \(e\) to the power of the above equation. Check this webpage for more information. Note that proportionality in the log odds does not mean proportionality in the probabilities, since they are only linked through a non-linear mapping (the logistic transformation).