7.4 Discriminant functions for classification

Let $\mathbf x\in \mathbb R^ n$ denote a real valued random input vector and $\mathbf y$ a categorical random output variable that takes values in the set ${ c_1,\dots ,c_K} $ such that

[ \sum_{k=1}^ K \mbox{Prob}\lbrace \mathbf y=c_k|x \rbrace =1 ]

A classifier can be represented in terms of a set of $K$ discriminant functions $g_k(x)$, $k=1,\dots ,K$ such that the classifier applies the following decision rule: assigns a feature vector $x$ to a class $\hat{y}(x)=c_k$ if

\begin{equation} g_k(x) > g_j(x) \quad \mbox{for all } j \neq k \end{equation}

Section 5.3 showed that in the case of a zero-one loss function (Equation 5.3.5), the optimal classifier corresponds to a maximum a posteriori discriminant function $g_k(x)=\mbox{Prob}\lbrace \mathbf y=c_k|x \rbrace $. This means that if we are able to define the $K$ functions $g_k(\cdot )$, $k=1,\dots ,K$ and we apply the classification rule 7.4.1 to an input $x$, we obtain a classifier which is equivalent to the Bayes one.

The discriminant functions divide the feature space into $K$ decision regions $D_k$, where a decision region $D_k$ is a region of the input space ${\mathcal X}$ where the discriminant classifier returns the class $c_k$ for each $x \in D_k$. The regions are separated by decision boundaries, i.e. surfaces in the domain of $x$ where ties occur among the largest discriminant functions.

Example

Consider a binary classification problem where $\mathbf y$ can take values in ${ c_1,c_2} $ and $x \in \mathbb R^2$. Let $g_1(x)=3x_1+x_2+2$ and $g_2(x)=2x_1+2$ the two discriminant functions associated to the class $x_1$ and $x_2$, respectively. The classifier will return the class $c_1$ if

[ 3x_1+x_2+2 > 2x_1+2 \Leftrightarrow x_1 > -x_2 ]

The decision regions $D_1$ and $D_2$ are depicted in Figure 7.7.

Figure 7.7: Decision boundary and decision regions for the binary discrimination functions $g_1(x)=3x_1+x_2+2$ and $g_2(x)=2x_1+2$

$\bullet $

We can multiply all the discriminant functions by the same positive constant or shift them by the same additive constant without influencing the decision. More generally, if we replace every $g_k(z)$ by $f(g_k(z))$, where $f(\cdot )$ is a monotonically increasing function, the resulting classification is unchanged.

For example in the case of a zero/one loss function, any of the following choices gives identical classification result:

\begin{align} & g_k(x)=\mbox{Prob}\lbrace \mathbf y=c_k|x \rbrace = \frac{p(x| \mathbf y=c_k) P(\mathbf y=c_k)}{\sum_{k=1}^ K p(x| \mathbf y=c_k) P(\mathbf y=c_k)} \\ & g_k(x)=p(x| \mathbf y=c_k) P(\mathbf y=c_k)\\ & g_k(x)= \ln p(x| \mathbf y=c_k) + \ln P(\mathbf y=c_k) \end{align}

and returns a minimum-error-rate classification by implementing a Bayes classifier.

Discriminant functions in the Gaussian case

Let us consider a binary classification task where the inverse conditional densities are multivariate normal, i.e. $p(\mathbf x=x|\mathbf y=c_k) \sim \mathcal{N}(\mu_k,\Sigma_k)$ where $x \in \mathbb R^ n$, $\mu_k$ is a $[n,1]$ vector and $\Sigma_k$ is a $[n,n]$ covariance matrix. Since

[ p(\mathbf x=x|\mathbf y=c_k)= \frac{1}{(\sqrt {2 \pi })^ n \sqrt {\det (\Sigma_k)}} \exp \lbrace -\frac{1}{2} (x-\mu_k)^ T \Sigma_k^{-1}(x-\mu_k) \rbrace ]

from 7.4.4 we obtain

\begin{align} g_k(x) & = \ln p(x| \mathbf y=c_k) + \ln P(\mathbf y=c_k) \\ & = -\frac{1}{2}(x-\mu_k)^ T \Sigma_k^{-1} (x-\mu_k) -\frac{n}{2} \ln 2 \pi -\frac{1}{2} \ln \det ( \Sigma_k)+ \ln P(\mathbf y=c_k) \end{align}

Let us consider the simplest case, that is all the distributions have the same diagonal covariance matrix $\Sigma_k=\sigma ^2 I$ where $I$ is the $[n,n]$ identity matrix.

In geometrical terms, this means that, for each class, the $x$ samples fall in equal-size spherical clusters which are parallel to the axes. In analytical terms, this means that the two quantities in 7.4.5

[ \det (\Sigma_k)=\sigma ^{2n}, \qquad \Sigma_k^{-1}=(1/\sigma ^2)I ]

are independent of $k$, i.e. they are unimportant additive constants that can be ignored by the decision rule 7.4.1.From 7.4.5, we obtain the simpler discriminant function

\begin{align*} g_k(x) & =-\frac{\| x-\mu_k\| ^2}{2\sigma ^2}+\ln P(\mathbf y=c_k) \\ & = -\frac{(x-\mu_k)^ T (x-\mu_k)}{2\sigma ^2}+\ln P(\mathbf y=c_k) \\ & = -\frac{1}{2 \sigma ^2} [x^ T x -2 \mu_k^ T x +\mu_k^ T \mu_k] + \ln P(\mathbf y=c_k) \end{align*}

However, since the quadratic term $x^ Tx$ is the same for all $k$, making it an ignorable additive constant, this is equivalent to a linear discriminant function

[ g_k(x)=w_k^ T x + w_{k0} ]

where $w_k$ is a $[n,1]$ vector

[ w_k=\frac{1}{\sigma ^2}\mu_k ]

and the term $w_{k0}$

[ w_{k0}=-\frac{1}{2 \sigma ^2} \mu_k^ T \mu_k + \ln P(\mathbf y=c_k) ]

is called the bias or threshold.

In the two-classes problem, the decision boundary (i.e. the set of points where $g_1(x)=g_2(x)$) can be obtained by solving the identity

[ w_1^ Tx+w_{10}=w_2^ Tx+w_{20} \Leftrightarrow (w_1-w_2)^ T x -(w_{20}-w_{10})=0 ]

We obtain an hyperplane having equation

\begin{equation} w^ T(x-x_0)=0 \end{equation}

where

[ w=\frac{\mu_1-\mu_2}{\sigma ^2} ]

and

[ x_0=\frac{1}{2}(\mu_1+\mu_2)- \frac{\sigma ^2}{| \mu_1-\mu_2| ^2} \ln \frac{\mbox{Prob}\lbrace \mathbf y=c_1 \rbrace }{\mbox{Prob}\lbrace \mathbf y=c_2 \rbrace }(\mu_1-\mu_2) ]

This can be verified by the fact that $w^ T x_0=w_{20}-w_{10}$. The equation 7.4.7 defines a hyperplane through the point $x_0$ and orthogonal to the vector $w$.

Uniform prior case

If the prior probabilities $P(\mathbf y=c_k)$ are the same for all the $K$ classes, then the term $\ln P(\mathbf y=c_k)$ is an unimportant additive constant that can be ignored. In this case, it can be shown that the optimum decision rule is a minimum distance classifier. This means that in order to classify an input $x$, it measures the Euclidean distance $| x-\mu_k| ^2$ from $x$ to each of the $K$ mean vectors, and assign $x$ to the category of the nearest mean. It can be shown that for the more generic case $\Sigma_k=\Sigma $, the discriminant rule is based on minimizing the Mahalanobis distance

[ \hat{c}(x)=\arg \min_{k} (x-\mu_k)^ T \Sigma ^{-1} (x-\mu_k) ]

R script

The R script discri.R considers a binary classification task where $\mathbf x\in \mathbb R^2$ and the inverse conditional distributions of the two classes are $\mathcal{N}(\mu_1,\sigma ^2 I)$ and $\mathcal{N}(\mu_2,\sigma ^2 I)$, respectively. Suppose that the two a priori probabilities are identical, that $\sigma =1$, $\mu_1=[-1,-2]^ T$ and $\mu_2=[2,5]^ T$. The positions of 100 points randomly drawn from $\mathcal{N}(\mu_1,\sigma ^2 I)$, of 100 points drawn from $\mathcal{N}(\mu_2,\sigma ^2 I)$ together with the optimal decision boundary computed by 7.4.7 are plotted in Figure 7.8.

Figure 7.8: Binary classification problem: distribution of inputs and linear decision boundary

script discri.R

rm(list=ls()) norm<-function(x){ sqrt(sum(x^2)) }

N1<-100 ## number samples class1 N2<-100 ## number samples class2 P<-c(N1,N2)/(N1+N2) sigma2<-1

mu.1 <- c(-1,-2) ## mean of cluster 1 mu.2<-c(2,5) ## mean of cluster 2

Data generation

X1<-cbind(rnorm(N1,mu.1[1],sqrt(sigma2)),rnorm(N1,mu.1[2],sqrt(sigma2))) X2<-cbind(rnorm(N2,mu.2[1],sqrt(sigma2)),rnorm(N2,mu.2[2],sqrt(sigma2)))

plot(X1[,1],X1[,2],col="red",xlim=c(-10,10),ylim=c(-10,10),xlab="x1",ylab="x2") points(X2[,1],X2[,2],col="green")

Xd<--10:10 w<-mu.1-mu.2 x0<-0.5(mu.1+mu.2)-sigma2/(norm(mu.1-mu.2)^2)(mu.1-mu.2)*log(P[1]/P[2])

m<--w[1]/w[2] intc<-w[1]/w[2]*x0[1]+x0[2]

abline(a=intc,b=m) points(x0[1],x0[2])

$\bullet $