Machine Learning Fundamentals: Linear Classification (1/2)



Introduction


In my previous article, I covered supervised learning. Now, we’ll dive deeper into one of the topics covered in that article - namely, linear classification. This article is the first part of two - in this part, I’ll lay out a baseline understanding of classification problem in general as well as some useful techniques.

As from before, my goal with this series is to serve as an educational tool for those looking to review the fundamentals of machine learning with an emphasis on ease of understanding. This series assumes some familiarity with machine learning in general, and as such I will try to focus on the underlying mathematics and intuition behind the concepts that we discuss.


Linear Classification


As you may recall from last article, classification refers to a set of problems where the input is from a d d -dimensional set X \mathcal{X} , and the output is from a set of finite categorial values, which we will refer to as Y \mathcal{Y} . A classifier will refer to a function f ( x ) f(x) that outputs a label y Y y \in \mathcal{Y} for an input sample x x . We will train this classifier on input data X X and output labels Y Y .

For example, for the dataset below, we’d like a function that can take in a sample point x x (here, d = 2 d = 2 ) and output a label y y for either Class A (red) or Class B (blue). Usually, we’ll assign numbers to classes (i.e. Class A becomes 0, and Class B becomes 1) so we can write y 0 , 1 y \in {0, 1} .

Example Data

There are a few types of classifiers we could naively construct off the top of our heads. For example, the constant classifier f ( x ) = y f(x) = y for some fixed y Y y \in \mathcal{Y} for all inputs x X x \in \mathcal{X} (i.e. in the example above classify red for everything).

Another simple classifier would be the majority class classifier i.e. f ( x ) = arg max y Y P [ Y = y ] f(x) = \argmax_{y \in \mathcal{Y}} P[Y = y] . There are definitely some datasets this could work really well for (see below), but in general we can do better.

Majority Classifier Data

Bayes Classifier

How much better? To answer that quantitatively, we’ll take a look at a classifier known as the Bayes classifier, which is defined rather simply:
f ( x ) = arg max y Y P [ Y = y X = x ] f(x) = \argmax_{y \in \mathcal{Y}} P[Y = y | X = x]

As it turns out, this classifier is actually optimal! More formally, for an arbitrary classifier g ( x ) = X Y g(x) = \mathcal{X} \to \mathcal{Y} , we can write
P x , y [ g ( x ) = y ] P x , y [ f ( x ) = y ] P_{x, y} [g(x) = y] \leq P_{x, y} [f(x) = y]
where P x , y [ f ( x ) = y ] P_{x, y} [f(x) = y] is known as the accuracy of a classifier f f . Intuitively, you can just think of accuracy as the percentage of samples that a classifier labels correctly.

It’s understandable to be skeptical of this result - after all, why do we need all these fancy neural networks if the optimal classifier is this simple? Let’s first go over the proof of this result and then we can take a look at why this classifier isn’t practical.

Here, we’ll cover the proof of this result for binary Y \mathcal{Y} i.e. the only classes are 0 0 and 1 1 for simplicity, although the proof does extend generally to the case of any number of categories.

Let’s consider an individual sample x X x \in X . Then, for any classifier h h ,
P [ h ( x ) = y X = x ] = P [ h ( x ) = 1 , Y = 1 X = x ] + P [ h ( x ) = 0 , Y = 0 X = x ] = 1 [ h ( x ) = 1 ] P [ Y = 1 X = x ] + 1 [ h ( x ) = 0 ] P [ Y = 0 X = x ] \begin{align*} P[h(x) = y | X = x] &= P[h(x) = 1, Y = 1 | X = x] + P[h(x) = 0, Y = 0 | X = x]\\ &= \mathbf{1}[h(x) =1 ]P[Y = 1 | X = x] + \mathbf{1}[h(x) = 0 ]P[Y = 0 | X = x]\\ \end{align*}
Let’s let η ( x ) = P [ Y = 1 X = x ] \eta(x) = P[Y = 1 | X = x] . Then, we can write
P [ h ( x ) = y X = x ] = 1 [ h ( x ) = 1 ] η ( x ) + 1 [ h ( x ) = 0 ] ( 1 η ( x ) ) P[h(x) = y | X = x] = \mathbf{1}[h(x) =1 ]\eta(x) + \mathbf{1}[h(x) = 0 ](1 - \eta(x))
Now, let’s use our original f f and g g again. We can write
P [ f ( x ) = y X = x ] P [ g ( x ) = y X = x ] = ( 1 [ f ( x ) = 1 ] η ( x ) + 1 [ f ( x ) = 0 ] ( 1 η ( x ) ) ) ( 1 [ g ( x ) = 1 ] η ( x ) + 1 [ g ( x ) = 0 ] ( 1 η ( x ) ) ) = η ( x ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) + ( 1 η ( x ) ) ( 1 [ f ( x ) = 0 ] 1 [ g ( x ) = 0 ] ) = η ( x ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) + ( 1 [ f ( x ) = 0 ] 1 [ g ( x ) = 0 ] ) η ( x ) ( 1 [ f ( x ) = 0 ] 1 [ g ( x ) = 0 ] ) = η ( x ) [ ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) ( 1 [ f ( x ) = 0 ] 1 [ g ( x ) = 0 ] ) ] + ( 1 [ f ( x ) = 0 ] 1 [ g ( x ) = 0 ] ) \begin{align*} &P[f(x) = y | X = x] - P[g(x) = y | X = x]\\ &= \left(\mathbf{1}[f(x) =1 ]\eta(x) + \mathbf{1}[f(x) = 0 ](1 - \eta(x))\right) - \left(\mathbf{1}[g(x) =1 ]\eta(x) + \mathbf{1}[g(x) = 0 ](1 - \eta(x))\right) \\ &=\eta(x)\left(\mathbf{1}[f(x) =1 ] - \mathbf{1}[g(x) =1 ]\right) + (1 - \eta(x))\left(\mathbf{1}[f(x) = 0 ] - \mathbf{1}[g(x) = 0 ]\right)\\ &= \eta(x)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) + (\mathbf{1}[f(x) = 0] - \mathbf{1}[g(x) = 0]) \\ &\quad - \eta(x)(\mathbf{1}[f(x) = 0] - \mathbf{1}[g(x) = 0]) \\ &= \eta(x)[(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) - (\mathbf{1}[f(x) = 0] - \mathbf{1}[g(x) = 0])] \\ &\quad + (\mathbf{1}[f(x) = 0] - \mathbf{1}[g(x) = 0]) \end{align*}
Here, we can use the fact that 1 [ f ( x ) = 1 ] = 1 1 [ f ( x ) = 0 ] \mathbf{1}[f(x) = 1] = 1 - \mathbf{1}[f(x) = 0] to simplify further.
η ( x ) [ ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) ( ( 1 1 [ f ( x ) = 1 ] ) ( 1 1 [ g ( x ) = 1 ] ) ) ] + ( ( 1 1 [ f ( x ) = 1 ] ) ( 1 1 [ g ( x ) = 1 ] ) ) = η ( x ) [ ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) ( 1 [ f ( x ) = 1 ] + 1 [ g ( x ) = 1 ] ) ] + ( 1 [ f ( x ) = 1 ] + 1 [ g ( x ) = 1 ] ) = 2 η ( x ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) = ( 2 η ( x ) 1 ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) \begin{align*} &\eta(x)\left[ (\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) - ((1 - \mathbf{1}[f(x) = 1]) - (1 - \mathbf{1}[g(x) = 1])) \right] \\ &\quad + ((1 - \mathbf{1}[f(x) = 1]) - (1 - \mathbf{1}[g(x) = 1])) \\ &= \eta(x)[(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) - (-\mathbf{1}[f(x) = 1] + \mathbf{1}[g(x) = 1])] \\ &\quad + (-\mathbf{1}[f(x) = 1] + \mathbf{1}[g(x) = 1]) \\ &= 2\eta(x)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) - (\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) \\ &= (2\eta(x) - 1)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) \end{align*}
Recall η ( x ) = P [ Y = 1 X = x ] \eta(x) = P[Y = 1 | X = x] . We now can consider three cases.

First, if η ( x ) = 1 2 \eta(x) = \frac{1}{2} , it doesn’t matter what either classifier picks since the term 2 η ( x ) 1 = 0 2\eta(x) - 1 = 0 , so the entire expression will be 0 0 .

Second, when η ( x ) > 1 2 \eta(x) > \frac{1}{2} , the Bayes classifier will pick 1 1 since η ( x ) = P [ Y = 1 X = x ] > P [ Y = 0 X = x ] \eta(x) = P[Y = 1 | X = x] > P[Y = 0 | X = x] , which means 1 [ f ( x ) = 1 ] = 1 \mathbf{1}[f(x) = 1] = 1 . If the arbitrary classifier selects 1 1 for x x , the overall expression will be 0 0 , and otherwise it will be positive. Mathematically, ( 2 η ( x ) 1 ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) 0 (2\eta(x) - 1)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) \geq 0 when η ( x ) > 1 2 \eta(x) > \frac{1}{2} .

Finally, when η ( x ) < 1 2 \eta(x) < \frac{1}{2} , the Bayes classifier will pick 0 0 since η ( x ) = P [ Y = 1 X = x ] < P [ Y = 0 X = x ] \eta(x) = P[Y = 1 | X = x] < P[Y = 0 | X = x] , which means 1 [ f ( x ) = 1 ] = 0 \mathbf{1}[f(x) = 1] = 0 . If the arbitrary classifier selects 0 0 for x x , the overall expression will be 0 0 , and otherwise it will still be positive, since 2 η ( x ) 1 < 0 2\eta(x) - 1 < 0 when η ( x ) < 1 2 \eta(x) < \frac{1}{2} . Mathematically, ( 2 η ( x ) 1 ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) 0 (2\eta(x) - 1)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) \geq 0 when η ( x ) < 1 2 \eta(x) < \frac{1}{2} as well.

Therefore, this expression is always non-negative! What does this mean? Recall where we started from i.e. P [ f ( x ) = y X = x ] P [ g ( x ) = y X = x ] P[f(x) = y | X = x] - P[g(x) = y | X = x] . We’ve shown that this is equal to ( 2 η ( x ) 1 ) ( 1 [ f ( x ) = 1 ] 1 [ g ( x ) = 1 ] ) (2\eta(x) - 1)(\mathbf{1}[f(x) = 1] - \mathbf{1}[g(x) = 1]) and furthermore that this is always non-negative, meaning we can write
P [ f ( x ) = y X = x ] P [ g ( x ) = y X = x ] 0 P [ f ( x ) = y X = x ] P [ g ( x ) = y X = x ] \begin{align*} P[f(x) = y | X = x]& - P[g(x) = y | X = x] \geq 0\\ P[f(x) = y | X = x]& \geq P[g(x) = y | X = x] \end{align*}
or in other words, the accuracy of the Bayes classifier is greater than any arbitrary classifier for a fixed point x x . Our logic above didn’t depend on the particular choice of x x , meaning that the same logic applies to any point x x . Therefore, the Bayes classifier is optimal globally. A quick note here: optimal does not necessarily mean perfect, as shown below.

Linear Model Linear Model
Linear Classifier (left) vs. Bayes Classifier (right) from The Elements of Statistical Learning, Section 2

We’ve shown that the Bayes classifier cannot be beaten (as visualized above). Why then, do we need any of the items we’re about to discuss in this article? The answer lies in need for the conditional data distribution P [ Y X ] P[Y | X] in the Bayes classifier.

Since the true data distribution is unknown, this data distribution is also unknown, meaning we need to estimate it from observed data samples. However, there can be infinite possible choices of x X x \in \mathcal{X} meaning that the amount of data needed to estimate this is not practical, especially as the dimension of X \mathcal{X} increases. So although the Bayes classifier is optimal, it is generally unfeasible to directly model in practice, leading us find smarter routes.


Naive Bayes Classifier

Let’s take a look at the definition of the Bayes classifier again.
f ( x ) = arg max y Y P [ Y = y X = x ] f(x) = \argmax_{y \in \mathcal{Y}} P[Y = y | X = x]
One way we can simplify this is by using Bayes’ rule i.e.
P ( A B ) = P ( B A ) P ( A ) P ( B ) P (A | B) = \frac{P(B | A) P(A)}{P(B)}
Applying it here, we’ll get
f ( x ) = arg max y Y P [ Y = y X = x ] = arg max y Y P [ X = x Y = y ] P [ Y = y ] P [ X = x ] \begin{align*} f(x) &= \argmax_{y \in \mathcal{Y}} P[Y = y | X = x]\\ &= \argmax_{y \in \mathcal{Y}} \frac{P[X = x | Y = y] P [Y = y]}{P[X = x]} \end{align*}
In our equation, since we are taking the arg max \argmax over all possible values of y y , and P [ X = x ] P[X = x] is independent of y y , we can actually remove that term entirely and write
f ( x ) = arg max y Y P [ X = x Y = y ] P [ Y = y ] f(x) = \argmax_{y \in \mathcal{Y}} P[X = x | Y = y] P[Y = y]
Intuitively, this equation tells us that if we can estimate the probability that the input sample x x came from the class-conditional distribution P [ X = x Y = y ] P[X = x | Y = y] , we can then weight this probability by the class prior (i.e. P [ Y = y ] P[Y = y] ), and choose the most likely class accordingly.

This is much more tractable than the original specification of the Bayes classifier, since we don’t need to estimate the full posterior distribution P [ Y = y X = x ] P[Y = y | X = x] , and can instead estimate Y |\mathcal{Y}| class-specific distributions P [ X = x Y = y ] P[X = x | Y = y] and prior probabilities P [ Y = y ] P[Y = y] .

With this formulation, we have a lot of flexibility in terms of how we choose to model P [ X = x Y = y ] P[X = x | Y = y] . If we assume that the data has some sort of form, we can choose to do some form of parametric density estimation. Essentially, we choose a type of model or distribution (i.e. linear, Gaussian, Bernoulli, etc.), and choose the parameters that are most likely to fit the given data. There are also non-parametric ways of estimating the distributions, which we may cover at another time.

One very simple way of estimating these distributions is add another assumption to our problem. If we assume that individual features are independent given the class label, we can split out the d d -dimensional x x into probabilities for each of the d d features. This classifier is known as the Naive Bayes Classifier.

f ( x ) = arg max y Y j = 1 d P [ X j = x j Y = y ] P [ Y = y ] f(x) = \argmax_{y \in \mathcal{Y}} \prod_{j=1}^d P[X_j = x_j | Y = y] P[Y = y]
This is a very simple model to implement in practice. However, because of the assumption of independence between features (which is generally not true), this classifier tends to result in bad estimates.

A very common choice in practice is to fit a Gaussian distribution for each class, parameterized by class specific means and variances. This is a common choice of class-conditional distribution due to the nice properties the Gaussian distribution has, especially as we increase the dimensionality of our problem. Let’s visualize what this classifier might look like for our toy example from earlier.

Gaussian MLE Class

This classifier also allows us to measure our confidence in our prediction for each sample we classify per class, which is very useful for certain applications.


Maximum Likelihood Estimation

A reasonable question to ask at this point would be how we actually fit models like the one mentioned above to the data we have been given (as well as how any of this relates to linear classification). Rest assured, we’re slowly getting there - in this section we’ll tie all these concepts together through our discussion of maximum likelihood estimation.

Maximum likelihood estimation (MLE) is a technique used to estimate the parameters of a model to some set of data. First, let’s assume we have some data X X X \sim \mathcal{X} , and that each sample x X x \in X is independent and identically distributed (i.i.d) (this last assumption is important for later!). Then, let’s say we want to fit some class of parameterized models P = { p θ θ Θ } \mathcal{P} = \{p_\theta | \theta \in \Theta\} , where each model p p can be uniquely described by a set of parameters θ \theta from the set of all possible parameter choices Θ \Theta .

We define the likelihood function as follows
L ( θ X ) = P ( X θ ) = i = 1 n P ( x i θ ) = i = 1 n p θ ( x i ) \mathcal{L}(\theta | X) = P(X | \theta) = \prod_{i=1}^n P(x_i | \theta) = \prod_{i=1}^n p_\theta(x_i)
where the second to last step is possible because of our i.i.d. assumption from earlier. Intuitively, this function represents how probably (or likely) is the data given the model p θ p_\theta . Maximum likelihood estimation then simply becomes which parameter setting maximizes the likelihood function, or mathematically
arg max θ L ( θ X ) = arg max θ i = 1 n p θ ( x i ) \argmax_{\theta} \mathcal{L}(\theta | X) = \argmax_{\theta} \prod_{i=1}^n p_\theta(x_i)

With this new formulation, let’s take a look at a toy example where we model our class conditional distributions with univariate Gaussians. Recall the probability density function for the Gaussian distribution
p { μ , σ 2 } ( x ) = 1 2 π σ 2 exp ( ( x μ ) 2 2 σ 2 ) p_{\{\mu, \sigma^2\}}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(- \frac{(x - \mu)^2}{2\sigma^2}\right)
Then, our MLE problem becomes
arg max θ L ( θ x ) = arg max μ , σ 2 i = 1 n p { μ , σ 2 } ( x i ) \argmax_{\theta} \mathcal{L}(\theta | x) = \argmax_{\mu, \sigma^2} \prod_{i=1}^n p_{\{\mu, \sigma^2\}}(x_i)
There are two tricks we can use here to make this problem a little easier. First, since the logarithm function is strictly monotonically increasing for non-negative values, we know that
arg max θ L ( θ x ) = arg max log L ( θ x ) \argmax_{\theta} \mathcal{L}(\theta | x) = \argmax \log \mathcal{L}(\theta | x)
Therefore, instead of our earlier equation, we have
arg max θ L ( θ x ) = arg max log L ( θ x ) = arg max μ , σ 2 log i = 1 n p { μ , σ 2 } ( x i ) = arg max μ , σ 2 i = 1 n log p { μ , σ 2 } ( x i ) \begin{align*} \argmax_{\theta} \mathcal{L}(\theta | x) &= \argmax \log \mathcal{L}(\theta | x) \\ &= \argmax_{\mu, \sigma^2} \log \prod_{i=1}^n p_{\{\mu, \sigma^2\}}(x_i)\\ &= \argmax_{\mu, \sigma^2} \sum_{i=1}^n \log p_{\{\mu, \sigma^2\}}(x_i) \end{align*}
which is a lot nicer to analyze than the nasty product we had before.

The second trick is one you’re probably familiar with from high school calculus - namely, that finding the maximum or minimum of a function boils down to analyzing it’s stationary points, which are values when the derivative of the function is zero.

Let’s take a look at the derivative of i = 1 n log p { μ , σ 2 } ( x i ) \sum_{i=1}^n \log p_{\{\mu, \sigma^2\}}(x_i) with respect to μ \mu and σ 2 \sigma^2 respectively. First, let’s simplify the inner expression. We can write
log p { μ , σ 2 } ( x i ) = 1 2 log ( 2 π σ 2 ) ( x i μ ) 2 2 σ 2 \log p_{\{\mu, \sigma^2\}}(x_i) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}

Then, to find our MLE estimate of μ \mu , we have
0 = μ ( i = 1 n [ 1 2 log ( 2 π σ 2 ) ( x i μ ) 2 2 σ 2 ] ) = i = 1 n μ ( 1 2 log ( 2 π σ 2 ) ( x i μ ) 2 2 σ 2 ) = i = 1 n [ μ ( 1 2 log ( 2 π σ 2 ) ) μ ( ( x i μ ) 2 2 σ 2 ) ] \begin{align*} 0 &= \nabla_\mu \left(\sum_{i=1}^n \left[-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}\right] \right)\\ &= \sum_{i=1}^n \nabla_\mu \left(-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}\right)\\ &= \sum_{i=1}^n \left[\nabla_\mu \left(-\frac{1}{2}\log(2\pi\sigma^2)\right) - \nabla_\mu\left(\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]\\ \end{align*}
We can discard the μ ( 1 2 log ( 2 π σ 2 ) ) \nabla_\mu \left(-\frac{1}{2}\log(2\pi\sigma^2)\right) term since there is no μ \mu involved. Continuing on, we have
0 = i = 1 n μ ( ( x i μ ) 2 2 σ 2 ) = 1 2 σ i = 1 n μ [ ( x i μ ) 2 ] = 1 2 σ i = 1 n 2 ( x i μ ) = 1 σ i = 1 n ( x i μ ) \begin{align*} 0 &= \sum_{i=1}^n \nabla_\mu\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\\ &= -\frac{1}{2\sigma}\sum_{i=1}^n \nabla_\mu [(x_i - \mu)^2]\\ &= -\frac{1}{2\sigma} \sum_{i=1}^n -2(x_i - \mu)\\ &= \frac{1}{\sigma} \sum_{i=1}^n (x_i - \mu) \end{align*}
We can then discard the constant 1 σ \frac{1}{\sigma} since we’re setting the expression to 0 0 anyways and solve
i = 1 n ( x i μ ) = 0 n μ + i = 1 n x i = 0 n μ + i = 1 n x i = 0 i = 1 n x i = n μ μ M L E = 1 n i = 1 n x i \begin{align*} \sum_{i=1}^n (x_i - \mu) &= 0\\ -n\mu + \sum_{i=1}^n x_i &= 0\\ -n\mu + \sum_{i=1}^n x_i &= 0\\ \sum_{i=1}^n x_i &= n\mu\\ \mu_{MLE} &= \frac{1}{n}\sum_{i=1}^n x_i \end{align*}
Intuitively, the best estimate of the mean parameter for the class conditional distribution is the mean of all the samples in that class in our training data - pretty simple if you think about it, but it’s nice to show it mathematically.

Now for σ 2 \sigma^2 (phew, here we go)
0 = σ 2 ( i = 1 n [ 1 2 log ( 2 π σ 2 ) ( x i μ ) 2 2 σ 2 ] ) = i = 1 n σ 2 ( 1 2 log ( 2 π σ 2 ) ( x i μ ) 2 2 σ 2 ) = i = 1 n [ σ 2 ( 1 2 log ( 2 π σ 2 ) ) σ 2 ( ( x i μ ) 2 2 σ 2 ) ] = i = 1 n [ 1 2 σ 2 + ( x i μ ) 2 2 σ 4 ] = i = 1 n [ ( x i μ ) 2 σ 2 2 σ 4 ] \begin{align*} 0 &= \nabla_{\sigma^2} \left(\sum_{i=1}^n \left[-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}\right] \right)\\ &= \sum_{i=1}^n \nabla_{\sigma^2} \left(-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}\right)\\ &= \sum_{i=1}^n \left[\nabla_{\sigma^2} \left(-\frac{1}{2}\log(2\pi\sigma^2)\right) - \nabla_{\sigma^2} \left(\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right]\\ &= \sum_{i=1}^n \left[-\frac{1}{2\sigma^2} + \frac{(x_i - \mu)^2}{2\sigma^4}\right]\\ &= \sum_{i=1}^n \left[\frac{(x_i - \mu)^2 - \sigma^2}{2\sigma^4}\right]\\ \end{align*}
We can ignore the denominator out since we want the entire expression to be 0 0 . We have
0 = i = 1 n [ ( x i μ ) 2 σ 2 ] n σ 2 = i = 1 n ( x i μ ) 2 σ M L E 2 = 1 n i = 1 n ( x i μ ) 2 \begin{align*} 0 &= \sum_{i=1}^n \left[ (x_i - \mu)^2 - \sigma^2\right]\\ n\sigma^2 &= \sum_{i=1}^n (x_i - \mu)^2\\ \sigma^2_{MLE} &= \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2\\ \end{align*}

We have the MLE estimates for each class-conditional distribution now - we can just assume a Bernoulli distribution for the class priors (i.e. P [ Y = y ] P[Y = y] is just the fraction of samples with that label) and we’re done!

A small confession - the analysis I did above was for a univariate Gaussian, but in our case our data would require a multivariate Gaussian. Rest assured, the derivation ends up being essentially the same barring some complications associated with vector calculus - for the multivariate Gaussian defined as
p μ , Σ ( x ) = 1 ( 2 π ) d Σ exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) p_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)
we’d have
μ M L E = 1 n i = 1 n x i \mu_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i Σ M L E = 1 n i = 1 n ( x i μ M L E ) ( x i μ M L E ) T \Sigma_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \mu_{MLE})(x_i - \mu_{MLE})^T


Implementation

Our work has all been very theoretical up until now. For the remainder of the article, we’ll take a look at some toy examples based off this Kaggle dataset. I’ll include some small code snippets to show how you might implement these algorithms in practice. Let’s first visualize our dataset.

Kaggle Data

We can now write a function based off our derivations above to compute the class means and covariance matrices.

        
import pandas as pd
import numpy as np

CLASSES = np.sort(df['class'].unique())
CLASSES

def  fit_gaussian_mle(df):
    mle_params = {}

    for c in CLASSES:
        # get class data
        class_data = df[df['class'] == c]
        X = class_data[['feature1',  'feature2']].values

        # computed for classification, but not necessary to compute class means and variance
        p_y = len(class_data) / len(df)

        # mle parameter estimate for class mean
        mu_mle = np.mean(X, axis=0)

        # mle parameter estimate for class variance
        sigma_mle = np.zeros((len(X[0]),  len(X[0])))

        n = X.shape[0]
        X_centered = X - mu_mle # (x_i - mu)

        for i in  range(n):
            x_c_i = X_centered[i]
            x_c_i = x_c_i.reshape(-1,  1)  # make it a column vector
            sigma_mle += x_c_i @ x_c_i.T # (x_i - mu) * (x_i - mu)^T
        sigma_mle = sigma_mle / n # multiply by 1/n

        mle_params[int(c)] = {
            'mu': mu_mle,
            'sigma': sigma_mle,
            'prior': p_y
        }

    return mle_params

fit_gaussian_mle(df)

In my code above, I’ve refrained from using numpy functions like np.cov to be more clear with how our derivations end up being implemented in code. With the means and covariance matrices derived from this code, we can visualize our data as follows

Gaussian MLE Fit

As you can see, in addition to classification, we are also able to plot ellipses showing the relative likelihood of each class the further we get from the mean of that class. In the cases where we have overlap, we’ll rely on the arg max \argmax feature of our original classifier to simply choose the class with the highest probability at that point.


Wrap-up


In this article, we’ve covered the basics of the classification problem, including topics such as the theory of the Bayes classifier and Maximum Likelihood Estimation. We also fit a sample dataset with MLE assuming Gaussian conditional distributions and demonstrated how you might implement this simple model in code.

If you’ve made it this far, thank for reading through as always! We’ve set up a baseline understanding of the world of classification in this article. Next time, I’ll dive more into linear classification itself, building up from the concepts we’ve covered this time.

All diagrams are made by me unless noted otherwise.