Lecture Script

Summary

Overview of Classification Approaches

The lecture covered three fundamental approaches to classification:

Discriminant Models: Directly map input data to class labels through a discriminant function, learning the decision boundary without modeling data distributions
Probabilistic Discriminant Models: Directly model the posterior distribution $p(C_k | x)$ to calculate classification probabilities without modeling the data generation process
Probabilistic Generative Models: Model the joint distribution $p(x, C_k)$ by learning both the class-conditional distribution $p(x | C_k)$ and class prior $p(C_k)$, then use Bayes rule to compute the posterior: $p(C_k | x) \propto p(x | C_k)p(C_k)$

Preliminaries

Bayes Rule

Joint probability: $P(A,B) = P(A | B)P(B) = P(B | A)P(A)$
Bayes rule: $P(A | B) = \frac{P(B | A)P(A)}{P(B)}$
Law of total probability: $P(B) = \sum_a P(B | A=a)P(A=a)$

Multivariate Gaussian Distribution

Probability density function: $p(x;\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$
Mean vector $\mu \in \mathbb{R}^d$ controls the peak location
Covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$ determines the shape and spread of the distribution
Larger covariance values create more spread out distributions; smaller values create more compressed distributions
Off-diagonal elements control the orientation of the elliptical contours

Gaussian Discriminant Analysis (GDA)

Model Setup for Binary Classification

Target variable: $t \in \{0,1\}$
Likelihood: $x|t=0 \sim \mathcal{N}(\mu_0,\Sigma)$ and $x|t=1 \sim \mathcal{N}(\mu_1,\Sigma)$
Two different mean vectors but one shared covariance matrix
Prior: $t \sim \text{Bernoulli}(\phi)$ where $p(t)=\phi^t(1-\phi)^{1-t}$
Parameters to estimate: $phi$, $mu_0$, $mu_1$, and $\Sigma$

Maximum Likelihood Estimation

Prior estimate: $\hat{\phi} = \frac{1}{N}\sum_{n=1}^N 1[t_n=1]$ (fraction of training samples in class 1)
Mean estimate: $\hat{\mu}_i = \frac{\sum 1[t_n=i]x_n}{\sum 1[t_n=i]}$
Covariance estimate: $\hat{\Sigma}=\frac{1}{N}\sum_{n=1}^N (x_n-\mu_{t_n})(x_n-\mu_{t_n})^T$
Prediction: $\arg\max_t p(x|t)p(t)$
The decision boundary is linear because both classes share the same covariance matrix

Naïve Bayes Classifier

Motivation: Spam Filtering

Emails represented as binary vectors based on a word dictionary
For a dictionary of 50,000 words, directly modeling $p(x|t)$ would require approximately $2 \cdot 2^{50,000}$ parameters, which is computationally infeasible

Naïve Bayes Assumption

Conditional independence: $p(x_1,...,x_d|t) = \prod_{i=1}^d p(x_i|t)$
Features are assumed to be independent given the class label
This is a strong assumption since features (words) are often not truly independent
However, the assumption works well in practice despite being violated
Important: This assumes conditional independence given the class, not complete independence between features

Model for Binary Features

Features: $x_i \in \{0,1\}$ where $x_i|t=k \sim \text{Bernoulli}(\phi_{i|t=k})$
Prior: $t \sim \text{Bernoulli}(\phi)$ for binary classification
For multi-class: $t \sim \text{Multinomial}(\{\phi_k\})$ where $\sum_k \phi_k = 1$
The naïve Bayes assumption reduces the number of parameters from exponential to linear in the number of features (2 × 50,000 instead of $2^{50,000}$)

Parameter Estimation

Prior estimate: $\hat{\phi} = \frac{1}{N}\sum 1[t_n=1]$ (fraction of samples in class 1)
Likelihood estimate: $\hat{\phi}{i|t=1} = \frac{\sum 1[t_n=1 \land x{n,i}=1]}{\sum 1[t_n=1]}$ (fraction of class 1 samples where feature $i$ is present)
Similar estimation for $\hat{\phi}_{i|t=0}$
Prediction: $\arg\max_t p(x|t)p(t)$

Examples

MNIST Binary Digit Recognition

Task: Classify digits 5 vs 6
Each pixel treated as a binary feature
The learned probabilities $P(x_i=1|y=5)$ and $P(x_i=1|y=6)$ can be visualized as heatmaps
These heatmaps show which pixel locations are frequently activated for each digit class
Example predictions showed high confidence: digit 5 with probability 0.86, digit 6 with probability 0.997

Training and Testing Procedure

Training steps:

Estimate $P(x_i=1|t=k)$ as the fraction of samples in class $k$ where feature $i$ equals 1
Estimate $P(t=k)$ as the fraction of training samples in class $k$

Testing:

For a new sample, compute $\arg\max_t p(x|t)p(t)$ to find the class with maximum joint probability

Generative vs Discriminative Models Comparison

Generative Models

Advantages:

Can work with less training data (low variance)
Can handle missing data naturally
Can model prior knowledge and uncertainty
Can generate new data from the learned distribution

Disadvantages:

High bias due to distributional assumptions
Performance degrades if distributional assumptions are incorrect
Models the full data distribution, which may be unnecessary for classification

Discriminative Models

Advantages:

Low bias
Generally achieve higher accuracy for classification tasks
Focus directly on learning decision boundaries

Disadvantages:

Difficulty handling missing data
High variance, requiring more data
Higher risk of overfitting when model parameters exceed the number of training samples
Cannot generate new data or model data distribution

The lecture noted that generative models are becoming increasingly important, especially for data generation tasks

Notes

Transcript

As Mori Beni, the Lamb of Chu Which is why we need the discrete random variable.

This is actually a biomechanical problem. So then, the face style, we have The email, mail from the So when we try to classify how can we represent this, represent the event? So one common way that we canRepresent. Email based on the vision record. The V-effect of a push plus that is equal to the number of words in the dictionary. This is one way.

It makes the probability of the eggs Give it another... so here i assume the dictionary has like 50k walls. Right, and then each figure, if you figure out the x, Also is a 50,000I mentioned, uh, x is a 50,000I'm gonna try to revert. Right? And now-so here, x is--You just saw the OK, then we're going to find the vector. Now if we directly model this likelihood, the probability of x given t, Namaste you!

How many parameters We need to run this Black people here, in this setting. So x is a 2.

which is widely used assumption in order to significantly reduce the parameters Okay.

So the summary here is that we assume that all features are independent. Given the cost label tree, either spam or benign, We assume that all features in the featureWith eggs? I'm independent. We should use that. I should have a This is the. We're going to do this thing here. P x 1. X2 and X3 decay, give them T to be functionalized.

The product of this Just a I am Carlino, the Switching P of XI Community.

So this is the assumption that What feel like in your head then? -OK, you may always spam, and then pi is the one. And price is the word-800 I'm ready if I tell you. In a span, we know Knowing the value of x, what I'm going to do is, has no effect on your leaves about the value of x to the 800. So this is the meaning of this independent It's basically a key in the name of brain assumption. So in this case, you can just assume that-We have a The problem of x bound x, x n when you're given t, The first thing that I like about this web x1 into a t times x, and I do a t at x1.

has no effect onRing of X, one to give it to you. Oh, that's good. So this conditional is different from the x bar over x i, x under x i. You've been in. If we say aReady?

With this naive assumption, So with this naive Bayer assumption, We just need to model each The condition of the degree of P of xi given t Right. Actually, this shows you the number of parameters now. I'm not just to a time-safety kit. Because we only need to estimate this, you know, This Conditional distribution Px i, given cheap. In the period of seven, this joint isOkay.

OK, we know this, and then let's see how we can-Okay, so this is very strong sound because there is no independent between the walls. Now let's see, with that, how we can, you know, Construct a new baseboard. Here we consider the binary random variables you just set up. Okay. Which means, right, future XI Three zero one. Remember that in the forensic generation model, we want to model the liquid and we want to model the craft.

Of course, because Xi is binary, We're going to simply Hold on there. This is Candida Distrugium, Purex High D.

As the ordinary distribution Right. And this is an early solution.

First, phi, the binary has to get here, right? The phi defines on the,It's probably to you. on the label, right? And by setting this to zero, we can calculate the phi. This i here is the indicator function. Basically if the Tn equals 1, the label of n-sum equals 1, and then this indicator equals 2i. Otherwise, zero. The fire's dead, Dad. You can see that The mean of phi can be easily interpreted. Basically, the phi is the fraction of the points You only do class one.

Turn that one, we're on to the plus one. Similarly, you can estimate the Other parameters which also has a very intuitive meaning, for instance, And the fire condition, condition on the computer, the fire condition,I mean, it's the fraction of that point.

You don't need to chew. cost either zero one and their features is Nice to meet you. See here, the far eye, Sup? Mr. Fine, I'll ice a t-shirt. Once we have this parameters we can opt out. And It's likely good.

The problem of--Like a pool, and also the--Mark's Christian. And then we can also use them to You predict, uh... So step overall.

I don't know if you hear, but overall, IT of First, he and the Navy base are fair, actually I said. You want you Just need to introduce the Likeable discussion?

and then you can define the likelihood and the joint distribution We're going to learn the parameters of the The description you have introducedYou must then source a prime does on daily time. There are actually big predictions by You know, By comparing-The divided into two parts. Joint distribution action. Providential Xt given t. I'll show you XT and G.

Like I said, right? Stay. Like from here. We're asking like who? So this is where we haveOkay, from the, Maximum like a loss estimationIn this area, we have, we just, See you at the You see this is the probability solution.

This is like a full px given y equals to 5. y equals 5 means it's a digit 5. This is the Like the root of p of y-sorry, not y-the root of x. Even The label y equals u6. This is the kinmap I like your cool. Like this. And the second step is the crowd, right? The crowd is a really long keyword decay and there is a fraction of training data, training samples that we get. We should assume that the crowd of these two leaders are the same.

Do you want to provide? And then testing with the new samplesAll right, we just need you to check the Labor T, which has the maximum product of the likelihood for test data and the prior distribution.

Basically, if you want to do this turn, This somehow shows something that was here. The first example shows that the probability of predicting this did he use probability one? The second number shows the probability of predicting this digit 6 probability 0.997.

The image of the probability of this digit is 5, which is very It's very over high probability here to 0.86 Actually this is a raw prediction driven tube.

Joint distribution. Maximum What's your description? Wish I may give what you. The product like 400. The college isYou and Dad. Random environments. Basically, here X is a--All right. So we model the light for us. But we share covariance measures.

Right, so in the journey remodels, The key is that we want your Linda moto. That generates the meaning that we want to We want all we want to estimate is the distribution of data You want to ask the distribution order? You know, the... Classical issues are distribution. So here, everything I think about is to talk about distribution. Right. The discriminatory models We all go each other. We focus on learning the decision function where And that's where you can see the diffusion of the air.

So the discriminatory models are not considered a discriminatory model, but they are models especially Care about the distribution that. So since they're really model, right? They're really model models that they're distribution. Right, uh... The strength, of course, is that it can deal with missing data or it can deal with the data. It can capture that uncertainty. But for a distributed model, it's very easy to model.

I've written the. You're not satisfied? For cones here, I want to mean that in the discrete model, we need more data to avoid unfreezing.

This is because if the model parametersAh. than the number of data, then it's easier to-Also, uh... The moral premise Okay, the motor parameters are much more than the Number of data. It's time to becomeIt was overfeeding.

Paul Boyle. because of, but data is less. I guess it's gonna be easilyOverfeed, overfeed. This is kind of-Should be overweight. Overweight.

But the pros here, of course, right, the generator model can deal with,Can't walk with the, And deal with the missing data, all of them can carry on suddenly. All the generator You can also use to generate the data from the same distribution. So why the generation model now is very It's becoming more and more important and popular now generate the data. And all the new ways that we see them. Okay, sorry. No, let's talk about

Classification: Three Different Methods

Discriminant models (판별 모델)

주어진 학습 데이터에 대해 판별 함수를 통해 각 데이터 x를 특정 클래스 Ck에 할당함
학습 데이터의 분포를 고려하지 않음

Probabilistic discriminant models (확률적 판별 모델)

학습 데이터를 기반으로 사후 클래스 분포 p(Ck|x)를 모델링함
모델링된 분포 p(Ck|x)를 사용하여 테스트 데이터의 분류를 수행함
Logistic regression이 대표적이며, 동물의 특징을 기반으로 개(t=1)와 고양이(t=0)를 분리하는 결정 경계를 찾는 방식으로 나타남

Probabilistic generative models (확률적 생성 모델)

학습 데이터를 통해 결합 확률 분포 p(x, Ck)를 모델링함
모델링을 위해 클래스 조건부 분포 p(x|Ck)와 클래스 사전 분포 p(Ck)를 찾음
이후 Bayes rule을 사용하여 사후 확률 p(Ck|x)를 도출함 (p(Ck|x) ~ p(x|Ck) * p(Ck))