Lecture Script

Summary

Overview of Classification Approaches

The lecture covered three fundamental approaches to classification:

Preliminaries

Bayes Rule

Multivariate Gaussian Distribution

Gaussian Discriminant Analysis (GDA)

Model Setup for Binary Classification

Maximum Likelihood Estimation

Naïve Bayes Classifier

Motivation: Spam Filtering

Naïve Bayes Assumption

Model for Binary Features

Parameter Estimation

Examples

MNIST Binary Digit Recognition

Training and Testing Procedure

Training steps:

  1. Estimate $P(x_i=1|t=k)$ as the fraction of samples in class $k$ where feature $i$ equals 1
  2. Estimate $P(t=k)$ as the fraction of training samples in class $k$

Testing:

Generative vs Discriminative Models Comparison

Generative Models

Advantages:

Disadvantages:

Discriminative Models

Advantages:

Disadvantages:

The lecture noted that generative models are becoming increasingly important, especially for data generation tasks

Notes

Transcript

As Mori Beni, the Lamb of Chu Which is why we need the discrete random variable.

This is actually a biomechanical problem. So then, the face style, we have The email, mail from the So when we try to classify how can we represent this, represent the event? So one common way that we canRepresent. Email based on the vision record. The V-effect of a push plus that is equal to the number of words in the dictionary. This is one way.

It makes the probability of the eggs Give it another... so here i assume the dictionary has like 50k walls. Right, and then each figure, if you figure out the x, Also is a 50,000I mentioned, uh, x is a 50,000I'm gonna try to revert. Right? And now-so here, x is--You just saw the OK, then we're going to find the vector. Now if we directly model this likelihood, the probability of x given t, Namaste you!

How many parameters We need to run this Black people here, in this setting. So x is a 2.

which is widely used assumption in order to significantly reduce the parameters Okay.

So the summary here is that we assume that all features are independent. Given the cost label tree, either spam or benign, We assume that all features in the featureWith eggs? I'm independent. We should use that. I should have a This is the. We're going to do this thing here. P x 1. X2 and X3 decay, give them T to be functionalized.

The product of this Just a I am Carlino, the Switching P of XI Community.

So this is the assumption that What feel like in your head then? -OK, you may always spam, and then pi is the one. And price is the word-800 I'm ready if I tell you. In a span, we know Knowing the value of x, what I'm going to do is, has no effect on your leaves about the value of x to the 800. So this is the meaning of this independent It's basically a key in the name of brain assumption. So in this case, you can just assume that-We have a The problem of x bound x, x n when you're given t, The first thing that I like about this web x1 into a t times x, and I do a t at x1.

has no effect onRing of X, one to give it to you. Oh, that's good. So this conditional is different from the x bar over x i, x under x i. You've been in. If we say aReady?

With this naive assumption, So with this naive Bayer assumption, We just need to model each The condition of the degree of P of xi given t Right. Actually, this shows you the number of parameters now. I'm not just to a time-safety kit. Because we only need to estimate this, you know, This Conditional distribution Px i, given cheap. In the period of seven, this joint isOkay.

OK, we know this, and then let's see how we can-Okay, so this is very strong sound because there is no independent between the walls. Now let's see, with that, how we can, you know, Construct a new baseboard. Here we consider the binary random variables you just set up. Okay. Which means, right, future XI Three zero one. Remember that in the forensic generation model, we want to model the liquid and we want to model the craft.

Of course, because Xi is binary, We're going to simply Hold on there. This is Candida Distrugium, Purex High D.

As the ordinary distribution Right. And this is an early solution.

First, phi, the binary has to get here, right? The phi defines on the,It's probably to you. on the label, right? And by setting this to zero, we can calculate the phi. This i here is the indicator function. Basically if the Tn equals 1, the label of n-sum equals 1, and then this indicator equals 2i. Otherwise, zero. The fire's dead, Dad. You can see that The mean of phi can be easily interpreted. Basically, the phi is the fraction of the points You only do class one.

Turn that one, we're on to the plus one. Similarly, you can estimate the Other parameters which also has a very intuitive meaning, for instance, And the fire condition, condition on the computer, the fire condition,I mean, it's the fraction of that point.

You don't need to chew. cost either zero one and their features is Nice to meet you. See here, the far eye, Sup? Mr. Fine, I'll ice a t-shirt. Once we have this parameters we can opt out. And It's likely good.

The problem of--Like a pool, and also the--Mark's Christian. And then we can also use them to You predict, uh... So step overall.

I don't know if you hear, but overall, IT of First, he and the Navy base are fair, actually I said. You want you Just need to introduce the Likeable discussion?

and then you can define the likelihood and the joint distribution We're going to learn the parameters of the The description you have introducedYou must then source a prime does on daily time. There are actually big predictions by You know, By comparing-The divided into two parts. Joint distribution action. Providential Xt given t. I'll show you XT and G.

Like I said, right? Stay. Like from here. We're asking like who? So this is where we haveOkay, from the, Maximum like a loss estimationIn this area, we have, we just, See you at the You see this is the probability solution.

This is like a full px given y equals to 5. y equals 5 means it's a digit 5. This is the Like the root of p of y-sorry, not y-the root of x. Even The label y equals u6. This is the kinmap I like your cool. Like this. And the second step is the crowd, right? The crowd is a really long keyword decay and there is a fraction of training data, training samples that we get. We should assume that the crowd of these two leaders are the same.

Do you want to provide? And then testing with the new samplesAll right, we just need you to check the Labor T, which has the maximum product of the likelihood for test data and the prior distribution.

Basically, if you want to do this turn, This somehow shows something that was here. The first example shows that the probability of predicting this did he use probability one? The second number shows the probability of predicting this digit 6 probability 0.997.

The image of the probability of this digit is 5, which is very It's very over high probability here to 0.86 Actually this is a raw prediction driven tube.

Joint distribution. Maximum What's your description? Wish I may give what you. The product like 400. The college isYou and Dad. Random environments. Basically, here X is a--All right. So we model the light for us. But we share covariance measures.

Right, so in the journey remodels, The key is that we want your Linda moto. That generates the meaning that we want to We want all we want to estimate is the distribution of data You want to ask the distribution order? You know, the... Classical issues are distribution. So here, everything I think about is to talk about distribution. Right. The discriminatory models We all go each other. We focus on learning the decision function where And that's where you can see the diffusion of the air.

So the discriminatory models are not considered a discriminatory model, but they are models especially Care about the distribution that. So since they're really model, right? They're really model models that they're distribution. Right, uh... The strength, of course, is that it can deal with missing data or it can deal with the data. It can capture that uncertainty. But for a distributed model, it's very easy to model.

I've written the. You're not satisfied? For cones here, I want to mean that in the discrete model, we need more data to avoid unfreezing.

This is because if the model parametersAh. than the number of data, then it's easier to-Also, uh... The moral premise Okay, the motor parameters are much more than the Number of data. It's time to becomeIt was overfeeding.

Paul Boyle. because of, but data is less. I guess it's gonna be easilyOverfeed, overfeed. This is kind of-Should be overweight. Overweight.

But the pros here, of course, right, the generator model can deal with,Can't walk with the, And deal with the missing data, all of them can carry on suddenly. All the generator You can also use to generate the data from the same distribution. So why the generation model now is very It's becoming more and more important and popular now generate the data. And all the new ways that we see them. Okay, sorry. No, let's talk about


Probabilistic Generative Models:

Gaussian Discriminant Analysis & Naïve Bayes Classifier


Naive Bayes Classifier (NBC):

Discrete Random Variables