Lecture Script

Summary

Overview

Logistic regression is actually a classification algorithm, not a regression algorithm
The lecture covers discriminative models, logistic regression theory, optimization methods, and extensions to multi-class problems

Discriminative Models Review

Given data points from two clusters, the goal is to find a decision boundary
Different approaches to classification:
- Fisher FDA: Decides the boundary based on inner products
- Perceptron: Determines where YX product should be positive
- SVM: Finds where the product should be larger than some margin
Confidence in classification varies by distance from the decision boundary—data points farther from the boundary have higher confidence
The distance or step to the decision boundary can be used to measure confidence in predictions

Logistic Regression Fundamentals

Binary classification uses the form h(w,x) with a sigma function applied
The logistic function sigma(a) is an S-shaped function
When a equals 0, the probability is 0.5, which serves as the threshold
The logistic regression model gives P(t=1|x,w) = sigma(w^T x)
The probability of predicting class 0 is 1 minus P(t=1|x,w)
The sum of probabilities for both classes equals 1

Logistic Function Properties

Symmetry property: sigma(-a) = 1 - sigma(a)
The logit function is defined as log(sigma(a)/(1-sigma(a)))
Most important property: The derivative of the logit function with respect to a can be written as sigma(a) × (1 - sigma(a))
This derivative property is crucial for computing gradients when finding the objective function

Likelihood and Optimization

Posterior probability distribution unifies both classes: P(t|x,w)
Data points are assumed to be independent
Log-likelihood function is the product of posterior distributions over all data points
Likelihood represents the probability of generating or making predictions on all n data points
The goal is to maximize the log-likelihood
This is solved through Maximum Likelihood Estimation (MLE)

Gradient Computation and Comparison

The gradient involves the difference between the label and prediction, multiplied by the data features
This form is similar to linear regression, but with the logistic function included
The similarity exists because both fall under Generalized Linear Models
Key difference: In logistic regression, the gradient is a nonlinear function of w
Unlike linear regression, there is no closed-form solution—the normal equation cannot be directly solved

Iteratively Reweighted Least Squares (IRLS)

A specialized algorithm for solving logistic regression optimization
Uses Newton's method, requiring computation of the Hessian matrix and gradient
The gradient is expressed as X^T(hw - t), where hw is the predicted probabilities and t is the true labels
The Hessian matrix involves X^T R X, where R is a diagonal matrix depending on hw
Through transformation, the update equation resembles the normal equation but with weighted least squares
R is not constant but depends on w, requiring iterative updates
This iterative weighted least squares approach gives the algorithm its name

Classification and Confidence

After solving for w*, use the posterior function P(t=1|x,w*) for classification
If probability > 0.5, classify as class 1; otherwise classify as class 0
The logit function provides confidence measures for predictions
Points with higher probability are more confident in their classification

Multi-class Classification Extensions

Binary logistic regression can be generalized to multi-class problems
Examples include face recognition (multiple people) and handwritten digit recognition (10 classes: 0-9)
The softmax function generalizes logistic regression to multiple classes
Softmax is defined as p_k = exp(φ_k) / Σ exp(φ_j), ensuring probabilities sum to 1
One-hot encoding is used for multi-class labels
For a 10-class problem with label 3, create a 10-dimensional vector with 1 at position 3 and 0s elsewhere

Notes

Transcript

You're not considering that as fruition. distribution and the probabilistic distribution models Here we need to consider the Is there a distribution? Yes. Hello, this is Very effective. Brian, on the break. When they use the Algorithm logistic regression. So I have to have using that, right? So here, this algorithm called logistic regression, but it actually is not a regression algorithm. Okay? It's all been used as a-it's actually a classification algorithm.

Okay? So let's recap this The discriminative models. Right, we'll get well. Given data points from, in VanEck's head, we are given data points from two clusters, and then we want to find a leader. Does the boundary wax equal 0? Okay. So in the Fisher FDA, is once you decide where the inner product This is boundary and then applying this line, I'm going to call it by zero. In the perception, we just need to decide Just need to see where the YX.

The product you like is Antienne. It should be positive, right? And in the SPM, we want to find whether this product should be larger than some money. Wow. So here, Listen. Of course, where is it? Question done. So how confident are they at a point is classified as the label Plus one, we'll take your class. So in this example, OK? Of course, you can define this design boundary. And then, it's the confidence to find this data point And this is their point.

Are the confidence the same as prediction as a class? Plus one. Of course, this data point in this case should be more confident than this data point because this data point is Closer to the Disney Boundary Right? But all this... The discriminative models You knowand then to measure the confidence Basically, it's your step of religion. Give me that one, X. I'm Dunderland. This is the boundary of parameter vector w.

What is the probability of X-Ping classified as theFirst, as a possible one. One, two, one. So this It's actually-This is the posterior distribution. You can tell that this is actually the posterior distribution of the T. If you use the language of-use the language, We should post your distribution on TikiMax. Okay. Now let's see the... Just take a look at the-Binary cost, which is your version. So The environmental cost of the expression This is the wax.

Listen, what you have Just a... The similarities between the unit of product Dr. X And then in the binary class of discretion, We just need you. Append. A function sigma This sigma, the form is defined as this. All right, let's see them all. Is Logistic function. Actually, the second one was. Now I think about just rule of this actually that Logistic operation. Because Keogh applied the distributions of that line for that distribution.

This is a function. of the logistic operation. I'm sorry, of the logistic operation. I see that it's kind of like s-shaped. So why is south has m? S shared function. And some properties are here. When a equals 0-And then the probability of the different fog is half. Of course, when I was He's positive. And then the value is smaller than a, and then the value is-Let's not hop. So you can think about that.

But this 0.5 actually can-Q is the threshold, right? So then by applying this sigma function here, Place in the y-axis. And now, if you find this new New function of the function. He'll use the.. Basically, the description is to Logistic progression. I'll pull off these wax. This is the probability of predicting this level as plus 1. And then the probability ofProducing today, Lord. At 0, they just one-manage that.

That's actually the... Minimum security.

Basically, Right?

Two causes. The probability of predicting the Sum of s plus 1 plus the probability of 0 should be sum to be 1.

So while the desperation-OK, while the desperation, this is very-Useful.

I heard of them, Ramshan. I heard of them. This is uh...

I see some very useful properties of registration. First is the metric. OK.

So sigma next wave, there is a 1 minus sigma a. You can always see this in this-Sigma A. Right? So we're going to be-so we're going to be 1. and then bring the negative one So I'm going to use the first one.

So here I also just-Show the details here.

And there is actually a-Geo function called the logic function. The sigma a Just take your mind. A is defined as the The sigma, this is the sigma negative wave. You define it like this, OK? And then you can also have an-And I'll give a form. Of A. OK. Which can be written as log of the sigma divided by the momentum sigma. And here, I will get hired another which I think is the most important for the logistic operation.

You start? The draft chief! of the logit function with respect to the variable A. Can I get written? Hasta. The sigma function is served. The sigma function looks like very complicated. On the dirties! Here, right, is the sigma function times 1 minus the sigma function. Double-add the register function.

Let's see, I'll just show the details here.

so this company is very very important because when you come when Did you find a function? And then? You can calculate the grid, you can directly write a directive asI'll just function, so.

Right here.

And in the derivative, right, in the derivative, you actually also apply-I guess you can apply it as a symmetric property here.

Okay, I'm cutting your radio.

And now let's just We know that we'll just function right and notice Did you find-Object functions. of the logistic operation. Because every machine learning algorithm hasIt's on, get it function.

Thank you.

So we said that we know that the posterior probability Of data X, right. For you guys, they go one. You find that SW and then this is your 1 minus SW. These two functions can be Unified.

You can see that if t equals 1, you only have Term. If t was 0, and then you want to give this term. So this to unify into this posterior distribution On the... Virgin Cheap And then here, the sum-triangle diagram, we have some-So, data, of course, right? Thank you. And in most cases, you can tell the points. Even though they're in Stanford.

Even though they're poor, even though they're in Stanford.

And then you can define this Log-electrical function. Log-electrical function is basically the product of this positive distribution on all data points. This is like the cool, okay? And it opened. We just write the, we just use the log-like rule. We'll apply a law of function. And then this product. can be written as a summationOn the endoplasm. So anybody knows the meaning of lack of food? Do you have to...

Have you heard of the term likelihood? Likelihood is a suggesting concept. Best way to get on is to get a little bit more. Yeah. So here, this probability distribution is called the posterior distribution. It's defined on the probability-Or in some ways, it's predicted as a labor cheap, right? And then, here is why we have this assumption that the data points are an independent sample. And then like in the Google account, think about as the probability of generating the end data points.

Sorry, as a probability of-making predictions on each data point x at level t because data points So the likelihood then is the product of those post-geoOK, and the log-like hood then just apply the log function on this. And then you can have this. The function which is summation of the n terms, of the n samples. So from this definition, you can also think about that. So here, the likelihood basically-you should let it go as light as possible.

To run this parameter w, It's called the nice molecular estimation. And we know that This is an uncountable problem. We can just direct, we can just...

Let's just take a look at this term. This term is basically the difference between the label TM and the prediction. times the x and y, your serve. Okay. And when you see the linear regression, when you compute a gradient, you can see that this also involves the difference between the true level and estimation times the the features, but the only difference here is This edge function here and the y-axis are different.

The form is actually looks pretty Close.

So there is identical form as solving the least four laws in the linear equation. And why is it, why the Q-logs is, sorry, why the Q-gradients I'm very impressed with one. A very similar form. There's actually a fundamental reason is because Oh. Okay, the deep reading here is called the general idea models. Okay, we'll talk about that later. So the logical progression. I need an immigration. In fact, not just these two algorithms.

The gradient are very similar. From here is the arrow multiplier of data features. And the fundamental reason that this is true The terms are similar because this chip is onlyGood day. Generalized. Ah. Be in the models.

In the logistic regression, The gradient here is actually a nonlinear function over the W.

This is a very complicated font here So that means even though here you set the gradient of x loss, which is like w to the 0, Each student cannot get the-cannot get the--He's a very complicated, non-interferential dog. But in the linear regression, you can get the exact-get the right over the w through the-Normal equation.

But here you cannot. Right.

So you cannot get this-you cannot set this to zero and get that. And then there is a specialized I presume. For solving theLogistic coercion.

By uploading that, You know, we cannot Get W. Easy. This algorithm is called the iteratively reverted D-square.

Let's try to understand this algorithm. Right, so you're putting on that right. If you're sent this very edge material, And because of the communicated form of H_value, there is no closed-form solution here. To solve this, we can apply-Right, you can apply the iterative algorithm, of course, Using this method, okay, I can stop this. Yes, I just Okay, just show some Let me do more details here, right?

In the Newton's method, you need to compute the Heisen measures and the gradient. and this is the in this case the gradient defined the gradient is The form is like this.

And then the case matrix is like this.

OK, assuming that you have this hazelnut sheet. Hazelnut sheet, you can see that-Just look a little bit. This gradient actually This is the cut eggs. Just-You can candidate all the data samples, and then you have the Vector here, which is fw minus g. I started in The... This noise, the actual function of whatever noise, okay, the T is the, t is the label of the order of 1's. And then the matrix is x.

The product in x, r, and times x is here. This r is a diagonal matrix, which also Depends on the It depends on the hw.

Go on further. Transformation, you just define x-tiles which is defined on x times r. Here, just because r also depends on the-Function as W. And then we further define this white-tailed. OK, R. Okay, time to eat. So through this transformation you can see that now this Wait. This parameter vector weights for this function. Here, it's written as this. You can see that Through these transformations, It has a very similar form as the normal equation in the linear equation.

Okay. But the difference is that This is the original normal equation we see here.

And also because this Why? It depends on the HW. So iteratively update this Where the update is a new time. Method, many times. Right, so right here, r is not a constant, but depends on w. It's certainly updated. Okay. Thank you. Is NIST iterative updated?

So that's why sheWe have the weighted square. And then we need to iteratively Update Vamos! Here. So then that's why this algorithm is called the iteratively are relatively square. I will give you two minutes to-Take a look at the technical details and then you can understand the-The relation between them The moment equation, the linear regression on this square.

That's why it's called the Eucharist.

And now let's see, right? Let's see if we have already solved the-The parameter w, how can we use that for one classification? Right, we just need to Put into the W star into the posterior function, posterior sphere here. Right. And then if this probability is more than half, And then he's classified as 31. All right, you've just, positive probability is less than 0.5 And there you start. Classified as the class 0, right?

And therefore any-therefore an extra xp If the predictionThat's probably a lot of them, otherwise. Right. This means A is more confident than B to be classified as negative 1. But both should be also positive-all those should be lighter than 2.5. So you can see that by using this logit function, and now you have the confidence of prediction Mr.

I mean, he was right to decide which one is more confident. Are we something small common that others should be? correct in class 5 or run in class 5. I guess we have three minutes, but let's see what we can. Okay, right, we just talked about the binary class logistic regression and then of course It's more common to Make it true.

Classified my class.

We should more have it like the first recognition problem, number of class, number of people, and then the handwritten d0 varying the number of parts is 10, right? 0 to 9. And then, to generalize the Logistic version from to soft to we need to introduce the function. for the sort of That's one of my functions.

It looks like this.

there's not much function here Then it's defined as this, as pk. What's your stop? Exponential of the sum phi k, and summing it in series 1, I guess we just I don't know why he's so scared. OK, so here, you have any phi, and then the probability is actually equal to softmax. And this function is a softmax. Basically, you can get the probability of predicting The cost k, such that the summation of the first equals 1.

I want to mention another very, very Useful strategy is called the one-hold encoder or one-hold encoding. In a binary classification, we can define a class to be Plus 1 or negative 1, plus 1 or 0, right? Yeah. But in the money class, Okay. We need you. We actually mentioned this before. For label T, for example, label is 10. How can we make this label comparable? Then we need to introduce what is called a one-hole encoder.

One of k binary coordinates. Number plus 10. And then Tn, the cost, equal to 3. Means that we need to define a one-point encoding which is the 10-dimensional vector. In this vector, The label three. The element corresponding to the is 1. And the reminders are 0. This is how to-Defiant. The label, okay. Okay, so I guess time is up. We spent a few In this next lecture, we talk about this product, which we call this very important concept.