Lecutre Script

Summary

Normal Equation Requirements

The normal equation requires X^T X to be full rank, meaning the number of samples must be larger than the number of features
When the number of features exceeds the number of samples (high-dimensional data), the normal equation cannot be directly solved
The solution involves regularization to handle cases where data dimensionality is problematic

Dense Solutions and Feature Importance

The solution w obtained from the normal equation is typically a dense vector, where most entries are non-zero
This implies that all features in the training samples are considered useful
However, in many applications, a sparse solution (where only a small fraction of entries are non-zero) is more desirable

Under-fitting vs Over-fitting: Polynomial Curve Fitting Example

The instructor demonstrated fitting a sine curve using polynomial models of different orders
Low-order polynomials (M = 0, 1, 3): Result in under-fitting with high training error and high test error
- The model has small capability and is not sufficiently powerful
High-order polynomial (M = 9): With only 10 data points, the 9th order polynomial can perfectly fit all points (zero training error)
- This results in over-fitting with a very complicated curve and large test error
- The model is very powerful but generalizes poorly
As model order increases, the magnitude of learned parameters becomes very large

The Need for Regularization

The unconstrained optimization problem allows parameter values to become extremely large, which is undesirable
Regularization addresses this by penalizing large parameter values through constraining the norm of the parameter vector
This converts the unconstrained problem into a constrained optimization problem

Converting Constrained to Unconstrained Optimization

Instead of solving the constrained problem directly, it can be converted to an unconstrained problem
This is done by adding a regularization term (penalty term) to the objective function
The new objective becomes: Loss function + λ × (norm of w)
This is called regularized regression

Ridge Regression (L2 Regularization)

By setting p = 2 (using L2 norm), we get Ridge regression
The normal equation for Ridge regression becomes: (X^T X + λI)^(-1) X^T y
The key difference from standard linear regression is the additional λI term (lambda times identity matrix)
This addition ensures X^T X + λI is always full rank regardless of whether the number of samples exceeds the number of features
This eliminates the constraint that sample size must exceed feature count

Lambda Hyperparameter Effects

Lambda (λ) is a crucial hyperparameter that controls the regularization strength
Large λ:
- Forces the norm of w to be very small
- Results in a model that is not powerful enough, leading to under-fitting
Small λ (approaching 0):
- Allows focus on minimizing the loss function
- Can lead to over-fitting with very large parameter values
- When λ = 0, returns to ordinary least squares
Proper λ selection helps achieve reasonable parameter magnitudes and avoid both under-fitting and over-fitting

Action Items

[ ] Complete homework assignment on converting constrained optimization problems to unconstrained problems
[ ] Join online office hours this Friday (due to bad weather, class will be held online)

Notes

Linear Regression

Regression 문제 개요

Regression은 입력 feature를 기반으로 연속적인 값을 예측하는 문제이다.

대표적인 예로 집의 여러 특성(features)을 입력받아 가격(price)을 예측하는 문제가 있다.

학습 데이터는 feature와 label로 구성되며, 목표는 feature로부터 label을 잘 근사하는 모델을 학습하는 것이다.

Linear Algebra Basics

Vector와 Matrix

Vector는 수치들의 1차원 배열이며, column vector와 row vector 형태가 있다.

Matrix는 2차원 배열로, 여러 vector를 묶은 구조이다.

Transpose는 행과 열을 뒤바꾸는 연산이다.

Vector / Matrix 연산

Vector–vector inner product는 scalar 값을 반환한다.

Vector–vector outer product는 matrix를 생성한다.

Matrix–vector product는 vector를 출력한다.

Matrix–Matrix Product

행과 열의 inner product 관점에서 해석할 수 있다.

Outer product의 합으로도 표현 가능하다.

Matrix–vector 곱의 확장으로 이해할 수도 있다.

Gradient와 행렬 미분

Gradient는 스칼라 값을 출력하는 함수에 대해 각 변수에 대한 편미분을 모은 벡터이다.

행렬이나 벡터에 대한 gradient는 동일한 차원의 미분 결과를 갖는다.

Matrix / Vector의 성질

Transcript

See the normal equation is here. So here, Q You should I'm gonna get the value down in half. From the Dublin hat-sorry, the Dublin start here, we need to ensure this xx transpose should be full ramp. Full ramp means that the number of samples should be a large number of features. Ready?

And also, I want you to show that The W star here is always a denseVery funny. The tennis vector means that More than she's.

In this vector, I have non-zero values. And then the question is asked. Like first Right. The required number of symbols should be larger than the number of features. But what if-no, what if the-Number of features is less than number of-number of samples is less than the number of features. Meaning that the data can be very high on here, like in some Parallel complications. Is someone worried? Here is number of features.

The number of features can be very high for number of samples. -And that case,How can we still get the value? Aubrey. You're just out. We'll come up on regularized. You know, in question, what that means, in question.

And the second part here,Double hat. I thought we had a very dense spectrum.

Which means that most of the entries I'm on zero.

So let's just think about the meaning of double hat. Double hat is applied to-All of their samples, right? Now we need to compute either product in a value and each sample. So... Here, right? We're done as a hat. have non-zero values in almost all its entries.

That means The fissures. All the features in the training samples are useful.

But this is not the case. We're always sure we'll have a leader. So in other words, In many applications, More important to learn this W, the bigger W, which?

You should have a-when you're sparse.

Spasmi, is that-Just a small fraction of the entries are non-zero, but most of them are. When you sum up zero I'm going to get that solution. This is another type of very famous root called quadilastro. Now I want you I mean some of the vision and Some other fitting issues. Okay. We saw them meaning that this means that the completion of This girl wants to solve the The integration is insufficient.

We need to come up with some batteries of those versions, which are going into the regularized neighbor version. Thank you. So then, let's just start using this. to illustrate this basic concept. The under-physian, over-physian, spot-substantiation, And this solution. So now let's see if we have You see, okay, now, we have like a,Tender questions. Fritids Tur's Curve This is a sine curve. Right? So here I The future is XM.

And the label-they were not nobody, just a-just a few people using GM. Okay. And this Tn is our function, this time, sig function. Basically, the best filters are There's samples. Right?

And now let's-Try to Design, elaboration, algorithm.

which tries to Approximating this Simon Kirk. Yes, I'm perfect.

0 is all of them. M equals 0. And then by putting m equals 0, you're going to see This function f Is that--How about you? Right? Is this clear or not? It is clear from this to this. Thank you. And then fx equals W0 Reading that. It's just a horizontal line. It's red. Of course, This f Pass the ball to the other side. This time you're right. And then this shows that it's anarchy. You got this curve It's very simple.

It's very simple. It is insufficient to-Represent this sine curveAll right, I'm modest, justTrying to increase theAll of you. If the order is 1 and not f, then It's just w is congruent as w0 plus w1 x bar because Always one. And then this F.

Corresponding to This line we sum small. Right, f equals to 1. Next one. You're going to think about x bar.

This is some long and some slow. Steer, right? You can see that it's on the feet. It has no feet. That's not fit as I'm comfortable. And then it's further increased the order. To three. And f of 1 is true. It comes to dot 0 and get dot 3.

So here I want to mention that there are 10 different points here.

You see here there are 10 different points and then the order is 9.

And then by minimizing the risk-cold laws we have defined.

Remember that risk-cold laws is just really different between the labor Tn and this Fx, right? And then if we said this night then This is R3-9.

And because of the number, there are only 10 data points.

and then based on the Um... So I'm forming a theorem of the-Add a row. We can create this very complicated red curve, which can past this, uh, inner box, meaning that the arrow is zero. It is so long since you're... So this This curve can make us get very small or zero risk for loss. Because here, if this, and pass all this. The loss becomes zero. But I can see that this curve is very, very-complicated, very strict, right?

This actually kind of looks like an ovo-fishing.

So let's just summarize this very Come sit right under feet and I'll feed him. What's the matter? Small M.

Which also means that Moldau has very has a small capability.

But the motor is not sufficiently powerful. Are we set?

You were Did you start? My life turned in error. Here, arrow, we just use the minuscule, minuscule, or the, or the score, or the score loss, okay? Hi. So here, this shows this blue point here. The model is simple. and then you can reduce theLogger.

On the other hand, When the app is very large, that means the model is very powerful. Like, am I not?

The mod is very powerful Such that Turn your arrow. But a training is called a loss key, a reverse rule. In this case, it's actually 0 with m equals 9. Paz.

You also choose a very large test area.

--So here, when M increases, and then the magnitude of the parameters becomes very large.

The value is very, very-It's very long. This maps to a very complicated curve I was in.

The value of w, w nine,Well this curve, This one, Adesvarius.

See, these parameters are extremely huge.

The fundamental reading is composed These chronos we have designed is actually an unconstrained argument problem where we do not have any conscience of the grammar value.

This may cause the issue This may cause some issues.

The W of values can be very, very large.

This is not cool, right?

That means we need to penalize such last-prongers. Basically, how you can analyze black-brown nationalism is you need to constrain the norm of the primal vectors.

when the norm represents the length All in favor? The flow of the river is in the lens of effect, right? And then they did This is where the real-life immigration comes. So this is the completion of-Let's function. Let's go to Oz. And now...

I've watched in the From the W to have aConstraints.

Friends, I don't have anything. Who's Houston? The construct. The pinnacle of this camera should be the island town. So is this clear to transition from the The brain gives this constraint is called loss. It's clear. Why would we need to reinforce that? Why would we need to reinforce that constraint here? Then, you've got to understand this. Through this polynomial curve freedom example, Why would you enforce this constraint here?

And then the natural solution is We try to convert this constraint of the problem Unconstrained problem.

Basically, we need to transport this, you know, to put this culture still in there.

I'm gonna hear that. You can use some. So we're like onion. I want you to mind the plan.

Basically, the first term here is still the most function we want to minimize.

This constraint, right? This constraint we can directly Put that in the object of function here. Having a health hammer So, we can't understand. What if you do not have a-Just to provide an understanding. So here, this comes in basically to ensure this W.

Somehow, I see what's possible for Smaller than their gamma. And now back to this of your bulking here, right? We try to minimize this stuff here from the. We want to minimize LW.

This is what the Nistler Laws tries to do. We need to minimize these laws.

And at the same time we imagine we minimize this The laser target.

Right. Correspond to that constraint. So this basically is when you make this as small as possible, so here at the same time we minimize this. Ah, vector.

If this is another therapy, if this country needs to issue this vector larger than some value, OK? Here, we're going to need to minimize the negative of this, or maximize the negative. Then this last time we change the next sign. I'll get you a gin. Transfer the confidence of the-transfer this-this is the confidence of the problem to the unconfident. This is a general knowledge. And then we call this term Some I basically love the function.

You can't just-This was lost plus some That's why we call this regularized immigration. Okay, here's the... The constant is the prime of w. Should we, ah, cross it? I guess there's one question in the home of assignment one. and these two for you two Transfer some country of the problem to the uncountry of the problem. Anolis?

here, right, this generalized, you know, original hyperparameter logic This lambda is actually very important here.

And then this-you can understand the effect caused by this lambda, OK?

So here, assume that lambda is very large. What do you remember as Merle Ives?

Think about that, if Lampa is very huge You're on a refuse. and then you want to minimize this loss function, right? Because The lambda is very lost, right, to minimize this value.

And then, of course, you need to-Fourth is the lens. Our picture is very small.

If Tang Chi is very lost, a lot of it is very lost, and this bag is very lost. So here we're not right there at all. And you want to be the man that just You want to minimize this New was functioning here. Then that were forced.

For the w, I'm a very small mole. I mean, WP should be very small. Your power key, your power is very small, the moment is very small.

The LW. Coming along.

which means that the motor is not powerful enough. I'm always on the power wall.

Then that Thank you. Think about that. Think about that, They're reminding you about yourself.

0 and then Lw will produce 0's Zero is polynomial, right? It's very... It's sufficient to fit that another curve. -Request alibi. Let us say another...

You love my very small And then The Lampra.

But in turn, can be very smooth as well. And then to minimize this loss function, just focus on minimizing the lost L-carbine. Ideally, the loss can be zero. If the loss is zero, The thought would be very Common vision.

Our values are valued very much. You've developed a lot of them. And of course that So here you can see that the value of w-sorry, the left there-The value of lambda, however, can somehow control Where the memories of female are being In other words, the number is very important.

Think about that, right? Laminate is very small. What if you directly set that to zero? This returns to the reduced ordinary square. We know that ordinary square Is there any two or three questions? Okay. I don't understand this. Alright, now let's just take a look at that. So here, I can see that the lambda trail also looks on the north. Thank you. Now let's try to apply this, OK? Try to-Let's see the solution of this.

Think about this offline. Another C, OK. This constraint-this constraint or just regularize these squirrels. We have many names. You can penalize He's screwed. He's screwed. Internet at school. Now here, By setting the P, this is a very general LP model. And then by setting this pin on Q, P equals 2,Hmm. Which is called theReach liberation.

I'm gonna start a re-integration.

I'm going to have to use the light on my cloud and then have to... corresponding and contrary to the problem and this one is called the LQ representation You're ahead of like L1 because you know L2. So you have this domain of L2. You apply the dystopian norm.

And then similar to the original design, the moment of creation. Here, just keep the detail, okay? I was just going to drop this offline. Like that? Just get the first order of by setting this new loss which will be the 0 and then you get this. Okay. And now you can see the difference here. This is not a new moment in Christian. Officer Reed, operation. So from this question, what's the difference?

What's the difference between this and the normal equation for this linear equation? What's the difference? What's different between this and the original version?

You can see that the key difference here, there's an additional term. So here, i is the identity matrix. There's a T-th dimension is lambda i. . Just that.

I'm going to element of 1 and your main element is a 2.

So with this moment of questioning, you can... right you can get this I'm learning a story about a hat. Here's here. You need to compute the inverse of this extension for plus or minus i. Okay, I want to highlight the Very important Remember that in the conventional knowledge group,The question for the commissioner is I'm going to have this lung eye-firmed, right? We only have And then we want to compute the difference of this x-axis and x-axis.

here we Rigid regression, we have this addition that's from lambda i.

We call this on the eye. This-Mishu is lambda. Lambda transpose plus lambda is always for x. These horrors are around. Meaning now there's no constraint on where the number of samples should be larger, more than the number of features. It's all of us program. That means we all of us get the universe. How many of you are first? So, here, I hopefully We're going to wrap this by ourselves. The liberation is the The regularized integration, these are very basic And of course you need to deal with Vector is needed to deal with the measures, needed to deal with the gradients.

Of course, you can apply this in a million ways. You don't need to consider-Word of number of samples is more than No more features. So here we need to steer new calm. compute the inverse.

what's the difference All this for feeling it amour, right? So here. When we set the lambda, we have a very small value. Meaning the lower lumbar should be like that. negative infinity, then the lambda is very small value. Lambda is very small value, right? Lambda is very small and then you can across thisOh, a vision.

And when lambda i is very large, for instance, you can say lambda i equal to 1.

Because then the moral is true. You know. It's very long, very, this line, which is very simple. I can't go on for the next morning. This is what this is all about.

someWe're gonna learn this very, this chord. --Launch your start. Uh... Okay, let's all look at the other one, just for the moment, I'm going to describe this. so here we We still apply the--Ah. Thank you. With every person, I really mean we use the nice phenomena The perimeter is nine. This is the value you've seen. The value is very, very large. And not by sin, as long as I'm christened. You know.

reasonably large value you can see the W value is given out That's from another perspective by controlling the value of lambda, we are controlling the Okay.

Avoid unfitting and avoidOkay.

I guess I'm sorry, but-I guess I will record the last one because we have wasted about two minutes. I will record the last one because it's all very important. So by the way, This Friday, I guess the weather is very bad. Good morning. Actors. I would stay online. I would send an announcement. And then you can join the basic. And then if you have any questions about this course, whether they're included in the lectures or something, you can talk to me. Uh,

Linear Regression

회귀 문제의 기본 개념

회귀는 입력 변수와 연속적인 출력 변수 사이의 관계를 학습하는 문제임
출력 값은 실수 값이며 크기와 순서의 의미를 가짐
대표적인 예시는 주택 가격 예측임

회귀 문제 예시: 주택 가격 예측

입력은 주택의 여러 속성으로 구성된 특징 벡터임
출력은 주택의 가격이라는 연속적인 값임
학습된 회귀 모델은 새로운 주택의 특징이 주어졌을 때 가격을 예측함