Summary

Problem Complexity Overview

Convex vs Non-Convex Optimization

Gradient Descent Fundamentals

Mathematical Formulation

Learning Rate (η)

Gradient Descent Variants

Parameters vs Hyperparameters

Advanced Optimizers

Overfitting and Underfitting

Regularization

Common Optimization Challenges

Case Study: House Price Prediction

Key Takeaways

Notes

Transcript

We have a term named complexity. Complexity of a problem to solve. Some problems, I would say, we have different categories, MP, black card, MP card. Some problems, the complexity is in a way that we can't solve them with the methods we have. So the papers you see that come out is to solve these kind of problems. So MPR is one of the types of problems.

Find the global minimum is not solved over a year. So, always there are papers being published to solve this problem in their own sentence.

So if you reach to the local memo, and if the problem is complex optimization, you are pretty sure that this is a global memo. It's a vowel shape and surface.

These methods, linear regression, logistic regression, STMs, they can solve the complex optimization problems. And the convergence to the minimum is guaranteed. But in non-convex, as you saw in the examples, there are multiple local minimums and saddle points that are on the surface. And neural networks and deep learnings are the methods that try to solve the non-convex problem.

There is no global convergence guarantee. So there are still works coming out to work on optimization in different settings. As you can know, most of them are non-complex. Does everybody understand the difference now?

Brilliant is the most famous method of optimization in machine learning. I think it's pretty old, but it's an intro, like a window to what is optimization in machine learning. So imagine you are blindfolded on a helicopter.

We are seeking the minimum, the local minimum in the graded system.

We try to find the direction that if you move in that direction, you step down, so your loss function gets decreased. So, for example, I have a model. I want to train it. But a parameter to start, a set of parameters to start, is this one. Or is this one. So, if I'm on this, if I'm on the right side of the figure, I try to step...

OK. This is a formulation of the linear regression. And A and B are self-parameters. I don't use theta yet for the simplicity. And we know that this shows the slope. So if I have this linear regression, this is A grade equal to 0. If I change, for example, I put 2 behind that, it changes the slope. And if I add something, like a negative or positive, it changes the slope.

So you see, when we change the a from 1 to 2, we have a value of y-hash. That's the right part. And then we measure this with-The loss function. And now boy has is higher, so the loss is more. See, so we change the value of parameter, detail, and the loss function changes.

Ready? And we repeat this until we reach the local minimum or global minimum. Mathematically we show it by this. So the new parameter, the new theta, And the FDLT plus one is the previous one. with this discount factor comes by the gradient of the loss function over the What is it happening?

So, yeah. So this is how we multipleThe... efforts we are putting in there. We call that learning. It's the step size. So whether that is bigger, is a larger amount, so you will reach out to the global media sooner.

If this step size, we call that a vertical, if this is bigger, The decrease is much more. It can be here. If this is less, it will be here. Always, do you think always we go for that? for the larger number for the learning rate or sometimes you have to tune in.

So for example, if I choose learning rate or step size as 2, it will get me from theta t to theta t plus one here. If I choose it like three, it will get me more down. And if I choose it like, I don't know, More than that, it will get me to theLocal minimum. Bonds. What happens if this is not the global hammer? It is only a local memo that we get stuff. So we have to engineer this parameter and change it according to what stage we are, what step we are, and what we need.

Sometimes we have to go with the lower one so we don't get stuck And uh... Local news. It happens a lot in the regression studies. and this is learning right so we have to think about what to choose About the next gradient, the direction of the secret mass. How do you think it will help us? It's the main thing. Am I going too fast or?

Yeah, slow. But the thing is that What's your Where are you going? On the left side and then on the right side. It depends. And which part are you? If you're on the right side of the figure, The gradient isMust be. So positive times the learning rate which is always positive you go down, you go to the left If you are on the right side, the gradient is negative. So Negative times negative is positive, so you go to the forward, you go to the right.

So on this side Miss LopezNegative. On this side it's positive. So when you time it up,And the gradient of the loss function, this is, This is positive. This is positive, so you take a step negative. To the left, and vice versa.

Let me simple regression, linear regression. Theta 1 times x, now theta 0. We have two parameters. and this one can be a vector because all the different samples have different predictions. So this can be a set of parameters for each one. So as you can see there, the gradient of loss function over a parameter can beDefinitely like this. The loss function, gradient for counter one, counter two, counter d.

And the magnitude indicates how steep this slope is. So the minus gradient shows the direction of the steepest decrease.

So we-They degraded into 100 at once. If you do it for different iterations or differentSet of data, it's a batch-graded system.

And... Stochastic gradient descent is the one random sample and mini-mesh gradient descent is the b random samples. So you can imagine If I'm doing the gradients over all samples, Or I use, I get random samples and do the gradients over them. Or I choose specific amount of samples and I try them on all the rest of them. Wish. by average force investment. And so we are playing with the number of samples.

So it shows that The more the inside of the pantheon is the more complicated it is to use this method. So for example, stochastic gradient descent, we are only caring about the number of parameters. So the worst case is that you check all the d parameters.

If you're doing the batch gradient descent, besides checking all the D parameters, you are checking all the samples for each parameter.

And you can see, in a linear regression, they worked like this. There were 10 iterations of the gradient. So maybe that's the reason of this crack. The mini-batch, the green line, is faster than batch-written. It's simpler in EPUB, but EPUB has a higher iteration.

And it's smoother than pure stochastic grids, and benefits from GPU or CPU privileges. So that's how we train machine learning models. For real-world applications, we usually use any batch code, any set, and other parametric engineering variables.

OK. So you know parameters. You play with parameters. What are the parameters? It's a first-time engineer.

Obviously, we don't know the correct value of that. But there are correct values for parameters always. That gives us the best solution. So, for this, when I want to minimize the error over prediction, if I choose A to be true, If I choose a to be 2, it's worse than a to be 1, because it's increasing the errors. So a should be close to 1. For example, we consider the best value for a is 1. So this is the parameter 1. We know, over different samples, the best value for this parameter is 1.

So this is the difference between parameter and variable. But what's hyperparameter? This is something which is outside of the model. You don't have access to this. During the training, you have to fix it by hand. You have to find out another way. If you want to use another method, use another method to fix that value. Eta is one of them. Why? Because it's out of the training procedure. Your model is linear regression. There is no eta here. Eta helps you to find out the best parameters. So it's out of the training and we call that hyperparameters.

Most of the hyperparameters show themselves in deep learning, in neural networks, in language models. We don't have them much in machine learning.

It has a slow progress along the relevant direction. And we can think about, now we know how to move, we can think about a velocity term. It accumulates past readings. So you can study that, too. You can understand what should be the momentum coefficient, so you can go.

to have the best learning experience. As I told you, Adam is the-The default optimizer in all deep learning librariesStudies. and it makes itself better internally by this procedure If you're interested, I've got this for us. This is my Bible school business. But, Everybody should know how it works. Done. I brought all of them.

And a practical guide is that whatever project you're having,We can start with that. For a passport, I think. and then if it didn't work, to be able to manually control it, go for SGD and on it. I will come and see you. I see that I made a good omission.

In what situation would peopleAdam is a self-controlled thing. It turned into a back box.

But internally it has the momentum and it corrects the biases. So you want to see, OK, probably I'm having a lot of subtle points. Probably my model doesn't work well. This atom kind of dumps you. You can't understand, OK, what's the best way to figure it out and find the solution to this problem.

label y we want to calculate the predicted value for y it's a prediction So we are fitting a model. In training phase, we call that the fit model. It's a machine learning model that they fit after they train with it. And then, we get the results of that. So when you say overfitting, it means that you're learning with your model. But the model learns the training data too well. So Ben... You want to test it, test this model on something else You cannot give you good traditions.

you have 1000 samples, your model is the best way of being optimized with parameters. and you want to test this on a different dataset. And the results you get is much worse. because he knows only this training data So By error, when we mean difference between prediction and label, You have a training data set and a testing data set. When you want to calculate the error on training data set, we call that the training error.

And when it's on the test, you get a certain quality test error. As you can see, when you do the order fitting, it means that you are training the training data set too well, the training error comes down, but the testing error goes higher. This is a problem. You don't want it. So, do you understand this? So you see, the test error is above the training error. It shouldn't be like that. Because we are training the model so in testing phase it will be much better.

So we should be careful about how well are we or how strict we are about training on the training devs. For solving this problem and also the underfitting problem, underfitting is the vice versa of this. So underfitting is that we don't care.

When you do this, it's like you don't care about the error. So the test error goes high, but the train error, you don't control that too. So you're not learning much.

I call that polynomial equation. polynomial regression. It's possible to have a model like this.

But it's learning the curve and then training dataset too well that it doesn't work well on the test dataset. Because when you bring another dataset here, There's so much better.

So the solution for this Which is a... I would say top in the knowledge is regularization.

overfit or underfit anymore. Regenerator is that parameter you see. We show that by r of theta, and there's a lambda. So the first part is data fit. So we fit the model.

And the lambda is the-What are the factors for controlling the strength of the realization?

And so it's really, this is what I'm pretty important to, it's about different regularization techniques, L2, L1, elastic net, which is a combination of L1, L2, and dropout.

We usually use dropouts in neural networks and L1 and L2 in machine learning. Daily machine learning W. I'll give you some time toSo when you use Python or different languages for doing machine learning tasks. The general default for regularization, if you add regularization, would be the L2 is rich or big decay. This helps to shrink weights towards zero. So it won't be like this. Lean to a A little bit.

The first famous one is vanishing or exploding gradients. So gradients as we learn them become extremely small or large in deep networks.

The fix is to carefully initialize them.

That's a famous initialization scheduling for a learning rate. And we can do batch normalization, residual connections. I know these terms are... I'm not familiar, probably, but they are the names of the terms we use in deep learning, or in machine learning, to solve the leaks, to solve the...

The other problem can be settled once. If you are interested in the math side of the optimization, the flat region is to slow down the convergence. It has been edited to include proper punctuation. So the last surface has very different curvatures in different directions in the non-convex optimization. And the feature normalization optimizers.

So we can go for larger batch sizes, we can do the gradient clipping, meaning that we save what we have with gradient descent, and then we move from there. or a learning rate warm-up, which is some procedure for initialization of learning rate.

So we want to predict the house price Y from features X using a linear model. So X is the feature. And Y is the label. Theta, T are the parameters of the model. So consider you have different features. For example, for pricing, putting a price on a house, what are the features you can think of?

We have categorical and continuous variables. It's okay. We can use all of them at the same time. But each of them has a different... It has a different role in the optimization. So each of them has a different parameter. And we consider we have d features. So we have d parameters.

The prediction function, exactly. So it's a function that if the input is the theta of x, then it gives you the prediction. So as you can see in the loss, in the mean squared error, you have the h of theta of xi. So h of theta xi gives you the prediction. And when you minus this with yi, what's yi? It's the labels. So if you calculate this, it gives you the error. And if you square this over all the samples, it gives you the loss, mean squared error.

For linear regression, we should know that the loss is convex. What does it mean that the loss is convex?

It means that the global minima is the same as local minima. So the gradient descent for sure will converge to that. And the closest form of this will be that one. So this equation is the equation of finding the best parameter for the machine learning.

So, I would say my key takeaway from today is that I train ML models because I want to, I mean, it's the same meaning as I want to solve an optimization problem. The last function defines what good means. The optimizer finds it. This is the difference between optimizer and the last function. And the very reason is the backbone of optimization.

and mini-batch stochastic gradient descent. If you know that well now, that's one of the best ways to work with gradient descent. It makes the gradient descent practical for large data sets. You can use it in whatever task you have.

Many can use adaptive optimizers to auto-tune parameter learning rates. The learning rate is the single most important hyperparameter in machine learning.

A regularization method, as we learned, prevents overfitting by penalizing complexity.