Lecture Script

Summary

k-Nearest Neighbor (kNN) Classifier Overview

kNN is a fundamentally different approach from traditional models like Perceptron and SVM
Unlike linear models that learn decision boundaries or nonlinear models like deep neural networks that learn complex functions, kNN has no training procedure
The data itself serves as the model - there are no model parameters to learn

Multi-Class Classification Problem Setup

Input consists of training examples $x_1, \ldots, x_N \in \mathbb{R}^d$ with labels $t_1, \ldots, t_N \in \mathbb{N}$
Goal: classify a new test sample $x_t$
The approach: find k nearest neighbors of the test sample and let them vote to determine the class

Similarity and Distance Measures

Defining similarity is essential for kNN
Common similarity measures include:
- Cosine similarity: measures directional similarity, often used for text data
- Gaussian kernel: distance-based, higher values for closer points
- Laplacian kernel: smoother decay than Gaussian

Finding k Nearest Neighbors

Naïve algorithm: compute pairwise similarity between test sample and all training examples, sort similarities, and select top k neighbors
Time complexity: O(Nd) where N is number of samples and d is feature dimension
This can be computationally expensive when N is large

Voting Strategies

Option 1: Equal weighting - each neighbor has equal vote weight
Option 2: Distance-weighted voting - closer neighbors receive higher weights
- Example: if two close neighbors have weights 0.9 each and three distant neighbors have weights 0.7, 0.5, 0.3, the closer neighbors' total weight (1.8) can exceed the distant neighbors' total (1.5)

Efficient kNN Methods

The naïve approach requires searching all training samples, which is inefficient
Example: finding the nearest post office among 30,000 locations - no need to search post offices in distant states

Vector Quantization (VQ)

Training stage:
- Generate landmarks (similar to cluster centroids, one per state/region)
- Assign each data point to its nearest landmark
Testing stage:
- Compare test sample location with landmarks first
- Only compare with data points near the closest landmark(s)
- Reduces time complexity from O(N) to O(number of landmarks)
Other efficient methods mentioned: KD-tree, Locality Sensitive Hashing

Hyperparameters

Hyperparameters are parameters that cannot be learned automatically by the model
Examples across different models:
- Polynomial regression: degree p
- Regularized regression: λ
- RBF-SVM: σ
- SGD: learning rate α
- Deep neural networks: number of layers, number of hidden units
- kNN: k (number of neighbors)

Hyperparameter Tuning with Cross-Validation

The Wrong Approach

Never use test set to select hyperparameters
Test labels should only be used for final evaluation
This is a common mistake, even in published papers

The Correct Approach: k-Fold Cross-Validation

Split training data into training subset and validation subset
Process:
1. Define a set of candidate hyperparameter values (e.g., p ∈ {1, 2, 3, 4, 5})
2. Randomly partition training data into k folds
3. For each fold: use k-1 folds for training and 1 fold for validation
4. Repeat k times so each fold serves as validation once
5. Compute average validation error across all k iterations
6. Select hyperparameter that minimizes validation error
Example with 10-fold cross-validation:
- 9 folds for training, 1 fold for validation in each iteration
- Compute error for each fold, then average
- Choose hyperparameter with lowest average validation error

Bias-Variance Tradeoff

As model complexity increases:
- Bias decreases
- Variance increases
- Training error continues to decrease
- Test error follows a U-shape curve
Underfitting: high bias, low variance, model too simple
Overfitting: low bias, high variance, model too complex
Goal: find the optimal complexity that minimizes validation error, balancing bias and variance
Example with polynomial regression: degree 1 underfits, degree 15 overfits, degree 4 is optimal

Key Takeaways

kNN has no training phase - only requires storing training data
Test time complexity is O(Nd), which can be improved with efficient search methods
Performance is independent of number of classes
Hyperparameter k must be selected using cross-validation, not test set
Cross-validation maximally leverages training data while providing reliable hyperparameter selection

Notes

Transcript

So it's totally different. Another very commonAh, cause we're in Ireland. So especially we just talk about the K years label, okay? The K is actually a subprime of K, but we'll talk about the meaning of K later. Thank you for your attention. You know that? Classification task, especially in my class, my class classification, Well, the input, right? Well, the input is-We'll have a set of training examples that's time-tested, and each has a label, TI to TM.

And then we want you Right, we want you to classify the new test in. Then a point, next sheet. Then how do you classify this is almost cheating. Now let's see the You know what I was thinking about. For some time, we have, what we have to do is that We first use the Training example to learn such kind of decision boundaries here. All right. And then we classify the data point belonging to the Regis.

Okay, so here you see that whatever the linear model or whatever nonlinear model, the first step is that We first need to learn We first need to learn the decision function through the Trinity Ensemble was labels.

The test is done well. Where's the head? All right. This test on XT has threeYour K was a four. Three. Sambas. From the group class, I'm going to go from the--Bring the cups. And then the second is So, once we have these new neighbors, and then We let the Kenya's neighbors vote. So here, right?

So here. Bye. The geostamps here are first to find the key news labels. And of course, One question is how? How would you define the nearest neighbor? What's the mean of the nearest, right? The nearest, of course, You've all studied monography. Right, you're a close married. For instance, we can use some We'll use the cosine similarity. I use a Even more. How would it feel like a Gaussian kernel?

Right, first is to find is to measure this analogy. And the second is "Find". Find these neighbors. Find the Kenya's neighbors. And how can we find a community? So The name I gotSound right. We have untrained samples, right? And we want you To find the K-diff levels, we just need to first compute the similarities. The pairwise monad of the gene, the XT, and the examples. Right. And then we sort theScores.

First compute the similarity, pairwise similarity between Xt and O.

And then we saw the similarities. And then you get OK.

You can see that the council line here D is the D-dimensional. Right here, right? to compute the similarity For X-I-N-X-T The formula I see is d, right? And then we're going to compute all the scenarios. The time for resting is 4:00. And time's deep.

And they will assume that every Nebo has some weight. So in this example, right? The five of them who was clear are blue and she are green and they were similar levels of the same weight. And then you classify it as the I'll do proof. Ross? What do you think This 5 year there was And the corresponding similarity, okay, the logic distance then indicate this model.

In this case, if we still apply option one, we assume that the data points are the same way and then they classify by the group. But actually the group pointsI'm here with Room twice. The group was actually a fun year. And then the second option is actually-We cannot force the neighborhood the same way, but-The more near the network you have, the lighter it is. So in the example Suppose thatRight away, it can be, right away it can be good.

You know, you guys are not here, right? But isn't it here, we have this seminaryFor the weight, the test in front of the news labels.

And even though here, three blue points are We didn't, uh... But actually, The weight The weight of the three group points actually is smaller than the weight of the-Two group, two group once. So if you consider the weight, all you should never go to death.

Right, first step is compute the-if you want to classify it, it has to be some x, t. When first step is compute the symmetries, I'll backseat you. and then we find your top K, and your standard rules. You see that the difference between This can and other agonies, there is no chain at all. Right, the notion is we only need to know the We're not giving a chance to samples, we're just labels.

So it's Kenya and Agra that actually That's not any training procedure. Okay, any training procedure? On the other hand, the commercial right Is that for each? Hurry. To classify each test sample, we inform the complexity of all the entries. Because Oh, yeah. Computing the Seminarist's Key is exciting. If n is pretty large, of course, this complexity is very--Okay.

Right, this is a deep, this is. This is the cause of fundamental-Of course we need to make this. Actually, I would like to give you one minute to think about If you want to make the cane efficient, you have any The solution right here And we know that, right? Ready or not? It wasn't ended because of I can recognize similarity between the test and sample, all the test and samples.

And we know that there are about 30,000 post offices. Life without longitude And then if you want to find news, The EU setting that we want to calculate Listen, the distance between two Place under 30,000 first office Location, right? Thank you. Okay, I'll show you that we have theAnd of course,there's no need for you to find the office outside the like in the east or in the west, right? Hey, Halloween.

Avoid that. You can use the This idea, this idea is... For the vectorWhat do I want you to do? At the beginning we don't need to After all the chaos of the O Post office. But instead, we can build some landmarks for each. For your state, this is kind of like what you can-If we treat each stageAt a classroom, right? First, generate landmarks for landmark per state, meaning finding the Central. Point.

Oh Dad? cost of that state here, right? And then we got a sign that put off your trees and used landmarks. Dennis.

First, just to compare the location with the landmarks that I'm used to you.

Because in the training stage, in step 2, we already know which post office are near to which landmarks So for here We can find three years landmarks And then we did the Ish! Landmark and then we compare the post office So here, you can see that here the time complexity now I'm going to use from the Number of training samples we want to compare by just the number ofI'm Max. Because of this one. Right.

Actually there are also some other efficient algorithms like the KT3 and the quality sensitive harsher I don't know what that is.

But this strategy of cross-groundation is actually used for All types of machine algorithms that you force have parameters. Right? So let's first just look at what the-how prominent that we are. So how prominent is the ratio of here to alpha? You know... prefix "hypo" that means these parameters cannot be learned We did the model.

Like I said, we said that we try to learn the DC function. This function basically calculates the model parameters. That model parameters can be learned through the I don't even do the service. This hyperparameters can not be learned by the model. OK, by the machine model. And I want to explain hyperparameters.

Let's see what the chakram pramidha is all about. Okay. So you're doing Russian? In the polynomial of writing, whether we keepBrian.

Which is actually Need to specify, right? In the regular linear regression, We have the Hohkner Lambda. We also run the Gaussian operator Perna, which has a weak sigma. Once you use that in your SVM, we need to specify the sigma.

And in the gradient of stochastic gradient design, we have the step alpha, or linear rate, or step side alpha parameter, hyperparameter.

And also we will talk about some introduction about the deep neural in future slides which all has its own kind of like the neural layer but the neuronsRight, and in theCan you say, "Oh, good." If I use the number of classifier, we have the number of the number of k. All these health planners are are really important You going to come? If you set these values inaccurate, And then of course, performance can be very bad.

All right. So the term, right, anaphilia and anophilia actually can be reflected by this So here x-axis means the molecule molest and y-axis means the arrow.

We know that if the model is very complicated, And then try an arrow. can be very small, even too be zero. But very complicated model. Make the day hard. Do you want to feed him? Each tube is a very high tether.

The morning is very-That's very simple, right? The training area is very large. I'm gonna wanna should be Right.

With this, the test error is small and theIn the Paradox of Greed, I mean, the refraining by the linear model. Very simple and then 15 degree, very complicated and then 4 degree,This is your number. And then he'll just use this polynomial regression model as an example to Show the idea. of how we can leverage the cross-validation to decide the option of what you think would The alternate you can take here.

Right, you say which one is the optimal, right? Like I said, the optimal is basically the model thatThese are the Simonius and the Misfare. Whereas in the other standard, the preliminary degree form, This is the list of this world and then we set up degree for model is best So this this thisSo much in Sanskrit.

actually this also this is theThe common mistake I'm just here to begin this.

I want to say that this is a big mistake for theYou know, actually, Even the public papers have made mistakes like this This is totally, don't come here to draw, okay? What's wrong? Because here, when you get the When you want to compute the misfit error on the test step, that means The test tables are already I'll be right back. It might not if you want to apply it academically in practice, of course you don't know the test.

The test level is what you predict. Okay?

It's complicated around because the test labels are on bad-level practice But even if you have the test labels, You cannot do that, you cannot do this. And that's how we use cross validation to I feel likeThis is the issue, okay? So here, right? We really know the training samples.

We really know the training samples. And we also do the labels. So a very common strategy now is that instead of using the test sample, okay, and accurately So if you're a Tesla, we don't have labor. So actually we have three ensembles and we know the labels and then we separate that This is YouTube.

Two portions. And then... The first part is called the training samples and the second part which is called validation samples. Now, what was the...

So here, you can see that the degree four polynomial regression can produce the smallest validation mean square error. I know I chewed this out.

Degree 4 Here I'll show a common Cross-Border Data Strategy Which makes you chew? computerAt the same time, we want to maximally use the verbal data and the train data.

How can I do that? The first step, right, we need you Define it. We have a predefined A set of the health standards get a P, and the P can be one, two, five, or eight. And then... We randomly partition the training center to train the summit to the K-Cops. Okay, boss. Okay, boss. And here, k minus 1 pi over trinium.

And they remind him one part for the testing or for the validation.

And then he computed the average error of the K-ring repeats because here, right? You're partitioning the train number into k bars, and then this has the-Okay Four inches is one, right?

And this average arrow of the K-repeat is called the bandage arrow. And then we just choose to have a round of cheer thatThis is actually theTake a short step, hold your hand.

You can see that in these four combinations, each One has a I have theMaximally leverage the current data and then thein the conditions.

like this for each hole you have 9 paths of training and the primary path is the validation and then you can compute the error each.

Right, this is my--How'd you get that? Senator. Peace. And then his arrow goes to you. Thank you. So right, the summary that this really summarized, right, the K-U is, there's no training. And time for this is order of T and then we can introduce the which actually is order ofNine to five.

We need to use a K4 polarization. Basically, we split the training data. You train the class on this. And then we learn how to use it instead of testing it. All right.

K-Nearest Neighbor Classifier

Hyperparameter Tuning:

Cross Validation