Lecture Script

Summary

Text Classification Overview

Definition and Problem Setup

Text classification involves assigning a document to one of a fixed set of predefined classes
Input: document d and fixed class set C = {c₁, c₂, …, cⱼ}
Output: predicted class c ∈ C
Applications include spam detection, sentiment analysis, gender identification, authorship identification, language identification, and topic categorization

Approaches to Text Classification

Rule-based methods: Use expert-defined rules based on word combinations or features
- Can be highly accurate when carefully designed
- Effective for simple problems (e.g., basic email classification)
- However, rule maintenance is costly and scalability is limited
Supervised machine learning: More commonly used approach
- Requires labeled training data showing examples of documents with their correct classes
- System learns patterns from training examples to build a classifier

Data Requirements and Training Process

Supervised Learning Setup

Requires annotated corpus with documents labeled by class
Dataset typically divided into training set, validation set (optional for hyperparameter tuning), and test set
Training set used to learn the classifier, test set for final evaluation

Automatic Labeling Strategies

Large-scale manual labeling is impractical
Can leverage existing labels in online data (e.g., star ratings, review scores, emoji reactions)
Example: Amazon or IMDb reviews come with ratings that can serve as sentiment labels
Social media posts with emojis can indicate sentiment automatically

Feature Extraction: Bag of Words

Core Concept

Converts variable-length text documents into fixed-size numerical feature vectors
"Bag" metaphor: toss all words into a bag, pull them out and count them
Process:
1. Define a vocabulary/dictionary of relevant words
2. For each document, count occurrences of vocabulary words
3. Create vector where each position represents a word from vocabulary

Key Characteristics

Major limitation: loses word order and positional information
Words are treated independently without considering relationships or context
Results in sparse vectors (most values are zero) when vocabulary is large
Computationally efficient and surprisingly effective despite simplicity

Vocabulary Selection Challenges

Using entire English dictionary is often too broad and includes irrelevant words
Social media text includes abbreviations, intentional misspellings, and custom terms not in standard dictionaries
Need domain-specific vocabulary tailored to the classification problem
Stop words may or may not be useful depending on the task
Emojis and special characters may need to be included for certain applications

Representation Variants

Binary representation: 0 if word absent, 1 if present
Count representation: 0 if absent, integer count if present
Choice depends on application:
- For spam: one occurrence of "Viagra" may be sufficient indicator
- For sentiment: multiple occurrences of "like" may indicate stronger positive sentiment

Practical Issues

Synonyms (e.g., "soccer" vs "football", "buy" vs "purchase") treated as different features despite similar meaning
Without preprocessing, semantically similar documents may have very different vector representations
Can compare documents using vector similarity measures (e.g., Hamming distance, cosine similarity)

Naive Bayes Classifier

Probabilistic Framework

Core question: Given a document, which class is most probable?
Mathematically: Find class y that maximizes P(y|x)
Apply Bayes' Rule:
- P(y|x) = [P(x|y) × P(y)] / P(x)
- P(y) = prior probability of class
- P(x|y) = likelihood of document given class
- P(x) = evidence (constant across classes, can be ignored)

MAP Decision Rule

Goal: Find class that maximizes P(y) × P(x|y)
MAP = Maximum A Posteriori probability
Since P(x) is constant for all classes, can ignore it in argmax calculation
Decision rule: y_MAP = argmax_y P(y) × P(x|y)

The Naive Independence Assumption

Document x = (x₁, x₂, …, xₙ) is a sequence of feature values
Key assumption: All features are conditionally independent given the class
P(x|y) = P(x₁|y) × P(x₂|y) × … × P(xₙ|y)
Reality: This assumption is false—words are not independent (e.g., "New York" often appear together, "happy" and "camper" correlate)
Rationale: Makes computation tractable and surprisingly effective despite being "naive"

Parameter Estimation

Prior probabilities: P(y) = (count of documents with label y) / (total documents)
Likelihood probabilities: P(word|class) = (count of word in class) / (total words in class)

Spam Detection Example

Vocabulary and Setup

Example vocabulary: {I, rolex, own, replica, watch, buy, cheap}
Each document represented as 7-dimensional vector
Training set includes documents labeled as HAM or SPAM
Model training = computing all prior and likelihood probabilities

Classification Process

For each test document:
Test multiple documents to evaluate classifier performance

Example Classification Results

Document with certain word combinations yielded non-zero SPAM probability → classified as SPAM
Another document yielded zero for one class → classified as the other class
Model makes predictions but needs evaluation to determine accuracy

Notes

Transcript

Not even scratching the surface is what we could do. So some of those that I mentioned here isThe classification you have to possibly-We write the same problems, different labels, different input. So How would we know about that? At the most basic level. For example, if you were to develop something that decides, is looking at a document and decides this is an email or not, right?

Would you need a neural network?

Some simple problems could beOh. subject were rule-based So it should be like that. Advising methods. Always open your mind when you can. I don't mean a big deal. a solution go for something similar. Could this be actually very effective? How about we apply that spam filter? The same approach. A rule that is found There you go. So there are problems where you will have an argument problem straight through the The system has to continue to be about is sound detection is part of that.

We can't just have a lexicon of words. If you see those words, this is what I'm going to say. Because all the words are changing, intentional mis-stallings and whatnot. but There will be problems where just a rule-based approach might work. Just like that. Now, More often than not, that is approached using some supervised machine It's not much of a machine learning class, but I'm sure that you are familiar with what Super 5 is.

you You're showing examples to your system. What you sell belongs to category X. So that means a little update to our system. The problem is on top of having a lot of documents and a set of classes, we also have a A training sense in our... The situation will be an annotated corp trying to have a seat. There's a series of documents that are also coming with a correspondingLadies and gentlemen. The standardized machine will produce a classifier based Everybody knows how How the training and tests would happen?

Or a supervised machine learning. Got the same big... Data set divided into training or test sections. What about the validation?

So validation is not a necessity when you're building a supervised machine anymore, but Should you choose to build a company?

proto-moles, a set of them with different parameters, you can use a validation test. intermediate to test the symptom. Where do you actually start? Building the final form. Stop. Let's do some sentiment analysis. So for sentimental analysis in supervised machine learning you would need a bunch of documents that are labeled.

A smiley face is a...

Just as good a label as a thumbs up or positive, negative. Does that make sense? That's actually given the internet and Tons of data out there.

Do you need... So remember, we need a code. Individual documents. Label, classifier, this. Here's a document that belongs to. Ready? And you want to do that by hand. And you do it without it. I'm not giving you any data set. Hey, just go on Reddit or go on IMDb or go on Amazon. Scrape the data. Will you be able to create an annotated dataset? Automatic.

All the reviews come with the date, the number of stocks.

Okay. Not exactly, but it goes. But first, Let's take this drone fax. and try to turn it into a numerical representation.

So we're dealing with text. I'm giving you a document of any length. And As you can imagine, you're probably better off if you have a fixed size feature vector Like some numerical structure of a fixed size of a ball. The old classifier. It's pretty easy when you're dealing with, I don't know, like, the only thing that we expect to have here, a future vector will be critical mass. What text is that?

Here's the dumbest idea that it works that we can start with. A bag of words, right? What is a bag of words? The way of representing any sort of actual document by a fixed size feature vector. So by any Thank you. It could be a few paragraphs of an article one way, or it could be more of these. You can distill all three into a same size test vector. Actually very easy. Okay. Take a document. Thank you.

The bag of words comes from a bag that you just Toss all the words or all the topics into a back and then you start pulling them out And count them. What's the problem with this approach? Oh, he's already again.

It's actually very interesting.

In a computational sense, it's very easy to use. efficiency in terms of accuracy of your model. That's actually a completely different story. Just the process of it. There you go. There's a bad problem at the bottom. The moment I toss everything into it. ...sictional bag. I lose. Relationships. So how is that going to work? Well... First of all you have to define a dictionary or a lexicon that you will be using.

Set of words that you choose. choose be valuable for your classification problem. Would you possibly take into account entire English? Dictionary of the D Okay. Give or take. Any kind of communication problem. Is this... This became an entire English-Too much.

People are misguided. intentionally because They have that character on it? So as a human reading it, you understand? What they wrote, what those... abbreviations and whatnot won't show up in the dictionary.

Okay, how about mixing a new dictionary with some... Customized Twitter or X dictionary that you've learnedExperience.

You see the problem, don't you? If it's not even a problem, it's a challenge. You have to make some decisions. Now, the bigger this-The bigger my dictionary is, the bigger my vector will be. That's always a problem, right? And you will see in a moment that it actually goes a little beyond just Just a storage problem. So In general, How would you select your List of words that you care for.

No specific problem, right?

You can wrap something, right? Remember stop words. It all depends on the problem. It is unlikely that this guy will be so bad. Thank you. You want that vector of information. The culture of fascination rules and variation Does that make sense? What about emojis? There's going to be a ton of work needed. Thumbs up, right? Thank you. You should probably. Another question for you. So this is work. This is one possible application of acclimatization actually good.

Become heavy. All right. So, once you have your list of words, you will simply Take that document, toss all the words into the bag, And then we'll be just picking out words and checking, do I have it on the list? If I do have it on the list, let's update the template. This is for better or worse the most basic representation of a document. How often is there a thing you want to do? Open. There will be some words that are in the-say Adventure.

Adventure is not on my list, so... What I show you right here is one possible way of building a bag of work factors and You could, though, have a binary version. There is a word like this in my document or not. Which one is better? Let's consider a discernment, SPAT, right? You have your spam award. Thank you, Don. Bye, Murray. That works versus a binary value. In the end,Spot all 10 instances of Viagra in your email.

-Like these, that one will one suffice. One is most likely, one instance is case in that, right? What about Hope you are healing. I liked it. I liked it very much. My friend liked it better. Number of lights, wood. Would the count of the word like nudge you towards, oh, this is more positive than other reviews?

I'm going to tell you to use the binary back and forth versus non-binary. Any other thoughts? Okay. We have a numerical vector that somehow represents a document in a numerical fashion. Can we use it to compare documents that way? Take two back of Morse vectors. To compare two documents.

What is a vector given?

Right. According to some measure,Thanks. So this is a very, very important But for now, Let's just observe that. If I did a semi-opaque job with my bag of words, a document, representation, I can do some comparisons. Is that far so good? Okay. By the way, you didn't expand that idea in And rest. All right. Your classifier, whatever that model is, whether it's a neural network or a U-bay as classifier, will essentially be some Function with you.

I imagine you are familiar, are you familiar with the Hamming distance? You're worried about the distance. I'm at half a distance. One-star shirt, there's ways to-This is the thing. Medical instructors. We will focus on two possible What about the cosine? Is this? Are you familiar with that measure?

Sorry, I asked you a moment ago. about the possible problems. If you haven't thought about those,Vocabulary has Individual words from soccer and football. Well, if my document-I will have a number eight. Of a CEO or vice versa. All right. That means that I will have better than. Different Thanks. Areas of my vector space, right? Even though Those documents technically might be discussing the same thing and they should be very close to one another in that case, right?

Does that make sense? Your bag of words, unless you clean up your vocabulary, unless you do something processing, so for every instance of self-created alternative, you do football at one time, you will deal with that problem. That new one that you're thinking about, How about these two sentences? Buy an old desk, don't want to purchase a new PC. Everybody knows. Bracket everybody knows what that is.

Technically the same thing, right? Depending on the vocabulary of the children, they will have completely different vectors. Go again. What was the directions in her house? That's not good, right? So now we work. -Will be Later on we'll cover them more. -Vector representations. Let's stick to the back. Okay, so back to our classification problem. We have our documents and we know how to turn them into kind of words representations, right?

It's pretty easy. or visionary. So they are the document and the briefings is done. Now back to our classifier. In the document, If it's classes, they'll fill something up. They'll pick one of those classes. I changed the notation from D and C. X or Y because it's the society. Thank you. All right, so there's a question. In English, given a document which category class that belong to? For which category of classes What kind of mathematical questions Are we answering here?

Let's say this document could be less than or equal to plus 2, plus 2, plus 3, plus 3. How will I know? How will I determine the probability? Probability. Very good. Good. you Which class is most important? So long story short, This is what we are interested in. Actually, in multiple of those. Every class, yes. One plus three. Three plus two. Answer that question, stack them up, and you have your answer.

How do we? You have a conditional probability. Okay. Start with a basel, right? If we apply the basel in the document in the figure A bar,Label class. Give him the document. What is its contemporary? Express by... How likely is thatDoctor, will you please get that ready? How likely is the probability? How likely is that the figurine itself? How likely is the document itself? So far so good. Where can we get those?

Remember our starting point. I, you, have a corpus. That documents and every document in that office has an assigned label on it. I already have something to work with. Problem. It's actually going to be Thank you for the news. Right. Document. Vector. Scalar label. No words. This Given a vector representation of some documentYou're gonna breathe. that this is our least concern. What the? Remember, we are trying toTwo, probability one.

Y3, YN. I'm feeding the document to the glassifier and I'm going through all the possible labels trying to pick the one that I'm using the highest probability. So this is aHow about this part of the building? What can we do about it? Do I actually care about the number itself? A realistic it is. Same problem as before. I don't care about it. I want to stack. the probabilities or that we're able to The label.

So... Let's go. Here we go. Okay. We still want to... Find the probability that actually the label maximizes this expression, but that expression will be maximized, the same label will maximizeThe ratio right here and... Just enumerate it, correct? The probabilities are positive. So what if I just ditch? I'm fine. The only change that happens here is that I move from the quality sign to this little What is that thing?

proportion MAP stands for maximum posteriority Maximum posterior probability. We will not be going for the food form, the map component, but it means the problem or the value that would most likely generate them. You All right. So far so good. We introduced our problem. You want to find Sasha? Class. Among all classes that maximizes This expression. How do we calculate that? Before I proceed, is it clear how are we slowly, what are we trying to achieve?

Sorry, the, the... I want people to estimate that for multiple layers. Following the Bayes rule, I get this and my document is actually a vector. individual values. And that is a constant I can do. So, A given B times probability of B. given me, well, if I have this joint probability with a sequence of little pieces of information, right? This is aFirst element in my vector, second element in my vector, Finally at the end I have a label.

There's nothing that is stopping me from Yes. Putting in, putting the, sequence of values in a different spot. Does that make sense? Using the same What a difference. Now. Now this is my A. This is like... And this is my heat. So far so good. Just a moment. Where is this at? Now? Compare. That expression equals this expression right here on the right. What can you tell me? What's the difference between the ballpark joint probability from abilities.

But they're different. Now I can play this game again. I can do that, apply the protocol to this.

And if I keep doing that, I will know what the chain was doing. This is where I started. Right?

There's a conditional answer. one, even everything else. All right.

Times another general probability without x1.

Now if I unpack that. I have an X2 human, whatever else? Now x1 and x2 is gone. I have a nice little product. Okay, everyone can see that? for every individual component Oh, my future background, or my back and ears background in this case, has its ownAs found in conditional probability. Yes, this is dependent on the rest of the vector and some assumed And finally at the end I have just the probability of of the labor.

Am I losing you or not? Right, this is where the naive thing is comes into play. If you remember what independence is, it means that two variables are independent. That means that they have no conditional independence If I know oneWell done. Pieces of evidence, the rest of evidence pieces are irrelevant to you. Correct. The naive Bayer's assumption is naively assuming conditional independence. Meaning of life.

And breathe. One of those. -Also specific. are of a minor teacher. We are assuming that once I know The label, the rest does not matter. In other words, under the naive Bayes assumption, all future value, future vector values or individualWords are independent. Is that true? Absolutely not. They are dependent on one another. New York shows up in your future vector for a reason. Usually goes together, right?

A happy camper, it's bad, it comes together. They're not independent. The naivety in the naive phase of socialization Take it. Take this assumption. Assume that independent elements of your inputs All the elements in your input are independent from one another. You know that it's not true. For the sake of, hey, I don't know how to calculate it, I don't know how to calculate it, I don't know how to estimate it, let's make a simplification and turn every single one of those And two.

A very simpleCouldn't this one work out? Does that make sense? Everybody knows what a naive base is now? are naively assuming independence in all they know. Like you're looking at a human being and you're trying to judge the future behavior, predict the future behavior, right? Are they aggressive? Are they... they have a criminal history or things like that. You can look at them, right? But with the main thing is you're detaching aggressiveness and criminal history and whatever, you're saying those are not related at all, right?

A fairly glass or label. I need to find all the... Be sure to give them the label. Under this assumption, Turn our original maximization problem in. Aversion on it, right? Formal abilities. So, a training set. Let's do someSpam, but green. This is my training zone. A set of documents? of the corresponding label. That work? If I use the bag of words for representation, meaning that first I will establish a vocabulary dictionary for my problem, Correspond to a fixed length vector.

I can replace every document with that vector. So far so good. I still have a corresponding length.

And consider Actual How about using words such as "Rollings" and "Reflector" for your smartphone? I haven't looked at some messages in a long while, so I don't know if I remember "Rollings" and "Reflector." Does that make sense? That works as a scant words Awesome one. Okay. So now let's say that we have a document that that does not have those words and has a label and, which is not Spanish, and so on and so forth.

that document was labeled by someoneor actually someone using your mail system label an email with Rolex and reputas and business decadence. This is how you got that. That works?

Alright, so I have a bunch of documents. Somewhere like a whiskey. Pam? Right leg forward, left step. How do we build a class of us? Well, remember when we came up with Our classifier will tell you What the label is, but by checking All labels. The gains of that formula And the--Which one is max? That's my answer. The one that maximizes this formula is a win. All right. This, on the other hand, that expression is really-There's a probability of a label occurring.

And then probability of a certain feature. ... such as the word "erolism" or "repulia" appearing in your document. givenLabel. So far so good. We don't have those probabilities, but they can be very easily--Estimated and you know The probability of the label is probably the easiest. Number of all the samples in your corpus are labeled. Four of them. What were the total number of samples that were had Four over.

That's how likely a certain label is. Does that make sense? The other one gets two over seven. We did that only on two. Sparrow message is 047. What about those individuals? Oh, you've done that before? The bottom ground is... Or else we should have the same approach. Count all the times. That specific word, The future shows up in Specific label. documents. So Rolex, how often it shows up in spam documents.

We have two spam documents. How often do you roll a shingles? One and two. Where did two came from? Over.

All the word counts in the spam documents go 1 plus 2 plus 1, 2, 3, 4, 5, 6, 7, 8.

You have nothing, right? All right, so Building in the evening. Class of '99, Google Power Pace. Yes. Finding. All probabilities will fall into that x couple. Once you have those probabilities, youYou have your mole. Let's train the model. I don't know. I overwatch. These are obviously artificial sentences by G for all its replica. Well, you get the picture, right? Would you classify them as amortized?

vocabulary. I use the words I, frolics, bones,Thank you. So I think all my sentences usedWords from that book, there's nothing outstanding here. What's going to happen if I have a sentence that is That includes a word outside of my dictionary vocabulary. Or. But here, I already fixed my vocabulary. I can do my back awards, back first for those. This is my training set and I can start Building a model, training a model.

This is different from a machine for a neural network where you do it E for E for E for E. Here you just estimate your probabilities and you're done. Alright, we already have that.

And for the conditionals, every word we do a counting, and we have two sets.

-No label? You got a one. Let's check out some of this. Let's see. The probability of y--You technically get two months One for the ham label, the other one is for the stamina label. If you have three labels, I would have to have another.

The most likely output that I could get based on those fundamentals on average. Does that make sense? All right, test it. Test it, what does it mean testing? Well, now I do have a Set of sentences set aside. I know their labels, but I don't know how to tell them. Letting baseball go to the line.

And, I will try to find, that's one.

It is a, prediction correct. So my model will ask the question, is this span or half? Is this span or half? Is this span or half? I know the answer, but I will not tell my model. So, given my... Two models that I already trained. I can now block those numbers in. Let's start with that first sentence, which is,This is our vector. And people actually come from two probabilities.

And see which one is either.

All right. So the probability that that sentence, given that sentence, The number is proportional to the ability of a document of the label being am times this. Plug in the numbers. Get a seat.

What about Stan?

Well, now we get non-zero values. We're just hiding them in the one-point span, which means that our model told us what? It classified it as Spanish. All right. That checks out. Let's go to another one. Correct classification, let's do the... Night. Same process. All the values. This is for M, this is for S, M, M came out as 0. So, let's classify our M. Again, for our classification, and let's do it for the formula.

Lecture Script

Text Classification Overview

Data Requirements and Training Process

Feature Extraction: Bag of Words

Naive Bayes Classifier

Spam Detection Example

Text Classification 및 Naive Bayes 모델

Text Classification 기본 개념

Naive Bayes Classifier 학습 (Learning)

Spam Detection 예제

Training (학습 과정)