Lecture Script

Summary

Language Model Fundamentals

Language models calculate the probability of word sequences by counting how often they appear in a corpus
The corpus choice significantly affects the probabilities learned by the model - expressions not present or rare in the corpus won't be learned properly
For bigrams, probability is calculated as: P(word2|word1) = count(word1, word2) / count(word1)
Storage becomes challenging as n-gram size increases - unigrams are easy to store in a dictionary, but bigram and trigram tables grow substantially

Special Tokens and Sentence Boundaries

Adding start and end tokens to sentences enables the model to learn beginning and ending probabilities
The model learns to stop generating after encountering end tokens like periods or exclamation marks
Example corpus used: restaurant conversations with approximately 1,000 sentences and 1,500 unique words

Markov Assumption and Chain Rule

The chain rule allows decomposing sentence probability into a product of conditional probabilities
The Markov assumption simplifies calculation by conditioning each word only on the previous word (for bigrams), rather than the entire preceding sequence
This is a destructive assumption that loses information, but makes the computation feasible
Example calculation demonstrates how to multiply bigram probabilities to get sentence probability

Comparing Sentences and Applications

Language models can compare sentence probabilities to determine which is more likely in a given corpus
Example: "I want English food" vs "I want Chinese food" - the model reveals which phrase is more common in restaurant conversations
This technique can be applied broadly: analyzing political commentary, understanding public sentiment (e.g., "I like Y" vs "I like Z"), or any domain-specific analysis

Practical Implementation Tips

Logarithmic space: Instead of multiplying probabilities directly, use log probabilities and add them to avoid underflow
When multiplying many small probabilities (0.001 × 0.001 × 0.001...), computers round to zero even when individual probabilities are non-zero
Working in log space prevents underflow for long sentences, then convert back to probability space when needed

The Zero Probability Problem

Sparse matrices filled with zeros are common in language models
A single zero probability in a sequence causes the entire sentence probability to become zero, regardless of other accurate probabilities
The model cannot generalize to unseen word combinations - anything not in the training corpus gets zero probability

Solutions to Zero Probabilities

Add-one smoothing (Laplace smoothing):

Add 1 to every count, then compensate by adding vocabulary size to the denominator
More advanced version: add α to counts and multiply vocabulary size by α in denominator
Not optimal but better than nothing - ensures no zero probabilities

Unknown token approach:

Introduce a special unknown token (<UNK>) to represent out-of-vocabulary words
Calculate probability based on count of all unknown words in the text
Alternative: create a lexicon beforehand and count words in lexicon not appearing in corpus as unknown
Simplest approach: assign a fixed low probability to unknowns

Large language model solution:

Modern LLMs use subword tokenization to build unknown words from smaller pieces
Example: "Alphorn" can be constructed from subword tokens the model knows, rather than being completely unknown

Evaluation Metrics

Traditional supervised learning uses loss functions by comparing predicted vs. actual labels
Language models cannot use standard loss functions because there's no reference probability for English sentences (e.g., no "true" probability that "I like Chinese food" = 0.81)

Testing approach:

Use a test set of known legitimate sentences and run them through different language models
Compare which model assigns higher probabilities to legitimate sentences
The model that consistently produces higher probabilities for valid sentences is considered better
Can average results when responses are mixed

Perplexity:

Perplexity measures how "perplexed" or puzzled the model is by a sentence
It's the inverse of probability, normalized by the number of tokens in the sentence
Lower perplexity indicates the model is less confused by legitimate sentences
Normalization allows comparing perplexity across sentences of different lengths
Calculation is straightforward once probabilities are computed using chain rule and Markov assumption

Extrinsic evaluation:

Quick and dirty method: have a person compare two models side-by-side
Used with reinforcement learning for large language models

Preprocessing Decisions

The example corpus was case-folded (lowercased) and punctuation was removed
This simplifies the problem by reducing vocabulary size and unique bigrams
However, removing stopwords like "the", "a", "for" depends on the application
For classification tasks (e.g., sentiment analysis), stopwords don't contribute to the decision and can be removed
For generative models like ChatGPT, stopwords should be kept to produce natural language

Programming Assignment Details

Assignment requires building a language model with bigram Markov assumption
Must strictly follow instructions for the basic model as there are expected outcomes that will be graded
Optional second version: modify the model (e.g., use trigrams, remove/add stopwords, change case handling)
Three Jupyter notebooks provided: 10, 10a, and 11
The notebooks handle approximately 40-50% of the work
Corpus used: Brown corpus with 70,000+ tokens
Note: Most common tokens are punctuation (period, comma) which will cause the model to over-predict punctuation

Part-of-Speech (POS) Tagging

Purpose and applications:

Assigns a part of speech tag to every word in text
Helps with lemmatization by providing context for word usage
Improves language model probabilities by providing grammatical structure information (e.g., subject comes after "the")

Notes

Transcript

Was this. Hold up. -Specific by-ground. What is this? Over. Love you, first. So if I have some sequence of word one, word two, how often that sequence appears in the practice over Ward 1 appears. It will tell you out of all the sequences that start with word one, this one shows up. x amount of time and it gives me an approximation of how likely your suit is to show up. Again, Forget about precise numbers reflecting English language.

And as you can imagine, and I will be stressing that a couple times in this course, Your choice of the corpus will affect those probabilities. You don't have certain expressions or very little of certain expressions in your corpus.

Your language model will not learn them. Do not use the phone. Okay. So probability of M Sam appearing, probability of Sam given that M O was already spotted. Cone. And Sam is only one over a few. Instances of M. And you can... You will do it for every Bigram. Possibly. Now you could store that information in some table for every biogram, but would that be a problem? Uni-grams are easy, right? You have a dictionary, This word appears that many times.

That's kind of a base on the book. For body grams? Your tables grow, right? Trigrams even bigger, so you may want to... Make a decision, should I do it on the spot when I need it? Estimate that probability problem. At any given point, I will just do it. We're going to go for the... For having a baby. No. Let me go back to these two examples.

Now we can hopefully see the value of me adding those two extra tokens.

Now I can have a probability of I being at the beginning of the sentence or Sam being at the end of the sentence. That probability is really high. Your language model will learn to stop producing the sentence after sound. Okay. Or actually an exclamation mark could be. Our Sam here, right?

Your language model will learn, oh, I said an exclamation mark.

This is my, typically this is my cue to stop the sentence. Alright, so everybody understands how to do that on a basic example. Let's do another one. This is actually, have you had a chance to look at that online textbook that I've been referring to? Draft skin. Okay, so. Both examples are taken from that textbook and this is The more elaborate. The corpus that I think Mr. Juraski or Professor Juraski collected himself It was essentially restaurant conversations.

Case folded?

Case folded. Or removed as well. Does it make sense? Probably it simplifies your problem because now you have a lesser table of all the unigrams, lesser table of all the bigrams. Whether it's going to be working perfect for your application, it depends whether you build it. All right, so now a few more words about that data set. There's almost a thousand sentences and Thank you. Unique words, which means?

Which rule helps me turn that probability into this product? The chain, right? Am I making markup assumptions here? We're not yet. Probability of the Ithworth Give him all words for sitting time. That's the ideal situation where I have to predict the next I'm using everything that came before.

Mark of assumption. Every conditional probability here will be turned into a conditional probability based on just the previous one.

That works. Again, it's a... Very... destructive assumption because we're losing some information but at the same time we're actually enabling the whole configuration to be even possible. Alright, with that said, all I have to do is sequence those probabilities and multiply them. Does that make sense? So, I built a language model, I extracted what was called unigram counts, then I extracted bigram counts, and then I was able to build this table.

You know, what if you give me a sentence, and I'm just looking up corresponding probabilities, put them in, in sequence. Can I get an example? That number does not have anything to do with actual meanish probabilities. It has some meaning. So far so good.

So, unigram table, unigram count table, bigram count table, that's for Indians. Table of probabilities. Chain rule.

with remarkable assumption, and I'm ready to Now let's do another one. I want Chinese food instead.

All right. If I do that. This is what happened. Yeah, you can verify the numbers yourself.

This is what I got. Okay, so I assigned values. Probability values, I estimated probability values in two different sentences. Now, this is-I don't know. What do you see in this result? This is not just Two numbers. Yes, I was technically capable of assigning two probabilities. But Is this information telling me something?

There you go. That's the one possible conclusion that you can draw just from listening to people. You don't have to taste English. Just read the room, right? The corpus is telling you that this information, right? You want to get into a restaurant business? Just listen to people. By the way, so obviously, You have an answer. You can apply that to anything, right? You're getting all sorts of political commentary online or otherwise, right?

You don't need a commentator, just scour a text or all sorts of media outlets, run something like this and you will have an idea what people feel about this or that. You're a prophet. I like Y versus I like Z. Does that make sense? It seems trivial, but it's also a very powerful tool and you don't really have to do much. Is this making sense? Alright, the basic language model. Absolutely the basic.

Instead of, you have to get your probabilities. You have a product. Less and less and less and less. I'm telling you, don't. Use logarithms in STEM. You will have to go back to the actual linear space later on, then what's the benefit? You can have a list of what you scream. I'll scream? Yes, in your computers of the information of adding is that And what about the underflow?

Computer side. 00 0, 0, 1 times 0, 0, 0, 1 times 0, 0, 1.

It comes in the computer. Those of you are pretty fast, okay? Even if all those little individual probabilities are non-zero for you, if you have a long enough chain, long enough sentence that you're analyzing, you are bound to get a zero. And probability of zero is not telling you anything. Now if you move the logarithmic space, is that you can go farther because you're adding, you're not, you will not get to zero, right?

can go much further and. But then you have to go-Go back, so you...

You go from b1 times b2 to this, but then you have to go back to probability, not log. Okay. That was a little trick. Now let's talk about The other problem, zeros. Everybody knows what a sparse matrix is? If you don't know what a sparse matrix is, it's a matrix filled mostly with zeros. There's nothing wrong with that, technically speaking, because this is what the data is. In general, when you're doing any sort of machine learning or whatever, you don't want sparse matrices because you're wasting But here The fact then we have zeros is leading to another problem.

What's up? What's the challenge here? Okay, but we have a lot of probability estimates My probability estimates that are equal to 0.

Approximately the median reliability ofOne, I'm not including the end, let's start. Right? If I make it longer, right, and then a zero, no matter how accurate those probabilities are, the whole thing goes to zero.

It's one.

Is that a good thing? It will quickly become problematic, not only because you have zero in your counts, well that's okay, your focus does not include certain combinations of locations. What? What about... Yes, something that you have not seen in your curves. This will be automatically 0, right? Something that you are not familiar with. You win. Your language model does not generalize. it will always give you a zero probability for something that has C then a C.

So that's unacceptable.

words were chopped into sub words, right? So part of the deal here is to be able to build an unknown word from those little pieces.

So for example here, the R is going to be similar to more things that are part of the vocabulary and we hear a large language model, if it doesn't know what Alphorn means, it will try to stick together. two-word thesis and gave you some as the probability, rather than just saying, was he right? You don't know what this is.

Okay, so one way to do it is to do it in a way use so-called smoothing. This will take care of both problems. Zero probabilities. and laying it on the ceiling once. I'm going to show you one way of doing that. There's other ones more. more advanced, but-Understanding one is easy enough. The rest are just more elaborate versions. So what do you do is essentially to every count you add one but then youCompensated.

And then Let me hear you. By the size of the vocabulary. Is that what it says? Because if I add it for every word, I add a plus one, then I have to compensate by all the words in my vocabulary. There's other ways you might see More airline version. Where are you at? And then you have to multiply the size of the only vocabulary by alphosmo to two columns.

This is not... optimal solution, but it's better than nothing. This way you will never get a zero probability, or you won't get it.

Another way to do it is-You have a special token. introduce and you will see that in some applications where they have some unknown token some of the large limit roles. that token as well. the Easy way to But now, OK, it's a standing for a ton of different works, right?

One way to do it is-The easiest way to do it is to find all the unknown words in yourIn your text.

And Build a probability based on The count of all the animals.

It can do something more elaborate, create a lexicon And then, what was a lexicon we talked about last time? A list of words that are used for some specific purpose that you would, before you start building a language model, you create a list of words. without looking at your corpus, right? You create a list of And then whatever is not showing up Whatever is in that lexicon that is not showing up in your corpus is counted as an unknown.

Even worse approach, you could just add a fixed probability or a low number of that, and that will do as well. that is obviously suboptimal, okay? That works. Saluting should take care of business for a short time. Especially from your apartment. All right, so... Now, a tricky question. Many of you have some experience with machine writing, right? Supervised learning, I don't know, Neural networks.

Ooh. How were you evaluating One of your more I have a loss function.

But typically with supervised learning, for example, you would have a training test, right? As far as your labels and if your model is not producing the correct labels, then you have a loss function. Can you picture a loss function for a language model?

Remember, there's two tasks, right? Predicting the next word and assigning a probability. Yes. I could possibly have a list of sentences and I have known probabilities in English, but we already discussed it. Can I actually have a reference probability of- I like C is for 81. Can you actually have a reference probability for that?

You can't, right?

So whatever your language model produces, you have no reference to it. Nothing's solved. So a loss function is at a loss, right? You have nothing to compare. Seriously, you can try, but...

With that said, you would still use the same approach as with other machine learning scenarios where you have some test set and you run it through your model and then you see how it performs.

So that test set will be, let's say I set up 10 sentencesThat. Compare. How would you choose that person?

Let's make it 0.3. If you have those numbers and this guy on the right is consistently better than then probability wise, then they'll want to learn which one would you consider a better language model.

The one that produces higher probabilities for those same sentences are lower. I wonder what the hell it is. Right.

You know those sentences, you know they're legit, they should be absolutely legit.

Well recognized. And that's the start of the problem. What if the responses are mixed? And sometimes one is producing higher probabilities, sometimes the other one is producing higher probabilities.

We can just average it, right?

So that's a... Very good way of approaching the problem. In essence, You do have to have some benchmark test to start with.

So why would we put out Thank you.

Well, this goes back to what I was hinting on before, whether you, Remove stuff, or not, depends on the application. If you're just doing, later on in the semester we'll do classification, right?

You're just trying to classify whether a certain review is better than the other, more favorable than the other, you're counting positive or negative notes.

Four to a, those words are not contributing anything to your decision, right? You will not be even counting them. And even if you were counting them, they're going to be negative. most likely equally represented in positive and negative reviews. They're not swaying. You don't want that. Language model is something that you Ultimately you want to build something like a chat GPP that talks to you in the language that it was called to talk in.

basic metric about the olden, not so terrible ways of doing it, just sit the person in front of two models. This kind of goes hand in hand with reinforcement learning or large language learning. Not the best way.

But if you want something quick and dirty, comparison of two models.

Extrinsic evaluation doesn't work, right?

Perplexity is the opposite of that. It's a measure of how perplexed your language Model is. Now, it actually makes sense to call it perplexity. What would that mean? Looking at probabilities.

The modal is perplexed by a sentence. It's a positive brain.

puzzled or perplexed when it comes to legit English symbols, that there's some things wrong with that model. It assigns a low probability to it. Something does not end up in that world. Does that make sense? So perplexity is really the inverse of the probability of a sentence, but there's another aspect to it and that should make sense to you. It is also normalized by the number of tokens in that sentence.

Now it may It makes sense if you are the averages allow for individual sentence perplexity measures that are of different lengths.

Once you have the probabilities calculated, This is simple.

The calculation of probability is based on the chain rule and then the assumption that's Alright, do you understand perplexity? Not difficult at all. Any questions? If you haven't looked at your programming assignment, Your programming assignment Is asking you to build language model.

Like this. In the background, mark of assumption. Yeah.

You're going to be disappointed. But I added an extra. Just building the model, given the corpus that I've assigned to you, that model is going to beFirst there's a Bigram assumption, then there's a volume You'll see once you start practicing. But I also gave you an option. Hopefully you had some fun with it. To make some changes to your model, perhaps you want to use a trigram, whatever it is, and you have an option to modify your model.

However you please. Remove stop words, add stop words, lowercase, whatever works for you.

Just to see if this will work. You don't have to do anything elaborate.

Hopefully this will be an interesting experience. For the basic model, I want you to strictly follow theYou signed it?

Because he will be-there is an expected-There are expected outcomes for your language model, and your language model will be great at those numbers. The other version you're-Thank you. I posted additional I Two Paranobles for you, one.

Here, let me show you what they are. 10, 10a, 11, 10a Please write at home and see if you run into some issues.

DistributionOh. Words, word counts, and now you see the problem that I mentioned before, why your solution is going to be funky because you have The most common telecommutant corpus that we're using here, which is the brown corpus, 70,000 of that and the second one isComma, the third is a. Periods are stuff that has to say to you. What that doesn't mean because there are so frequent your language model, I expect it to predict a lot of commas and periods where they shouldn't, which is fine.

I hope you will notice that yourself and perhaps do something about it.

So use those First you 10A, 10, and 12, and this is like 50, 40% of your work right now. Okay, cool. If you encounter any errors, let me know if you encounter errors in your...

Not as a thing obviously is covered, but-What is part of speech to acting? It means assigning a part of speech to you. Every word in your text. Why would we do that?

Thank you. Change is the lemma, or helps you perform that in a way because the context in which a word is represented changes.

It's a baseball. Okay, so... There is that and there is-I think it also would give the model an idea of how to present the sentence versus the probability because you would be able to-It would be able to tell like the subject comes after the and so on and so forth. Very good comment. The probabilities are not enough. No, let me be clear. A lot of language models that you are so familiar with kind of fuse that all together.

Building a Language Model

Step 1: Corpus 수집

Training corpus 선택
Domain 일치 중요함

Step 2: Preprocessing

Lowercasing
Tokenization
Sentence boundary 처리
Start/End token 추가

예

<s> I love NLP </s>