Lecture Script

Summary

Introduction to Collocations and Language Patterns

Collocations come in different types, including phrasal verbs and common phrases
Preprocessing may require treating multi-word expressions (like "social media") as single tokens
Words do not randomly appear in text - they follow statistical patterns and tend to co-occur in predictable ways

What is a Language Model?

A language model is a probability distribution over words
Language models work on the principle of identifying statistical patterns in text
The process is entirely based on probabilities
Language models predict the next word in a sequence by calculating probabilities
Joint probabilities and conditional probabilities are interchangeable in this context

Applications of Language Models

Word prediction: Predicting the next word in a sequence
Spell checking: Identifying misspelled words by assigning very low probabilities to invalid sequences
Machine translation: Selecting between multiple possible translations by choosing the one with highest probability
Speech recognition: Mapping sound sequences to words by comparing probabilities
Sentence validation: Determining if a sentence is valid based on its probability

Building Language Models: The Challenge

Cannot build an infinite lookup table for all possible word sequences
Need to map sequences to probability values using functions rather than exhaustive tables
Using uniform probabilities (every word equally likely) does not work well
Must use corpus-based frequencies to estimate probabilities
All probabilities are estimates, not precise values
The choice of corpus significantly affects probability estimates

N-grams

An n-gram is a subsequence of n items from a sequence
Examples: bigrams (2 items), trigrams (3 items)
N-grams can be applied to any sequence, but are primarily used for words

Calculating Sentence Probability

The zero-count problem: If a sentence doesn't appear in the corpus, its probability is zero, which breaks the language model
Must use the chain rule to decompose sentence probability into a product of conditional probabilities
Conditional probability formula: count of sequence divided by count of preceding context
Longer sequences are more likely to have zero counts in the corpus

Markov Assumption

Conditional independence: Future words depend only on a limited context, not the entire history
The Markov assumption states that the future depends only on the present, not the past
This is an assumption made for simplicity, whether or not it's strictly true
Example: In "I like apples," the word "apples" depends on both preceding words, but Markov assumption simplifies this

Context Window and Orders of Markov Models

More context generally allows better predictions
However, not all preceding words are equally relevant (e.g., first word of a 10,000-page novel doesn't affect the last word)
There's a "sweet spot" for context window size that depends on the problem
Context windows explain why chatbots eventually "forget" the beginning of long conversations

N-gram Orders

0th order: Future doesn't depend on anything
1st order (bigrams): Future depends on one previous word - most commonly used in this course
2nd order (trigrams): Future depends on two previous words
Higher-order n-grams are more accurate but harder to estimate

Maximum Likelihood Estimation

Using counts from the corpus to estimate probabilities is called maximum likelihood estimation
These probability estimates are most likely to generate a corpus similar to the original

Practical Implementation

Start and end tokens (like <s> and </s>) are added to handle sentence boundaries
The start token allows calculating the probability that a sentence begins with a particular word
Example shown with calculating frequencies and probabilities from a small corpus

Notes

Transcript

Collocations come in different sorts, right? There are different types of collocations.

That makes sense. Phrasal verbs, that's a lot of them in English.

Phrases, right? Do you want, do you feel that you need, right? Figure out, oh, this goes together.

Why would you? Well, you can find keywords, or you can use it for other purposes.

This is a good example for additional stepping of preprocessing. Social media. That has to be treated as a single token.

Alright, so now time for a very important topic of this course. Everybody in this room knows what a... It's a large language model. What I'm going to introduce you to right now is just a language model as a concept. And it's not the kind of...

But it is working on the same principle. Does that make sense? So what is a language memo?

A language model is related to this concept. There is a statistical pattern in that you can find, and this is... Thank you for your time.

Words going together, collocations, right? They do not randomly appear in text.

So, it won't start short.

Your GPT, longevity, and whatever language language you're using is technically doing this.

It predicts, I guess is not a good term. It's like, I like a man who has to...

This is what language models do.

This process is entirely based on probabilities.

But, this is not the only thing that language models are doing. Language models can also...

Now, you're judging the work of whatever is producing a response to that.

Actually, this is a very long list of possibilities, at the very least, it is going to be nameless language. So now it has to pick the one that goes there. One way to do it is to assign a probability of each sequence, of each sentence, right? And pick the one that is the highest.

The other one is to calculate the conditional probability of each one. But that's, you already know the relationship between joint probabilities and conditional probabilities, right? One, we can move between one and another.

So, that's the language model for you. Of course, word prediction is not only...

Fixing things like this, right? Why would that be a problem? Well, that's a misspelled word.

Well, you assign a probability to it and it's going to be very, very upright. Something's wrong with that. As opposed to... Let's not talk about details right now, but there's a lot of applications for that probability.

Those applications will go beyond what I will cover.

So, long story short, and that's one of the most important definitions and one of the most disappointing definitions for you in this course, is a language model is a probability distribution over words.

So now, the number of cones kind of goes into infinity, right, if you think about it. Then, that also means that the number of rows will go towards infinity. And for all intents and purposes, that table is infinity by infinity. It probably is not, but...

It's close, right? In other words, you cannot build that. So you need other ways of doing it. So, essentially, when you're dealing with languages, language models are in place,

I don't know, sequence, or I don't know, is some function of that sequence. Correct? A table like this could be used as a function. A lookup table, that will just give you an answer. But, if you can't do that... Well, maybe you can come up with some way of mapping a sequence into a probability value. This is exactly what we are largely learning, which models are generally used.

Does that make sense? So far so good? Alright, so how would you... I'm already telling you, this is no go. You will not be able to do that.

It can also be an answer. Is this sentence valid? Low probability sentence? Most likely there is something wrong with it, right?

Those who are bilingual know that a certain sentence is going to be translated in multiple ways, right? You have to pick one. Now, with a language model, you can decide that. One of those options has a higher probability based on some... This is actually part of what's going on with machine translation that he used sometimes.

Looking at the sentence as a sequence of sound tokens, right? Some sequences would be more likely than others.

By the way, I'm sure you've noticed when you're speaking with a machine over the phone, right?

You don't have a nice little open-ended conversation. For a yes on that, press 1, or say 1, or 2, yes, no, right? Notice that there's usually 2, maybe 3 options, or maybe 6 options. No more than that. If you reduce the complexity and then there is a probabilistic mapping, your sound sample is compared to some reference, right? This is probabilistically more close to yes than no. So we're dealing with yes or no. If the probability that that machine came up with is very low, it will actually...

Is that going to work well? Every word has the same probability. You can hear them as you wish. That doesn't...

So, okay, scratch that. Let's use some specific probabilities, okay? I've already introduced you to some of them above. Text focus, which has tons of examples, contains certain words that appear more often than others. Right?

If I give you a corpus, right, then we could be in a corpus, right? With that, we will come up with some value. It will be the number of...

So does that make sense? That would be it. Can I put an equality sign here? Well, in this class that will not come down. Can we have a precise probability of the word rabbit at co-occurring endings?

I just ruined your work, right? So, one thing that you have to forget for the rest of the class is that those probabilities will be just estimates. For better or worse, right? And it gets even worse than that. If you pick your corpus that is not rabid, really, right?

There is a chance that that probability estimate will be horribly different to some rabbit racing manual, right? Picking your data source will matter. So far so good? Okay, let's go.

And n-gram is a subsequence of n items from any enum sequence.

You see, right? D, C, D. D, C, D, E. These are three angles, trigrams. Well, you'll see why that is important. So, an engram is a subsequence of an oversequence of a predefined length.

So, this could be applied to any sequences of anything, really we'll be doing it mostly for filaments. That's all I've got.

But, depending on the corpus, this will be zero. Is that a good answer? Zero over some... There's our number.

There you go. Your language model is broken from the beginning. You are not able.

According to the chain rule, I could turn that into a product of conditional probabilities.

All of them computers are useless, shows up in the corpus as the numerator. Okay, fine. Over. And we have something different here. The computer's are X. Does that make sense? Because this condition tells you that we already have. Computers are proceeding useless. Correct? So I cannot take into account any three word sentence.

All right, but that is also different. It can be simplified to the following expression. And while I will...

Also, the longer the sequence is, the more likely it is that I will not find it any further.

But let's use it in an actual scenario. Let me assume conditional independence.

This guide is conditionally independent of that guide if I know it.

The more context I have, the better prediction I can make. So, should I be expanding this historical chain of words independently? Maybe.

You read a novel that's 10,000 pages long, right? It starts with the word hello. It's a Friday novel, so you learn who done it at the end, right? Would you say that all the words that happen from page 1 to page 1,000 would be useful to predict the last word in that novel?

But you would not argue with me that there is some... that it's a non-zero probability that that first word actually affects the last word. Okay. Now let's take a magazine. People or whatever. The Atlantic, right? You take the first word on the first page and the last word on the last page.

I imagine there's a sweet spot somewhere, but it's hard. It will depend on the problem. Does that make sense?

This is why when you talk with Chad for days, it will eventually tell you that it doesn't remember where you started.

mathematically proof. Now the Markov assumption is called an assumption because you make that assumption whether you are right or wrong. If you do it for simplicity and convenience, you make the assumption that this holds. If you... Am I making sense so far? Because that mark of assumption is going to be showing up in your AI work a lot.

Another way and more general way and the right way to think about the Markov assumption is if you have a prime line, right? This is now, this is the past, this is the future, right? So what Markov assumption says is that the future depends on the present only.

It is an assumption.

Because if you think about it, I like apples. It's absolutely, the apples are absolutely dependent on both I like and I love.

I'm reducing the conditional to just one previous token, or one previous item.

All right, now I could extend that to, um...

Technically speaking, the more ground we use, the more accurate our conditionals should be.

But it will be, the more n-grams we use, the harder it will be to estimate this. But you can play this game as you wish. So, long story short, this is...

So now back to orders of Markovian assumption. The year of order, that means that the future does not depend on anything. Okay, there's no additional, whatever, yes. Zeroth order. First order. This is the one that you will be mostly using in this class.

Now, future. By the way, this way to approximate or estimate that conditional... It is the result of so-called maximum likelihood estimation. Are you familiar with that term?

Did you take any machine learning classes? Some of you did, right? I'm pretty sure the maximum likelihood estimation was on the median there. All right. We're now going into mathematical details. Thus, if you use. The counts from the original corpus, two, estimate the probabilities of occurrences of certain words in certain sequences. And then he randomly tried to build a corpus from nothing. You're more likely to degenerate that original.

Those probability estimates are most likely the ones to generate a corpus like that.

Assume that there was no proper punctuation here. You could still have a punctuation right here, as a token, and still have the and at the same time. This is how it goes.

So, first, let's find the frequencies and the probabilities corresponding to individual units for a given purpose.

According to this corpus, that's a 50% chance that you would need to get that.

How likely is it that it is starting to start?

Does that make sense? Do you actually have that number in the mail? This is part of a reason why you're adding this extra token so you can do the punctuation.

N-grams

정의

N-gram은 연속된 N개의 단어 sequence

종류

Unigram → 1개 단어
Bigram → 2개 단어
Trigram → 3개 단어
n-gram → 일반화

예시

문장

I love natural language processing

Unigram

I
love