Midterm Exam Summary

Script

Summary

Action Items & Next Steps

[ ] Bring glasses and cheat sheet to exam
[ ] Print cheat sheet on two sheets and staple them together

Byte Pair Encoding (BPE) Tokenization

The instructor demonstrated the BPE algorithm for building vocabularies and tokenizing text
Process involves iteratively identifying the most frequent character pairs and merging them into new tokens
Algorithm naturally discovers meaningful linguistic patterns like suffixes (e.g., "er") without being explicitly programmed to recognize them
The process repeats K times (suggested 2-3 iterations in the example, though real applications use many more)
Final vocabulary includes not only complete words but also sub-word units and common affixes
Once the vocabulary is learned, it can be applied to tokenize new text by greedily merging character sequences according to the learned rules

Hidden Markov Models and Viterbi Algorithm

The combination of Hidden Markov Models and the Viterbi algorithm provides a general framework applicable to any system that can be represented by a Markov model
Viterbi decoding aims to maximize the product of two probabilities: the sequence of state transitions and the corresponding sequence of emissions
For part-of-speech tagging, states represent possible POS tags (the example used 4 tags, though realistic systems use 30-40)

Dynamic Programming and Viterbi Table

Dynamic programming is essential because it breaks problems into smaller sub-problems and reuses solutions, avoiding repeated calculations
The Viterbi table has tags/states as rows and observations/words/tokens as columns
Building the table involves two steps: calculating the first column, then populating all other columns
First column cells are calculated as the product of: transition probability from start state to each tag, multiplied by emission probability of the first word given that tag
For subsequent columns, each cell considers all possible prior paths and maximizes over: (prior cell value) × (transition probability) × (emission probability)
Back pointers are maintained throughout to track the path that produced each maximum value
Final decoding: identify the highest value in the last column, then follow back pointers to reconstruct the most likely tag sequence
This approach is significantly faster than brute force despite the repetitive calculations

Issues Noted in Previous Assignment

One issue: probability of the sentence was not given, but students should have assumed equal sentence probabilities when comparing tag sequences
Another issue: the Markov model in the assignment had no explicit end state, which is appropriate for some scenarios (like weather modeling) but not others

CKY Parsing Algorithm

Discussion of lemmatization and word relationships (e.g., "fantastic" vs "fantasy")
CKY parsing requires grammars in Chomsky Normal Form, where every rule must produce either two non-terminals or a single terminal
The parsing table is built diagonally, with cells representing sub-sections of the sentence
Each cell can be built from different partitions of the sub-sentence it represents
A successful parse requires finding at least one way to derive the start symbol S for the complete sentence
Purpose of parsing: groups sequences of words together meaningfully, which is useful for applications like question answering systems
Probabilistic parsing is especially useful for resolving ambiguities

Exam Information

Exam duration: 75 minutes
Students may bring glasses
One cheat sheet allowed, printed on two sheets and stapled together

Notes

Transcript

Incoming. I'll work a week. So One way to resolve the ties is whatever shows up first. Right, left, right. So we're looking for the highest frequency, but we have two of those. According to the logic that I just described, we would go after that. This is our first month. So I will now merge these two. Okay, can you create a new token? It is from the perspective of a tokenizer, it is a single new character even though it's made of new gear.

Now I have an ER. With that set up, I will go back to myHorrible. Nicely divided into two. words and characters right now, and I will replace every single instanceOf separate D, separate R, or the D. E-R together. So now we haveAnd we can go through the process again, which means Identify unique tokens. I'm sorry, I think it's bifares. All right, now, whatever else was there before, Say some of them will be replaced.

So we have W and ER. This is going to turn a pair right now. E-R-N-N underscore. See how it's already picking up on an important suffix in the English language. That means something the PPE does not know what it means, but it's found out that it matters Thank you. In other words, in the end, dividing u and er carry some information that can be exploited. All right, so... Notice I also added an inch to mySo capillary right now.

And I keep counting again. This time there's only one winner. Or with others. That's emergent. All right, so I have a new character. I will add to my vocabulary. I'm old enough to eat my own food. And you repeat that process. K times the number of large times. I don't know, what was it? The only assignment, two or three times? I think three minutes. Is that right?

Don't leave it. If you keep repeating this process over and over for a long time, you will not only Fine.

Things as suffixes. That means... S is for? You will also start-Using actual words. So you will build a vocabulary that is not only made of English words but also Many-Supports depending on how rich your purpose is. Does that make sense? So, one, two, three, ten. King repeats. Do you have a final? The final book Have you learned? I expect it to include a lot of words, some words and whatnot. And you can use it.

to apply it to the Tokenizing things, right? More learning, we can-Take any sentence that comes out of the way. Break it down. Somebody from the test set or the actual application. Thank you.

And now, greedily, Go through it to find what can be merged. And we will see if we have an N. We do have an E. This will be merged. And then we do have a NU. In this particular instance we managed to tokenized and work more into the actual If I do it for the lower, I don't have-Yeah, if I were doing it long enough, I would.

End up having a lower. I don't have it right now, so I will stop. Where I can use the-So I would definitely stumble up on a low. And this is how you would tokenize lower according to my vocabulary. Of course, once everything is done, you really drop that stuff token because it wasn't part of the original. Thanks. And you have a tokenized center. Does that make sense? What's the purpose of the The language for it, so you can do sequencing there.

However, and don't forget that, that combination of market model and therapy algorithm is just a-General. Whenever you have a system that you can represent by a given market model, you can apply it.

You can decode it.

Find the underlying system. Everybody understands how that market model is being built, what's in it, transition and emission probabilities. Operations will be input to the brain. There's a set of possible observations Thank you.

Two things.

at the same time What are those two things? I mean, we're trying to find-Probably. Most likely sequence, but we are using

There you go. We're trying to. We're trying to maximize the following expression. We're trying to maximize the probability of a product of two-Number one.

Sequence of tags or sequence of state changes in the market model and then the corresponding Underlying.

Sequence of emissions. We move state to state.

Next state, commit. Next state, commit. Responding.

So really, big therapy is trying to maximize that problem.

How's it going? And the answerWhat would be a sequence? How does it work?

Well, first of all, we need a set of states, right? So now we're in case part of speech tracking says possible states is Is that a false sign of attack? In the slides, we're Using four distinct tags is a very low number of realistic and practical data. 30 or 40. Something about that thing. Difficult to explain. Good morning.

So, assuming that we have a Corpus within building.

. What are the transition probabilities for one?

What does that state?

By the way. There were two places in the other sceneExcuse me, ma'am. Word. You should have raised some questions. You are not givenAll the information that we are required to solve it. It was fine if you made a mental note somewhere, an assumption, this is it, okay? And you're not going to be penalized for it. I was hoping that someone would just Hey. It does not add up, OK? Well, being down in the.

Any thoughts on what that was? One was rather intentional, the other one was--Slip up on that part. Three pounds. One problem that he had, he were supposed to-Decide whether a certain set of tabs is better for one sentence or another, right? Always blessed.

The probability of ascendance is when we were building the Piedmont Market Mall in Abacate, there was a step when the assumption was made that, hey, we're dealing with the same sentence. The probability of the sentence does not matter. So. You, I imagine you solved it under the assumption that they have the same probability. Fine. I was hoping that someone would take it.

And the other one was with the-back to the market Model starts.

Start, state. In our case, we don't have an end state. Assignment right. Can you imagine a scenario where there is no room for a handstand?

Or if you had a human marker ball that just reflects the weather changes and observations of temperature and weather, right? Stop. Supposedly don't come, boom, right?

We're not assuming. Everyday temperatures are changing. There's no stop. Stop.

Sometimes you will need it, sometimes you will not need it. Here, we don't have it, we can just keep generating. Sequences of sentences of A. Anyone? Everybody understands how that becomesYeah. Once you arrive at the state, you have One of those observations words Alright, so given those two tables, or this table and this diagram, you are ready to decode.

The sequence.

Unfortunately, The brute force approach is not-And what kind of an organ? Day 9 program, what is so special about Day 9 program? In general. To some extent, yeah. So you're both right, Breaking down into smaller problems and you are able to reuse The solutions to smaller problems. You're not. repeatedly We're doing the same calculation.

You can go back and...

We need some specific sub-problem value or whatnot. This is what the-Eternity table. Looks like this. Lowercase.

It has All the tags are states.

As rose. and observations or words or tokens as calls. Of course you can flip it out, as long as you have that information in order, And your task is to...

Hello, ladies. Populate every single value in thatAnd then tape.

Ultimately, Your last code in that table will store a solution in your property table.

The highest value someone, And this value is corresponding to the path. The ruler, they all maximizes all the calculations. All the probabilities and the decisions So just reading---Can you see the other belt? Now the sequence to decode the sequence you also have.

Backtracking point.

How are we building this table? There's really two steps to it. First step is calculating The first column and the other one is--After laying all the other columns.

First column you're using A simpler calculation for all the others here is a little more than that.

Approach, so how do we calculate the first column?

This is pretty simple. There's a lot of back and forths here.

Really what matters is for every cell here, you're using a product of a few other builds, which is,We have transitioning from the beginningsDo you have specific statements they have for me? That would be-When you are here, what's the probability that you willThere were flags. The first column is really Every cell in that first column is a product ofTransitioning from the start stateAny other state?

and then admitting that first word.

What do you have there?

back pointers in this case. Let's keep all of them. Some zeros here, but the pseudocode is not specific on that, so we might as well keep all of them. With back pointers and the values, we are ready to move on. To the next column. And here every cell is calculatedThis award is also is completing The same approach. Yeah. For the first column to begin here, we always start from the start. Now, We have to, let's say, we're trying to do This way.

Those cryo paths are If you look at the formula for calculating a single cell, A max, that's important, out of...

Multiple products, multiple, I think there's four ways to get them.

Now, those products. For the first column, we were only looking atTransition. Times emission probability here. Since we are already somewhere along the path, we have to factor in The information about the crime. That priorThe value that we Okay, let's look at this one. I might have came from B11, right? So I would Okay.

We're all So in that whole same approach, you get some numbers. You also should keep track of all the back pointers here. That's all right.

But what we do with those calculations Now, after we're done, we're going to go into health.

That was our initial sequence of observations. All right, let's see. The highest number is here. Now... We found the highest number of our own all-inclusive piece.

And we followed the back features.

And that is the rule of maximum sequence. And we are still not convinced that this is going Way, way, way faster. force Despite all the repetitiveIf you think that you can skip some of those, don't. It ranksYou hear me?

That's the-Fantastic delays will lead you to fantastic. PerhapsThe way the process goes is like a dictionary of information.

It will go to fantastically and it will most likely lead you back to fantastic.

And that is what would we say fantasy with respect to fantastic. My answer is probably not the answer that you're looking for. It's a fantasy and fantastic view of the same semantic field, which is a term that I ended up using.

Chomsky Normal Form. What is special about Chomsky Normal Form? You can see on the screen that the rules of the grammar on the screen are Chomsky Normal Form.

Not a single rule.

That means that whenever youA rhyme and a symbol, whatever it is.

That diagonal can be built on two components of connecting He's so cool.

We'll see that. I hope so. Think about it. This cell right here.

with that subsection of-Sub-sentence. That's all right here.

And everything else is already captured by it. We hope that's one way to break it down. There's four different ways to break it down.

-The next one would be Joe Eats. That's the blue one. Unfortunately, and the rest is known, right?

Unfortunately, there's no rule that connects the start symbol with another person. This is a null. So we already had an S, right? Now we're getting a null.

So I had for the first partition, I had this. All the others I have notes. If I manage to find a single S, it's possible that I would have all fours. There's a single S right here. I have a legal way of R, saying that sentence.

Thank you. Thank you. You should not be doing anything like that. Basically, what is the purpose of parsing? Okay, so parsing will go to... Sequences of words together.

It will partition your sentence in a way, it will partition This goes Together, the rest goes together. And you can have-that's not a perfect example, but-New York City, right? They should go together. President Trump's son, right? If you want to group it together, for example, the question answering is later on. No. President Trump andThe work side work. Thank you. According to if you have a question answering system that goes, 9x, right?

So this is specifically useful when What was this? Is this specifically useful to resolve that kind of problems? Even more so, probabilistic problems, even more so than theDoes that make sense?

I'm fully aware that you only have seven to five minutes for your exam.

I already mentioned that I have glasses.

Bring them in, put them on, and smile at me or whatever other glasses are out there. One cheat sheet allowed. Print it out on two sheets.

Everything about parsing ON THE MIDTERM EXAM!

Week 1

What language is

1 Fundamentals of Language & Linguistics • Language: 구조화된 의사소통 체계 특정 체계(English)와 인간의 언어 능력 모두를 지칭 → Phonemes / Morphemes&Lexemes / Syntax / Context&Semantic • Linguistics: 언어와 그 구조를 연구하는 과학적 학문 • Linguistics Subfields: ◦ Phonetics: 개별 소리 연구 | Phonology: 소리 체계 연구 ◦ Morphology: 단어 내부 구조 | Syntax: 문장 구조/규칙 ◦ Semantics: 단어/문장의 의미 | Pragmatics: 맥락 속 실질적 의미

2 Phonemes, Morphemes, and Lexemes • Phoneme: 의미를 구별하는 최소 소리 단위 (예: /p/, /b/) 자체 의미는 없음 • Morpheme: 의미를 가진 최소 단위 ◦ 예: un-break-able (3 morphemes), cats (cat + plural -s) • Lexeme: 의미적 단일 단위인 단어들의 집합 ◦ 예: run, running, ran은 모두 동일한 Lexeme RUN에 속함

3 Syntax & Sentence Structure • Syntax: 문법적으로 옳은 문장을 만드는 규칙 체계 • Parse Tree: 문장의 계층적 구조를 시각화 (S, NP, VP, V, D, N 등) • Lexical Categories (품사): ◦ Major: Noun, Verb, Adjective, Adverb ◦ Functional: Conjunction, Determiner, Preposition • Noun Subcategories: Common(town), Proper(London), Pronoun(he/she)

What are blocks of language and their relationship to various NLP tasks

Basics of morphology

Morphology (단어의 구조) • 구성: Lexeme(Root) + Affix(Prefix/Suffix) • Inflectional (굴절): 문법 정보만 추가, 새 Lexeme 생성 X (예: dog → dogs) • Derivational (파생): 단어의 품사나 근본 의미를 바꿔 새 Lexeme 생성 (예: happy → unhappy)