Lecture Script

Summary

Text Preprocessing Decisions

Stop Words

Tokens and Types

Type-to-Token Ratio (TTR)

Text Segmentation

Tokenization Approaches

Whitespace-based tokenization:

Character-based tokenization:

Subword tokenization (Industry standard):

Tokenizer Demonstration

Byte Pair Encoding (BPE) Algorithm

Overview:

Token Learner (Training Phase):

  1. Start with character-level vocabulary from training corpus
  2. Add special end-of-word token (underscore "_") to distinguish word positions
  3. Perform whitespace tokenization, then character-based tokenization
  4. Iteratively merge most frequent character pairs for K iterations
  5. Count frequencies of all unique byte pairs
  6. Merge the most frequent pair, update vocabulary
  7. Repeat until K merges completed

Key BPE characteristics:

Token Segmenter (Inference Phase):

Parameter K:

Practical Tokenization Challenges

Notes

Transcript

Remove the punctuation and... It depends on the context. Again, these question marks are important. If you're asking something, a question, then having that signification would help in the response.

Right, right. A strategically placed comma is also going to change the meaning of the sentence every once in a while. Your call, right? So you do have to have some understanding of what are you dealing with in the text and what are your Future steps, right? Because you're preparing facts for some future processing or anything from... Thank you. Words Thank you. Bye. I don't like that. Okay. At least you know what removal and fluctuation.

There's a term here though, stop words. If you were to give me a question, Yes, what a stop word is. What would that be? So you will be removing, removing stuff means removing certain words from the text. Are there any words that come to your mind that you get, okay, let's dispose of that because it's not bringing much to the table.

Maybe words like the or have, like the. where you don't seem to be able to get a whole lot of value? Yes. Yes. I was thinking more of like introductory set of phrases like or I guess it's in conclusion or sentences that will be in but don't really add context to it.

That's a very good comment but you would have to be also careful with the ee ee ee. If you absolutely know that it's not taking away from processing any future texts during the same system than he is, by any means. But you're closer to it than--Any conclusion? I agree with you that might be not important. So why would you do that? Technologically or engineering, Wise. Simplifies the sentence. Simplifies the sentence, shortens the sentence.

The number of little pieces that you have to go through is reduced automatically. You're leaving in times where everything is super fast, right? It will be faster. Memory seems to not be an issue. But imagine if you're kind of running aGPD, cerebral pulmonary,Are you getting a bazillion requests every second? Again, this will be up to you. Should you employ that? How would you know that a stop word is a, how would you select a stop word?

You could ask your English professor, right? Meaningless words in English, right? Can we register? Is that a good way to do it? The difference between and and that is It may matter in some... in some cases, right?

My distinction was, we talked about sentence structure before, and Certain words in those sentence structures were integral, like nouns, verbs, but maybe the other articles might be unnecessary that could be disposed of in stock words. So based on sentence structure, maybe come to a conclusion. Okay.

Let me hopefully give you an. I said that we're discussing English here, but let's-Relax and condition for a moment. You're being tasked to build a system that processes a language that you don't know. We don't have a whole year or two to learn that language. The UBC has to do that. Can you imagine having words like "eh" and "that" in other languages? Possibly. Sure, right? Why not? Now, you have never spoke that language, spoken that language, you have never used it, so you wouldn't know, right?

You might ask someone, but if you were to write a general tool that picks up, hey, here's a list of stuff, words for language, why? How would you go about that?

Maybe those words, since they're kind of more filler, have a higher frequency in the entire document, so you can remove words that-appear over a certain number of years.

Excellent. So words frequency in words andCome along to the piers and the shababs, like the top. about frequency words because feel like that will be helpful. We'll talk about More frequency. Big. Now, So you get the idea. So these are optional pre-processing steps. The other two are kind of optional too, given the current state of NLP, but I'll walk you through it. Now, there's one thing that I did not explain here.

Yeah, so it's like an atomic piece to be processed. The smallest piece to be processed.

Yeah, like a building block.

Right. Keep in mind that Um... Okay. Let's say that you even approach the breakdown process only. tokenization here's a sentence and you're breaking it up right and forget about soft words you would tokenize it by words right and you get to the end what about this Full stop at the end. Should you keep it? If yes, would that be all? No. It's not a word, it's a character. If you keep it, it will be processed.

Hopefully you can extract some meat. That would be a token as well. Don't forget. So tokens are a wider concept than the words. And this is why they are. All right. A few technical... Um... Explanation for you. Let's go to... Segmentation, I'm not going to use the term segmentation Sometimes tokenization gives us a little confusion. I like segmentation here, segmenting text into sentences. One very easyApproach today is use the dots, like if I don't, you're done, right?

You will have some problems such as There's a full stop out of theSet to. quote unquote, unexpected places that you have Take care of but that is not a big deal one way to to handle those edge cases is to have a The classifier that goes on again, this is all problem solved so library Perfect English sentence segmentation. But if you were to buildFor a specific language, right? Here's a... So just right here, so decision three that will handle all sorts of work cases.

Are we dealing with both? End of sentence, if yes, done. If not, well then there will be, is this an abbreviation, right? All sorts of tests that you will custom build for the problem. Does that make sense? But we will not be doing that. Usually, dot split will take care of most of your problems. If not, you can always Rely on the expressions or some custom Music. No. Let's talk about the token exemptions.

I already think I explained what it is. Token? Do you remember what a type is in programming languages? It classifies different types of data. Every data type should be different. different than the state that you're in. Freakdom in a different way. So a token is like a type in, Thank you. NLP. It's an element of the vocabulary. Think loosely speaking, it's an entry in some dictionary, except this dictionary will have words such as but it also will have you.

Now there is an emoji and whatnot, whatever you make it. Okay. That's a very good question. Every single unique entry in that dictionary is one type. Now your tax will have multiple instances of that type. you will have multiple dogs in that tax. Those are called tokens, specifically. It could be multiple tokens of the same type, but there will be only one type. Before that, I think of a token. This is also That distinction is actually helpful in some cases when you're trying to do some text analysis.

Here's one of Thank you. Super crude and easy and basic measures of performance measures on the metrics for effects. Type to token ratio, right? So here we're looking at a sentence, a good horse is a horse that you love, right? So how many tokens do we have? None, right? One, two, three, four, five, six, seven, How many times do we have? Well, if we lowercase that, that's... That's one repetition of the same type.

So we have seven types. You could argue that there's eight types, fine. But let's assume that there's Seven types. So type to token ratio is seven over nine. All right, if I gave you, here is the token ratio for or some textbook or for a novel. What's your favorite novel? War and Peace. Whatever, right? What would that mean? What's that number, Kevin? Hi. But very high, you have, you know, this is a spectrum, right?

The lower the type two dopamine ratio, right? There's more. Tokens. In times way more because of the new things. But then if there is a low type 2 dopamine, you're reusing the same warm very often, right? The other option is you're I have a lot of words and you use some very sparingly. In loose terms, what does that mean about the level of blood table here? The richness of the text is high if your type 2 token ratio is here.

If you're here, right, Means that you're kind of Of course the length of the text matters. It's a very short text. It's not very meaningful, but take more in peace. If you would have a type two token ratio right here, The writer is not very... Doesn't have a, doesn't use a very rich vocabulary, right? All right. Does that make sense? Now how would you use that piece of information? Again, you may not use it in your NLP applications, but...

Let me give you a hypothetical. I'm giving you Two datasets, let's say they're both made up of tons of nodes, right? And you calculate the average type-to-token ratio and one has a very low type-to-token ratio and the other one has a very high type-to-token ratio. Which one would you use to train your own chat GPT? Would you go for the one with a very rich vocabulary or for the one that has very limited vocabulary that is reusing the same thing over and over and over?

We can only pick one. And remember that your own chat GPT has to learn to speak and it will imitate whatever is in your data set. If you give it a very rigid data set in terms of vocabulary or almost even the same word. Like this is a very cool class like you have to write the letter like Today It's very cold. It will be to do with that. Alright, does that make sense? Very simple method, but it's good to know.

Um... Here's another thing that would possibly trip you, right? We're talking about... Tokenization, first segmentation for sentences where you're trying to find more of a sentence. Now the tokenization itself is usually for English starts with a white space. Now, most of you have spent enough time in computer science classes to probably understand that This term does not mean Just space. All right.

And also, and the blind character, invisible, it's a white space character. A tab, it's a white space character. A non-breakable space, it's a white space. You don't see it, but it breaks the words in you. in the sentence, okay? So, loosely speaking, let's broaden the category. We're not breaking where the space is. We're breaking in where a white space character is. Does that make sense? Now, I'm pretty sure there's people here who would tell you that I'm using the language we're using.

Space would not be enough to break. Four. A sentence of two months. So that will backfire again if you step out of the English language. You may need to use other approaches. I asked you about German words last time, right? German words and clusters of German words. It's the same word. There's languages such as Chinese or Japanese that doNot separate. Wordsworth spaces, I'm not an expert, but... That's what I do.

No. So, Long story short, You. We'll see a lot of instances where The boundary between tokens may not be as easy to find as, let's look for white space. And that also applies to the English language. Because... I think we already covered that. You know what I mean? Good work. splitting into do and in, right? If you do that, you could go into more things. Then you may be able to extract a little more meaning.

Here's The Hugging Face tokenizer playground. If you've never heard about Hugging Face and you're very interested in the NOB, this is a great place for all. Things NLP. A tokenizer, it shows you how different tokenizers Work. So here's Hello. different organizers. From working. NLP systems such as GROP, BUD, or GBP4. Does that make sense? These are not made up. These are not academic. exercises, these are real tokenizers.

So let's go back to Shrek movie. The graph, so. Text. So how about we just grab-For Polari and Frank is not going to be moving. I hope we'll not be seen. Okay, so flip. Here's a-Shrek. This is how GPT-4 tokenizer tokenizes text. Does it make sense to you? Or some sense at least. You can see that it broke down certain words into sub-words in ways that you might not have expected. Right, every little color in there.

Because rainbow is really one token. Notice that it didn't split taking into... Hug, and... It makes sense. Tractured. I would not expect that. And you have a comma as a specific. as a specific token, individual token. And so. and so on and so forth. Flat your legs, okay? Flat. Why would I display this? Getting into two supplements. Hopefully after this class ends you will have a better picture of why.

Let's... Can you screenshot the bottom portion? I'll just do a... This is GVB board, this is everyone, you wanna go, all right? There you go. You can see already fractured, right? Look so short, if you play around with those different tokenizers, you will see that they differ slightly. Is that okay that they did first-hand? I don't know if the list of results The reason why they perform slightly different is Not a matter of a specific design someone sat down and said I want this to be tokenized that way.

It might be a result of the text that they used to build the tokenizer. And a couple others. Thank you. All right. Nope. Thank you. Let's get to Building a token engine. How do you build a token engine? Right? So we're past the scenario where we're just using the white space here. This is when you stop. We'll be looking for a boundary of-as well. cough At the... The core of what you just saw in that tokenizer playground is an idea that white space based on connection is not enough.

For a number of reasons. I gave you at least one reason. Subfixes, prefixes, and root forms may be useful. The other one is what about Unknown words. The words that your tokenizer is not familiar with or your system is not familiar with. How do you make sense of a word you haven't seen before? You will see down the line that one way to approach it, that cruel way to approach it, is just to reject that word.

Does that make sense? Remember the types, unique words. The tokenizer will build such a list. And it will include whatever it-Right there. training that text It's a very elaborate with a lot of English text, it will probably capture the entire English dictionary just well. Okay, that makes sense. The other Yeah. The end of the spectrum is to have The character based of Kenny's action. What is the character based of Kenny's action?

Does that make sense? Yes, okay. cough Now, it obviously has a limit A model of little pieces like this. Little Subways You could, on the other hand, you could just choose that, hey, every word is, or every sequence of characters is made of individual characters. Does that make sense? So you would take words such as learning and break it down into A sequence of characters, essentially. What's the problem with this approach?

Remember, you're tokenizing your text. to look at individual pieces individually. In the whitespace-based tokenization, you're looking at words later on and trying to extract meanings from them. Which is fine, you look at the word, that word has a meaning. Now if you look at an individual character, does it have a meaning? It may have a meaning, but beyond that, not so much. So... You will useThat breakdown to build worth back.

Just like you're building Fractured out of three components here. You're trying to, yes, it's broken down, but it is ultimately it's representing a bigger work. Okay, what's the problem with that? Can we use that? Is it easy to... We'll look at a sequence of characters and then try to figure out what is representing. If I give you A sequence of three words, right? Or three characters. It can mean a specific word that is three characters long or it can be a part of a 10 character word.

12 character word, four character word, it's very hard to put it back together later. There's this challenge and it also creates a ton of combinations because yes, you will be putting them back together and there will be combinations such as xy for example. YX. Is there a word that starts with that? It has a YX in English inside. I don't know. Is that your phone? You surely don't have anything that goes like my first in English.

There's nothing, there you go. But a character-based optimization would look for that kind of combinations and try to put it back together. Am I making some sense? All right, this leads to relatively small vocabulary size, small... amount of possible combinations. This leads to a lot of combinations. I mean, it seems like you will, your vocabulary will be just an alphabet, a couple of digits, but really later on when you build up on it, you will have a lot of Every possible combination has to be accounted for, which is difficult.

Middle ground is to have something that is called a software tokenization. Which will break. It will go beyond The whitespace tokenization, it will break up the word into So, what's in here? Well, you might... You could do it... We already talked about it. You could have a list. of typical. And again, suffixes and prefixes are perfect candidates for it. towards that too, right? Does that make sense?

Let the machine figure it out. And... This subword tokenization as you can see right here is an industrial standard right now. It has a lot of benefits such as handling the unknown words, and what not It's a middle ground. It's a bit more complex than White space tokenization, building a tokenizer takes a little time versus, hey, just put the job on where the white spaces are. But it has the benefits.

So we'll talk about-An algorithm called a "Bike Bear Encoding" which is a sub-work tokenizer. and that BPE is actually a part of a charging Um... No. There will be two combines. The token learner and the token segmenter. The first one replaces that language specialist that says, "Okay, these are your sub-awards that you should be looking for." It will figure out What is a token? It will give you the list of possible tokens.

I guess, would you call it a white space tokenizer? I guess so, yeah. It goes a little more beyond white space because a hyphen and quotation marks is also kind of a... Take a right over here. There's a different one. So that's how a whitespace tokenizer works. For some authentication, this will be in. Oh, there you go. I did not have a German German example for you last time here. pound noun in German.

My German understanding tells me that this is a word, this is the word. And so on and so on. You can just build words like that. And workspace tokenization does not really help you do it, right? Of course you have other Thank you. Other problems? Here's a perfect example. Do you want to split, Senator Sisco? You should, right? Because it's... Just go, they both go together. But yeah, we might still still get it, we might just break it down and read Simon and Francis Cope as two distinct words.

So. Again, don't be messy. You will always see a problem. That may or may not be true. or some little customers that have those efforts and say, hey, I'll be-Happy to lead with that. Neural network approaches usually handle those kind-All right, so I promised you explaining how BPE works, right? Do you know how, uh,Zip works. Compression. Why is it compressing? Thanks. It's doing a Text compression.

Remarkably well. Alright, so here's the idea. By the way, because sometimes students confuse that, BPE is not a compression algorithm. It is built on the same idea. But let's start with the compression. Here's a spring. All right, there's one, two, three, four, five. Six. Eight, nine, 10, 11 characters. Now, long story short, if I came up with I introduced another unique character that I will remember that it means two days.

That's the set. Training set. This is a text that is being used by the learner to see the patterns Once those patterns are discovered, Your segmenter is deployed and it's doing that for you on any new text that comes in. I'm going to show you on the example, how it works. It's a very artificial example. Here is a training corpus. I have not explained what a corpus means Just yet, but you can do it.

There's not much to it. It's a textual data set. for a large purpose of textual data. This one is by no means large and it's pretty clunky because it's just, you know, repetition of some words. But that repetition is there for a reason, so we can have that. So you can see Easily. Some patterns that you would find in English, right? And there will be a test set. This is not something that a learner will use.

People will use it later on toTo actually do Dell. Do the job. in practice. Okay. Here's our text. First step. is to decideSure. Yeah. What is this brilliant argument, or case? The parameter k is specifying the number of merges. Amen. Using two different characters together to produce another one so we I show you in this compression context, I show you That's right, A and A was replaced by Z. That would be considered a merge except we will not be getting a new character, but two characters, A and A will become one new character, A.

It's still too careful. Does that make sense? So you have to precipitate. We specified the number of the eight is obviously very little. We're not doing a big one. Okay. is the size of your dictionary, your vocabulary, ultimately. So if you say that, Your K is 60,000, for example. Your GPT will work of a dictionary of 60,000 EMI tokens. No more, no less. Of course, if there's less words or sub-words in the text it was trained on, You will not have 60 pounds on the bed.

English language has 50 something unique words. Subwords, suffixes, prefixes, digits, characters, special characters,The mold is whatever and you easily do it. Does that make sense? Good. Here we're doing Um... Eighth. Your first step is to distill your carpus for unique characters. To specify the unique vocabulary. All right, ignore the white space character. Just look at it. The words in this Thanks.

And from that, we will exit. Good luck. A set of of individual characters. Bucks. Does that make sense? That text right here is made solely of those characters. It's not a full English alphabet. That's not it. We're just doing it for demonstration purposes. Your actual vocabulary, starting vocabulary for a regular text would be exclamation marks, emojis, and whatnot. That's step number one. Step number two is very important.

Add a so-called stoke token. Two of us. In lower case it will be an underscore and it can be anything that is not already in there. Typically it's an underscore. What is a stilt-tongue? So... A stop token will be used to indicate end of words. Because... Why would that matter? Highlight the end of war. Cheers. There's a lighter. Right. This is Very common subject in English, right? It means something.

There's an ER. Is that the same ER, meaning-wise, as the ER at the end? There you go. You specifically want to make a distinction between C and D. Some words that start the word emphasis. What's that? And the Lord suffixes in the present tense. Okay. Yeah. Once you have that, do a wide space-based tokenization. Yes, I told you that one building, South Ward took an ice area, but the first step is to just into individual awards.

So far so good. You do have this vocabulary. that we already created with the stop warp token. All right, now add That's the upward token at the end of every... individual work. Now do the character-based tokenization. And your characters are... Thank you. This is your alphabet. Once you do that character-based organization, you have just the sequences of Oh. loose characters, but before, because we did the white face tokenization and we added the stub token, we remember where the ends of the words were, where the separations were.

Does that make sense? So far so good? Okay, now 4K merges. Luke. This is your collection of bytes. You will be looking for bite pairs that are most common. Look for unique vipers in this corpus. Any unique combination sequence of characters, list them. I'm just going through it. There is L-O, L-O, there is O-W, O-W, W underscore, and so on. If you have, for example, an E-R here, and then you stop on another one, you don't count them.

Just unique combinations. There it is. So far so good. All right, and then we have our third frequency. How often do you see the Korean? One, two, three, four. Five, six, seven, L-O. These two are, what are those? E R and R underscore are the most frequent. You didn't cut it yourself. Go ahead. So You want to merge the most frequent ones first. But we have a tie. How do you resolve that tie? How would you resolve that tie?

By convention here you will go for the first one that you encounter in the index going left, left to right. So we have br before r underscore. The ER will be our Candidate for emerge All right. -Use it. I know it's kind of... It might be challenging to graph, right? How do two characters become two characters, but you treat whatever you came from out of it as one character, but this is one you do. That ER right here is your new character.

This is kind of the equivalent of that Z replacing AA function. And then you update your vocabulary. This is a new entry to our local category. This is a set. I know everyone in this room is super mathematically inclined. Does the ordering matter in the set? Thank you. No. Amen. I will be adding in the sequence. They encountered them. Okay. Now... The last step for this merge is to replace every lockOf separate ERs with ERs.

So we will end up now. Hold him. E-Card before being ears. Okay, and you repeat the process. That was I1, that was the first merge, and I1. First March, we're doing the second March, same process. You identify your unique pairs, but now you have W, N, E, R. N, E, R is treated as a single character. You count them again. Now we have no time. We want an E, R, and an underscore. This is... How we will learn.

The ERC. So these two are completely different. This is the ear kind of one and this is the ear suffix. Doesn't make sense. Notice this is an automated process. There is no language driving it. And immediately, because there's so much-Many instances of VR in this pack have already stumbled upon an important English subject that will be a part of it. Okay, we're down to one. All right, can we, you repeat that?

We were supposed to do K merges so the next merge you can try it yourself for practice. It will be NE becomes NE, NE down becomes new. Now this is interesting. What we have just discovered an actual English word. We put it together because it's so common, it becomes part of my dictionary. Now, yes, I set K to be eight, right? So it will stop after all these merges. We found low. You were and so are the easiest.

Uh... That's a processing issue, right? You want to... I have a manageable dictionary to use. You don't want it to be too large because... You have more union courts to deal with. You don't want it to small because you will not capture The variation in vocabulary and the actual words I don't know what they used to Specifically, opening eyes or how do they specify that? So I don't have a precise answer.

But I think The usual. They use a regular language, English in this case, dictionary size as a starting point. Let me add a little bit. Once that, that... That number is fixed, it's also fixed in your large language model and it affects other components in this larger language model. Does that make sense? I'll try to look up what was the procedure when choosing that. I'm pretty sure that's kind of a trial and error to some degree.

I definitely need an answer. Okay. So. Do you see how it works? Great. After this process is done with our Vocabulary. Learning, yeah. Higher set of Not only, you know, characters, individual characters, subwords, prefixes, suffixes, but we also have specific words. Again, if the K was high enough, we would capture all eight points in... Now we're ready to do the sub word. Stop working. The token of this mentor piece of it is really a greedy Greetings.

Approach to... Using individual characters. So you look at the mirror,Copy and learn. If you recall my test set Was too much. It's newer and lower. We don't have a lower here. You do have a new one, right? The second anchor is--Now ready to-Submit those words. Newer and lower there are again broken into individual characters to stuff word hereIs being added. And then you're lookingGreetings. For pairs that you have an equivalent So there is an ER, you are going to fuse ER and you are going to fuse later on ER to ER underscore until you are done, until you cannot.

If you make any more of these merges or fuses, Again, does that make sense? There is a new reward. This newer will be actually fused into the entire newer. Lower, on the other hand, will be left as the underscore our later job because you don't have them in your regular text. Does that make sense?

Let's see our vocabulary here. This is my vocabulary. Thank you. I need a coffee paste in here. Let's clean it up a little bit. Here's You're new or broken down into individual characters. You look up the diaper, N-E. Do I have anything that I can replace N-E with? Yes, but we do have ER first, it was a more frequent one, so you have to kind of remember which one was done, which one came first, that's why I put them in order, right?

So NE is not sure of first. I do gradially graph ER, done. This becomes then NE. That would be good. E, R. First one. I go through it again. Find eight more. I was a parent. I could use that. Then I will Fine. And E, I believe. Find E-N-E-W. Produce newer, and then you and IbarDiffused with the air. Does that make sense? So that ordering we came up withMaps. Because ER was the most frequent, I want to later on merge the characters in the same order.



Text Processing & Pre-Processing

Character Encoding

ASCII

ASCII는 영어권 텍스트 처리의 기본 인코딩 체계이며 이후 확장 인코딩 체계의 기반이 됨


ISO 8859-1 (Latin-1)

ASCII보다 확장된 문자 표현 가능

그러나 전 세계 언어를 모두 표현하기에는 한계 존재