Summary
[]: Character variations (e.g., [wW]oodchuck matches "Woodchuck" or "woodchuck")|: OR operator (e.g., woodchuck|groundhog)^: Negation when inside brackets (e.g., [^A-Z] matches non-uppercase)?: Optional character (e.g., colou?r matches "color" or "colour").: Wildcard matching any single character\\: Escapes special characters to use them literally\\d: Any digit\\D: Non-digit\\w: Alphanumeric and underscore\\W: Non-alphanumeric\\s: Whitespace\\S: Non-whitespace^: Start of string$: End of string\\b: Word boundary\\B: Word non-boundaryNotes
Transcript
There you go. There's the outcome of this thing. It doesn't make much sense, right? but it can be usedVery simple task where you don't care about the elements. The other A similar approach is to use to lemmatize the text. I will explain what it is, but first let me show you how would that work.
So some lemmatizers would change went into go. Because if you look in the dictionary, Pretty sure that what would happen was not before when, right? You would be redirected to go. and then under go you would have the full description. What lemma writers are doing is exactly that. They find the word and they look for its lemma, which is the basic, basic word on which the word in question is being used.
That's easy. It becomes across. Notice how We're not cutting off the ass as a stemmer would do. It's pretty much a look up. Scenario, okay, uh, groceries, the corresponding lemma to it is "grocery." Now, is this process of lemmatization going to be faster or... Slower then. Then stemming. Absolutely, because you have to do Look up, you have to pretty much goOh, here we go again.
You have to go through one case in order to get through all of them. You're fine with stamming? You're not going through all the words in the dictionary? playing a rule and cuttingNothing. Sir, sir. Do not confuse in common language with stem, with something that is called a base word or lemma. Lemma, stem. different concepts. Can you kind of work on the same stem as lemon? It's absolutely must go, right?
Can you come up with a scenario where you would prefer a lemmatizer over a stemmer? I think monetizing would be better for chatbot and stemming for specific code. Since like, I don't know, like a legal document for medical records or something may have a very particularStructure to the language they use, or spelling. be more efficient because you don't need the benefits. in that context, but when someone's writing in a chat bot, they might be making a bunch of typos and random errors that can be different from person to person.
So I think that monetization makes it More of a box connection. Good.
Here's another scenario. Remember when I was showing the NLB pipeline? steps and steps and every step is taking the previous steps out with some processing already happened and you have to use every other processingHere. We do stemming like this. And this, that green text, that text on green would be past forms, right? Now, what about the moms? Are you going toStructs some meaning from that token because it's normally a word, right?
You might want, you could try to revert it back to the original. Forward, right? You know what it is. wasting space. And long story short, in general when you don't need the precise meaning of that word later on, then STEM might be okay. You're just counting. You don't see algorithms that simply count tokens. Not even the ones they don't care about. The meaning, the form, the reflection, whatever. Then you can get away with it.
I may not exactly match every limitizer You got the picture, right? stemming, cut off the suffix, unknown suffix, whatever's left, you keep it away, and it's not over. Dilemma. is always a work. This is what your token is being reduced. Always will. Okay, stop words, we already talked about stop words. This is sort of a by the way.
What you could do as part of your pre-processing facility. Stemming lemmatization. Standardizes your text, right? Every different form of a word that it is you get. Say it to him, Morrison. Another way to standardize is, for example, Make sure that Every dog becomes a do not, or every do not becomes a dog. So you all have two versions of it. And you have one token that-across the board, and so on and so on.
Numerals, digits, same thing. If your downstream task is relying on counting, specifically counting fromCold as hell. because otherwise nine and nine are two. consider it to be completely different tokens. Language Defection You've seen it, right? Specifically with Google Translate, for example, you will type something in in a different language than it was I'm trying to guess what that language is first.
I could be wrong, but it would attract me. Look for specific worries in the way. Code Mixer This one is tricky. What it means is trying to detect This movie was no bueno. Make some people speak like that sometimes, the computer might be a problem. If you're dealing with that sort of, Material you might want to He involved co-mixing. And it's literation, okay? Cyrillic versus Latin alphabet. You have to, if you want, use a standards.
Here's the first one that shows you how to do pattern matching with regular expressions, some basic stuff. There's two common packages for regular expressions in PyCon. The one that I'm using here, very easy to use. This is our In context, Just NLP lowercase. This is a pattern, okay? So this is the tough part when it comes to regular expressions because you have to specify a pattern. I'll give you some examples.
And the good part, the good news for you is I will definitely not be quizzing you about, hey, come up with a pattern. regular expression pattern for this amount. Do it yourselves. Besides nowadays you just ask judges if you can order the peanut. -Well, listen, now you have For the basic ones, let me give you the lay of the land. Okay. Yo. Here's a couple. You see those square brackets? This is This is the pattern, okay?
Square brackets indicate some sort of variation that we're looking for in this case. You're looking for all the WUTCHUK instances that start with either lowercase w or Acrocase. This means that you're looking for a single digit, right? One of those. So when you have a bit of A little bit like this in your pattern. It means look for either lowercase w or uppercase w in front of object. This means look for either 1, 2, 3, 4.
It's an R, right? You're not writing. won't buy anything. It's just every single one of those characters are an option. You can actually simplify that so this And we were written as that. Makes it much simpler. This is a similar pattern that you look for. It's all, I mean, any uppercase. Laughing. These are the. Another one is using Stay free. So that should be familiar to you. Is it warm? All right, so look for either a woodchuck or a ground dog.
Where is mine? Let's go to something more sophisticated. There is a carrot symbol that is used sometimes, and it's always used well-known. And that means negation. So... That means that you're looking for, in this case, what I did, you're looking for a three character long Sequence. Not starting with a capital letter. Does that make sense? So AOD would work, but AOD would not work. We would not be a match.
Yes, with being that you are... Not looking for an ex. So far so good. Looks complicated. This question mark. Can we useThe marker of optionality. So this is actually a very cool example of handling two different versions of English. This meansNah. That question mark shows up after a character that we consider optional. It may or may not be there. Color, color, color. And so on. Hello.
We've already seen a dot, a period. This is a wild card, so it represents any single character. There are some additionalSchool operators that Sort of a shortcut for errors. slash D, any digits slash capital D means the nots. Any non-degenerate? This becomes really-Well, here's a white space actually. This is different. We talked about white space. Quite a few different characters to capture in there.
So if you want to actually look for a period or look for a question mark in a period, Thank you. Use a back switch, left, projects know that you're actually not using it as an offer here. All right, some more sophisticated aspects of Regis Anchors. You can... as opposed to just looking for a specific pattern. Anywhere. You can specify where are you looking for a pattern in the human filter. Are you looking-Define-That pattern at the beginning or the end of the work, right?
Blood, it means the pelvic floor, isn't that? So what I showed you already was finding a pattern. Okay. Answering the question, do I have something that matches this better in my text? But that's not all that you could do with projects. Here's a local over text. This is a pattern we're looking for the word "the". The R means that we're just dealing with a rump spring or something.
There is a useful argument. Package function find all. All right, so we have all of that. This is the result. It will give you a list. Uh oh. Instances What's the benefit of it? You're being hung, right? So you follow the whole vow. You've seen it, right? You're searching your website, you're searching your Word document control app and they will give you how many instances of something that may be in your test.
We're doing replacement. I thought I got a chug in there. There's a command substitute, right? A match pattern. Replace. What are you going to replace? The matter left. Does that make sense? Find, search and replace, find and replace, same thing. All right. Python has its own split function, obviously, but RU has its own split function. that is more sophisticated, because we will split that based on a pattern.
Someone misspelled giraffe and wrote draft. That would be one possible application of spelling. We misspelled spelling, we generate it, and no more. What about typos? Homophobes. Anybody know what a homophobe is? There's a word that sounds... Exactly like some other word that I just spelled differently. So these Definitely. need context to be figured out, right? Otherwise there's nothing wrong with those words, they're correct words.
We need a vocabulary. And then I guess context, I guess. Your whole query is like the giraffe head for like you would be able to figure out. Through the back of words representation. We don't have context here. Context we definitely have. But if you go off of the number of steps it takes to... There you go. So you also mentioned having a vocabulary, right? So it's a reference that you can go through.
Okay. What are the words that are closest toRight, right. And you go through your dictionary and you come up with a bunch of possibility that's probably not the whole list, right? Because this starts with an assumption that G is in the right place, but it may not be. There may be a whole lot wrong with this. There is going to be a list, quite a long list of possibilities. You have a vocabulary of English, right?
Other places where you would want to learn how different art is from? Same idea here. How do we measure the difference between two spirits? Minimum distance is one way to do it and this is comprised of Counting. How much work do you have to doTwo, So Let's say that we are trying to compareOkay. So we're trying to Okay, our Hypothesis is Go ahead. The word we're looking for is giraffe. Does that make sense?
We've already went through those. Probably looking for others downstream, but we're focused on giraffe right now. This is our function. So... The idea here is to measure how much work do I have to perform to turn drag into giraffe. And you have some options here. Depending on the giraffe, or token length, there may be various ways to do it. Here are the possible ways. We need to insert. Thank you. That would be a way to turn.
We have to do a graph. We have to delete. You can have the picture. Okay. We can... Insert a character, delete a character, or substitute the character. So you have three possible actions to lead such as to use and influence your hands. Counting how many of those you require to align to... will give you the number of edits necessary to turn one string into another. So... I'm going to make it up, but at some point in the draft, we are having, let's say that this requires two steps because we're removing half of the heat, right?
Insert. See? And substitute with U, everything else stands. Only perform one, two, three, four, five operations. Cost. Five of four inches from the net, zero to two. Move from one to the other. That count is called an added distance. However... This at a distance is Calculated under the assumption that all deletions, insertions, and substitutions are worth the same. There's another approach to it. Cool.
Is this the only way to turn intention into execution? You could actually start by substituting I with E, right? It's a possible. The longer your strings are, the more options you have. And that really-It becomes a bit of a search problem, right? You can't solve that problem by building a search tree. where every branch is an action, religion, and institutions have to determine if this will only lead you to some belief.
A goal state, which is, you know, a case or an execution somewhere here is... execution and that path has a certain length and that will be your edit distance. Of course you can have a different path that is lower,Third. execution, though maybe it's a shorter one. And so then you know how search, three searches work already. So, sounds like a lot of work, right? Have I given you any idea to reduce the search?
Is it possible that parts of that tree are repeated. Some soft tree you did the same work in some soft tree like in the other soft tree. Where is this going? Is there an algorithmic approach or a tool that is used when you are repeating the sameIs that a bun? Familiar with dynamics programming here. Not at home. Well, for that, for that purpose, I have a little refresher for you. Pretty simple. Here's a typical dynamic programming approach.
So instead of recalculating them, I could just look it up. Yeah. I get my answers by simply Adding things that I've already ignored. The results report that already. Does that make sense? There's a search. I think I'll not change my mind. I'm not a secretary. I'm Bill. That five is that four and that three, right? Breathe. We can go and calculate it somewhere else. We can reuse that. Dynamic programming.
We're using little subproblems to solve a bigger--Solution to some problems is a bigger problem. Can we use it for dynamic programming for our medium-weather distance apps? So we can do that. Let's start with some formalism. Okay, we have a few springs. They do not have to be at the same length to kind of align them. Does that make sense? And go from go to substance, Superstation, right? For our dynamic programming approach, we will have you in theUse something that isNow to a sub-headed distance.
In the distance for a sub-part of our problem. D-I-J, which is... How much addedAnything we have to do for To turn character I into character J Does that make sense?
You learn the answer, how many edits required. Get from the first. I mean distance, dynamic programming, we will have to do Thank you. Give us a table. I'll show you how. Step number one, set up. Well, learn what I did. Source and target is. Length, set up a corresponding Matrix with plus 1 in their direction. Plus one cell. That would be This is how we-how we-if you end up looking for other sources on how it's done, that matrix could be flipped.
You have bread in three different ways. That means that you have three possibilities right here. That subtree isValid with ...insertion and substitution. So, to make a note, For the future, if you're interested in actually turning one string into another, you have to keep something that is called a backboard. Here I have feedback from the person because I could... Go back. Do you want to produce cells?
Now, how do we know how to turn? So there's two questions really that this came up to answer. How much edit, how many edits do I have to do? Do or and the other question is what is the sequence of those deaths? If you were to do Transfer one into... So to answer that second question, you follow, you go to the upper right corner here, and you follow the back four years. That section is easy because I don't have any options.
It's a table and we're just populating aA table of size 10 times 10. N squared.
Whatever the longest one is, right? Let's just make it. Save that, right? But it's going to be M or N. Is it going to be Greater than n times m. Well, we get a bar of 20. Well, let's make it m10. So 3 to the power of 10.
10 times 10 is 100. 3 to the power of 10 is much more. Can you get the picture if you, If you went through CS430 and were dynamic programming and you did and you looked at it and why are we doing that here is the answer.
Information Retrieval
Sentiment Analysis / Topic Modeling