Lecture Script

Summary

Overview of Natural Language Processing

NLP aims to enable machines to understand and process human language
Key goals include language understanding, prediction, and interactive communication with machines
From a user perspective, NLP should provide tools that understand natural language without users needing to understand the underlying technology
From an engineering perspective, NLP involves building applications using various algorithms, approaches, and data depending on the specific task

NLP Applications Discussed

Students identified several familiar NLP applications:
- Automated phone call systems and conversation automation
- Grammarly for grammar checking and correction
Other major NLP tasks covered:
- Language models
- Text classification
- Sentiment analysis
- Information extraction and information retrieval
- Conversational agents
- Text summarization
- Question and answer systems
- Machine translation
- Topic modeling

Topic Modeling and Text Classification

Topic modeling works similarly to clustering - grouping documents based on patterns without predefined categories
Example use case: automatically categorizing documents into sports news, novels, short stories, letters, etc. based on content
Live demo of sentiment analysis showed how text is classified with numerical scores (e.g., -1 to +1 scale for positive/negative sentiment)

Sentiment Analysis Deep Dive

Basic approach involves identifying words with negative or positive connotations and weighting them
Simplest method: maintain lists of positive and negative words and count their occurrences
Limitation: simple word counting struggles with context
- Example: "positively horrendous" would be misclassified as positive
- Negations like "not great" repeated multiple times could be incorrectly classified
- Phrases like "no redeeming qualities" vs "only redeeming qualities" require contextual understanding
Data sources: reviews with scores (e.g., Rotten Tomatoes) provide pre-labeled training data
Statistical approaches can offset effects of outlier reviews

Information Extraction

Live demo showed extraction of named entities (people, organizations, locations, dates) from text
Example results: correctly identified dates and some proper nouns, but misclassified "Mass Effect" (a game) as an organization
Process is more sophisticated than simple classification - involves identifying and labeling specific pieces of information
Modern systems typically use neural networks rather than simple lookup tables
Approach discussed:
- First identify part of speech (e.g., proper nouns)
- Then classify the entity type
- Can use lookup tables for entities with finite lists (countries, weekdays, numbers)
- More challenging for dynamic categories like company names
Applications:
- Calendar integration and command execution
- Analyzing reviews and sentiment for businesses
- Many current AI assistants use this technology

Other NLP Applications

Question Answering and Summarization: Systems can process transcripts and extract relevant information to answer specific queries
Information Retrieval: Search engines (like Google) scan billions of documents and return relevant results - not a trivial scanning process
Open vs. Closed Domain Conversational Agents:
- Open domain agents can discuss any topic
- Closed domain agents are limited to specific contexts (e.g., company customer service)
- Current state: Still not fully satisfactory, but improving rapidly
- Gemini noted as having the most natural human-like voice interactions, though still imperfect

Levels of Language Analysis

Linguistics divides language into hierarchical levels with increasing complexity:
- Phonetics (sound) - mostly skipped in this course
- Morphology (word structure)
- Syntax (sentence structure)
- Semantics (meaning)
- Pragmatics (context and intention)

The NLP Pipeline

Fundamental NLP pipeline for building applications:
The course will focus on the understanding and generation stages, not the reasoning stage (which is application-specific)
Modern systems like ChatGPT perform many of these stages in a unified manner

Text Preprocessing: Stemming and Lemmatization

Morphological Analysis: Examining word parts (prefixes, suffixes, roots)
Removing Stop Words:
Why Reduce Word Forms:
Trade-offs:
Good news: Modern neural network-based models can handle multiple word forms and extract relevant patterns

Handling Unknown Words

Example discussed: How to handle made-up words, misspellings, or slang terms
Modern neural networks (like ChatGPT) can understand text even with significant misspellings - handling words spelled with only half their letters
Strategy: Find closest equivalent in dictionary based on context
Some words can be handled with lookup tables (countries, weekdays, numbers don't change)

Semantics and Pragmatics

Semantics: Extracting meaning from text, potentially as formal logical rules
Goal: Enable machines to learn relationships from text, not just process word sequences
Pragmatics: Higher-level understanding of intentions, context, conversations, and speech patterns

Notes

Transcript

You use the, is that it has wornJudging that he has learned something fromWritten sources, right? I'm not going to define the learning process here. I'm not going to argue with her. She actually knows how to make things and whatever she does. This class, Having the And we'll be there. Finally, Understanding languages themselves is Okay, so. Here's what we want to do with them. Understand what's next for them.

Is it 100% possible? I don't think so. Is it mostly possible? Absolutely yes. Successful. The dishes are done. Now, that's... Indulge me. Do you think thatBecause that's the next step. Good things to do. question I don't even know how to teach that. Anybody know what that location is? An open question. I'm on the side of no. But, ultimately,Nice to have that. And this is-Progressing very fast. having an interactive communication with a machine and a translation.

Now, so this is what a regular user would want from NLP. Right, you don't care what's on there, that good? You just have a tool that understands you, What's What does it mean from engineer? I pressed a couple times last time to do that. We use what we learn and more to build your own applications. Those applications will do all sorts of things with us. They won't have to do chatbots. They won't have to understand everything.

Depending on the task, You'll be using different algorithms, different approaches, and different data. I'm sure that this will in your eyes on the NLP engineering perspective and scan the facts and extract information and whatnot. Spelling, right? What other obligations are you familiar with? Natural language processing. which means human. Natural language, interactions, whatever it means. What have you...

Seeing what have you used?

I think a lot of companies are automating phone calls and whatnot. Phone calls, automating conversations? All right.

Okay, good. Do you trust him? I don't believe this. I don't either. Anything else? Can I move it? Grammarly. Grammarly, yes. That's a very good tool. All right, so I'm going to give you some other things. There's an overlock in what you said, there's an overlock in what you said. If you don't know what a language model is just yet, you can find that in ourBig field. Building a model of a light year.

What a burden. Text classification, we'll do that in this class. Sentimental analysis, there's a lot of flavors to information extraction, information retrieval. These two are different paths. I'm sure some of you, CS4.29,-Different topics. Conversational regions, that's obvious. Text summarization, question and answer, and machine translation. These are not surprising to me. What about topic modeling?

Have you heard that term? What do you think the topic model would be? I imagine most of you, not all of you know how a cluster in one word is formed. Right, you have a data set. In this case, it's going to be documents. You suspect that there are patterns in the documents or in that data, or that some of the data homes are sharing, but you don't know what are the groupings. Next topic modeling is kind of like that.

Okay. These documents, I'm just giving you a pile of documents, separate them into your own business. Sports news, this is a novel, this is a short story, this is a... Letters from your grandma, whatever. based on what's inside. So that's it. -Obvious. Have you seen tax classification in action? Or not. Let me show you how that works. How do you use the text classification? Any favorite news source?

Favorite movie? Let's play it safe. Okay. What did just happen? We have a document full of text, right? Words after words and that document ended up being classified with a number, right? Thank you. On scale,. plus one being paused. Here's a question for you. How would you approach that kind of-What was the text? Was it a bad review of the movie? It was a bad review. I picked up one out of them. Oh, okay.

I would look at some of the words and see if they are typically negative. Words with negative connotations or positive connotations and based on that, Potentially, I don't wait to those words, like how negative they are. I guess, and you could create a desirable model using that. Perfect.

Okay, so... Nowadays, I'm not sure what this model is. Specifically, What's the underlying technologies into neural network or something else? But if you don't believe, that kind of sophistication when you use that, right? Take the, look at the words, right, and, you know,Get the word negative. Such as? Crowing Boring. And at the most basic level, we just come back. We'll get to text plus information.

The most simplistic way would be just to have a kind of a list of words that are negative and words that are positive. their occurrences. I mean, there's more on one side than on the other hand. Positive. Of course, that would be just very, very, veryWhat's that? Most of-Next classification is based on the center. We look for certain words. Yes.

Does this classifier take into account context? For instance, if somebody says this is positively horrendous?

I don't know. The first demo that I foundOnline. I have no clue whether this one actually is taking the I doubt that. What you highlighted isSomething very important, this is a scenario where this counting of positive and negative words kind of breaks down, right? on the same token that someone wrote. This was not great, not great, not great, not great, right? It would be, for a simple model, it would be probably classified as a positive, because there's a lot of breaks in it.

Yeah, this one says no redeeming qualities. It says only redeeming qualities.

It's in your face right now how hard it is, right? There's ways of, you know, massaging and things. Thank you. Trick here, a little trick there. But, well, we'll get to that point. So here's text classification. Where would you get data? Warwick. By the way, You have a text that has a score. A lot of times for a lot of problems people already have. Yeah. And of course, Would you be able to weed out a trouble in Review out of that?

Just one minute, 'cause there's so many little steps that you can do here. I can't do that all in two programs. Information extraction. This is an honor, Mary. Useful. A petition of NLP. Does anyone have any favorite news source? There you go. Here's a random text that I plugged in. What happened? Information extraction. Actually, let's do it again. Talk about briefly aboutThe information expression.

When I ask you about your texts, let's meet them this Friday, right? That's the information instruction. Did it do a good job? You have some tax, you're a person, Mark. Mass Effect was labeled as organization. Probably not. That's a game, right? Outdoor Wilds game is kind of the one thing that is related to all sorts of games. Misclassified, torn in all, You did a good job the past decade. You did a good job for E-Current.

Do you see what's happening here? So instead of this time, instead of classifying the text, we're plucking up a piece of paper. Thank you. Relevant to some problem in this case. Is it an instance of tending? It is an instance of tending, except it's a kind of more sophisticated I will guess you may-Word.

I don't have a specific answer for this particular demo. I don't know exactly what-This is probably based on-Stacy was just So I don't know how it was trained. What is it? I'm pretty sure that it's not just the cutting over My guess it would be that it looks like Oscar Wilde or something. Yeah. You will get to information extraction at some point. You will see how that's being done. Nowadays, mostly it'sNeural networks.

Some of the sentences here definitely follow that structure where you have Birds, mountains, and... Well, mass effect, these can be considered in this context. So maybe we start with that sentence structure and maybe with data, we'd be able to identify whether they're proper nouns or something else. Great.

By the way, notice what he just described. Before you even slap the egg on Mass Effect, you look. We'll wait a minute. Whether it's a noun or not. There's an underlying process thatProceeds. That's exactly how it works. First, we were part of the speech, and I mean, Do that. Of course you can play it easy, for example, and we don't have any How difficult is it to have a list of 200 countries? Go through the text if I spot it.

Companies, that's a little trickier because they come and go. Good. What I'm trying to draw your attention to is this. That's working, and there will be some really good lookup tables. Numbers, we don't change, right? Weekdays, they don't change. Okay, so So... We kind of got into the point. to some understanding how would you do that?

So you already covered the calendar It will connect you to your calendar and execute possibly form, formulate a command at-Finally, this is what a lot of yourClaws are doing this. There is a language mold that's... Build to process boards and on top of that you will have other little componentsBut what would you use that?

How about question and answer and then summarization? You may ask a question, okay? You have a... You have a conversation. President Trump, there's a transcript of that. And you asked the question, what is President Trump's stand on this or that, right? It has to look a The parts where President Trump is saying something versus the interviewer and whatnot. Yes?

Maybe like Google or like Yelp reviews for a company or something. Right, right, right.

Yeah. Just write your own script that runs what are people thinking about Okay, so information extraction. information retrieval you've seen that on google, I'm not going to give you an example this is the search engine is an example of information retrieval why? And we'll just search for the bazillion of websites and documents and we'll just show you the ones that are seen to it. relevant to your search, to your query, right here in the back.

How is... Not a trivial let's scan through all the documents and When you ask and then you give. The answer is I will too. Take ages. Are you a sectional agent? Thank you. See this, you got that one. You will. An outlier. Use it anymore, but, uh, uh, introduction law space fiction. A few things about Space Station. Never mind. You know how it works. Machine translation, You found that yourself? I think we'll have a little time to talk about that.

You can see that there is-There's a ton of different applications, and once again, I'm... Repeat myself. Those tasks. You use different parts of that. -We'll be going through that. All right, there's other obviously that you can think of. I do hope that many of you will come up. Some of those tasks that you will see in this classI've heard. Some ofThen let me... Asking what do you think about this?

It's open domain control. Sexual age, right? It's pretty easy to understand. And what it is, conversation agent that. You can talk about. As opposed to a closed domain, Have you seen those? I'm gonna do what my company's reviews is not gonna go to like, shape.comYou've been dealing with Chad Busberg a lot, so how do you rate The quality of existing open domain in decision making. Yes. They actually talk aboutAnything he did in a serious way satisfying Thank you.

Still not there. Moving in that direction, like I think. Gemini probably has the best, like, when you're using your voice and communicating back and forth. I think they probably have the most, quote unquote, human interactions with it, but it'll still trip off where it doesn't.

I feel like a completely organic conversation, but it also doesn't entirely feel like you're justIt feels like It's trying to add some personality versus just--Anything. All right, last time I showed you this table just to remind you that The way the language is. Study this linguistically. ...divided into... Let's talk about how those levels of tests are being processed So we have the complexity grows in this direction, right?

This is the opposite direction right here. We'll skip most of the sound aspect of it. How do we do that? Because these are levels. to understand the attitudes, to understand the syntax and the semantics and whatnot. So what you would-building yourself. Our pilots, okay? You might, if you did any NLP work on agent, creating agent, I'm playing new processes. It has a, Reason, for the most part, it comes from So.

A fundamental NLP High five. is broken in that particular way. Let's ignore this future analysis. Thank you. Great film. Next, into words and sub-words. Then you parse sentences. Then you apply grammatics and distill the context. Here. This is you right here, okay? Grab everything that was discovered along the wayThen, The part that we will not be spending. Then you generate the answer and it expands.

Your response might be, I don't know, the number that we're going to 1956, right? That's, it's a date. response to some question. Now you have to massage it and then turn it into a English Paragraph that will be then used in the next one. You're reversing the process. All those little steps along the bullseye are Really, Chris. There's a reason. Behind having the pipe. Reasoning, and this will be application specific, so this is where you come in.

Interest in building an MLP system. Now those two sections of the pipeline are...generation. We are going to be here. Does that make sense? For the most part, Your chat GPDs are doing all this stuff in one lesson. Wait, they do have stages, they do have--But before we will get their rules. So let's start here. Alright, so morphological and lexical analysis. We're looking at text. We're looking at words and their parts.

The prefixes, the suffixes, the roots, and the what. This is called morphological analysis. Have you heard the term "limitization"? What if... we're standing in both our... When you simplify the word. Okay. Simplifying words as in... Happier, happiest, go, going, when. Stemming, I'll just... Describe and explain the differences in the. and how we can do that. Stemming and then accusation. Hands up.

Looking at your text. Let's stay here. Go, go, into, and into. You got every single area,Two questions for you. Why would you do that? And why would that work or not work?

Sentences can have a lot of filler words that are nice for context, but if you remove them from the sentence, you can still understand. What was the point of trying to do that?

Okay, so that would be... Moving somewhere, surrounding, go, go. Would you keep going, going, yeah. It's an action movie, I think. Probably. This means something. The words that you would skip, probably. There is meaning to it, but very little compared to an action movie.

We haven't got there yet. But anything, any word in your text We'll have to have a Thank you. Right? You can't just storm go and So let's say, let me make it up. Let's say that Go has 234. Go has4,096. When is the Let's say that they have individual IMBs, right? Was that my cat? I'm adding a little dictionary. It's not a proper dictionary because it's a dictionary. You don't have, you don't keep all the forms.

necessarily, right? For his limbs. All right, so. Is this the same word? From a computer perspective, you said it goes from different numbers, right? If you strip that from meaning, you just erase every word with a number. You're dealing with Different words, technicals. If you remove the contents or you don't get to the contents level just yet, you can't--That'd be sick, right? process and goal going forward asIs that affecting, possibly affecting the processing?

It could be affecting the space complexity. Space complexity, right. We have a bigger dictionary that has to be We're all a version of you. Very good. Let's go back to the Shrek and sentiment analysis. Would you care? Happy, happiest, greatest, greatest, greatest. Would you care what our... Those different versions are beingAnalyze or can you just? Replace all of them with Pappy, would you get the sink?

Sentiment analysis. More or less. I'm happy watching Shrek. I would be happier watching Shrek 2, right? I'm the happiest whenever I watch. You can see that. So reducing the forms of different words is just helping you reduce the space complexity of the problem and as a byproduct of that your processing time and time complexity will be reduced. So that's the answer to the first of my questions. Now the second one.

Can you imagine the reverse when you're actually losing some of your time? Not included in that.

Saying you're happy watching Shrek and saying you're happiest watching Shrek are two very different things. We'll say you're happy to watch. You're just happy when you're happy. Yes, that means you're happy. Like the happiest that you could ever be. Yeah, correct. So depending on the context of that statement, it could mean. Very good.

And depending on what are you trying to achieve, how many interested in being in answering the question, I don't care whether it's happy or happiest, right? When are you happiest? There you go. If I reduce happiness to happy, I will not have an energy as easy. So again, here's another fundamental problem for you. You will have to make that call. That's helpful. Good way. Do I care about that? Am I willing to sacrifice it on this base because it's in the war?

The good news to some degree is that all the neuron-based models that you have right now, It takes everything in it and actually willOkay, we'll try. Happy as happier eight pieces by the new Extract everything that you don't have to. However, it is used in some cases. Name your question, we'll get to it slowly. One more. Example, going back to how are those different levels of language things are related to applications.

--Would that work for any language? 30ish, right? Is this a real English word already? Probably not. Everybody understands that, right? Yeah. You may rest now, thank you. Any other examples? All right, let me ask you a different question. Have you ever thrown a misspelled word or a sentence? Crazy word that only used in your friends, whatever some slang word. Mason Sineworth at HRGVP. Give it a go and see what happens.

I'm giving you an answer right here. I'm ready. I'm giving you a scenario where you are Throwing andA word made up of pieces that are unknown because they're not part of the language of the dictionary or whatnot, right? Or the list of preferences is Everybody knows what it is, or everybody can understand it. But yeah. Would that be a challenge for the computer?

I saw a YouTube video where A guy talked about how people are becoming lazier with their typing and CheckGPT and other alums are actually perpetuating that because he was doing experiments with it where he intentionally was only spelling words with half of their letters and it was still able to understand everything he was saying. So I think that works from the neural network side of things, but if we were going to use these older I'm not sure if they would be able to have the same success rate.

How close is it to the spelling of the other words? Given the context of the plantin or plantin, you can kind of figure it out just given the sentence of the type of plantin.

The long story short, the near tragic, If it sees an unknown world, Unknown component. Cut off the ink, I'm sure. It would find the closest equivalentNext home dictionary, do it. And then you do it. Interesting. To actually help massage that situation, but All right, soMorphology. Pieces of words. Words become sentences, sentences are structured. -Understanding if you are correctly structuredThis will be.

The next step. And then extracting knowledge. I already told you that. Gasps This happens at multiple-Syntax. Cough Let's stop at semantics for a moment. Four of them. Semantics means extracting meaning from facts. What is this? For all birds, they're very big, they can fly. Okay, so we, um... I don't think it's a rule of logic. It's a first-order logic rule, right? Then can you possibly extract a rule like that from a really good text?

Can you represent the text that you wrote or spoke or whatever, do you think something like this or some formal structure in that way? This goes into the reasoning and everything, right? Perhaps. The idea is that you want your machine to learn from sex and learn relationships and what not. Otherwise, it's just word next to another word. Pragmatics. This is above semantics and it's related to intentions.

Introduction to NLP

Natural Language Processing (NLP)

정의

Natural Language Processing은

linguistics, computer science, artificial intelligence의 교차 영역임

컴퓨터와 인간 언어 간의 상호작용을 연구하며

대량의 자연어 데이터를 처리하고 분석하도록 컴퓨터를 프로그래밍하는 방법을 다룸