Lecture Script

Summary

Probabilistic Context-Free Grammars (PCFGs)

PCFGs extend traditional context-free grammars by adding probabilities to production rules
To calculate tree probability, you need a specially annotated corpus (tree bank) that contains parse trees for sentences
Tree probability is calculated by multiplying the probabilities of all individual production rules used in constructing the tree
When multiple production rules correspond to the same non-terminal, their probabilities must sum to 1
Terminal production rules (individual words) typically have probability 1.0 since there's only one way to produce them
PCFGs help resolve syntactic ambiguity by selecting the parse tree with the highest probability
The CKY matrix approach can incorporate these probabilities, though cells may contain multiple possible parse values

Sentiment Analysis Fundamentals

Sentiment analysis can be approached as a text classification problem using naive Bayes classifiers
Basic approach uses bag-of-words model with smoothing to handle zero probabilities
Binary bag-of-words (presence/absence) may be sufficient rather than counting word frequencies
This dimensionality reduction technique is called binarization

Challenges in Sentiment Analysis

Negation handling: Bag-of-words cannot distinguish between "I don't like" and "I like" since both contain "like"
Basic solution: When encountering negation words (don't, doesn't, didn't), prepend "not_" to all following words until reaching "but"
Sarcasm detection: Computers struggle to detect sarcasm and nuanced expressions that humans understand
Complex sentiment: Reviews may contain positive words but end with negative conclusions
Neural networks trained on millions of reviews can help address some of these challenges

Feature Engineering for Sentiment Analysis

Feature engineering involves creating additional features beyond simple bag-of-words
Lexicon-based features: Count words appearing in pre-built positive/negative sentiment lexicons as separate features
Synonym/antonym expansion: Use semantic relationships (e.g., if "corrupt" is negative, infer "brutal" is also negative) to expand lexicons
Conjunction patterns: Identify patterns like "good AND something" to build lexicons automatically using seed words
Punctuation features: Exclamation points can amplify sentiment and be used as standalone features
Review length: Word count may indicate review reliability and strength of opinion
These features add psychological and contextual dimensions to the model

Vector Semantics and Word Representation

Core principle: Word meaning is defined by context - words appearing in similar contexts are likely synonyms
Vector representations allow measuring semantic similarity through distance metrics or cosine similarity
Words with similar meanings should be positioned close together in vector space
Handling unknown words: If a model encounters an unfamiliar word (e.g., "duke" instead of "duck"), it can infer meaning from surrounding context words like "swims" and "quacks"
Modern language models convert every word into a vector in high-dimensional space (thousands of dimensions)

One-Hot Encoding

Definition: Simple encoding scheme where each word gets a unique vector with a single 1 and rest 0s
For a vocabulary of 58,000+ words, each word vector would have 58,000 dimensions with only one non-zero element
Problems with one-hot encoding:
- Creates sparse vectors (mostly zeros), wasting computational space and memory
- All words are equidistant (90-degree angles between any pair), making it impossible to capture semantic relationships like synonyms or antonyms
One-hot encoding is a starting point but inadequate for capturing word meaning

Course Context

Material on probabilistic grammars will be on the exam, but antonym/synonym sections will be covered after the exam
Students are currently working on language model assignments
Students requested more practice with applying these concepts

Notes

Transcript

We are really maximizing the vulnerability of the police. Does that make sense? That's the only moving part. How the trees constructed the sentence Hello. I'm hoping that you're starting to get used to all the shenanigans and shortcuts that we're doing to get to the answer.

All right, with that said, we went all the way from the conditional probability, right? thisVery simple. Probability of a tree. How do we calculate that probability of a tree? What would you need? to give me an answer. To that question. What would you need to get that?

Part of speech tagging, Yes, if I could do that. That would help, but that's a lot of work, right? Then again, how do you know which one is the right one? Okay, let's go back to say part of speech that How would you mathematically, what would you need to mathematically kind of put a number on how often a noun has is assigned to the word "fly." Where are you getting any bearings on that number? Would it be all of the times All the times that the fly was huge, right?

So I need a corpus, I need some data where I have a more fly and a pack, fly and a pack. Now follow that logic for trees.

Loosely speaking, think aboutWell instead of speech tags you would have because of words and a corresponding parse at someone, at something.

This is how you parse that sentence. Does that make sense?

SoSo, let's say that you have a Eat sushi, right?

And there is in your course, there is a special course that has that information. You have This. Added as an annotation. Does that make sense? There could be an alternative one that works.

All you would have to do is count How often?

This The rule is used to explain thatHero.

That sequence of voices, that doesn't, is that my conclusion? Yes?

So in other words, you need a special purpose that has instead of one of speech check. It has been awesome.

rule or production rule in the nation.

Assign not just to simple words, but also Phrases and whatnot. With that, We can move on and apply something called probabilistic complex free grammar. So you already know what a grammar called a free grammar is. Now let's add anotherAspect to.

I set up probabilities. Matt.

Every production room. Here is what we were working with so far. This is a ground art with some production moves. I could. A sign. Does that make sense? How would I get those probabilities or estimates for that? A noun phrase is actually expanded to a noun versus a noun phrase being a terminate and a noun. It will be involved in the purpose. Are you with me?

Okay. Why is there a one here? At the bottom.

But there's another one right here.

If you sum them up, you will get more than one.

Why do I have horns in those individual words? There is no other way to break down the program. Positional. There's only one rule that we can apply, therefore you use it always, 100% of the time. As opposed to the birthright, right? I have three possible options. Math it. And you could imagine this. So, uh, Q1.

A probabilistic convex free grammar has...

We have multiple production rules that correspond to the same thing. Simple. the total of one probability, a sum probability of one.

Does that make sense? Okay, so say that I gave you this, Probabilistic context.

How do we actually calculate The probability of a tree, this is how a tree may be.

How many want to? Well, estimated. Right. And this is going to be probably the simplest The way of contemplating any probability in thisOf course.

Now take a look at this pre-broken down to every individual roll-up line. Cold Earth, right? Every time a rule is being used It's a part of the truth. Does that make sense? Is that true? Every sub-tree. Now it will have its own probability. I use thisOne.

And use that.

2, 2, 0, 2, okay? 1. This one goes here and this one goes here. That's your probability. That works? Of course. If I asked you that kind of question on the exam, I would expect you to write a little more than just numbers. Does that make sense?

And then instead of just building this probabilistic system, CK and Y matrix with--Rules, right?

Rules are connected. They're also adding probabilities. And then when you're done with your matrix, you're able to resolve ambiguities just by looking at different Probabilities are picking up high as the probability Is that that?

But I'm pretty sure that you realize that Let's say for this. Cell right here, I have multiple values. I'm showing you only one way. but there's really multiple ways to break down. Okay. That works?

All right, so let's close it for today, and this will not be on news. And we'll talk about two topics.

Or maybe we can use. How about text classification?

Where are we then? Last time. Is that a filtering?

your classes, your tags, your labels, or something else while they're negative.

Sentiment analysis. Can you do a bit more? Just a quick recap.

You could use a naive Bayes classifier instead of Legs. labeling our sentences for So here's your Again, you would have a training set where you have a set dance. That is a label. Watch out those. So I have,Set aside for that same idea. So, where's my pen? Of course your labels could be something completely different. ... Conditional even in class with the probabilityIt'll be a little circle of water.

You build those and you have your own. Is it? Is it clear? Everyone wants that? That's smoothing. Why are we using a smoother? To a point zero, representative. What is--What is this? The size of the dictionary, your dictionary, your vocabulary. Okay, good. All right, so, nothing here, an idea, we can apply this same approach. Technically speaking. But, what could we do, what more could we do to improve our sentiment analysis rather than just counting all the words?

Do you think it's better to have a binary vagabond or a non-binary one? Do you count, for example, the objective multiple times or one is enough?

So this applies even more than in the sentiment analysis.

Empirically speaking, The fact that an objective appears and it affects the news and the whole worldHow many times do you have to do something like that?

Binary could be Just enough.

Okay, so binary tells could suffice for the The process, by the way, if you're interested. of reducing the counts to binary recommendations is called Now. Here is challenge number one.

And I have a teacher corresponding to don't. And then somewhere here I have a teacher corresponding to not. So this sentence would have a zero here and one here, right?

The other one will have a One and one. Correct?

If you have a don't and a like versus and like, just by looking at those two vectors, can you tell me that one corresponds to a positive review or not? We don't know where they're potentially. Exactly, except it's a bag of words, and from that, dollars might be corresponding I don't... Dislike, right? I don't dislike it. In fact, I like it. It could be a sentence like this.

So you have a sentence They don't like this movie. One. Okay. So negations, that's a very, very easy approach, not very effective at the same time.

Negations can be handled as follows.

The moment you see, did, or doesn't, or don't, or whatever, something that negates, You turn every word that follows but. John Bill, the fluctuation. negative versions of the following words.

So your vocabulary has to grow, you have to accommodate not like, not just, not only. You'll have separate parts in your back of words, but it will become in those separate ways.

Absolutely, but it requires Another step was already mentioned, having a lexicon of all the negative labels for positive neutral and just count the words that are in this lexicon or that, or more specifically, Keep going. Am I making any sense? Okay, this is your bag of words, right? Like... Happy This is your back of voice. You could...

It's a vector. You could have another feature.

That is separate from the text itself, like counting the words. individually. A cone of every possible Positive or corresponding to some positive reflex that you already have. And count all the words that are in the negative hexagon. Thank you. 40 words from the positive lexicon and half from the negative lexicon in our practice.

An extra feature. That is not coming. Directly from the vocabulary. Does that make sense? This process right here. Is through a bone.

Typical machine learning work is called feature engineering. You're building your vector of features. We started with just a bag of words, but we can tag along other features such as counts of all the possible positive words that you have in some of that stuff. You of course have lexicons available. You can build your own.

To boost your sensitivity analysis is to Identify people in hierarchy, one way or the other. Let's say we know that Corrupt is a negative word, right? You have it in the lexicon. But you don't have the word brutal.

Would it make sense? Infer. They're both by an ant.

Most likely though. Thank you.

So these are little mules.

Heuristics, if you please, a little trick set. Help you build a better model. But works in the opposite direction. Plus, plus, minus, right? And you scan your facts you're looking for. Four cases.

This is also a process that could be used to build your own index. Starting with an apheletica, with some so-called seed words. And then look at your corpus, whatever you see good and something, we'll add that something to the good lexicon to the positive lexicon. Very easy tricks. Help you build Another more advanced The version of that is to have a graph. Representation for work polarities. Does that make sense?

You build a sentimental analysis. Well, and I need to pay more. Would that be a... Positive or negative review? Is that a positive or negative?

Sarcastically. This is not something that a computerWe'll be very-Even worse is--Here. Do you know who Katharine Pat Burby is? Is this? Her range of emotions is just two emotions.

If it was A to Z, she has a wide range of stuff. You can see that this is not easy, right?

And your computer model will not be able certain nuances that humans can give.

What about That. This is a perfect example of a Although it's so fast that...

Regular combing. Brilliant, great, whatever, but in the end you're getting your negative right there.

Neural networks, so part of that problem.

If you train it on millions and millions of reviews and whatnot, it will try and start to Okay, but what else can we do? We have our bag of words and I already told you-One of the ideas that I saw in the news is about positive and negative. Here is an interesting...

If there is a no somewhere, then-and count it separately, not count it as a null. presence of something. negative as a separate feature to amplify the effect of this, but how How about this?

Dr.

Dura, What will be your Take away from all of us. I'm seeing an exclamation point in the background. Right, so you need someone who likes it, right? something in the presence of the Exclamation Board Just times 2 or times 10, whatever. You can bake it in a oven. As a specific feature, do you feel that that piece of information will sway your sentiment analysis? I wonder since I often do that to mention points of positive messages, but I'm sure that they can be used for negative messages.

Sure, this is I think the intention here because it amplifies the overall message. It's a stand-alone feature, right? Not just counting how often the exclamation points happen. It's just this piece of information is sort of strong. Leaf, right? Yes. ... What about the last one? Think of it as just a word cut. How many words do I have in my mind? My document.

Would that be of any value for the second analysis?

Is there a word X or Y? That's an extra feature. What kind of message was that, Peter? Rachel, could you consider it as a set of rules? reliability factor, the more waters are in that review, the more water it carries. Or even, even in thePsychologically thinking if someone's spam, That amount of time to write that review, they have to feel strongly about something. Therefore, that longer review carries more water than some children.

It's not necessarily always true, but you're adding an extra dimension to yourYour input to the Is that all making sense? Because that's an unfortunate reality when it comes to the NLP. Some aspects of the police.

psychology-based All right, so you can play this game of, adding more features as you please. Now, let's go back to Again, I skipped the antonym, synonym section, but I'll come back to it after the exam.

That's not super important. Would it be valuable, we already talked about it, would it be valuable for us to have a new numerical way of representing that this word is the opposite of the other word, or that the word is a friend of some other word? Mathematically speaking. I think I showed you something. nice could be close to you.

You want to capture that relationship.

We already talked about measuring distances betweenA long way from good to bad, and a short way between good and nice, that means something, right? or using the cosine similarity, these two words are similar because they angle between those two vectors. that vector is large, that means they're going in the opposite direction.

There's a lot of value in being able to encode everything about your tax and essential weight. And there's logic behind that.

Does that make sense? Convex, right? In most cases the meaning of the word is "convex," the cancer. If A and B have almost identical environments, we'd say that they are synergistic. Does that make sense?

Going back to my little drawing.

This is actually a very useful card. What's that? I have a bunch of words.

I know the words. And then you give me a word that lands somewhere here. Could you possibly, in certain dimension, Could you possibly draw a conclusion that C might be a center of A and B?

What? Let's say a duck swims and quacks. Would that be a sentence that is Very much related to ducks. All right, and now x, y, z. This wins. and wax.

When your mindYou're talking to someone, I don't know. Duke. Is that a word like that? Duke. Someone says a duke.

Swims and flags, right? Misspelled, whatever.

Your language model that you're building right now for your assignment, I have no way of... Knowing what Duke means.

If you do that for challenging it should be able to figure out your movement of duck or something. Only because It has seen so many sentences where there is a neighborhood surrounding the dark Does that make sense? So Reading the cards, reading the Wikipedia, reading Reddit and the Internet, right? May I?

forces Your NLP system learned that Duck is human. because there are some other words that are nearby, black or whatever. Now once it sees a duke, will produce a network of involved How that happens later. And that vector will be somewhere here. Can you see right now how That's the vector representation help us resolve a problem of that nature. It will not know what it is. Would this useAssociation with the other words Um.

And that gets bad and gets the system. Every single word. Receives its ownVector. It's turned into a vector in some commonness vector space in time. thousands of dimensions now.

Now, if you have a word that it hasI'm not seeing. It will deduceVector for base.

So if I change the context and use the word dupe again,Those vectors are also used to contextualizeI won't be able to do it. Anyway. You want to mathematically capture those relationships as best as possible. Whatever those relationships are. And vectors are going to go. Vector's semantic. So vector semantics is based on two ways.

Meaning is defined by-Context.

Number one, right? So where is the workplace relationship-Pretty good. strength Take that context and take the word meaning and turn it into some point in vector space. These are the two ideas behind What's in it? And you have already seen one way of doing-we were actually not doing specific words. We were doing documents.

So... What can we do next? Well, once again, Let's keep the... This concept. Up close.

Similar words are nearby in some vector space or semantic space. They're going to be. Close to one another. In reality,In reality, does that make sense to you Okay, so here's another way to divide a vector space according to some-Dimensions of joy,Sadness, right? Efficiency, not efficiency. Does that make sense? Positive or negative or an objective or not an objective. You can use those dimensions and scale awards.

If you have a zero and one, for example, and zero means-in general, if you were to feedback for a neural network, for one that's a higher precedent than the other. -What am I doing? In terms of word of things, one hot upcoming is an idea. Let's take this step back.

50 something, 50 something. 2014. Right. I'm asking you with Okay, I move to Something how the meaning vectors that represent each other, possiblyAfter rain. Relationships that I once began over there. What would be the most trivial and basic approach toBuilding. Unique vectors for one. for every word. So we have first word, which I thinkI suppose it's A, right? The last one, make it 6. That's the name of the place.

And with one encode, one-hot encoding, one-hot encoding means that you have a vector Where there's only one One. vector and the rest is 0. So this guy will get a 1 andNumber one zero G. Can you imagine that guy would... One at the end and all-That would be a one-off.

A little more toward that. Will it work? In principle. Yes. You'll be able to distinguish two vectors from one another. What are the problems? Space and plastic. It's complex.

Are you-Does anyone know a name for a matrix or a vector that is mostly filled with zero? Attention. And he's having sparse vectors is a good thing.

Exactly the same reason you're wasting space. Holding the words. That's a technical problem. What about measuring relationships between those So now how many dimensions I have in my vector structure? Okay, word.

I have 58 dimensions, and every word is really-I have three words. Or A would be So in this direction, Following the there wouldn't be nothing in the middle, right?

Every little word It has a 90 degree angle between Good job.

This is not the way to capture good and bad or good and very good, which you can close to anyway.

So one hot and kind. It'sLet's get started.

I understand the concepts, but I think you have a bit more practice of understanding how they're applied more effectively.