Lecture Script

Summary

Introduction to Table Indices

Problem with unindexed tables: Records are scattered randomly across many disk blocks, requiring sequential scans of entire disk to retrieve data
Reserved blocks approach: Slightly better but still requires scanning all reserved blocks to find specific records
Need for indices: A better mechanism is needed to speed up searches, especially when using primary key conditions
Index structure: Consists of search keys (identifiers) and pointers to physical data locations
Key advantages: Index files are much smaller than data files and can fit in memory, enabling faster lookups
Pointer types: Block pointers (point to blocks) vs record pointers (point to specific records within blocks)

Ordered Indices and Index Types

Ordered indices: Index entries are sorted and stored based on search key values
Two main categories: Ordered indices (including B+ trees) and hash indices for unordered data

Clustering vs Secondary Indices

Clustering index (Primary index): Data file is physically sorted in the same order as the index file; ordering of search key matches ordering of sort key
Key constraint: A table can have only one clustering index because data can only be sorted one way
Secondary index (Non-clustering): Search key order differs from data file's physical sort order; multiple secondary indices allowed per table

Dense Index

Definition: Contains an index entry for every single search key value in the data file
Structure: Uses record pointers that point directly to specific records
Advantages:
- Can determine record existence from index alone without accessing data file
- Direct access to records via pointers, no need to search within blocks
- If index fits in memory, requires at most one disk I/O operation
Disadvantages:
- Larger index size (one entry per record)
- Higher maintenance cost for insertions and deletions

Sparse Index

Definition: Contains index entries only for the first record of each data block
Structure: Uses block pointers; much smaller than dense index
Search algorithm: To find record K, locate the largest key ≤ K in index, retrieve that block, then search within the block
Advantages:
- Significantly smaller index size (one entry per block vs per record)
- Fewer index blocks to maintain
Disadvantages:
- Cannot determine record existence from index alone
- Requires additional block-level search after locating correct block
- Cannot be used with unsorted files - requires sequential/sorted data files

Multi-Level Indices

Purpose: When a single-level index becomes too large to fit in memory, build additional index levels on top
Structure:
- Inner/basic index: First level index on data file
- Outer index: Sparse index built on top of inner index
- Can continue building levels until top level fits in memory
Design principle: Upper levels should always be sparse; building dense on dense provides no space savings
Example structure: Data file → 1st level sparse → 2nd level sparse → 3rd level sparse (single block in memory)

Handling Duplicate Search Keys

Context: Non-primary key attributes can have duplicate values; index must handle these appropriately

Dense Index Approach 1: Create separate entry for each record with same key value

Advantage: Easy to count records and locate all instances
Disadvantage: Large index size, many entries, potential performance issues

Dense Index Approach 2: One entry per unique key value, pointer to first record only

Search proceeds by following pointer then scanning sequentially for additional records
Advantage: Smaller index, faster key searches
Disadvantage: More complex implementation, requires sequential scanning

Sparse Index with Duplicates: Works but can be confusing when same key appears as first entry in multiple blocks

Deletion Operations

Sparse Index Deletion:

If deleted record is not first in block: No index modification needed
If deleted record is first in block: Replace index entry with next record in block
If block becomes empty: Remove corresponding index entry
Key principle: Index only modified when first record of block changes

Dense Index Deletion:

Must delete both the data record and its corresponding index entry
Nearly all deletions require index modifications
Higher maintenance cost than sparse indices

Notes

Transcript

We're going to study about the traditional indices. Then we'll talk about the new embedded indices, the binary searches, the binary indices, the binary algorithm. And to be specific, we'll be looking at the B plus tree. So, for those people who still don't get this index, an index with an identifier, suppose you have this, suppose we have scattered

It was cut at hour records, randomly across many pages or blocks. And you want to read all the records that belong to Table R. Let's say I'm a student. I want to read all the 30,000 students. So, what does it mean? If the 30,000 students are spread over 100 blocks, you are going to read every single block.

That is going to be time consuming, which we call the scan. So we're just going to do a sequential scan. That's slow and you're going to have a lot of performance overhead.

A better version, someone can say, hey, I want to reserve some blocks that actually store that data. So, this can be slightly better, but still, it's going to need to scan the entire disk. You don't want something that scans the entire disk. You want to specifically find the actual two poles. So, I go ahead and say, okay, find the two poles given a value of a primary key. Okay, now you're talking.

If I have my reserved blocks, I know these are the blocks that have a student table. I resolved it. But I'm not going to want to... I'm not going to... I'm not going to read all the blocks. I have reserved blocks, but I'm still going to scan all the records in the reserved blocks.

It has been edited to include proper punctuation. So, even if we scan the entire record in the result block, that is still slower because you're still going to scan all the blocks. And we say that, hey, is there a better way to speed up these searches?

We need to find a better mechanism. And, again...

which we can call it the maintenance overhead. So the indexes that we're talking about or that we refer to at the low level, I'm going to have components that we're going to require. We're going to call a search key. So a search key, look at it under a single attribute or a set of attribute, just like I showed you, name and phone number. They're going to be used to look up records so quickly.

OK? And then the other one is going to be a pointer. So if you don't have such key, you're a pointer. So we're going to have a structure, which we're going to call the key-value structure. The key and the value, the key-value structure. So, in that file, we're going to keep this combination, the search key and the pointer.

A is going to be our pointer, pointing us to the physical location where the data is. Whereas the search key, which is always going to be an idifier. An id and a pointer. So if you have so many ids and a pointer, they're going to make up a file. So we're going to call that an index file.

That index file is going to be so important to a database system. Before I actually query the actual data, I go through the index file, quickly. Like, I mean, if you're reading a book, a textbook, and you want to read a few topics, you know. I'm going to go to the last page where there's an index page, then you're going to really look at, oh, they're talking about concurrency. Concurrency is on page 3, is on page 49, and page 110. So it's something like that. The database also needs some kind of assistance or...

for you to quickly find the data it's looking for. So, this index file, we're going to look at the index file and we're going to see how this index file can be used. So, automatically, the index file is going to be smaller than the actual database file.

And somebody can ask, oh, if it's smaller, can we put it in memory? Automatically, we're going to put it in memory. Because it's smaller, it fits in memory. The original file, remember. The database file needs to stay at the disk. So, how about the pointer? So, we're going to have two pointers. We have a pointer that's pointing to the block, and a pointer that points to the actual data. And I give an example of my unit, my block.

If you come to my block... There are over 6 units, but how do you first go to that block? Remember, after this road, there's another block. After this road, there's another block. So, me telling you that come to Block 1, when you go to Block 1, it's a block pointer. But when you get to a block pointer, there's so many units. There's so many units that you need to go there. So, we're also going to need a pointer inside here to point me here, to point me here, to point me here, to point me to this unit, depending on where you want to go.

So, we're going to have a record pointer and also a block pointer. And we're going to look at... But how, where to implement it. So the record pointer will definitely take you to the last mile. Two kinds of indices. I've already talked about, if you want the ordered indices, index, then you're talking about the B plus three and all the other ones.

But basically here we're talking about the B index. If you're talking about anything that's unordered. You want to find a file quickly, but what you're going to find is unordered. Then you can apply the hashing that we talked about. And you can see the order by illustration is very clear here. And what does this order mean? It means that, hey, the index entry, this is our file here, our index file. The index entry...

It is sorted. The inner entries are sorted and stored on the search key. So, my A is going to be, storing A here, and then my A, another A is here. B is here. So basically you're going to see that hey the index entries these entries A, B, C, D entries are stored sorted on the search key. So if you look at the search key

It's going to be in this way. So 10, 20, 30, 40, 50. You can see that, hey, the sorting here is sorted. But we can go farther than just ordered, and we call it a clustered index. Because ordered may be a little not clear. So clustered, which is also called the primary index.

What does it do? The ordering on the file must be also the same ordering here on the And the ordering on the index file should dictate the ordering on the data file. And that's why you see the key that does the ordering in the file, it's called in the memory, it's called the index in the search key. And the index that orders them...

Sorting on the file is called a sequential file. So when your sequential file, when your sequential, when your sort key on the file is the same as the sort key, In a search key or in an index file, we say that you are basically doing a clustering index. So the way you cluster your...

Index file is the same way you cluster your actual files on a disk. If it's not a clustering, then you talk about non-clustering, which is also secondary. Secondary is the opposite, where the search key is not the same as the sort key. You can see it here. So you're going to find that this was one-to-one mapping. You can see.

But now, it's not one-to-one mapping. You're going to see an overlap here. You're going to find that, hey, the ordering here... The ordering on an index file in memory is not the same ordering of the search key.

And what we call that, we call that non-clustering or secondary clustering, which is basically exactly the opposite of a clustering. Meaning that, hey, the order of the search key should not actually be the order of the index file. We're going to stop here for now.

So, Someone's going to say, "That might be so efficient because it has a direct entry to the record." Okay. How is it? Mm. So if you look at the number of blocks here, here we have five. 6 we have, 6 we have, 1, 2, 3. Entries. And I can ask, why do you think we have fewer blocks here? Full of gloves, done here. than the original file on the disk. Somebody. I'll try to answer that. We have three blocks.

This one we're only keeping a key and a pointer. So the entries here are key and a pointer. By the time you flew this full cage, Foke du de kien pointa. You know, it's going to be so small. So that's why we only need fewer Fewer blows here. than The number of index blocks usually is small compared to the number of data blocks. This is exactly what I'm talking about. If the index is too, just in case you find that index is too big.

I know we're going to look at that. However, if you have such an index file, if your index file is under dense, If your inner fire is in a dense format, Daniel. We may not really have to do any other such in our original file. So we may not really need to fact all these blocks because Because every single identifier here is pointing us to the exact tuple.

If your index fits in memory, that's even very good.

Because we're just going to go and read a block once. Okay.

So if you're not looking at the death index, you're looking at the sparse index. And look at this sparse index as a An index that's gonna have entries only for the first record per Look. So this is 5 blocks, we are going to take the first entry 10, first entry, 30 First entry 50, Pavlov. So one, only the first record a block is entered. as our entries in this class. Now somebody's going to say, how can we find 40?

So still how many blocks here we have? We have only three blocks 1, 2, 3, and 1, 2, 3, 4, 5. We've just seen that, hey, our indexes, actually our indexes here is even going to be much smaller because we are going to be storing only the first record per dollar.

So in case we have a new record here, which is 210, a new block there, 210 has to appear. Okay. Okay. So, just like the, just like the Just like this index has the pros and cons, this also has pros and cons. So one of it is-We know very well that he All he stole was one knife. index by block So we're going to have even much, much fewer data stored. That means our memory will have There's no way we're going to find that our index pi, our past index pi is going to be even much bigger than the number we know.

Because we are starting on the firstRecord. The problem here is we may not be able to find the record directly. Just like 40.

If you're looking for 40, you don't have any index 10 sending you 40 right over here. If you look at this index file, we don't have 40.

What you do, we're going to go through an algorithm that locates the number 40. Let's say K is 40. You want to record, you want to locate.

The 2 for 40. So what are you going to find? You're going to find the key that is largest, largest key that is less of equal to 2. than the key you're looking for. After you get that key, You're going to make sure that you retrieve the block. The blog.

The blog pointed to by that index pi. Then later you're going to make a set within the block. So the set needs to go through the block. And this is an illustration using SPARKS. We are looking for index 40. You're looking for index 40.

What are we going to do? I don't have 40 here. When I look at my index file, I don't have 40. But the algorithm says, hey, can you look for a number that is larger than the largest number that is less than? Ten is the, a number that's less but is not the largest. 30 is a number that is less but the largest. So that is the largest but less than or equal to 40. So what do we do? We basically go into Follow the pointer.

So when you follow the pointer, we see that it's pointing to 30. That means you have to go through another kind of indexing to find the record 40, which is going to be just going down here in the equation. And this is identical to, let's say, the block.

That's exactly the kind of searching the SPAS index is going to go through if it's looking for a particular record.

Alright, so So Can we have both these records in one such structure, yes, we could call, we might call the multi-structure. level So on the first level you may want to have a dance The next level you may want to have a sparse, the next level you can have a sparse SPAR. That is especially, this technique is very common if your index files are big and they cannot really store, cannot be stored in the in memory So yes, we can span many blocks.

If you want to get a high efficiency security, if you want your security to be very effective, it's the same thing here. If you want our retriever to be effective, we can build a multi-layered system. Indexing. Level. Okay. So if your index does not really fit in memory, then it calls for this kind of approach. That means you're either going toPeace.

So what you need to do the solution that we're going to do We're going to build a layered approach index and the first one we're going to say, okay the outer index might industry nobody we're going to build a sparse index and of the basic So the basic index is the original file with build as first index.

Then the inner one, the inner index, which is going to be our normal index. So you're building Sparse index on top of sparse and sparse and sparse, but given the fast one is just normal. So... If again, as you build, if you find that matches is building sparse index, but they are still larger, You still want to Make them smaller and smaller by only storing the first in entry of each block.

OK. Hello. Let's look at this example of a multi-level. So we have first level. First level is also going to be spas. And how do you know it's fast? By looking at the first entry. Only the first entries So this is our normal file, our database file and it is sequential.

10 follows, 20, 20 follows, 30, 30 follows, they're not mixed up. Okay, it's in an order. And then the other thing is we are only you can see my one of my first block here block one block two block three The entries are only So the first level's path or the first level is only going to have this. Now look at it is pointing to the last entry here. Let's say the last entry block. It's too tight. Game.

Now One, two, three blocks. This is one, two, three, four, five. Five blocks could not fit in memory. Let's say there's only one block. Our memory can only occupy one block. That means we not even here, one, two blocks will still also not be fitting. So what we need, we need to extend it to the third level. And what are we building? We're building spas. So this is how it's going to be. This is going to be my, who can tell me which values are we going to enter here?

At this level, I want somebody to tell me what entries are going to be here. If this is my one block, at the third level I only have one block.

So at this level, look, the entry, the first entry there, this is going to be 10, and this is going to be the next there, 3,000. If we have something else here, it will also show there. If we have something else here, it will also show there.

So if you We are going to build a dance. That means this... At the second level, This is gonna be tiny This is going to be 50, 70, 90. 110, 130, 150, 170. You're basically going to have the same number of blocks.

And you're going to have a one-to-one mapping.

Are you serving anything? No. So... The answer is We compute it, but it doesn't make sense. Because you're not making any savings. You're not even saving on space. Ned, I used to tell you on... Um... On the time. So, viewing Building a dance As a next level, it's very pointless. You don't really need it. Does it make sense to build as pass on uncertainty? No. Okay. If you're building this pass, it needs to be sorted because remember we are only storing the first entry per block.

So it's going to be very difficult for you, it's going to be so messy for you to deal, to implement a sparse index on an unordered.

We want to examine how do you deal with duplicate keys.

When you have duplicate keys, what happens? And then what happens if you're doing a dilation and insertion? Let's start with duplicate keys.

And how does Duplicate Keys come Coming to existence What we've looked at is we looked at the primary key whereby we have only one I did fire, which is supposed to be unique. Okay, but Other Nankies can be allowed to be duplicated. So it's only the Unique identifier that should not be unique. Any other key can have duplicate. So, If you are dealing with duplicate keys We want to see how do we deal with a duplicate case when it's dense.

and in the dense index. How about when it's fast? So let's look at duplicate keys when it's dense and when it's fast. And you can see the keys here are three duplicates. 2 duplicates, 3 duplicates, 1 and then So how are you going to deal with Duplicate keys. that are undeadified. Are you also going to allow duplicate here? So let's say The dance, let's first handle the dance. How does the dance handle duplicates?

So we know that dance is going to have a pointer for every single record. that appears in the desert fire. And that's why you see we have three things that are going to appear here because of this. 2020 will appear here.

One, two, three, four, five. We still have five blocks. I have this, I have this, 20 appears, 30 appears. This is like still Keeping track of every single record. But look at how annoying it can be. Yes, we may find the record so easily because I have a direct link to that record. But you're going to have more fields than necessary. Your index might span more blocks. which may not be really fit into the memory because you have so many duplicate And it can be very If you want to find all the data records With such QK, then you have to follow all the pointers.

That half K. Rich bicycle is not illegal, it may not really solve the problem of indices, of speeding up the retrieval.

Do we have a better way? Let's see if there is a better way. of solving that. You don't want to be following all the pointers that have the value you're looking for. Four. So this is your data. This is your sequential file with duplicates but we want to find a better way and a better way we're saying hey we're only going to be dealing we're only going to store this listing entries. So these deep entries are the ones going to be entered in our index file.

So this is my index file. It's only gonna take these things. values and this is my sequential file This sequential file that you see right over here has duplicate. So when you look at my index file, I'm only having entries that are unique. I don't care how many times you duplicate, but I'm going to be unique. But again, This kind of approach is going to be very Quick How is it going to be kicked? Because I'm not dealing with the duplicates.

I'm only dealing with unique values. Our weapon is going to be more complicated. Imagine, especially if you want to find successive records, if I want to find the last two tenths, Which one am I going to follow? The last two chains, the first chain is on the block one and the next chain is on the second block. That means I'm going to be following this pointer which I'm going to call pointer 1 and pointer 2.

This is a bit complicated. If you want to find records with the search key K, you're going to follow one pointer to the index, then move forward.

Now, Maybe we don't even need to use a desk if we want to work with the duplicates So let's say, okay, Sparse index Since when is this happening? How are they efficient? Is it efficient in sparse duplicates? Do they leverage a sparse index structure or do they leverage dense structure. So we know that here we only store the first value.

which is in this block 10, 10 is right there. Another 10, 10 is right there. 20, 20 is right there. 30, 30 is there. If I'm continuing, this will be here.

So, when you look at this structure right over here, What is the pro here? The benefit here is yes, we're still going to have a smaller, every time you have a smaller index, meaning you'll have fewer entries and those entries can fit into the memory and execute so fast. However, the drawback Off it! is even going to complicate things much further because finding a record like 30 How many 30s? We have two 30s.

So you're still going to be confused whether you go with pointer 3 or pointer 4.

We are tracking two things. We are tracking a file on a disk and we are tracking an index. 5. In memory When I delete record 40, I have to delete this entry here such that we are in I don't want to have a pointer pointing to a null Enough! In our value that is already deleted, we don't want to have orphans in our data pointer. So let's see how deletion really works.

Because remember, Overflows are a little bit like buffer pages. They are not primary blocks. So Modification in a dance. If you want to do some modification in your desk, If Um To a record, you have a record that you want to delete and the record is on disk. So what we need to note here is a modified-a record is modified if a record-This entry, our index file entry The file here, the index file here is modified if a record is created.

Why do you think we need to do that in our dance record?

If record 40 is created Why do we need to update? our index file and if record 40 is deleted why do you think we need to update Yes. Index file. If we have moved this to a second, why do we need to... Why do we need to modify every single time we add or remove?

Deletion using spasms If at an entry of a such key exists in our index file if we have the index file If it is deleted, that is by replacing an entry In the index file for the next key, The next touch key So we We have our index file.

If I have a such key which is 40 Okay. 40 40-40 If 40 is deleted by replacing 40 by 50, Then I need to also change this for a provider. This is the first entry. If this is the first entry, In the data block, in your data file, then you need to change it. All the entries that must be here must be first entries of each block.

So you can see that this indexing structure E is for S+. And what shows this first? The entries in my index file Only fast entries. First entry is on block 1First entry is on block two. First entries on block three. Fast entries on block 4 So, Can we delete a number? Yes, let's delete a record on the file. Let's delete a record Delete 40 What is going to happen if you delete 40? If you delete 40 First of all, how can we even get to Folly?

This is the desired record that we need to delete. Can we delete that? When you delete the file, this is the data file or the sequential file. Which is on the disk, this is our index file in memory. The index file in memory has to Reconcile. has to be in sync. So, when the fight goes, what happens? The block is 4KB. I'm removing half of it. half of the data. So this will remove, but nothing is going to, no change is going to be decided because I'm not tempering with the fast record.

And there you see it. 40 is gone and this block is empty Not a block, this section, half of the section of the block is empty. Because fortune is gone. So the other question you're gonna ask is okay, I want to delete Daddy. Go for the Angelique Thierry. If you go further to delete 30, what is going to? Happy. If you're deleting 30, Now, 30 appears as the first entry of block 2 And that It's going to force a change in our index file.

That is done, delete 40. If you're gonna delete 40, that means delete 40. This whole block two is empty. Block 2 is empty. Reconciling with this, this is going to also go away. Block 4, because it's pointing to something that doesn't have data, including its pointer, is going to go away. Now, we are going to have an empty slot here, we are going to push 50 up here and then 70 is going to come here.

You can see as you delete all the records in the block, you are going to delete this pointer plus this entry. And this is the final modification that you're going to see right over here.

Remember, a dance has a one-to-one.

Every record has a mapping. So let's try to delete When you delete the 30, what is going to happen? When you delete a thari Tariq is going to go away. including its pointer, including the value here and this will construct "CupUp, CupUp" here You and see exactly what is happening. You know So our second block is halfway full You know And we have a assumption here, no pointers to records from Moon anywhere.

So when you're creating, this is gonna be empty slot. This is gonna be empty And this pointer will go away.

Basic Ideas: sparse, dense, multi-level

Index
Index 기본 구조
특징
목적

How to Store a Table

Table 저장 기본 방식
가장 단순한 방식
문제

SELECT * FROM R WHERE key = value

결과

Table Indexes

Table Index