Summary

Hardware and Software Page Size Mismatches

Hardware typically uses a standard 4KB page size
Operating systems also understand 4KB pages
Database systems use varying page sizes depending on the implementation:
- PostgreSQL: 8KB
- MySQL: 16KB (four times the standard hardware block size)
- Oracle: 4KB
When software sends larger pages (e.g., 16KB) than hardware can handle (4KB), the data must be broken into multiple pieces to avoid partial writes
This creates a need for failsafe write mechanisms to handle situations where database pages exceed hardware page capacity
Failsafe writes send data in increments that the hardware can accommodate, writing in bits until all data is written

File Organization Methods

The course covers four main file organization approaches:
- Heap: Unordered collection of pages stacked randomly
- Sequential/Sorted: Files arranged in a specific order
- Hash: Hash-based organization (to be covered in detail later)
- Tree: Tree-based file organization

Heap File Implementation

Heap files store tuples in no particular order, requiring mechanisms to iterate through pages
Two approaches exist for navigating heap files:

Linked List Approach

Uses two trackers: one for free pages and one for data pages
Header page points to the heads of both lists
Each page tracks its own free slot count
Navigation requires sequential traversal following "next" pointers

Page Directory Approach

First page serves as a special directory page
Directory maintains references to all physical page locations
Provides more efficient page lookup compared to linked lists
Requires synchronization between directory entries and actual physical pages
Similar to an index in a book that points to specific page numbers

Page Structure and Headers

Each page reserves approximately 10% for metadata (header) and 90% for actual data
Page headers contain:
- Page size
- Checksum (to verify data integrity during transmission)
- Database system version
- Transaction visibility information
- Compression support information
Headers make pages self-sufficient by containing all necessary metadata

Page Layout Approaches

Tuple-Oriented Layout

Basic approach: Track number of tuples and append new data
Initial implementation (Stroman's idea) tracks tuple count and free slots
When tuples are deleted, compaction strategies move remaining tuples together to avoid sparse storage
Works well for fixed-size tuples with 4KB pages storing tuples compactly

Slotted Pages

Designed to handle variable-length tuples
Header contains an array tracking the starting offset of each tuple
Slot array grows from the front while tuple data grows from the back of the page
Page is full when these two areas meet
Drawback: Creates holes/gaps between variable-sized tuples, leading to performance degradation

Log-Structured Layout

Focuses on recording actions/logs rather than tuples themselves
Operations (insert, update, delete) are appended to the end of the file
Advantages: Very fast for writes (append-only operations)
Disadvantages: Slow for reads, as deleted operations are hard to reconstruct
Can be improved by using indexing/identifiers to jump directly to entries
Requires periodic log compaction to prevent excessive growth
Can lead to write amplification when rewriting the same data multiple times
Commonly used in non-relational databases like Cassandra, HBase, and LevelDB

Tuple Layout and Structure

Each tuple is a sequence of bytes that the database system interprets
Tuples have their own headers (similar to page headers) occupying ~10% of space
Tuple headers contain:
Actual tuple data follows the header with attributes stored according to their defined data types
Data type selection is critical for performance:

Handling Large Tuples

Most database systems do not allow tuples to exceed page size
When data (e.g., images) exceeds page capacity, it must be broken down and linked across multiple pages
Physical denormalization (pre-joining): Related tables can be joined and stored together on the same page
Benefits of denormalization:
Drawbacks of denormalization:

Record Identification

Each record needs a unique Record ID, similar to page IDs
Strong record identifiers use composite keys combining:
System-generated composite IDs are preferred over application-level IDs for uniqueness guarantees

Key Takeaways

Database organization operates at multiple levels: pages, files, and tuples
Pages can be tracked using page IDs via directory or linked list approaches
Storage can be organized as heap, sorted, hash, or tree structures
Tuples can use row-oriented or log-oriented storage
Tuple data is stored as byte sequences interpreted by the database system
Proper data type selection and page layout choices significantly impact performance

Notes

Transcript

Yeah, so the hardware out there knows a standard block is for KB.

So the hardware out there knows the standard page is 4KB. Hardware usually has a 4KB.

Your software, depending on which software you have, But our operating system also understands F4KB So remember the operating system is going to manage the system.

but in the database system depending which system you're using For example, if you're using Postgres, It's an 8KB block. MySQL Workbench or MySQL is going to We have a 16K beat.

This is four times the size of a normal standard pet block. Okay. Oracle has a okay beat.

So What happens? If your software is sending you 16 KB But the hardware can only have 4kb. That means you're going to send, you're going to break this data into four pieces. Otherwise if you try to bombard you know what is going to happen? You're going to fail. You're not going to write all the data. You're going to, instead of writing 16 KB, you're only writing. That's why we need to have like a failing chef Every time you fail CEP you're looking for an alternative.

better way of handling the situation. Failure set. And that is why We say, hey, if you have the scenario where the hardware might not be able to accommodate all that is coming in, You can be writing these pieces in pathways.

then there we have to find a mechanism of handling this writing and we call that a false self-write where we are going to only send what the hardware can accommodate. in beats.

So before we actually break down the real structure, we would want to know How are these files organized? How are pages organized? How is the file organized? So we have a hip.

Just like you hear a heap Okay?

When I'm collecting pieces of papers from you, Your answer sheet, I just put them one after another. So it can, he upsets. without any order. Okay. If you are looking for your answer sheet There is no mechanism, there is no way I am going to look at it. Unless I have a mechanism, something to iterate, one to jump onto the other. page then I can do that but this randomly piled one on top of another.

We're going to look at that sequential, which is just easy to understand.

You are arranging files in a certain order. hash we talked about when we get to hashing and then this is the tree firing organization at this point We don't want to know anything inside the page. If this is the page, page ID 001, You just want to know how these pages are arranged but not not anything yet in the page we'll look at that if we are looking at tuple layout So the hipping approach, like I said, You are going to have an ordered collection of pages, one on top of another.

And they're going to be this much. And then you come to him and say My maps need to be updated. Then I say find your script. It's gonna take you some time to look through until you find it because there's no other You know, So it's the same thing, we need something to iterate.

We need something to iterate over these pages. And we have two approaches.

If everything is full but For a leaf list, We're only going to have two trackers one tracking the head the head of three pages and They are the head of use pages So this is data page list and this is free page list. If I want to write, I have to go to a free location. So I'm going to follow this.

I'm going to add a More information to store. So I need another 4 KB which is going to be stored here. I need another one. So my home is already full. But the point is here the iterator The iterator which is in form of a linked list How do I get? How do I reach? Peg 004. How do I read pages of this raffle? I have to go here and then go to the next and then go to the next you're gonna see that there's a kind of a list a link a list A linked list So in programming languages, just like we have arrays, we also have labels.

And this is pointing to my Sorry, my previous analysis pointing to my next. So This link list is one way that can help us navigate through the hip. to go to the desired page of your choice, just like how I show you.

Now, if you're not using a linked list approach, you're using a directory approach. And it's still the same thing. Remember We want the page to be self-sufficient, meaning that it can manage itself. It has everything. So how many pages do I have here? So many pages including this last one. Let's say eight pages. in in pages These end pages need to have a reference in the directory page. That's why we call this a page directory.

The first page, this P001, is a special page. what's the importance of this it must have A reference of what is actually on the physical hard disk. and the physical location so for example this is This is ID 1, 2, 3, 4, 5, 6, 7, 8, 9. If this is 1, this is going to be pointing 1. If this is 2, if this is 3, if this is the last one, it will point to the left. You can see that he There is no... There is no...

The actual reference is Let me use a book. You know very well that, hey, When you open the last bit of book they put an index They'll tell you, hey. A new one called concurrency has been used in page 116, has been in page 200, and has been in page... So they tell you pages where this keyword has been used. You know? So this is a directory pointing you to those pages. So imagineThe... Imagine when you go to 115, you find an empty null pointer.

That doesn't make sense. So maybe they removed it there and they forgot to update it That's why we say, hey, we need to think. They need to be some kind of thinking. and this means hey you are going to synchronize What is on the page and what is on the actual page What is the actual physical location and what is in your directory, which is basically our indexing "A-la-lay." So, That is how we can lay on a hip.

That's how we can deal with a hip. Let's go inside the page itself and see what happens inside the page so inside the page inside this 4kb block or page we We're going to find, remember this is a 4KB. Let's say 10% of that 4 KB is going to be reserved for... Metadata, data that is needed by the Blob to survive or to do its own role. And you can see that, hey, 90% is gonna be accidated. Let's start with what we find in the header.

The header can contain all these Functionality, those features that will, for example, checksum is going to help us to see if we have not lost any data when we are transmitting. The database system is going to come and say, hey, block. Page 001, tell me what's happening. You know, and then it has to first go through the header. The header is going to tell us, hey, the page size is this. Okay, and then check some with this The version of the system is this.

So the reason we want all this is we want to make the block cell sufficient. So we want to see what is inside this block and how can we organize data in this block. We know that 10%, the first bits of the first bytes of that block are going to go to metadata, its header.

So start with the two-point oriented. The database system is going to come in there and say, "Hey, page, which page is this? P001. Tell me what's happening. What's the status?" It has to first go through the header and read the status. When I read the status, it's going to be here, number of tuples is zero. I don't have any tuples here. So this wall is empty. No. We use Stroman's idea to give us the status of that page.

Stroman's idea is going to track the number of data the number of tuples in the page, but it's also going to tell us We can also get the number of free slots. for example If we add a new tuple, we don't have any tuple. So the actual data, this page is empty. How do you know it's empty? Stroman's idea looks at the statistic which we call the number of tuples. Now if you look at this one, this is different.

I can know that hey my block is half full my block is Well... maybe completely full No. What? How is Trowman going to handle this division let's say a deletion happens And 2.2 has been deleted. If 2.2 has been deleted, we had empty, this was empty and this is also empty. So we have 2.1 and 2.3 in existence. And we say that we advocate for compacted strategies storage. We don't want to have sparse storage.

We don't want to leave space, we want to compact our data. And on reconstruction, this is going to slide up here. Say that 2.1 and 2.3 are in close contact. And then the two slots are empty and are going to be pushed down. Okay. So, straw man will still come and reduce it. I have two tuples only. One has been deleted. So, Sumer is just tracking the number ofUh.

which requires what we call slotted pages.

Yes. You may find that this whole page that needs to be 4KB will actually storing this why because of this space you see this space here We left some space here, we left some space here. We left some space here. So you live in gaps. You know, because you are dealing with variable sizes of tuples, you're not dealing with fixed tuples. This starts from here, ends here. This starts from here, ends here.

This starts from here, ends here. They are all compacted.

This is going to call for Some performance degradation.

So the starting point Location, and if you look at it, You see the first two posters from here and it's tracked.

is an array and its role is just to help us track the starting point.

of each tuppo, that is all we want. If I know the starting point, I know even the ending point of The starting point is the egg The starting point of air is It's also the beginning of the next. Okay. So, that header here that you see here we're going to have number of uh two post number of uh Yours too, lords. So these are all your slots And these are empty slots. You need that tape. They are waiting if you add it if you add this load they're also gonna be Drawing a link to it An offset of the starting location.

now this is what was if you're going to continue that if you add as when you add a new tuple you also adding a new array the array size is going to increase until these two arrows meet If these two arrows meet then it's gonna be that this Your 4kb will be full Okay. But we say this has a drawback The drawback here is the holes that we saw, the holes that we're leaving, empty. You may find that this will never be Fueled again And it just has to be left like that.

So here, We focus mainly on records, on logs. as opposed to two-pole. information So in the logs what is there is The actions that are happening on the file What have you... Created or modified? So here it is easy to append To tell us what has happened So maybe after updating I did a delete After delete, I did an update.

After an insight I need a delete So it's easy to see a log file.

by just appearing at the end of the file. Okay, so here we just have to look at the end of the file. The end of the file tells us the last action that happened. And then the modification operations are normal in SAT, update and delete. Now.

You can see new entries. It's easier. This form this file format or layout is very easy if you just Append only a pen only meaning you're just adding only to a file but if you are also reading then we're going to have a problem with the reads so this structure is It's very fast on reads because I mean you just have to establish the end of file In the end of the file, they knew it's easy for the write, but it's very slow.

when you're reading because some of the operations like delete This operation is going to be very hard to reconstruct because it has been deleted.

If you're choosing these two layout formats, you need to know the pros and cons of the other. So yes, if you are only going to add to the file this one of good and you're doing less reads if you're doing less readsDean. You can do the log file. So, how can this log structure look oriented be modify be enhanced You can improve the slow reads by using the identifiers. Basically you're indexing. You're using indexing to throw you to a specific entry.

You know, it's very easy to to jump into an ISO. The only way the log file can improve is by introducing in the ID numbers which is basically an indexing system.

So you can see that it's going to be very easy for a system that appends so easily. so you compact everything at the end of the file.

Mm. The only problem with that We are going to end up with an amplification, a right amplification. whereby we're going to rewrite the same data over and over. For example, you adjust editing These are gonna be three entries But you're just fast changing this one and then thinking, oh, I think it's 18. I would think of...

Remember everything that you do is appended at the end of the log file. That is going to lead to amplification. So for people who want to, who will study more of this, you're gonna find that things like Cassandra, HBase, and Level, Those are systems Those are systems that Those are systems that are going to that support this kind of fire organization. Also call it the rock oriented. Sorry, colon oriented.

This one? You are telling the database system that, hey, I am telling you how my data needs to be. Can you resolve this? Because people are going to enter their debtLittle bit. You know? Day 2 back. Okay. So... Like you see on the right here Anything? So it is the work of the database systemTwo. Interpret those sequential bytes to really reveal to us Okay, the value that is stored in each attribute.

So, Just like a page that has the first portion dedicated to the header, the same thing, we have the first 10% dedicated to the tuple header. So you can see this 10% is to put the data and 90% from here to here, 90% is dedicated to the actual attribute data. So don't confuse. We have a page header. And we have... Each tuppo is going to have what we call the tuppo header.

The tuple header is going to have self-sufficient information. That is going to help it survive. You okay? Call it the metadata. Things like visibility information, we're going to have bitmap information.

of storage So, of course, at that point, you don't want to. So, Now that we looked at the header and we know what the header can do, Let's look at the actual data. In the two-foot.

And if you look at this table full A, B, C, D, E. A, B are integer. See? Yeah, including C. D is a double and a float. Now... you can see ABCD are all side the sizes are equal why the only teachers I'm saying integers in memory and it you You are resolving for four bites. you're reserving four bites so from here to here is four bites because of this from here to here is four bites because of this from here to here four bites because of this now a double You result 18.

Bites. of memory. That's why this is twice D is twice. E float is also 4. It's 4 bytes. That is why choosing the right data type is so important. If you are thinking in terms of performance Why do I need an 8? Bytes reserved in memory when you know I actually need four.

Now, we're saying that hay Most database systems do not allow tuples to exceed the size of a page. If you have a tuple that exceeds the size of the page, maybe let's say F.

Would you note I love It took put two to exceed So what did you do? We're basically going to Have it we're going to break that into another and then link it So didn't normalize I do put data So If two tables are related, you have two tables that are related, Foo and baa We can rejoin them This can be pre-joined. Let the table end up on the same page. So... Because we cannot allow some tuples to exceed the size.

So break it down so that You, uh... If you have another one, let's say "vous" This cannot also fit in vu Make sure you also And if you have another one too You just have to make sure that whatever is there is 4KB, 4KB, 4KB, 4KB. It can fit. So you're going to find that you're going to have a chain. You're gonna have a chain Of those pages that you pre-join them, So we can physically denomalize, which is a pre-join.

You come to join? We can physically rejoin these related tuples and store them together on the same page. So If you if we if we have it linked like this then When you're storing, you can store just the pre-drawn. The linkUm. between these two Look at it as a foreign key. Okay. So we don't have to store both of them. On the same page. We just have to install its Well... It's JOIN, which is basically What we call the denormalized.

If you have your data denominalized, Pre-join We're going to reduce the number of times we do the input outputs. The number of times you read and write So you just have, because everything is on one page, you just have to load one page instead of loading two or more pages. I've just Loaded everything on one page instead of having stuff on separate pages. The only thing we're going to make updates more expensive And you're going to need more space for each tuple.

These systems are He made use of his technique But several new driver assistance do this without. Um... Calling it physical normalization. Identifying a record. Identifying the record. Now that you have You have a tuple? Maybe have several tuples. You need... and record ID. Just as we have a PID We also need a record ID. And this can be record ID 001, record ID, page ID 001. What is this? This is basically uniquely Iden Fine.

The starting slot in a physical, the starting spot At the physical level, Okay others also added file number the file ID so imagine you You have a file ID plus the offset No. You find that, hey, that identifier is a strong one. Why? Because it's a composite. Composite primary keys Very strong primary keys, they can never be replicated. Y Why do we choose this kind of scenario? Using the system generated ID numbers, As opposed to just having our own ID from from an application point of view.

So we don't want to rely on the On that, what is this conclusion? OK. So... enough for storage B that we've covered We have covered that hay Database is organized in pages. Different ways to track the pages. As long as you have a page ID, You can Track the pages. You can use the directory. Or you can use a "links-links-bash-bash" Store pages, you can store them in a row-oriented fashion or, I mean, store pages.

Storing pages, you can store them in a heap. Or you can store them In a sauteed manner. You can use a hash file, you can use a tree organization. Two poles. We saw the tape Our two posts can also be In the raw fashion or in a love fashion We've looked at, hey, we have Two per hand now. and two potatoes. These two-pointed data is a sequence of bytes. which is Ciphered. By the database system Basically For you to understand, okay, what does this bite mean?

This bite means there are three attributes. And these attributes are all integers. The same size attributes and each size is a 4KB. That is all done by the database system. You don't have to I don't have to Um, So...

Now that we are at the lowest level of storage, which is a tuple, We want to see, we want to break down. The representation of each attribute Okay.

Database Storage (Part I)

Database Storage 파트는 DBMS가 데이터를 디스크에 어떻게 저장하는지 설명하는 부분이다.

DBMS 구현에서 가장 중요한 두 가지 문제

데이터를 디스크에 어떻게 표현할 것인가
디스크와 메모리 사이 데이터를 어떻게 관리할 것인가

System Design Goals

DBMS 구현의 핵심 목표

1. 메모리보다 큰 데이터 관리

대부분의 데이터베이스는

데이터 크기 > 메모리 크기

따라서 DBMS는

대부분의 데이터를 disk에 저장
필요한 데이터만 memory로 가져와 처리