Summary

Data Representation Fundamentals

DBMS stores all data as bytes at the lowest level, not as high-level data types
Tuples are essentially byte sequences that the DBMS interprets based on schema information
Schema information is stored in system catalogs, which the DBMS uses to encode and decode tuples
Storage Manager coordinates with other DBMS components including buffer manager, query optimizer, and indexes

Data Types and Internal Representation

Integers (INTEGER, BIGINT, SMALLINT, TINYINT): Use native C/C++ types with 8-bit standardization
- Fixed-length storage
- Choose appropriate size based on data range for optimization
Variable Precision Numbers (FLOAT, REAL, DOUBLE): IEEE-754 standard, CPU-optimized
- Faster performance but prone to rounding errors
- Example: 0.1 and 0.2 cannot be precisely represented in binary
- Not suitable for financial data requiring exact precision
Fixed Precision Numbers (NUMERIC, DECIMAL): Arbitrary precision with guaranteed accuracy
- Similar to VARCHAR in storage (variable-length binary)
- Trade performance for accuracy
Variable Length Data (VARCHAR, TEXT, BLOB): Arbitrary byte arrays with length header
- Default 255 bytes if not specified
- Should specify appropriate length for optimization (e.g., 3 for country codes)
Dates and Times: Stored as integers representing elapsed time since Unix epoch (1970)
- Uses 32-bit or 64-bit timestamps

Large Value Storage

Overflow Pages: Used when values exceed page size limits
- PostgreSQL TOAST: Triggered when data exceeds 2KB
- MySQL: Triggered at half page size (4KB for 8KB pages)
- SQL Server: Triggered when exceeding page size
- Can chain multiple overflow pages together
External Storage: Large objects stored outside DBMS
- Oracle uses BFILE data type
- SQL Server uses FILESTREAM
- DBMS maintains pointer to external file location
- Suitable for images, videos, and large binary data

System Catalog

Internal metadata repository storing information about tables, columns, indexes, views, users, permissions, and statistics
Catalog is self-contained and stored as DBMS tables
Can be queried using standard SQL
INFORMATION_SCHEMA: ANSI standard metadata views
- Read-only in most DBMS implementations
- Provides access to table schemas, columns, and other metadata
Critical for query processing - DBMS consults catalog to verify table and attribute existence

Storage Models and Workload Types

OLTP (Online Transaction Processing):
- Short-lived, fast transactions
- Operates on few tables (2-3 out of many)
- Write-heavy workload
- Repetitive operations
- Example: E-commerce transactions, simple profile queries
OLAP (Online Analytical Processing):
- Long-running complex queries
- Spans multiple tables with aggregations
- Read-heavy workload analyzing large data portions
- Example: Top-selling items analysis across locations
HTAP (Hybrid Transaction/Analytical Processing): Combines both OLTP and OLAP

N-ARY Storage Model (Row Store)

Stores all attributes of a tuple contiguously in a single page
Advantages:
Disadvantages:

Decomposition Storage Model (Column Store)

Stores values of a single attribute together in column-oriented fashion
Also called vertical partitioning or DSM
Advantages:
Disadvantages:
Tuple Identification Methods:

Tuple Modification Operations

Insertion:
Deletion Strategies:
Deletion Complications:
Updates:

Key Principles

Storage model selection depends on workload characteristics
Row store optimal for OLTP, column store optimal for OLAP
Storage Manager works in collaboration with other DBMS components
Tuples are encoded/decoded based on schema information from system catalog
Physical storage layout and operations are computationally expensive and require careful design

Action Items

[ ] Complete first lab assignment (10 days)
[ ] Review assignment rubric before submission
[ ] Demonstrate file creation and partitioning into 4KB blocks
[ ] Implement reading blocks into buffer memory

Notes

Transcript

And of course This the database system needs a catalog it needs metadata to be able to Um... Do all this way. So we need the metadata that's going to have schema information about the tables. Uh, The system uses, so the data in the system needs this metadata for us to figure out The two points itself are the same. Let's think about this Attitude.

But we think about this attribute At a low level, Okay.

Now, At the low level, We say that the database system understands Bites! Understand bites. So the value is gonna be bytes You know...

Dates time. Intervose If you go to Postgres, PS2L And you look at how many data types they support, they support so many. Character, Which is just a car. Okay. Alright, I think we... Only four minutes left, we've got six minutes. I can just sign off from here. OK, so one thing that we-I'm going to upload The first assignment, the first lab, Assignment. which I'm going to applaud. And, um... Normally, assignments are going to be taking 10 days.

So you got 10 days to work And the Simon including the weekends, you can decide I encourage you guys to This cast aside But when Juha When you are Submitting assignment. Please do Do not... Do not submit the same piece of work. Okay. Always look at the rubric. I am also going to upload the rubric. And I normally say, refer to the rubric before submission. And you can actually know how much you're going to score before submission if you Uh... Look at this rubric. A rubric tells you-It basically sets the expectation.

in CLC++. So it means that hey, we are reserving eight bits of data. And the beauty about The beauty about Standardizing this across all the is that, hey, because there's so many applications, internet applications out there, They are going to use integers, they're going to use the different types of represented differently. Still you want to know that hey my application that uses Java, takes an integer of 8 bits, and my application that uses C++.

uses an integer of 8 bits. So it's kind of a standard that we want to represent.

yeah so if you look at this standardization you will see specification of of that of this integer say hey integer can be broken down in this one two three four And the difference is, I just told you, the difference depends on the range of numbers they support.

Because we're talking about, we're going to talk about optimization, which is enhancing or improving performance. You want to choose data types that are Appropriate for the data.

So, What's going to be the appropriate? Are you just going to have it as a... Are you just gonna have it as a... as a voucher of 225 bytes that is wasting memory so you don't want to you don't want your System. to reserve 255 bytes just for three characters that you want to store. So those ones are clear. So integers are very straightforward, but please pick the most appropriate integer. If you go to precision, this is floating points.

You'll find that on the 64th, that's when these Four species are different. The problem we're going to have with the varying precision They can be fixed they can be fixed to a standard for example a float a float is going to give an act the float is going to give I fixed six decimal points. Okay, and I really Explain why We normally go to ditch. The Flux. So floats normally we're going to use, we're going to be using things like dabbles, Um...

Rounding numbers or rounding errors just like you see right over here.

This one's plusses. Now I have not changed The only thing I've changed is only here Instead of having a default F You know, I'm just saying, hey, can I look at 20? Procedural points So you want to see a higher precision point. of these floating points. So what you see These two values are different. In the first example, you saw the two values are actually the same. But now the two values are different.

So that is what you say, hey, rounding errors, that's one of the problems that you can get if you are going to be using mostly Variable procedure.

Then we introduce A floating point, but they guarantee you an accuracy regardless the number of points you want to look at. For example, things like numeric and decimal, you know that they're going to ask you for a precision and a decimal number. If you have numeric 2 numeric 5 2 That means there are going to be five Numerals by day, last two are basically decimal. So The beauty about using This is basically giving a Giving the system An idea.

of How best how best the system can really represent without altering the accuracy like that we've seen in the previous example. So it's also going to be like a bacha, variable of varchar where you can fix So look at this as a vaja then can fix Um... There's more places.

all the text. So You having this in that you having these two values But these two values are the same. They are both not 0.3. that if you fix them, you're guaranteed of a naturalcy, although you're gonna trade offUm. Performance. Basically your performance is going to degrade a little bit but you're going to get accuracy. Now, the Greek about postgres is a bit extensible. And what do I mean with this?

Variable length data is very common, especially with a variable character Variable binary is not really gonna talk about that. Text. Like NIMH Then if you're working in a database And you say, you know, okay, I'm working with the Chinese names, you know, the names are going to be short. shorter Again. You basically want to fix them, you know? If you don't fix it, then it's gonna give the default which is not efficient which is a 255 bytes 255 bytes.

So you've got to fix it the voucher, fix it that is It's gonna be a five So Varying dataVarioData.

If I don't put the maximum, the system is going to give me default 255. If we're talking about optimizing databases, You're not doing a good job.

If I say A new variable, country, and you know that You want to only use three characters to...

So it's better to have it as a chart Three So you can have this, you can have this, you can have Japan, you can have... Blah blah, all those three digits away, all they can do.

So what you do We're going to basically create an overflow page. Look at an overflow as a backup page. So, we can actually have a chain of this overflow, another overflow down there, another chain of overflow there. If your picture is 64 KB That means you need 4 overflows. This is one overflow You need three overflows. The original, page, The original plate, the overflow phases are going to be three.

because of 16 I'm thinking I'm saying 16 K Here is your image and we have big images If you are storing photo information That means I need four blocks. Four blocks of 4 kW. So how many overflows am I going to have? I'm going to have the original, which is the primary page. This is going to be the primary page. For KB, feel it there. If anything is overflowing, should you go to... A, a chain. a chain of these pages.

So we're going to have a chain on those pages.

You just need to maintain a link. So there should be a leak From the textbook from one overflow from one page to another page basically keeping track of it Now, if you're using different operating system, the code is featured differently If you're using force grease, it's going to be a toss. feature so you can go there and manipulate it if you are looking at the database system software. So the TOS is the Oversized Attribute Storage.

And it's always triggered If If what you're storing is greater than You can do KB. So we know that hey the full page is 4KB. So we are saying hey if what I'm restoring is greater than half the standard page size What does that mean? It means that Trigger the overflow page meaning that activate the overflow page push This is the next floor. If you're using a maestro It has a goal and overfloor feature.

It's also 3-guide if If the page is more than half But remember the page in MySQL is 8KB. So if it's a 4KB, anything greater than 4KB, Yeah. Um, S.U.S.S.A.M.A. The feature is also called an Overflow. It's always triggered if it is greater than the size of the page.

You can have additional overflow pages until your whole data fits well. Now if you are not using an internal approach you are going to use an external approach and this is exterior to the database system meaning that hey You are going to look at something physically isolated from the original system.

So this is gonna be, could be any storage devices that is outside of the database.

So blob is normally blob. So the photo information that you're trying to store is a blob. and I'm sure you've opened a file, you've ever opened a file and you just see a bunch of zeros and ones that you don't understand. Yes, on the screen you don't understand, but 2a database system is actually understanding.

So if you're using Oracle, it's got a B5. and It always has, it's going toI have a link to that physical location that you're using. the external location. Microsoft uses a file stream feature to do to have it on a separate Can you mind if I play something on the next turn or no? So we're just going to store it somewhere. We don't want it to fill up our database. So we just have it, we just call it upon, just like how you have an external.

This is like a new data type Dates and time or timestamps Ten years ago You would represent death as a voucher You know, because it's a mix of numbers and characters. Today, a date is recognized as a date, date of time. Or you can put a timestamp, meaning that hey, You just want it. String of number you want a number that represents today for example today has its own timestamp The actual time the actual date and time as isAs is represented by the number.

That number is initialized In what we call the approach time, the unique approach time, this is the time that was initialized uh... when the computers Studied the incorporating time, that is 1970. So the time, the number there is 00000. The approach time when the computer started the time was 00000. And... Every day that comes up that number increases If you get a timestamp for today and a timestamp for yesterday, the word of today is So This is um and the other the other Data types.

that we can represent our attributes in. This is a loader, especially Postgres has a bunch of them. Like I told you, it's also extensible. You can add your own user-defined data types. A system catalog, a very good database has to come with a system catalog.

Look at a system catalog as a mini Database within the database system. You will already say that the database system is an intelligent database. But where does it get intelligence from? It gets intelligence from the mini database, which is our metadata.

So yes, um In the metadata of the system, this is the kind of data you're gonna find.

Things like tables, indices, permissions. You see internal statistics. Like there's some statistics we're going to use when we come to optimisation.

It's generated by the system, you know, And they tell you, so every good database should have a system catalog. And this should be, most of the time, self-contained, meaning that it's encapsulated in that one unit.

So So I already told you that this system catalogs can be queried, can be bootstrapped with standard query.

So, the most common one is information schema.

Every time you see an information schema, you just know that, hey, I go there, I get information about the information or information on my actual data. database and the actual system So you find things that are really only there that we don't really have to change. views. If you are creating a temporary table, it will be there, you know. Um. So... These are some examples of querying the metadata. No.

If you're using an old version of SQL Server 92, Okay, and you want to know about this database, put the database name there to show you. If you're in Postgres, just say /t and you're going to see everything about your database. My S2L show tables will show you how many tables are there.

As you will notice the door tables You can actually go farther To look at the actual Table schema.

So you specify student table. We want to break Down to no... What is the benefit of a student table? Maybe a student temple has 21 fields.

If we At the low level, this information Yes, we can use it at the high level, but the system also needs it. The database system needs it. You're going to tell the database system, can you pick for me return data from student tables and professors? Those are three tables. student courses and professors. If it is not of the catalogue, if it is not In the catalogue, it's not going to know it.

So your query is going to go to the catalog. Say hey I know I'm going to return later from three tables, but do we have three tables?

And from each of the tables, I'm picking two attributes from one table, one attribute from another, and one attribute from there. Are those attributes actually present in that table? So that is a...

Um... So your question is you may want to know what's the best layout? For me. That question is going to be asked from the other From the mode of operation, which we call the workload, Gabe? The workload. So this storage layout is motivated based on the workload that is coming.

So this workload, OLTP workload, is normally a workload.

A certain day to the current date. When my host name is given, Um... I want to update this too. where I was given. This is an insight Ok, you can see it's a simple insert. I'm just, you know, in setting new values inner vision. Now, This is regarded as online Transaction Processing Wacklot. And I'm giving an example of it. This is where normally We are going to do You know. We are going to do simple work.

Like, hey, can I please check my-My My profile information Thank you. Can I please check? Um... Which classes are still having slots? Those are simple queries. They will just pick data from a few tables. No. Some of the characteristics of this kind of workload, they are short lived Meaning that, hey, we have fast to finish. And, um... their queries operate a few tables, if you have fifty tables You're probably looking at three tables or two tables.

you typically handles more right so this is ideal for more rights than read. Okay. So we can say read heavy applications Require these. More rice than wheat. Repetitive Something repetitive. And I'll give an example of you buying something from Amazon. How many people are buying something from Amazon right now? Many. But what are they going to do? They do the same procedures. Gets. Get the item, put it in the cart, go to the checkout, check out.

Next person in the queue does the same. Get the item, put in the card, check out this Good idea.

repetitive Simple and repetitive Simple and repetitive So, we are saying, hey, if my database system is going to deal only OLTP, Then I am going to motivate a role oriented as a storage model. However, If you have it the other way around, The queries are large. The complex They take long to finish. which we call the online analytical processing workload. You know, they're going to span multiple tables.

multiple entries like this table that has some extra it has some group by aggregation You can see that You know You're going to do a lot of logic. You're asking me to do a lot of logicEagle. Picking from how many tables here? Okay. We have a filter, we have a pattern recognition, a pattern like, you know, substring. So... This is regarded as complex, although we don't really have a canonical definition of what is complex.

But look at a complex query as something that is going to require a series of logic. We have one logic that is the input of one logic is going to be fed into the other logic and logic until you get an output If I want to see If I write a query that is going to return My cumulative Red point average Okay, for the first two times and then I want to use that computed one to see okay Um... How many GP points are my left knee.

So they have to get the repetitive data from OLTP workload such that they are going to be able to analyze the data. So there must exist OLTP application before OLTP. TP. And these are the characteristics. Um... They ride on the existence of LLTP data. An example would be You want to return the Top. Uh... Five both items from Amazon. in different sites. You want to see in New York, in Chicago, and in LA.

I want to know what is the most bought items from the Amazon site. Maybe the guys in New York are buying electronics. We are buying mostly winter related items. You know? So that is a complex way, that's the, you know, because it's going to kind of aggregate with some kind of samurais. This is You know, Uh... They do a lot of rings. so reads large portions of data, a really large portion of the database.

So those two workloads If you have 50/50 Then you can go with a hybrid.

And there's a paper talking about a hybrid scale. We say new It's a new paper. It's a new kind of a, and it provides both walls. So you have a bit of OLAP and a bit of OLTP.

So Because we know that our What could be either OLTP or OLAP or even both.

You know so we want to motivate our story model based on that I will say that hey, we have two ways. Um. to store tuples in the page. And the database system can store tuples in different ways that are better for OLTP or OLAP work logs. I can say that hey We are going to be assuming that an N array which is the row oriented Storage Um... Which is like... Yeah. ideal for relational database. So... N array is basically a row-oriented one.

So N array, which is also N, S, M. The database system here is going to store all attributes. of that single Uh, two-pole. close to each other, continuously in a single page. So we say everything is going to be in a single page. Thank you.

It's going to pollute 80% of useless data into the memory. You don't want Something that you don't need in your Memory. which you are trying to manage efficiently. Okay. So if we are not using NSM, we are using DSM, Decomposition Storage Model. And... This is basically So different from the other all the attributes there are fixed in one entire single tuple here we're saying hey We're going to store the values of a single attribute in a column.

which you can also call this model a colon-oriented. So all the values of that column All the values of that attributeIt's also, it's going to be like in one block So it is also looked at as a vertical petition. So you basically petitioning your database vertically, you're collecting individual columns and you're physically isolating.

Also can be generalized as a sea storm.

That's a colon stop.

And we say these are very good if you're looking at OLAP workloads.

There's an easy or simple state when you want to use that Simple. Case here, so we are assuming that the tuples are fixed sizes and they're not in sequence. Most of the time, they are always ordering something that's not in sequence. So, inside you just have to establish the end of file. When you establish the end of file, it just goes there. or in deleted slots. We have always looked at that. If you're not using the end of file, You have to use for any deleted slot And I think I've shown that before.

Deletion When you are deleting a tuple from the block Thank you. And a deletion. When you deleteItaly?

Thank you. You can delete and immediately reclaim the space How do you do that? Delete and then shift The other two poles are the lower to occupy the space. or marketer delete and let it free for reuse So just mark it as delete and if any SAT comes in, then you can easily use it. So the problem is you if you have so many deletes you may have a chain of deletes and you need another way to manage them.

It's gonna waste a lot of space. Okay. So the trade-offs here when you're deleting, And again I'm saying deletions and addition or insertions. of two posts on the lower low list They're quite expensive computationally. Because look at all this stuff we're going through. So what's the trade-off here? If you want to go with the fast approach, It's gonna be expensive, why? Because You are actually moving All the tuples, tu.

You know to the new position because you are basically reclaiming new spaces. And then if you have a chain-or at least Of many free reused patches You're going to find that if you accumulate it over time, you're going to have a lot of space being wasted. You don't want to do that. So we have a caveat. or a warning when you're going to do these deletions.

Us. As opposed to this, Special tuple. So this tuple is basically used to contain the information deleted Now update Update is very similar to insert The beauty about this is, hey, if whatever you're updating is shorter, Meaning that hey, they fit.

This fits into a fixed size tubule. That is an easy case. We consider that as an easy case. If the tuple is longer, And we talked about this If the tuple is longerWar is more than what we can afford.

As we conclude this storage, We have seen that hay The company that is responsible for database storage the complete layoutOf a database file on a disk He's a store manager and he's not entirely independent. It is a system that what? In collaboration with other systems, The bathroom manager the query optimizer the index You know All that, the next step is the database system. So the first component that we have done and all that is responsible is that.

And your first assignment, requires you to demonstrate How a file can be created? The file is partitioned in two blocks Each block is 4 KB In that block you can You know, rig They fight into memory Which is the buffer memory and that is our next topic But the other thing we talked about is the smallest representation of Daytime. Okay, on a disc is when we lay out in a tuple. And that is the sequence of bus.

The database system does not understand the two-point name and the value, but it understands the bytes. Where it translates is the codes and the codes, you know, and they understand this is the attribute Holding this much Okay, on the physical memory, No. in the on a disk So the database system does all that.

Um... And then we talked about two Um... Two ways of lay data. But we say, hey, We are going to lay this data Either in row format or in column format If you are showing low performance You must be leveraging the OLTP, which is the online transaction process.

Database Storage (Part II)

Part II에서는 데이터를 디스크에 어떻게 저장할 것인가에 대한 Storage Model을 설명한다.

핵심 질문

Tuple의 attribute들을 디스크에 어떻게 배치할 것인가

즉

Row 기반 저장
Column 기반 저장

이 두 가지 방식이 존재한다.

Data Representation

Database에서 데이터 저장 방식은 Storage Model에 의해 결정된다.

대표적인 storage model

N-ary Storage Model (NSM)
Decomposition Storage Model (DSM)