Lecture

Summary

API Management and Gateway Concepts

OLTP vs OLAP Systems

Database Types and Use Cases

Data Normalization

SQL Language Components

Database Security and Access

Database Indexing

Database Design Principles

Tools and Technologies

Upcoming Course Assignments

Action Items

Notes

Transcript

multiple purposes. One of them is policy applying, end point abstracting. They don't want you to know the next one. They give you a proxy header and then behind the scene they'll call. They don't want to express the real end point. So there are so many things you can do with API management. And then you can also do federation where you need to call multiple APIs and then assemble them. So there are so many things that are done You're not going to do all of it.

You're just going to make a simple API. But be aware, I mean, there are many companies who purely thrive onAPI, just selling. Epi, end points. All they do is sell and you pay for it. And traffic data is one of them. I think there are only one or two companies that You got this. And then all the construction and everything else that happens, weather, they combine all that. You don't want to build that.

Even if you're a big company, still you don't want to build that. I don't think Google has its own. I know that Microsoft has its own. They all buy data from us. for the big maps and like Tesla or any other traffic vehicle system, they're all buying it from the same people. There's one person who, so the API gateway is the one that's getting all these calls. throttling it and sending back the information.

Yeah, so you'll be-So you'll be building your own API. So... What is the difference between All that. and all TP. I intentionally put that abbreviation just to see if you know what the difference is. So So how many people, is anyone specializing in data here? Data warehousing. You're all on the application side. That's a separate specialization, Data Warehouse.

It comes from the data science area. Oh, even data warehousing comes under data center, eh? It's not possible because I can't data warehousing here as part of the ITMD. I don't think it's possible. But even ITMD has data science and AI, right?

Okay, maybe it comes under that. What does Lee, Professor Lee teach? Let's come center. I can't. Dr. Lee, like I taught the same class as him.

For a leader warehouse.

So what is the difference between a data warehouse And a typical MVC database as an application developer. What is the difference? Why are they creating a data warehouse if the data is already in the database? You won't get into that thing since it's not a data warehouse. But there are two types of data depending on the purpose for which you're using. When you're developing an application, You build a database.

that are tuned for Getting a transaction. Done. What do I mean by that? You go to Amazon, you want everything to appear very, very quickly, right? So when I want to get the products, all the products should come. And then when I go, then when I click on a product, that product brings the price. Right. And then when I want to go to like a Find similar products and then it lists all the products. Right.

And when I want to go add it to a cart, I immediately add it and then I pick a payment method. If you notice, each one of them are kind of isolated. Each one of them. So in a typical transactional database system, Do you think we'll have one huge table Or do you think we have multiple tables? Multiple. Multiple tickets. And explain to me why. Don't you think it would be easier if everything I want is in one row that I could quickly query that?

But you have to go through everything to get it now. Well, the intuition might say no, I can go to one group, get the product, description, pricing, things. But your first answer is right. That's what we call normalization. So if I have one huge, so the transactional systems, they're kind of navigating through a set of actions. Okay, the first thing is Hey, I log in so I know who you are, right? And then if I want to get every information about me, I can just go into my profile table and get our information.

Because I already know who I am. And then if I pick a product, then I've already done the product ID. And then once that is the product ID, I can get pricing and everything else, right? So I look at multiple products and pick a product. Immediately I've identified the product and then I can go everything about the Friday. And then once they want Enter an order The product the order gets into my table, so there's an order table.

But in the order table, there will be another order details table that will tell you what are all the items in the order. So everything is kind of Split. And that's what we call normalization. Even though I may repeat few things, right? When it's split, I have to have some kind of foreign key or some kind of a reference, right? I split it. But that makes it easier because Don't have to go throughOkay.

multiple transactions Like, if I put all of them in one row, To get All the others, what do I have to do? The entire thing. So if I have 100,000 artists with 200,000, with 20,000 people to get my order, I have to first look for mine and then go through all of them, find mine, and then within that find my order which is current. So It is It makes it very, very inefficient for transactional data. On the other hand, In a data warehousing, we typically do reporting.

In the reporting, we don't go and say, give me the order of this person. They will say, hey, give me all the order for this month. For that, Do you think We should normalize the table or denormalize the table. When I say normalize the table, you're breaking down the table, right? Denormalize means one table, everything. So if I'm doing that kind of a query for analytics and reporting, Do you think I love a lot of tables or fewer tables?

When is a query efficient? When you have a lot of tables or fewer tables? A query, a single query, when is it more efficient? When you have a lot of tables or when you have fewer tables?

When you have a lot of tables. The query is efficient when you have a lot of tables.

Give me the logic for that.

I can see a few tables here with data, so it takes time to retrieve data.

No, but if I index it properly, I can scan through. But if I have a lot of tables, what will happen? How many joints do I have to have? I have to join from this key to that key, from that key to the other key, other key to that key. The joints take more time. Scanning a properly indexed table takes a lot less. Right. So whenever I'm doing a query, it is better to have Pure or mortals? Two of them. So the analytical, when we say OLTP, we're talking about online transactional processing system.

And when you take all that, we are doing online analytical processing systems. I'll let both of them use database. These modern days they call something called data lake where you may not use database but you still get various like RK and things where it's stored as a file but they give you a SQL interface. End of the day SQL is the most popular query language. When we create Usually, databases are created for a transactional system.

But because of these multiple tables, They are not very good for reporting. So... A lot of jobs that you see about data warehousing Snowflake, data breaks, We're all about analytics. which stores data completely differently. Its needs and its organizations are different. But in our class, we're not doing data warehouse, and we are doing what? Transactions system. for your back and front end development typically it is transaction system What is the front end for analytical system?

Dashboard. What would be a tableau? but you as a full-stack developer you use those tools but connected So it's typically done, that's why I said data science, data warehouse people, they do that. But you are developing a web application and APIs. So you typically use transactional database system, which is normal. What is nominalization? I will tell you. and one of your assignment is If I give you the huge rows of multiple columns, how would you normalize?

Because until you, most of the deficiencies in queries for transaction systems comes out of poor database design. So as an API backend developer, you should have a fairly good knowledge. We're not going to go through complete database fundamentals or anything, but we're just going to go how would you organize your data so that when you're creating an API, if I ask you to create two or three tables, you do them properly, right?

You do them properly normalized. Not create one single table and do your APL. So that's the purpose. So even in both the cases, our goal is we want to store the data properly and easy to retrieve. The easy to retrieve will depend on how you have normalized and depends on what context you're using it. Are you using OTP or OLAP? and data consistency and integrity. So... Let's say you have a database.

Let's take we are using NoSQL. So what is one of the things I told you about NoSQL? Does NoSQL manage referential integrity? What is referential integrity?

But if you're in a no-sick world, Will it allow me to read it?

It will, because there's no lock. And then what happens is that there are so many eventual consistency, but if you do not want this thing to be, then you'll have to do something to force that, hey, when some bracket is being accessed, you need to Like you need to make sure that you update after, right? Because what will happen is, When this person sees it, it'll say refrigerator. But when somebody has changed, it'll let you do it.

Right now I can use the ID to find you very quickly. And I can use that ID to look for all the others that you have. So... The trans-agriculture system flows very nicely to work with normalized tables because you always have this ID, ID, with this ID you can go to the next one and you find, right, you don't need to scan the whole thing. I don't need to scan to find your ID because as soon as you're logged in I have your ID.

I don't need to scan all the orders because I'm going to filter it by your ID to get your orders. So, Having normalised table in a trans-circuit system, the way you navigate makes it very, very conducive. And then, of course, all of this, you know, all the three things above are for all the databases, but You can use any type of data scope. You can use key value pair, cache, memory cache, you can use SQL Server, no SQL Server.

Why is Google not a fit for a company like Apple? each product Yeah, we talked about remember sparse columns and things where, hey, if I have a consistent way of describing an entity, SQL is the best. But if my entities have A scattered schema where this thing will have five fields and then those five fields will be completely different, right? What describes a clothing is completely different than a shoe.

So Most of them use SQL, but these huge, gigantic e-commerce people don't use SQL.

Now in this modern modern Postgres has become the de facto standard you And now Postgres is, you know, they have so many extensions, so many, like they are better stored in Postgres, right? So. SQL server was dominant, of course Oracle was dominant, Most of the people, nobody will start with Oracle these days. They cost, they're a little bit old fashioned. Postgres is the most, it's an open source I'm very popular.

Now I always put Cosmos DB but it's not very popular. I'm a Microsoft guy so I put them myself. But the most popular is MongoDB. I'm Cassandra. And then I would put graph as a NoSQL database, but it may deserve its own category. Okay. So where do we, what's an example I gave you for graph D2Bs? Blame this. This one? LinkedIn. LinkedIn. where relationship notes and ages, right? Hey, this person knows this person.

Obviously the reason I'm asking is because you can't, you don't do in cache, it's either you flush the cache and create a new cache, you cannot do the SQL type manipulation. But then why do we use cash? for quicker access of people.

Yeah, and there are some data that does not change that often. Right? And that does not-Why do, first of all, we need quicker, but if every time I'm flushing the cash and recreating cash, Not a good idea. Or in other words, others database is not a candidate for cache. Is product a good candidate? Is product catalog a good candidate? It is good, not maybe for my, for Amazon in this case I constantly updating adding products, but for most companies the product doesn't change.

So less change, more stable, and most often used.

That's the definition. And on the right hand side is one of those records in the customer table. But if, for example, age integer but if you put I'll read 35 It won't stop unless you have some checks and balances before. Thank you. Yeah, the way that we, and I won't go deep into this, but the way we store data In relation In online transaction system is by row.

But in a data lake, are a thing, they take columnist stories. And the reason for doing that is There are a lot of repetitive, like for example, a product, there will be only 10 products, right? So, I'm just saying as an example, there could be ten fathers. So instead of putting the 10 products every time in a row, they will compress it and take Everything related to one product in a column. So this way when I do a query, I can go to the column restore and quickly build everything related to that product.

Because in an analytical system, I'm querying data. But in a... In a The transactional system, most often I'm I may be quitting, but most of the time I'm transacting. I'm inserting a row for an order. I'm inserting a row for a-customer, right? So the raw storage, the raw type of indexing storage is used in the transactional system. Columnar indexing and storage is used for data warehouses. So...

One of the things I'd like you to do is how many people have a MySQL on your database?

And Visual Studio Code itself Pass.

SQLite So we don't want to use MySQL for demonstration purposes. But eventually I want you to create a database in the cloud. But for playing locally, use SQLite. SQLite is very good. It's even less complex than MySQL and all that. And there are visuals to be put as extensions to Work with it.

And so there are three things. So who can expand each one of them? Forget Anomalization, what's DDF? Time to brush up your database. So it's data definition language. And DML There are many places. And DCL? Data Control Lens. So with the definition of then you should know what it's going to do. What do you think will be the definition of language 2? Creates tailing.

Create, but more specifically, the schema. It defines that table.

When I say define, it says, hey, here's the name of the table. Here are the fields. This field is a primary key, and it cannot be null. This is a foreign key. Um... Right, that's... That's the data definition. And then the data manipulation language As it says, what is it? Insert, delete, modify, update. And then by definition what is data control language?

I just want to Axis Like granting access to the schema, user creating sub-elements into a schema, controlling the table, right? Without, not the data, but the schema and that level.

So... The data definition language is to create And what is Alter? UpdateAlter.

Modifying the schema. So as an example is what? Give me an example of modifying this schema. Changing the name of the thing, adding another column, removing a column, they're all there. And drop your node, just dropping the shape. So this is a DDL, right? Paying the table employee. And look at everything. So it painted-generate the schema for you, like employee ID is the name of the column and it's an integer and it is the primary.

And then, first day He's there. Blake. You will say how long you want it to be. So last name and then You can also have unique keys, some of them allow that, so that Hi, Dave. Salary. and the decimal and the decimal of up to 10 Unless you're a CEO, maybe they'll need more. Yeah, 10 fields. So it just kind of gives you the definition, right, when you create. And then data manipulation, Language is As you mentioned, select, insert, update.

Like for example, we credit another employee with the same ID, You're violating the DDM. That's why the DDL is very important. and then if you insert a string in the place That's the beauty of the relational database, right? That's why DDL is very important, because when you have a good DDL, as a developer, you're safe. You won't dump garbage. So, properly designing the schema is one of the fundamental things for and select same thing and then the query of doing joins and doing inner join, outer join, group by aggregate That did itself.

You know I'm doing quite a bit of AI. The biggest thing that AA has still not figured out He's constructing a SQL query from natural language. It's one of the most difficult things to do. What do I mean by that? Everybody thinks Chajabini is smart, but just give about 10 tables. And then ask, Give me all the others for the customer within this day's range. Most of them feel miserable. people working on it Crazy.

how to interpret your natural language query into proper SQL. And all of them are trying to use metadata to Like if you explain your data with metadata, hey, data means it's the data of the customer and the customer may have one or more records. And if you add all the metadata and then use language model to generate the query, It seems to be doing well.

People are talking about AGI and this stuff. There are some fundamental things that Avalanche cannot do. The reasoning, natural language query, they are still primitive. The way people talk about it, you think, "Oh, it's so easy." And then people get blown away by one or two experiences where they did it right, But can it do it consistently? I know most of the language models as far as I know. The reason I'm saying is that this is a good skill to have, good SQL skills.

So Mike Sutherland has something called managed identity to do that.

No, but as a permission. The, the... The admin, user admin name and the password. It is a terrible way, the custom. Which means that you have to expose it But if your middleware is doing that, then you can store it there or store it in a key vault or something. But a better way would be to use managed identity where you get a token from an issuing authority and then you tell the database, hey, if this token comes, then you can allow them to access it.

One thing is firewall rule. So When you create a database, for example in Azure, there's a firewall. where by default they'll say, "Do you want to add the client IP?" What does that mean? It means... You're setting up the database and you're doing it in cloud, but there's an IP address of your machine. And most of the time these days they fix it, right? Because if it's constantly changing, then you have to go every time and put the new IP.

So what it's saying is that in addition to your username and password, I will only allow if the traffic comes from this IP. That IP can be an application IP. where the API is running, or it can be your. When you're locally testing, you include all those IPs in the firewall to say, hey, Only this These IPs can access with the user admin address.

By Naturally You will find your database doesn't work. because of the firewall. So you learn how to fix that. You'll be doing all the connected and then it won't work. from your API. it'll refuse the connection. Then you have to say, They allow this machine to connect to you. That's the way to secure it. So it's more than just user ID and password, but also the IP address of that. Most SQL servers, if you're doing an Azure SQL server, You will have to add the IP or use managed identity of the application server to accessThe And here is where you grab, and then some of the times you can take once, like let's have 10 tables, right?

And I can create a grant read-only. Like I can say that if this is a product catalog, you only have the permission to read. If it's an order schema, you have the permission to insert order. So you can do grant and revoke By user by schema for this schema this user is granted a read permission. So you can really very granular from a security perspective.

So it says, "Grant select on the employee's table to user who is called analyst user." And this can be a person or a role. A group to which there will be analysts and you can assign people to that group. See how granular you can do. Grant, select, insert, update on employee. If you're an HR staff, right?

I don't know about the flow, but let's say if I have a customer, right? And if I have order, and if I have order detail, the visual representation will be Here's the Testament table. We give it all the attributes. So visually I can see and then to his point, My water is, it will show What will be the, what will be in the honors table? to connect the customer. How will you connect customer and order table?

Customer's ID. Customer's ID. So it will show that relationship with the line and if you double click it, it will say the primary key of customer ID is the foreign key in the order. So that's the relationship. And then what is the nature of the relationship? What can be the nature of the relationship? One customer can have Both of the buttons. Bunch of lollies. So one too many What's my name?

We have to get some mapping tables to So Their reference key is in the mapping tables Yeah, so that's what we call a bridge table, right?

An intermediary table. That takes this many to many into one to many by combining the key. So in that table, you will have an employee ID and project ID, right? That combination has to be unique. So in other words, That would say that this employee I'll query that table to say how many projects this employee has. Because what it will have is a unique entry for each employee for each project. Like employee ID and project ID.

But employee ID One and one you cannot have because they're violations. So I'll have a five unique entry to say that, hey, I'm part of the project. So instead of going through this multi-relationship, I'll go to that table to say, give me all the project ID. Where the applied e equals to 1. And then from the project ID I can get the information about the project table from the project table. That's how I work to this point.

We do a mapping to bridge and convert many-to-many relationship into one-to-many. One-to-many is the only way you can do a legitimate transactional system.

So it shows the type of relationship. What is cardinality? Let's leave that for you to come back next class. Like if I have a drop down values in a field, What is the comment? I'm going to let you and maybe you can explain to us. What a cardinality piece.

Before it used to be Irwin and a lot of tools. But now you get so many tools as part of the system. So it's really a beautiful way to visualize. So...

So how many people work with extensions in Visual Studio Code? GitHub is an extension, right? So many extensions. So look for an extension for you to design The VR entity Lucid chart is very popular, but they're not paid. But there must be some free extension. It looks like a few. The problem is you have to first get the schema in a JSON format. But if you We're using a SQL Server, Microsoft SQL Server, and we're using SQL Server Management Studio that can create diagrams for you.

What do I mean by, how many of them, is like, hey, I want to find this record. and I want to find it by location. So how many indexes can I have to have a fixed location? One. Only one. Why? Because Then next, I know where it is, so I can find it. and that is called Cluster. Because what we're saying is that And how do you pick your cluster index?

Which one will you pick as your cluster index? The one by which you query the most.

Right? So typically, if I'm going to get it by customer ID, then customer ID is my cluster index. Because that's the query I use most often. And hence, I want to use the physical location index using that.

If I want to index it by a thing, I have to use non-clustered index. What I will do is I'll create and index and then point the physical location Two as a clustered index. Right?

Because I can only have one physical location in this. Everything else is a pointer. and the one that the pointer index called non-cluster index.

The one by which I am physically locating a record is called cluster index. And you should be very careful as to what you pick for cluster, because you cannot have multiple cluster index, right? And also, all your queries are not going to be 100% based on the cluster index. So you pick the one that you mostly query and use that as a cluster index.

And for all of the query time, you create non-cluster index.

And then the thing that they always ask for Is It's minimal and stability, simplicity and familiarity. They are, they sound like aApple pie, but they are very, very fun. So, minimality means don't make a, Fifty...

Feel long with the primary. which nobody knows. Make it very simple.

And it should be stable. What do we mean by stable? Doesn't change.

So after we come back, we'll finish that so that you can do the two assignments that are due next week and the following day. One is the simple one, just setting up. your minimum API and deploying it. I hope most of you have done that and the other one is this. creating normalization. And after that we'll go to the third exercise will be on the API, creating the API. And after that we'll go to the third exercise will be on the API, creating the API. Since you love the D to B's, maybe we can do this. Okay.

The star schema is the opposite of normalization.

What do I mean by that? I only have fewer dimensions And I just repeat everything. Why? Because when I do query, I just... I want to query for all the customers and I don't want to do multiple joins because Wearing multiple joints, like if I normalize this and I use it in an analytical system, I have to go get the customer ID, then go get the customer's zip code. I want to get everything in a single table.

And why You may say, "Hey, it means you're repeating it." That's okay, because in an analytical system, do I do update of a record? In the reporting system, do I update the record? Most probably not. Most of the time I'm appending, right? So I don't have to go and make 100 changes. I just... So in an analytical system you use dimensional model where Your number of tables are less You denormalize them.

But I just wanted to-yeah, this is a-when you do this, This is the kind of analytical system when you slice and dice it. I won't get into that because I use the same slide to teach the data normalization for data warehouse. This is how the data warehouse is, but you won't be doing this. So that's all. You are now ready to complete the two assignments. In the next class, then I'll give you an assignment for building andAnd then in that assignment you won't have any friend there.

But you'll have an API credit operation. Add the backend database working. That would be the third assignment. And fourth assignment probably will do something on architecture. How do I architect? Like now that you know RenMAP, okay, how do I do services? I'll ask you to use some tools to draw an architecture for a given application. And if time permits, then we'll do object relational mapping. But if not, then you'll have a final project where you put all this together.

And then do a presentation for that. That will make the resume. So first assignment will be API. with a straight HTTPS take their assignment and then make it into a graph cure. End point. and then present it in the class. That will be the midterm. So we'll build an API with the database and then we'll convert that into a GraphQL.

So when is our midterm? You don't have to strictly do that. but around that time, right? So let's say this week, The coming week you have this assignment and the following week you have the database. Now the coming week you have the first assignment. I still gave some time for people to get it. Getting the infrastructure and putting some people may struggle. Has anyone completed in your, in our, this group, the first assignment?

No? You did. You did. You already did. They're putting, deploying API in a cloud? No, I didn't see the summit. You haven't submitted yet. No, but most of you are there, right? You already have subscriptions. You have already done it. So you have one week to do this, right, from now to one week.

And within the following week you'll finish the database. And then the following week you'll do the API. and the following week you'll do the presentation. So maybe three to four weeks from now.

But since I won't be here, maybe it's after March, end of March. So you'll have four to five weeks to do that presentation. So what I might do, I'll do the API and the draft QL And then we'll do the archery.

Public API

https://public-api-lists.github.io/public-api-lists/


Databases

The Tole of Database in OLTP or OLAP Systems


Types of Databases

Relational Databases

NoSQL databases

In-memory databases


Database Models