From the course: Databricks for Data Analysts

Creating catalog objects for analytics - Databricks Tutorial

From the course: Databricks for Data Analysts

Creating catalog objects for analytics

Alright friends, so now as you start analyzing your data using SQL queries, what happens over the time is that the logic of your queries and their results become something important for the others. So that means at this point you have to start learning how to store your query and the results as an object inside the Databricks catalog. And here we have three options. We have tables, views, and materialized views. So now we're going to cover all of them and we're going to start with the first option how to create tables based on queries in Databricks. Let's start. So this is going to be very simple. We're going to go to the first line over here. We're going to say create table. And then after that we're going to specify the name of the table. So we start usually with the catalog name, then the schema name. I'm going to say customers underscore analyzes. And now since I'm going to create different things, I'm going to call it table. Then we have to give the the keyword AS. Now all what you have to do is to run it. Let's go and do it. If you look to the output, you will not see the result of your query because now this is a command. We are telling Databricks go and create a table. Now, in order to see the result, all you have to do is to go and refresh the catalog, then go to the workspace, to the schema, and then open the tables. Look at this, we have already a new table called customers analysis table. And now I can deal with this as any other table. Like for example, I can go and create a new query. And now in order to see the results, just simple select all sales data, and I have my new table. So we're gonna run it. Now, look at this. I'm able to see the customers analyzes without writing the whole logic. Now, of course, I can tell the others, hey, I have a new table, go query it, and you're gonna see the whole reports about customers. Now, I want you to understand something about those tables. Now, by looking to this, you can see the customers, orders, products, stores. Those are the original tables, but the new table that we just created, the source for this is the query that we created on top of those raw tables. But at the end, all of them are tables, so they are physically stored inside Databricks and as well permanently there. And with this, we have like a trade-off. So first of all, it is fast to query, but here there is like a disadvantage using tables and that's if something changed over the time the customers with the data inside the orders those new updates will not be reflected inside the table unless you recreate the table by the way if you run this again you're gonna get an error because it's gonna say we have already a table like this now in order to fix this we're gonna say create or replace table so this is the disadvantage of tables you have to maintain it in order to get the new updated data by for example recreating the table and now you might say you you know what I don't want to deal with the whole thing recreating stuff I would like to have something live so if an update happens to my original tables I would like to see it in my new objects and this is exactly why we have something called views so by creating view you will be storing the SQL query as an object inside the catalog and each time you use the view your SQL query gonna run behind the scenes in order to get you the fresh updated data so let's how we can do it. It's gonna be very simple about the syntax. All that you have to do is to just replace table with a view. And now maybe let's give it another name. So I'm gonna say view. Actually that's it. Let's go and run it and see the results. Same thing here. It is a command, so you will not see anything in the output. Let's go to the workspace, sales data, and now we have a view. Even it has like small new icon. Now let's try this out. There is no new syntax. You're gonna deal with this like any other table so select from the view name and then let's go and run it but what happens here is different from the table Databricks went to the query that is stored inside the view executed the whole query and the result is returned for you here for the output and of course there is always a trade-off so the advantage of using views is that you don't have to maintain it so if you have a new data inside the customers the orders you don't have to rerun the view. But now of course the disadvantage of this view is that the performance is going to be depending on the underlying query. So this means if the original data are getting bigger and as well the query is getting more complex with joins, the simple select query on the view might take longer time. So yes the views are live, you are getting fresh data without maintaining anything, but on the other side you are consuming more resources and they might be slower than normal table so now if you look to those two and you cannot decide i would like to have both of the advantages that's why we have the third option we have the materialized views the query going to be pre-computed and the result going to be stored physically in the catalog and as well based on the trigger it's going to be refreshed automatically now as usual the syntax going to be the easiest thing so before view you're going to say materialized and for the name i'm going to just gonna add an mp4 view so actually that's it let's go and execute it so it is done as usual i'm gonna go and refresh my catalog to see the new objects so as you can see we have our new materials view now what happens here that Databricks executed your query and the result of this query is stored physically inside the view now you might say okay this sounds exactly like the tables right You are totally right, if any update happens to the original data, you will not find those updates in the view. So how are we going to fix it? Of course, we will not go manually refreshing the view, instead, we're going to build an automated trigger. And this is the power of the Materialize Views. So we're going to go over here and say Trigger, and then we're going to say On Updates. So now if you leave it like this, what's going to happen, Databricks has now to keep its eye on the underlying tables, on the customers and on the orders. And once it finds one update on those tables, it has to go and re-execute the query. And then the result gonna be stored again physically inside the MView. So with this, we have a new advantages of that. Our view is always up to date and as well, it is faster than the normal view. So once I go and execute my MView, it will be way faster because everything is pre-calculated behind the scenes and stored physically inside Databricks. So my friends, those are the three options on how to create objects in Databricks and that's it for this chapter. Now you know how to work with SQL in Databricks. And now I totally recommend you to do the exercises because I'm gonna challenge your new SQL skills in Databricks. And after that, in the next chapter, we will be learning how to build analytical dashboards and reports in Databricks. This is gonna be amazing. So I will see you there.

Contents