From the course: Testing Python Data Science Code
Using schemas - Python Tutorial
From the course: Testing Python Data Science Code
Using schemas
- [Instructor] The data that you are using has a schema. It is very important to validate data against schema both in production and in testing. You are probably familiar with libraries such as pydantic that helps with data validation. However, pydantic works object by object. And in scientific computation, most of the time we use dataframes and tabular data. I find pandera to be a really nice library to do that. Let's have a look. Assume you have some sales data that contains the time, the value, and ip. The data is in a CSV file. CSV does not contain any schema information, so you want to validate it. You have a code for the loaders. So, we are loading the sales data. We do pandas read_csv with the file_name, and we tell pandas that the time column is date. And now we can define our schema. I'm going to use a function that validates the value, is an ipv address. So I'm using the IPv4Address from the ipaddress built-in model, and I'm trying to pass it and return True. Otherwise, I'm returning False. And now I can define a sales_schema. The sales_schema is a pd DataFrameSchema which gets a dictionary where the keys are the column names, and the values are a way to validate that the data in the column confirms with the schema. So, we're going to say that the time is Timestamp. The value is an Int64. And we want to make sure that it is greater than zero. We don't sell anything with a zero value. And with the ip, we say that it's a string, but we say that to check it, we use our own function, and we tell pandera to use it element_wise. And now we can write our test. So, here's the path, the absolute path, to the current directory. The CSV file is here in the current directory. And now we can load the data and call the schema to validate the data that is being loaded. Once we run the test, we can see it passes, meaning the output data is valid by the schema we defined. A nice thing about pandera is that you can also use it in your loaders. So, what I can do is from pandera, I'm importing check_output. And then from the schema, I'm importing the sales_schema. And now I can add decorators, as check_output, sales_schema, which means now that every time load_sales is going to get called, before it returns the value to the caller, the sales_schema is going to validate that the data is valid. Otherwise, it's going to be an error. I found out that when using pandera, my assumptions about the data are validated, and every time I need to adjust the schema when new data or new types of data come in. But it's well worth the effort.
Contents
-
-
-
-
-
-
Using schemas2m 51s
-
(Locked)
Truth values2m 35s
-
(Locked)
Floating point wonders1m 46s
-
(Locked)
Approximate testing1m 18s
-
(Locked)
Dealing with randomness1m 45s
-
(Locked)
Comparing pandas DataFrames1m 31s
-
(Locked)
Challenge: Testing numerical code56s
-
(Locked)
Solution: Testing numerical code51s
-
-
-
-