From the course: Python Statistics Essential Training
Visualizing distributions - Python Tutorial
From the course: Python Statistics Essential Training
Visualizing distributions
- [Instructor] In the last lesson, we looked at summary statistics to understand these two neighborhoods. Let's use some visualization now to understand them as well. I'm going to pull out the North Ames and College Creek data into their own series so that we can look at them a little easier. I'll just use some pandas to do that. And the next thing I'm going to do is I'm going to make a histogram of both of those. So I want to plot these histograms on top of each other. What I'm going to do is call hist on our North Ames series. I'm going to give that a label as well, so a legend will appear in that, and that will return a map plot lib axis. I'm going to take that axis and pass it into the College Creek histogram call so that College Creek is plotted on the same axis or same plot as we have. I'm also giving that a label, and then after I do both of those histograms, I'm going to tell map plot lib to throw a legend on there with AX dot legend. This is the result of that. Now, note that both of these are fully opaque and because we plotted College Creek after we plotted North Ames, it is occluding some of that. It might be the case, probably isn't, but it might be the case that what College Creek is occluding, North Ames matches directly, but it's hard to tell because we don't know what's going on underneath College Creek there. To deal with this, I'm going to adjust the transparency, and we do that by changing the alpha parameter. So here I'm going to set the alpha to 0.7. Remember, alpha is how transparent something is. This should let us understand what's going on below that. Let's look at this. From a quick look at the histogram, these look like different distributions. We already looked at the summary statistics. We thought that they were different distributions as well. It looks like this histogram is confirming that. Let's look at one more plot that's useful for looking at distributions, that's continuous distribution function. I'm going to show how to do that with pandas. Let me just walk through this chain here because we have a few steps of that. So we have our North Ames series. I'm going to convert this to a data frame by saying two-frame. Now it's a data frame with a single column. That's fine. A data frame is two dimensions, but it is possible to have a data frame with a single column. Now that I have a data frame, I can add a new column to it. I'm going to make a column called CDF, and that is going to be taking these values and ranking them. And the CDF column is going to be the result of calling the rank method on our data and converting that into a percentage. So it looks like the first entry there is around the 96 percentile. The next entry is around the six percentile of entries. The next thing I'm going to do is I'm going to sort the values by sales price, and we should see when we do this, the index should change. And indeed it does. The next thing that I'm going to do is I'm going to plot this. I've said multiple times in this course, one of the keys to plotting in pandas is understanding how pandas makes plots. In this case, I'm just going to call the plot method, which will make a line plot. Now with a line plot, I can specify X and Y. If I don't specify X, it will use the index for X. I don't want to use the index for X, what I want in the X axis is the cell price and in the Y axis I want that percentage or that CDF. Let's un-comment that and run that, see what it looks like. We get something that looks like this. So if we created a CDF of College Creek and North Ames, and they had the same distributions, we would expect these plots to overlap each other, to trace each other, essentially. Let's see if they do. I'm going to create a function called plot CDF and it's going to take a series as the input and an optional map plot lib axis and a label. It's going to do the logic that I just showed above, converting a series to a frame, making a CDF column, sorting it, and then doing the plot, but it's going to return the series as the output. Let's run that and see what happens, just make sure that it works. You can see that this returned the series. You can see at the bottom, there is a series there, and then below that it had a side effect of making that plot. With that function in hand, let's now call that on both of our data sets. I'm going to call that with North Ames and College Creek, passing in the same map plot lib axis for both, and it should plot them on the same plot. I got an error, it says map plot lib is not defined. I ran into this because I restarted my code space and my map plot lib library was not installed. So if I were to do this in the real world, I would come up here to the top, where I've put my imports here, and note that I don't have map plot lib here, I would come in here and say, import map plot lib dot py plot as PLT. That will make it so, in the future, when I run this, I'm not going to have that issue. Let's run this again. You can see that it prints out a series here because plot CDF returns a series. Below that we should see our plot, and we can see that these CDFs do not overlap, giving us further evidence that these do not have the same distribution. In this lesson, we looked at comparing distributions by looking at plots. We looked at histograms and continuous distribution functions. We saw that, for these two neighborhoods that we're looking at, these don't appear to overlap, suggesting that they are not the same distribution.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.