Exploratory Spatial Data Analysis: Spatial Autocorrelation and Choropleth Maps

Thuc Dao

Published Jun 18, 2022

For the purpose of comprehensible content, Python code is not included in this article. The complete code can be found in the Jupyter Notebook file (.ipynb) on my GitHub:

https://github.com/ThucDao/ExploratorySpatialDataAnalysis

As a data analyst, you are familiar with Exploratory Data Analysis (EDA). It helps find out the patterns and relationships between variables and how they affect each other. In order to measure the relationship, you calculate the correlation coefficients and visualize them in a heat map.

What if you want to find the correlation between a variable and a location to get the patterns geographically? You cannot use EDA because it treats location data like other regular features. You need a new kind of analysis: Exploratory Spatial Data Analysis (ESDA). In lieu of the correlation and the heat map, the new measure is spatial autocorrelation, and the visualization is a choropleth map.

Instead of continuing with the theory, we should jump into an example now: Airbnb average listing prices in Canadian cities in 2022.

Here is the process:

1. Import required libraries

2. Load listings data and neighbourhoods geodata of the chosen city

3. Convert the listings data to listings geodata

4. Join listings geodata and neighbourhoods geodata

5. Calculate the average price of every neighbourhood

6. Create an interactive choropleth map of the average listing price in the chosen city

(I will also show the difference between the three classifications of the choropleth map.)

7. Determine the global spatial autocorrelation with Moran's I statistics to prove the presence (or absence) of clusters.

8. Determine the local spatial autocorrelation with LISA statistics and make a choropleth map of LISA cluster to show where the clusters are in the chosen city.

(I will verify the choropleth maps of some cities with their Moran's I values and p-values from the Moran's I statistics.)

STEP 1:

The following Python libraries or modules are needed:

pandas for data manipulation
matplotlib.pyplot for static visualization
folium for interactive geovisualization
geopandas for geodata manipulation
libpysal for spatial computation
esda for statistics and classes in exploratory spatial data analysis
splot for connecting spatial analysis done in PySAL (e.g., libpysal) to different popular visualization toolkits like matplotlib.

STEP 2 TO 5:

It is pretty straightforward and should not need further explanation. You can refer to Python code on my GitHub to learn how to do each step. At the end of step 5, you should get a data frame of unique neighbourhoods and their average listing prices like this:

STEP 6:

I have made interactive choropleth maps of all six cities which can be viewed here:

https://interactive-choropleth-map.thucdao.repl.co/

It should be noted that there are three classifications of the choropleth map:

1. Classification by equal intervals: divides the data into equal size classes (here, classes are price ranges). This is the one used in all interactive choropleth maps above.

In this classification, it seems that only one neighbourhood (Hampstead) has a high average price, and other neighbourhoods all have low average prices.

2. Classification by quantiles: places an equal number of observations in each class (here, it means an equal number of neighbourhoods per price range).

With this classification, the number of high-price neighbourhoods seems equal to the number of low-price neighbourhoods.

3. Classification by natural breaks: minimizes within-class variance and maximizes between-class differences

This classification tends to give a harmonized arrangement of classes (price ranges). It ensures that the variance in each price range is minimum.

While these choropleth maps show the average price per neighbourhood grouped into 10 ranges, they do not give us any pattern. We don't know whether the price is dispersed, clustered, or distributed randomly? If it is clustered, where the clusters are? We will answer two questions in step 7 (global spatial autocorrelation) and step 8 (local spatial autocorrelation).

STEP 7:

Determine the global spatial autocorrelation with Moran's I statistics.

Moran's I is a way to measure spatial autocorrelation. In simple terms, it's a way to quantify how closely values are clustered together in a 2-D space.

Moran's I Test uses the following null and alternative hypotheses:

Null Hypothesis: The data is randomly dispersed.
Alternative Hypothesis: The data is not randomly dispersed, i.e., it is either clustered or dispersed in noticeable patterns.

The value of Moran's I can range from -1 to 1 where:

-1: The variable of interest is perfectly dispersed
0: The variable of interest is randomly dispersed
1: The variable of interest is perfectly clustered together

The corresponding p-value can be used to determine whether the data is randomly dispersed or not. If the p-value is less than a certain significance level (i.e., α = 0.05), then we can reject the null hypothesis and conclude that the data is spatially clustered together in such a way that it is unlikely to have occurred by chance alone.

Let’s look at the result of six cities:

Toronto – Moran's I value: 0.37939917945817603 | p-value: 0.001

Vancouver – Moran's I value: 0.25926115056716487 | p-value: 0.014

Victoria – Moran's I value: 0.23031037456272505 | p-value: 0.014

Montreal – Moran's I value: 0.08129999107479662 | p-value: 0.109

Recommended by LinkedIn

Data visualization in python

Rohit S ModGil 6 years ago

Seaborn: Elevating Data Visualization in Python

Shakil Khan 1 year ago

4 Ways to Automate Exploratory Data Analysis (EDA) in…

Shahid Shaikh 1 year ago

Quebec City – Moran's I value: -0.0998559053213681 | p-value: 0.227

Winnipeg – Moran's I value: -0.34351245464294594 | p-value: 0.002

Provided that the significance level is 0.05, we can reject the null hypothesis for Toronto, Winnipeg, Vancouver, and Victoria, which have p-values < 0.05. These cities have evidence of clustered prices in neighbourhoods. Among four cities, only Winnipeg has a negative Moran's I value but > - 0.5, which shows that there is a slightly dispersed price. The other three cities have positive Moran's I value and < 0.5, which can be interpreted as a slightly clustered price.

We do not reject the null hypothesis for Montreal and Quebec City as their p-values > 0.05. These cities have the price randomly dispersed. The fact that their Moran's I values are close to 0 supports the random price pattern.

STEP 8:

Determine the local spatial autocorrelation with LISA statistics and make a choropleth map of LISA cluster.

While the global spatial autocorrelation can prove the existence of clusters (or a positive spatial autocorrelation between the listing price and their neighborhoods), it does not show where the clusters are. That is when the local spatial autocorrelation resulting from Local Indicators of Spatial Association (LISA) statistics comes into play.

In general, local Moran's I values are interpreted as follows:

Negative: nearby areas are dissimilar or dispersed, e.g., High-Low or Low-High
Neutral: nearby areas have no particular relationship or random, absence of pattern
Positive: nearby areas are similar or clustered, e.g., High-High or Low-Low

The LISA uses local Moran's I values to identify the clusters in localized map regions and categorize the clusters into five types:

High-High (HH): the area having high values of the variable is surrounded by neighbors that also have high values
Low-Low (LL): the area having low values of the variable is surrounded by neighbors that also have low values
Low-High (LH): the area having low values of the variable is surrounded by neighbors that have high values
High-Low (HL): the area having high values of the variable is surrounded by neighbors that have low values
Not Significant (NS)

High-High and Low-Low represent positive spatial autocorrelation, while High-Low and Low-High represent negative spatial correlation.

Finally, we make LISA cluster maps from the LISA results. Although LISA cluster maps are also choropleth maps, they do not show the average price per neighbourhood but instead the price relationship in each neighbourhood.

Let’s view some LISA cluster maps and compare them with Moran's I values and p-values.

Toronto – Moran's I value: 0.37939917945817603 | p-value: 0.001

The map shows clearly some clusters, as proven by Moran's I value and p-value. Because the total number of High-High and Low-Low is bigger than the total number of High-Low and Low-High, the overall trend is positive spatial autocorrelation.

Winnipeg – Moran's I value: -0.34351245464294594 | p-value: 0.002

The map shows only one High-Low, which explains the negative Moran's I value and hence the overall negative spatial correlation.

Vancouver – Moran's I value: 0.25926115056716487 | p-value: 0.014

The map shows clearly some clusters, as proven by Moran's I value and p-value. Because there are only High-High and Low-Low, the overall trend is positive spatial autocorrelation.

Montreal – Moran's I value: 0.08129999107479662 | p-value: 0.109

From the global spatial autocorrelation, we have already known that Montreal has the price randomly dispersed. This is verified by the appearance of only 2 Low-Low out of 34 neighbourhoods.

Reference

Radil, Steven M. (2011). Spatializing social networks: making space for theory in spatial analysis. University of Illinois at Urbana-Champaign. Retrieved from https://www.ideals.illinois.edu/handle/2142/26222

Data source: Inside Airbnb

http://insideairbnb.com/get-the-data/

Montreal

http://data.insideairbnb.com/canada/qc/montreal/2022-03-12/visualisations/listings.csv

http://data.insideairbnb.com/canada/qc/montreal/2022-03-12/visualisations/neighbourhoods.geojson

Quebec City

http://data.insideairbnb.com/canada/qc/quebec-city/2022-03-09/visualisations/listings.csv

http://data.insideairbnb.com/canada/qc/quebec-city/2022-03-09/visualisations/neighbourhoods.geojson

Toronto

http://data.insideairbnb.com/canada/on/toronto/2022-03-08/visualisations/listings.csv

http://data.insideairbnb.com/canada/on/toronto/2022-03-08/visualisations/neighbourhoods.geojson

Vancouver

http://data.insideairbnb.com/canada/bc/vancouver/2022-03-10/visualisations/listings.csv

http://data.insideairbnb.com/canada/bc/vancouver/2022-03-10/visualisations/neighbourhoods.geojson

Victoria

http://data.insideairbnb.com/canada/bc/victoria/2022-03-29/visualisations/listings.csv

http://data.insideairbnb.com/canada/bc/victoria/2022-03-29/visualisations/neighbourhoods.geojson

Winnipeg

http://data.insideairbnb.com/canada/mb/winnipeg/2022-06-08/visualisations/listings.csv

http://data.insideairbnb.com/canada/mb/winnipeg/2022-06-08/visualisations/neighbourhoods.geojson

To view or add a comment, sign in

Exploratory Spatial Data Analysis: Spatial Autocorrelation and Choropleth Maps

Thuc Dao

Recommended by LinkedIn

More articles by Thuc Dao

Others also viewed

Recommendations for Data Processing, Visualization, & Management Tasks: A Geoscientific Perspective

Step-by-Step Guide Automatic Exploratory Data Analysis in Python |

A complete Exploratory Data Analysis guide with Python

Unraveling the Geospatial World: How Python, Big Data, and Data Science Work Together

Matplotlib

Handling all aspects of data science projects with R (and with all, we mean all)

Visualizing Change Using Time-Series Line Charts

How to save resources by optimizing last mile logistics with efficient algorithms

My First Exploratory Data Analysis Project

Explore content categories

Recommended by LinkedIn

More articles by Thuc Dao

The Google Developer Group Sudbury Event: Build with AI – Mining Edition: Groundbreaking Innovation

2026 Sudbury Catholic District SB Carousel Event: Parenting in the Age of Artificial Intelligence

AI Revolution in Education: Reimagining Work, Learning, and Trust in the Age of Infinite Leverage

Language-of-Thoughts: The New Prompting Technique Has Been Already Obsolete.

Outlier Detection in Univariate and Multivariate Analysis

CRUD on Google Sheets with Google Colab

Others also viewed

Recommendations for Data Processing, Visualization, & Management Tasks: A Geoscientific Perspective

Step-by-Step Guide Automatic Exploratory Data Analysis in Python |

A complete Exploratory Data Analysis guide with Python

Unraveling the Geospatial World: How Python, Big Data, and Data Science Work Together

Matplotlib

Handling all aspects of data science projects with R (and with all, we mean all)

Visualizing Change Using Time-Series Line Charts

How to save resources by optimizing last mile logistics with efficient algorithms

My First Exploratory Data Analysis Project

Explore content categories