Introduction to Spatial Data Analysis
In this article, we'll discuss Machine Learning for spatial data analysis in Geographic Information System GIS, otherwise known as Geographic Information System GIS. There are usually many Machine Learning applications and competitions for tabular, time series, text, and image data. Additionally, we can find tutorials and articles on Spatial Data Science, such as in Analytics Vidhya. However, they are primarily about visualizing spatial data or performing fundamental spatial analysis, such as clipping, buffering, etc. There is a field that integrates machine learning with geographic information systems to solve spatial problems. As a result of reading this article, you will:
1. Gain an understanding of the basics of machine learning and spatial analysis
2. Be familiar with both conventional and machine learning methods for spatial data analysis
This article will also introduce you to (GIS) or spatial data analysis and how these can integrate with Machine Learning. As an introduction to Machine Learning basics before exploring Machine Learning applications for spatial analysis, this article is also relevant to GIS users.
Intro to Machine Learning
The first thing we will do is understand Machine Learning basics. You may skip this section if you know the basic concept of Machine Learning. Machine learning instructs a machine to build a model or machine based on a large dataset. Today's discussion will focus on tabular and later spatial data, not including images and text. Machine learning can be divided into three categories. In this article, we will only discuss supervised and unsupervised learning.
We should have a training dataset and a test dataset as part of supervised learning. In these datasets, the columns are variables, and the rows are observations. They are distinguished because the training dataset has one dependent/target variable, also referred to as a label, and the rest of the variables are independent variables. A training dataset is used to train a model to learn the pattern of independent variables to predict the target variable.
In contrast to the training dataset, the test dataset contains only independent variables without target variables. To predict the missing target variable of the test dataset, we use the machine learning model trained on the training dataset.
The supervised learning process involves regression and classification. The continuous target variable is predicted using regression. The categorical target variable is predicted using classification.
A dataset that does not have a target variable is considered unsupervised learning. Unlike supervised learning, unsupervised learning does not predict target variables. In unsupervised learning, observations and essential variables are compared to simplify large datasets. Throughout this article, we refer to "variable" as a "feature."
Clustering and dimensionality reduction are two types of unsupervised learning. Observations are grouped into clusters with a similar pattern by clustering them. Dimensionality reduction determines how the variables can distinguish the observations. Typically, variables with mostly the same value are removed because they do not contribute significantly to the pattern. We will not discuss dimensionality reduction in this article since it is not directly related to spatial analysis.
What is Spatial Data?
I hope you enjoyed reading the intro on machine learning, even if you already have experience in Machine Learning. Let us move on to Geographic Information Systems (GIS) basics. I recommend that GIS users skip over the following five paragraphs if they are GIS users. The central roles of GIS include collecting, managing, manipulating, analyzing, and visualizing spatial data. In the material we are going over today, we will be focusing on spatial analysis in particular.
In contrast to tabular data, spatial data has spatial attributes associated with each observation, unlike tabular data. There are two types of spatial data: vector data and raster data. The point, line, and polygonal shapes are all vector data types. In contrast, raster data is composed of pixels in the form of an image.
In actuality, spatial data is tabular, but its observation has spatial characteristics. In other words, each observation pertains to a particular location within the real world. There are three types of observations in geospatial data: latitudes, longitudes, areas (polygons), perimeters (polygons), centroids (polygons), and lengths (lines). Three types of spatial features can be considered a group: density, distance, and cartography (point). Tabular data does not have any of these features.
A polygon shapefile is an example of a file containing data shaped like a city, a block of houses, a land-use area, etc. Shapefiles can express network data such as roads, pipelines, rivers, and routes. As a general rule, point data contains information about elevation points, water table depth points, and other points of interest. Depending on what we need to accomplish, polygon, line, and point data can be converted into one another.
It can be seen in tabular data that one observation does not have any spatial relationship with any other observation. When it comes to spatial data, each observation is separated from the other observations by a distance. As a result of the spatial attribute, we can perform spatial analysis (or geometric manipulation), such as clipping, erasing, buffering, union, interpolating, etc.
When you click "Clip," you will be presented with a group of observations where the areas overlie another group of observations. "Erase," on the other hand, returns a group of observations in which the areas do not overlap with another group of observations. The term "buffer" creates a buffer area surrounding observations up to a certain distance. A "union" is the combination of more than one group of observations. To convert points into polygon shapes by interpolation, points are converted into polygons by interpolation between points. Machine Learning is closely related to interpolation since it predicts what the values will be between two known points. This will be discussed in more detail later on.
Recommended by LinkedIn
Machine Learning for Spatial Analysis
We can run Machine Learning tasks such as regression, classification, and clustering in spatial data. In the context of GIS, one of the most frequently used tools is an interpolation, for example, the interpolation of a set of points containing house price information into a polygon or a raster. The actual purpose of regression analysis in spatial data is to interpolate because we want to predict the unknown values in the areas between the points.
Kriging is the most commonly used tool for interpolation. Using a tool called Empirical Bayesian Kriging (EBK) to interpolate the points with Machine Learning is possible. In conventional Kriging, only a single semivariogram model is used to predict unknown values, while in EBK, multiple semivariograms are used in conjunction with the Bayesian rule to predict unknown values.
The EBK method explained above interpolates univariate data. Additionally, we can also input dependent variables that impact the target variable. The use of EBK can support the interpolation of house prices by adding variables such as "distance from the main road," "distance from the public facility," "criminal occurrence," and "disaster risk." In addition to the Ordinary Least Squares (OLS) and Geographically Weighted Regressions (GWR), other algorithms are available for spatial interpolation.
Machine Learning for interpolation
Machine Learning Regressions, such as linear regression, tree-based regression, or Support Vector Machine regression, can predict target variables based on the dependent variables but do not consider that target variables in closer proximity tend to have more similar values. Closer areas tend to have identical house prices. Tobler's first law of geography states that "near things are more related than distant things."
In addition to point interpolation, we can also perform the areal interpolation. According to their surroundings, an actual interpolation converts a set of bigger polygons into smaller polygons. Polygons can be resampled into a few polygons with their values influenced by the neighboring polygons.
Classification is the second task of Machine Learning. Conventional Machine Learning uses Maximum Likelihood, Support Vector Machine, and Decision Trees for classification. The names of the algorithms are the same in spatial analysis. One of the most popular tasks of Machine Learning for spatial classification is identifying land cover types from satellite images. Remote sensing is the science that studies this.
Clustering is yet another machine learning application for spatial analysis. Using conventional Machine Learning, we can cluster many observations based on how similar their patterns are. This can also be done with spatial data. In the regression technic, we've discussed that things located close to one another are more likely to be similar. Therefore, we can consider spatially constrained multivariate clustering. "Spatially constrained" ensures that each cluster consists of adjacent polygons. Each polygon in a cluster cannot be separated.
Using Hot Spot Analysis, we can determine where the high and low values are concentrated. Hot Spot Analysis shows where clustering polygons concentrate the high and low values. A hot spot is where high values are concentrated, and a cold spot is where low values are concentrated. Below is an illustration showing how the polygons are grouped into clusters using three different tools.
Density-based clustering is a Machine Learning tool designed explicitly for point shapes. Through density-based clustering, a series of points are grouped. A cluster of points gathering with high density separates them from other points gathering far away. Below are the points clustered according to their spatial density. The method is the same as DBSCAN from conventional Machine Learning.
In addition to vector data, we can also cluster raster images using "Image Segmentation." Satellite images and aerial photos are examples of images in which objects are segmented. Image segmentation is no different from conventional machine learning. In traditional machine learning, we segment objects from any angle, such as people, trees, and houses. We usually segment objects, such as trees, in a vertical image in spatial data. The result can then be used for mapping.
Space-Time Pattern Mining is the last Machine Learning application for spatial analysis. By clustering temporal and spatial data simultaneously, this tool enables spatial analysis. A three-dimensional cube represents the data. Spatial dimensions are defined by the x and y axes, and the z-axis represents time-series dimensions. The bins have values. This allows us to analyze emerging hot spots and cold spots. Over time, we can determine which area has increasing, decreasing, or constant value.
Conclusion
With Machine Learning, a prediction model can be built from regression, classification, and clustering tasks. In contrast to tabular data, spatial data makes all observations spatially related. With Machine Learning for spatial data analysis, we build a model that predicts, classifies, or clusters unknown locations based on known locations in the training dataset by taking the spatial attribute into account.