Sometimes a Spatial Problem requires a Non-Spatial Solution

Devdatta Tengshe

Published Mar 15, 2017

Yesterday, I was trying to solve a frustrating problem, and while eating a bar of chocolate, I saw how the crumbs of chocolate were lying on the Foil; The individual chucks of chocolate had left an impression on the foil, which looked like a Graph paper, and Instantly I saw how I could solve my problem.

To provide some context; over the last few months, we have been helping a large Fin-Tech company unlock the spatial potential of their data. We have finished Geocoding all their Customer address, and managed to get most of them geocoded at the locality level. Those of you who know about Indian Addresses can imagine how much of a miracle that was. It was an epic adventure and deserves a post by itself; Now we needed to get some actionable information out of the dataset.

It’s one thing to put a few hundred points on a map, and show that to clients. It’s something totally different when you have tens of millions of points spread all over the country. We needed something more. We had to visualize the data in a way such that it was easy to understand, and could effectively convey to the stakeholders some insight into their customers.

One of the important teachings of statistics is that, Aggregates are usually more representative than individual Data points.

The people who work in Data Sciences know this very well; This meant that we had to now Aggregate the data in some way, and then calculate statistics on those clusters. This raises the point: How do you aggregate, cluster or bin the data? Since we are talking about spatial points, the natural solution is to create grids (be they Hexagonal or Square) and then run your basic Point-in-polygon analysis.

This is where the large amounts of data came into play. Creating a Grid over a city is easy enough; How do you create it over the entire country? And once you create it, how do you do a point-in-polygon over such large datasets? This kind of Operation is usually of O(N2) complexity. (In layman’s terms this simply means that since you need compare each point with each polygon, if the dataset size doubles, the operation will take four times as long). We had almost 5 crore Polygons and even more points.

We tried every traditional method. We tried doing it in a Desktop GIS; When we ran the Spatial Join with Larger grids (Only 5000 grids covering the country), the process took over 2 days to run, and finally failed. We tried doing the same in a Spatial Database, but the query never finished.

This kind of Problem seems like a perfect match for Big Data systems. Hadoop and other systems can process even larger amounts of data with ease. I have personally worked on another project for another client, where we used Redis and Accumulo for fast queries on firehose data coming from hundreds of IoT devices. We should have solved this problem with these kinds of technologies, but Alas, we were working with a Traditional FinTech companies, and with the IT policies and other limitations, such kinds of fancy toys were not accessible.

We then decided to run it using Spark in standalone mode, and the queries ran; But only on the coarse-grained grids. When we ran it on the fine-grained grids, we began running into memory issues. This is where we were stuck yesterday.

When I saw the crumbs on a Grid like Pattern of the Foil, it was like a light bulb was switched on. I realized that to figure out which rectangular grid a point is in, requires us to know just where it is in relation to the origin. You can then divide the offset by the size of the grid, to figure out which row and column it is in.

The only thing now left to do, was to make these calculations in a flat, projected coordinate system, where you can make these linear measurements. For this, we chose the LCC projection for India, defined in the NNRMS standards. If you are interested in the details, it has been added to the EPSG database recently and has the WKID: EPSG::7755.

Once I had this realization, it took me hardly 15 minutes to write about 30 lines of Python code which does this calculation. When we ran it on our data, it took about 10 minutes to bin the entire dataset, and generate aggregates on it.

One might think form this title, that I’m discounting the spatial insight required for the solution. Far from it. What I’m trying to highlight is that, very often you need to use an Out of the Box Solution for the challenges that we are facing. This wasn’t the only insight I gained from the whole ordeal.

Other things that I realized:

We usually don’t pay attention to Performance when doing GIS Analysis. When you are working with large datasets, it is vital to do so.
It’s helpful to be proficient in multiple applications and paradigms. The more tools that you carry in your toolset, the more efficient you become.
Those lectures on Cartography and Projection that I slept though were really important.
If you know some programming and scripting, you’ll always have an upper hand over those who don’t.
And finally, GIS is more than just the Software that you use. Unless you understand the Spatial operations and algorithms behind it, you’ll never be able to unlock the full potential of GIS.