Quick Fun Read: "A Tale of Geospatial Big Data"
A Tale of Geospatial Big Data
Are you sure a GIS is a complete solution for your organization?
Ten years or more ago, spatial data was pretty straightforward in the intel business. Standard schemas and typed ontology for cataloging small areas of the globe worked just fine. There were some not-so-small vector layers like Admin Boundaries of the world, VMap, and what not, but most stuff was pretty small and manageable by the individual analyst. The biggest chunk of vector data you might deal with was the NGA Gazetteer at around 8 million points if you actually needed the whole thing, and some other special stuff could get pretty big. And, if you needed Raster datasets (multiband, DEM, whatever) of some area, it was ok to wait for them. Most people gravitated to your typical GIS solutions and those worked well enough. However, as time went on, war got more complex and therefore decision making got more complex and therefore more complex data was needed. This complex data also needed to be available at the speed of unconventional warfare and the speed at which analytical perspectives adapted to the situations on the ground. There now existed an ever growing pressure on analysts to “fuse” more sources of large spatial data into their workflow in order to add value to this new “holistic” and “whole of government” form of analysis. This new breed of analytics demanded larger AOIs full of weird data, and when AOIs get bigger, the need to piece together a patchwork of data arises, and the pain of data integration, transformation, conflation, and enrichment began to burden the individual analyst. The days of small shops creating data for only themselves, by hand, at a slow rate was not working anymore; the world just got too big. Moreover, around this same time in this evolution, spatial data started being produced by machines and crowds of volunteers and in large quantities by commercial entities.
Soon this paradigm of having to deal with these random layers became a GIS analyst’s worst nightmare. Analysts spent most of their time doing data management and transformation, rather than analysis. Some of these naughty spatial layers dared to be huge in size, because they came from a sensor, i.e SIGINT. Some other even more nefarious layers were produced from automated extraction of millions of toponyms from large collections of reporting like M3 using crazy things like natural language processing, where each point had a big nasty block of text associated with it, and a probability that represented veracity…imagine the horror. Every source seemed to have different fields and inconsistent values, and even worse, some sources had the same fields as other sources but with completely different contextual meaning. Pain and suffering grew until the analysts started crying “why can’t everyone just use my schema and my ontology?”…and they actually expected it to happen in some cases for a while. Continuously, as if the whole problem was a living organism at war with the GIS world, different sources continued to be drawn at different scales in different parts of the world by different organizations in different ways into different schemata on different networks. Soon some of the analysts, in an almost Darwinian way, stepped up to the plate and succeeded, while others faltered and fell by the wayside. All of the pain of performing this type of integration was pushed down to the analyst, and their effectiveness was based on their individual and personal technical sagaciousness; those who could do it were coveted, and some stubborn masses actually feared them for their wizardry and witchcraft. These analysts continued to row their chicken-wire-canoe systems as fast as they could, and only the ferociousness of their effort kept them afloat as they supported project after project.
Soon, out in the web-sphere, a thing called Big Data came into existence and was acknowledged. Sure enough, much of this new scary breed of massive data contained ludicrous amounts of valuable spatial information. In fact, even when some of it didn’t contain spatial data natively, some of these new big data technologies allowed developers to infer spatial data from the Big Data. Although this Big Data was spatial data, it was certainly not GIS data. Many hard core GISers still continued to cling to their columns and rows, their single server instances, and their ever increasingly convoluted ontology. This happened until one day their battle was lost. Suddenly, there was demand for geospatial social media and crowd sourced spatial data…i.e Twitter and OSM: conventional database weaponry was useless against it.
Billions of geotagged posts, dozens of millions of shapes, time series streams; it was textual, sparsely structured,having tens of thousands of edits or millions of posts per day…and it wasn’t on the classified networks. This new “Geospatial Big Data” attacked systems in waves with unprecedented speed and unpredictability…massive volume, random languages, text, high velocity, unknown veracity, randomly structured and tagged, multidimensional, sparsely structured or not structured at all. As if this public stream of big data wasn’t enough, this modern age of big data computing and cloud technologies enabled the forces of the sparse vector layer world and the heavy image data world to join forces in the form of insane-o scale vector generation using image mining algorithms against petabytes of multispectral imagery…vectors were being spawned from raster at enormous scale, a tactic not many people saw coming….again, every database technology’s firearms were useless against it.… it was hell in GIS world. Most of the GISers who resisted change were forced underground where they helplessly waged small insurgencies in the far corners of the data plane.
As if maniacally fearless, the Open Source world of data wranglers fought back in the form of a full frontal assault. Core search technologies started to become more spatially aware to meet the demand. Spatial query was developed into core open source indexing solutions like Apache Lucene. The big elephant in the room, Hadoop, (pun intended) stood up to the plate as a behemoth that gave programmers the ability to process any data no matter how big and nasty it was, and a small army of Hadoop’s volunteer programming minions revolutionized how the world controls data. ElasticSearch emerged, and enabled GeoShape indexing and search inside their massively scalable search solution, and solved some hugely important real time analytics problems over big data via what they call Aggregations…an essential tool for browsing billions of features on a map. Amazon web services provided the ability to scale compute and resources on demand; an ecosystem that lends itself well to heavy raster data operations. DigitalGlobe, previously known for producing awesome imagery with its constellation of satellites producing 50TB+ of data per day, stood up what it calls an Information and Insight Platform (IIP) that would control and leverage all of the forces of Geospatial Big Data so analysts didn’t have to anymore. The IIP leverages Amazon, Hadoop (as a Cloudera partner), Apache Storm, and ElasticSearch to enable enrichment on heavy raster data as well as to tame and enrich massive amounts of open vector data, including Social Media from Gnip, OSM, and gigantic amounts of vectorized output from image mining algorithms in AWS. Finally, a system that serves as the individual geospatial analyst’s first line of defense against Geospatial Big Data…at DG we call this type of system a NoGIS solution, where NoGIS stands for “Not Only GIS.” So as you examine your organization’s spatial data management and analysis needs and challenges, maybe you should ask yourself if you need a GIS to solve your problem, or should you be looking for a Geospatial Big Data platform?
Markl - great write up and I hope that you are doing well. I am going to pass this on...take care.
Nice write up. I hope all is well.
Awesome, Mark. I think you've just spawned a very unique genre-GIS humor-lit-that-happens-to-be-right-on-the-money.