Access Overture's Places Data

Access Overture's Places Data

Overture Maps Foundation released the alpha version of their open data, and I couldn’t wait to get my hands on it. When I went to their download page I saw that they have released their data not in any of the standard GIS formats, nor in OSM’s PBF format, but in Apache parquet format.

The kind folks at Overture have a writeup on how you can access it here, but it is far too technical for someone who doesn’t have knowledge of the parquet format, and hence it is not too easy for us GIS folk to download it.

And that's why let us go step by step, and figure out how to access and download this data.

Before we get started, lets understand a couple of things:

  • Apache Parquet (https://overturemaps.org/download/) is a highly compressed data format, which stores data by columns, which makes it light in size, and quick for parsing. However this also means that you can’t open these files in a text editor, or a standard GIS software like QGIS. It is meant for programmatic access by Big data Applications. 
  • DuckDB (https://duckdb.org/) is a serverless, in-process Database for Online Analytical Processing (OLAP), or as they rightly claim on their website, duckdb has ‘All the benefits of a database, none of the hassle’. If you look at the documentation, you will see that it can access parquet files, from amongst many other formats, and hence we will use duckdb to access this data. Additionally, it also has a spatial extension, which we will use for spatial querying and writing the data to GeoJSON format.

Here are the steps to get the POIs for a small area on to our local system.

  • The first step would be to download and install duckdb onto our system. Follow the steps given on this page(https://duckdb.org/docs/installation/) to download and install the duckdb cli executable on your system
  • You should now run duckdb, by running a command somewhat like `duckdb’ on the commandline, when you are in the same folder where it was extracted or installed.
  • We will need two extensions to make our lives easier. One is used when you are streaming files from a remote store like Amazon’s S3, and the other is used to enable spatial functionality, which we will use to query the data, and save it in a spatial format. You can do so by running the following commands within duckdb: 

INSTALL httpfs;
INSTALL spatial;

  • Now you are ready to access data.
  • Go to a site like http://bboxfinder.com/ and draw a polygon for your Area of Interest. This will show you the bbox at the bottom of the page. In my case I have selected an area around Pune

No alt text provided for this image

  • Lets load the required extensions, and also set the AWS zone that we will be using. This can be done by the following commands in duckdb:

LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';

  • Before we query the data, we need to understand the columns; To do this, we can run a command like:

Describe Select * from read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*', filename=true, hive_partitioning=1);

  • This will show us the columns within the data, like this:

No alt text provided for this image

  • We only want some of the columns from the data, and some of them are nested structures so we will have to convert them to see them.Lets query for a couple of records, and confirm that we get the right data.

Select 
id,
JSON(names) as names,
JSON(categories) as categories,
JSON(brand) as brand,
JSON(addresses) as addresses,
ST_GeomFromWKB(geometry) as geom
 from read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*', filename=true, hive_partitioning=1)
LIMIT 2;

  • This will get the data, for only two records, and convert the names, categories, brand and addresses column to JSON, so that we can then extract them.
  • Now I want to query and get only data for our bounding box, so let us query on the bbox column. We can do this by running the following Query:

Select 
id,
JSON(names) as names,
JSON(categories) as categories,
JSON(brand) as brand,
JSON(addresses) as addresses,
ST_GeomFromWKB(geometry) as geom
 from read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*', filename=true, hive_partitioning=1)
where 
    bbox.minX > 73.77 and
    bbox.maxX < 73.955 and
    bbox.minY > 18.43 and
    bbox.maxY < 18.61
LIMIT 2;

  • This will show you two records which match our query parameters.
  • The last step is to make a query which will write this data to a GeoJSON File. This can be done by the following query:

COPY (
Select 
id,
JSON(names) as names,
JSON(categories) as categories,
JSON(brand) as brand,
JSON(addresses) as addresses,
ST_GeomFromWKB(geometry) as geom
 from read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*', filename=true, hive_partitioning=1)
 where 
bbox.minX > 73.77 and
bbox.maxX < 73.955 and
bbox.minY > 18.43 and
bbox.maxY < 18.61
 ) TO 'poi_pune.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON');

  • Do note that this query might take some time to run; Depending on your configuration, and internet speeds, this might take a few hours to run as well.
  • Once the query has run, it will write this data to a GeoJSON file, and you can open it in a GIS software like QGIS.

No alt text provided for this image




To view or add a comment, sign in

Others also viewed

Explore content categories