Tutorial

11 min read

Introducing the Geoparquet data format

The need for a unified format for geospatial data

In recent years, a lot of geospatial frameworks have been created to process and analyze big geospatial data from various data sources. A lot of them struggled with a unified data format which can be distributed across many machines. The most popular geospatial data format so far is shapefile but it has many drawbacks, such as:

it's multi file format (.shp, .shx, .dbf, and .prj and even some optional ones)
it doesn’t compress well
it is limited by its column name size (10 characters)
it doesn’t support nested fields
it has a 2GB size limit
it can only store one geometry field in the file, also with the size of shapefile processing efficiency drops a lot
Its format is not specifically designed for distributed systems.

Geojson can store nested fields but it is not distributed. There was an approach to store geojson as separate lines to enable distributing these files:

{"type": "Feature", "geometry": {"type": "Point", "coordinates": [21.022516, 52.207257] }, "properties": { "city": "Warszawa"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [19.511426, 51.765019] }, "properties": { "city": "Łódź"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [2.121482, 41.381212] }, "properties": { "city": "Barcelona"}}

Nevertheless it is not compressed and stored as plain text, so as you can see, it's size is huge. Additionally, geojson only allows you to store one geometry field at a time and it doesn't support spatial indexes.

There is also geopackage which is an SQLLite container, its size is less than shapefile and it’s relatively new but it’s not constructed for distributed systems.

In geospatial big data frameworks like Geomesa and Apache Sedona (incubating), the community uses parquets a lot for storing geospatial data, but in many cases it uses wkb only or wkt to store geospatial data objects, without any improvements in the case of metadata or indexing techniques.

Parquet introduction

Apache parquet is an open source columnar data storage format which stores data for efficient loads, it compresses well and decreases data size tremendously. It has API for languages like Python, Java, C++ and more and is well integrated with Apache Arrow. For more information on this, please refer to the official website of apache parquet https://parquet.apache.org/docs/.

Let’s take a look at the biggest benefits of using parquet:

It is compressed well so it reduces storage costs on cloud
Data skipping and field statistics can help improve performance of data processing by loading smaller chunks of the data
Designed to store big data of any type

Parquet is a great data format for storing complex huge amounts of data, but it is missing geospatial support, so that’s where the idea of geoparquet came about.

Goal of geoparquet

To establish a geospatial columnar data format, which increases efficiency in analytical based use cases. Current data formats like shapefile can’t easily be partitioned and processed across many machines.
To introduce a columnar data format to the geospatial world. Most of the current geospatial data analytics tools do not benefit from all the breakthroughs developed over recent years.
To create a unified, easy to interchange data format which can be used interchangeably across modern analytics systems like BigQuery, Snowflake, Apache Spark or Redshift. Currently, loading geospatial data can be problematic in the case of such systems, for example in redshift, geospatial data can only be loaded using copy command from shapefile*. Apache Spark users often store the data using a wkt or wkb format, but in many ways (with crs or without, using ewkt or wkt) Geoparquet can solve interoperability problems.
To persist geospatial data from Apache Arrow (GeoArrow spec was developed in parallel with geoparquet).

We went through the advantages of geoparquet, now let’s look at file specification.

Geoparquet format specification

The key components of geoparquet are:

Geometry columns
Metadata
File metadata

Column metadata

Geometry columns

Geometry columns are stored in wkb (well known binary) data format. Wkb translates a geometry object into an array of bytes.

The first two bytes indicate the byte order

00 - means big endian

01 - means little endian

Next is to four-code the geometry type (which also includes the geometry dimension) ie the 2D point is represented by 0001. The last 16 bytes represent x and y coordinates (8 byte floats) for points.

POINT(50.323281, 19.029889)

01 | 1 0 0 0 |-121 53 -107 69 97 41 73 64 | -103 -126 53 -50 -90 7 51 64

For more details please visit https://libgeos.org/specifications/wkb/

* It’s common that a geometry column in wkt or wkb format has a lot of points and exceeds the redshift column length.

* This blog was written when version 0.4 was specified.

File Metadata

Version - the version of the geoparquet file at the time of writing this article was geoparquet 0.4.0

Primary_column - the primary geometry column name, some systems allow for the storage of multiple geometry columns and require the default ones.

Columns - a map of geometry column names and its metadata.

Geometry column metadata

encoding - the geometry column is encoded into bytes, as the initial approach is wkb but additional methods can be included in later versions of the specification. This field refers to the encoding method.
geometry_type - defining the geometry type of the column like "Point", "LineString", "Polygon" - for 3D points it is PointZ. It is useful to optimize the data loading and during complex operations like spatial join, prior information about geometry column type may be useful.
crs - coordinate reference system specifying in PROJJSON format. This parameter is optional so when it's not available the data should be stored in longitude latitude format (EPSG:4326 crs)
orientation - coordinate order can be clockwise or counterclockwise.
edges - indicates how to approach the edge between two points, whether it's a cartesian straight line or a spherical distance, possible options are planar and spherical.
bbox - defining the boundary box for each geometry column, it can speed up read times by pruning only parquet within a specified boundary box. It can be especially useful for queries between huge datasets. Also, providing multiple geometry columns allows the user to pre-calculate the boundary box for all columns, which can improve spatial join performance for complex geometry objects (ex polygons with a lot of coordinates). Bbox is an array with min and max coordinates for each dimension.
epoch is the time when the coordinates were measured and is important for points measured on the surface which can change over time.

Geoparquet features

Multiple spatial reference systems - it’s important to keep the information about coordinate reference systems in the file to be able to compare different data records.
Multiple geometry columns - the specification tells us about one specific geometry default column or many others.
Great compression / small files - the huge disadvantage of shapefiles, geojson files or KML are their size;parquet reduces file size.
Working with both planar and spherical coordinates, geoparquet aims to support 2D and 3D data.
Great at read-heavy analytic workflows, columnar data formats like parquet allow a cheap reads subset of columns or filtering based on column statistics.
Support for data partitioning - parquet allows data splitting to increase parallelism, geoparquet aims to create geospatial partitions to efficiently load data from the data lake.
Enable spatial indices - for the best possible performance, spatial indexes are essential.

The Road map of the geoparquet project

The Geoparquet project aims to provide a unified data format within months, not years; the official repository was created in August 2021.

0.1 - basics established, target for implementations

0.2 / 0.3 - First feedback, 3D coordinates support, geometry types, crs optional.

0.x - spatial indexes etc.

1.0.0-RC.1 - 6 implementations which work interoperably.

1.0.0 - The first version will be released once 12 working implementations are available.

Current implementations

Packages which currently implement geoparquet

Geopandas (Python)
Sfarrow (R)
GDAL/OGR (C++, wrappers to many languages like Python)
GeoParquet.jl (Julia)
Geopandas example code (Python)

Let’s analyze some points of interest data from California (downloaded as shape files geofabrik).

data-2

As you can see on screen image there, the shape files representing northern California in total have around 13 MB. When the data is loaded onto the geopandas dataframehas 82425 records and 5 columns.

data-colums

Let’s directly, without any changes, save the data to geoparquet data format as one file. To do so we can use geopandas implementation of geoparquet ( v0.4 as of the time of writing this blog).

To save to geoparquet we need one line of code (assuming that our dataset is named gdf)

gdf.to_parquet(
    "ncal_points.geoparquet"
)

geopraquet line code

We have already reduced the size of the file to 2.7 MB. Additionally the read time is faster, including the fact that you can easily select only those columns you want to analyze.

Now we can look at the file metadata.

file-metadata

We can see information about the primary geometry column, its crs, geometry type and bbox. From a query standpoint, the bbox field is really important, because we can easily select only those parquet files which we need to do our spatial operations, like spatial join.

Now let’s split the area into smaller rectangles and create parquet files based on those.

map

Target file sizes look like those below. We can consider other data splits to distribute data evenly, but for the purpose of this blog let’s keep it simple.

data-2

One partition is totally empty, but for other 3 the bbox looks like this:

'bbox': [-123.67493, 35.790856, -120.0241473, 38.8988234]

'bbox': [-124.3963383, 38.8997457, -120.0221298, 42.0085916]

'bbox': [-120.0214167, 38.9019867, -119.9051073, 41.99494]

The data split into smaller files allows for techniques like data skipping which can improve query performance tremendously. Also, geoparquet format aims to have indexes implemented which can also drastically reduce the amount of time needed to analyze geospatial data at scale.

Is Geoparquet the future of geospatial data processing at scale?

The Geospatial community needs one unified data format for storing and processing big geospatial data workloads. Currently, many data engineers struggle with integrating geospatial data across systems, often wkt and wkb columns are treated as interchangeable processing units between systems, yet in some cases can not be handled by other systems (ex. Redshift ). A lot of geospatial work is not performed optimally due to a lack of geospatial support in popular big data formats like parquet and orc. It’s common that the data is loaded and then filtered to a specific extent to perform spatial joins, which is not an optimal approach. With geoparquet, a lot of computation power and data storage can be saved by efficiently storing big geospatial data. A lot of frameworks create data loading optimization (skipping, filtering, indexing) to perform geospatial queries even faster. There is a lot of work to do to make the geoparquet format acceptable by the community and be widely used and production ready, but this is exactly what geospatial data processing at scale needs.

streaming

big data

geospatial data

geospatial analytics

python

Streaming Geospatial

Last updated: 8 August 2022

Written by

Paweł Tokaj

Senior Data Engineer

Like this post?
Spread the word

Want more? Check our articles

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

DATA Pill – the blue pill that (accidentally) works!

Ever felt overwhelmed by the flood of news about the latest technologies, tools, and trends in Data, AI, and ML? A new framework here, a revolutionary…

highly available airflow cluster aws notext

Tutorial

Highly available Airflow cluster in Amazon AWS

These days, companies getting into Big Data are granted to compose their set of technologies from a huge variety of available solutions. Even though…

getindator man standing in front of a modern scheme showing mil 476f21ba 2f04 44d0 8c3b 8493e593b122

Tutorial

News Recommendation: the challenging area in building recommendation systems

Remember our whitepaper “Guide to Recommendation Systems. Implementation of Machine Learning in Business” from the middle of last year? Our data…

Tutorial

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

Machine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…

Tutorial

dbt Semantic Layer - What Is and How to Use

note: Read the second part of this post here. Introduction Many companies nowadays are facing the question, “How can I get value from my data easier…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Introducing the Geoparquet data format

The need for a unified format for geospatial data

Parquet introduction

Goal of geoparquet

Geometry columns

File Metadata

Geometry column metadata

Geoparquet features

The Road map of the geoparquet project

Current implementations

Is Geoparquet the future of geospatial data processing at scale?

Want to know more about Big Data?

Join our newsletter and do not miss anything!

Like this post?Spread the word

Want more? Check our articles

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

DATA Pill – the blue pill that (accidentally) works!

Highly available Airflow cluster in Amazon AWS

News Recommendation: the challenging area in building recommendation systems

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

dbt Semantic Layer - What Is and How to Use

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!