Tutorial
11 min read

Introducing the Geoparquet data format

The need for a unified format for geospatial data

In recent years, a lot of geospatial frameworks have been created to process and analyze big geospatial data from various data sources. A lot of them struggled with a unified data format which can be distributed across many machines. The most popular geospatial data  format so far is shapefile but it has many drawbacks, such as:

  • it's multi file format (.shp, .shx, .dbf, and .prj and even some optional ones)
  • it doesn’t compress well
  • it is limited by its column name size (10 characters)
  • it doesn’t support nested fields
  • it has a 2GB size limit 
  • it can only store one geometry field in the file, also with the size of shapefile processing efficiency drops a lot
  • Its format is not specifically designed for distributed systems. 

Geojson can store nested fields but it is not distributed. There was an approach to store geojson as separate lines to enable distributing these files:

{"type": "Feature", "geometry": {"type": "Point", "coordinates": [21.022516, 52.207257] }, "properties": { "city": "Warszawa"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [19.511426, 51.765019] }, "properties": { "city": "Łódź"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [2.121482, 41.381212] }, "properties": { "city": "Barcelona"}}

Nevertheless it is not compressed and stored as plain text, so as you can see, it's size is huge. Additionally, geojson only allows you to store one geometry field at a time and it doesn't support spatial indexes. 

There is also geopackage which is an SQLLite container, its size is less than shapefile and  it’s relatively new but it’s not constructed for distributed systems.

In geospatial big data frameworks like Geomesa and Apache Sedona (incubating), the community uses parquets a lot for storing geospatial data, but in many cases it uses wkb only or wkt to store geospatial data objects, without any improvements in the case of metadata or indexing techniques.

Parquet introduction

Apache parquet is an open source columnar data storage format which stores data for efficient loads, it compresses well and decreases data size tremendously. It has API for languages like Python, Java, C++ and more and is well integrated with Apache Arrow. For more information on this, please refer to the official website of apache parquet https://parquet.apache.org/docs/

Let’s take a look at the biggest benefits of using parquet: 

  • It is compressed well so it reduces storage costs on cloud
  • Data skipping and field statistics can help improve performance of data processing by loading smaller chunks of the data
  • Designed to store big data of any type

Parquet is a great data format for storing complex huge amounts of data, but it is missing geospatial support, so that’s where the idea of geoparquet came about.

Goal of geoparquet

  • To establish a geospatial columnar data format, which increases efficiency in analytical based use cases. Current data formats like shapefile can’t easily be partitioned and processed across many machines. 
  • To introduce a columnar data format to the geospatial world. Most of the current geospatial data analytics tools do not benefit from all the breakthroughs developed over  recent years. 
  • To create a unified, easy to interchange data format which can be used interchangeably across modern analytics systems like BigQuery, Snowflake, Apache Spark or Redshift. Currently, loading geospatial data can be problematic in the case of such systems, for example in redshift, geospatial data can only be loaded  using copy command from shapefile*. Apache Spark users often store the data using a wkt or wkb format, but in many ways (with crs or without, using ewkt or wkt) Geoparquet can solve interoperability problems.
  • To persist geospatial data from Apache Arrow (GeoArrow spec was developed in parallel with geoparquet). 

We went through the advantages of geoparquet, now let’s look at file specification.

Geoparquet format specification

The key components of geoparquet are:

  • Geometry columns 
  • Metadata
  • File metadata

Column metadata 

Geometry columns

Geometry columns are stored in wkb (well known binary) data format. Wkb translates a geometry object into an array of bytes. 

The first two bytes indicate the byte order 

00 - means big endian

01 - means little endian

Next is to four-code the geometry type (which also includes the geometry dimension) ie the 2D point is represented by 0001. The last 16 bytes represent x and y coordinates (8 byte floats) for points.

POINT(50.323281, 19.029889)

01 | 1 0 0 0 |-121 53 -107 69 97 41 73 64 | -103 -126 53 -50 -90 7 51 64

For more details please visit https://libgeos.org/specifications/wkb/

* It’s common that a geometry column in wkt or wkb format has a lot of points and exceeds the redshift column length.

* This blog was  written when version 0.4 was specified. 

File Metadata

Version - the version of the geoparquet file at the time of writing this article was geoparquet 0.4.0

Primary_column - the primary geometry column name, some systems allow for the storage of multiple geometry columns and require the default ones. 

Columns - a map of geometry column names and its metadata. 

Geometry column metadata

  • encoding - the geometry column is encoded into bytes, as the initial approach is wkb but additional methods can be included in later versions of the specification. This field refers to the encoding method.
  • geometry_type - defining the geometry type of the column like "Point", "LineString", "Polygon" - for 3D points it is PointZ. It is useful to optimize the data loading and during complex operations like spatial join, prior information about geometry column type may be useful. 
  • crs - coordinate reference system specifying in PROJJSON format. This parameter is optional so when it's not available the data should be stored in longitude latitude format (EPSG:4326 crs)
  • orientation - coordinate order can be clockwise or counterclockwise.
  • edges - indicates how to approach the edge between two points, whether it's a cartesian straight line or a spherical distance, possible options are planar and spherical.
  • bbox - defining the boundary box for each geometry column, it can speed up read times by pruning only parquet within a specified boundary box. It can be especially useful for queries between huge datasets. Also, providing multiple geometry columns allows the user to pre-calculate the boundary box for all columns, which can improve spatial join performance for complex geometry objects (ex polygons with a lot of coordinates). Bbox is an array with min and max coordinates for each dimension.
  • epoch is the time when the coordinates were measured and is important for points measured on the surface which can change over time.

Geoparquet features 

  • Multiple spatial reference systems - it’s important to keep the information about coordinate reference systems in the file to be able to compare different data records.
  • Multiple geometry columns - the specification tells us about one specific geometry default column or many others.
  • Great compression / small files - the huge disadvantage of shapefiles, geojson files or KML are their size;parquet reduces file size.
  • Working with both planar and spherical coordinates, geoparquet aims to support 2D and 3D data. 
  • Great at read-heavy analytic workflows, columnar data formats like parquet allow a cheap reads subset of columns or filtering based on column statistics.
  • Support for data partitioning - parquet allows data splitting to increase parallelism, geoparquet aims to create geospatial partitions to efficiently load data from the data lake.
  • Enable spatial indices - for  the best possible performance, spatial indexes are essential. 

The Road map of the geoparquet project

The Geoparquet project aims to provide a unified data format within months, not years; the official repository was created in August 2021.

0.1 - basics established,  target for implementations

0.2 / 0.3 - First feedback, 3D coordinates support, geometry types, crs optional.

0.x - spatial indexes etc. 

1.0.0-RC.1 - 6 implementations which work interoperably. 

1.0.0 - The first version will be released once 12 working implementations are   available.

Current implementations

Packages which currently implement geoparquet

  • Geopandas (Python)
  • Sfarrow (R)
  • GDAL/OGR (C++, wrappers to many languages like Python)
  • GeoParquet.jl (Julia)
  • Geopandas example code (Python)

Let’s analyze some points of interest data from California (downloaded as shape files geofabrik). 

data-2

As you can see on screen image there, the shape files representing northern California in total have around 13 MB. When the data is loaded onto the geopandas dataframehas 82425 records and 5 columns.

data-colums

Let’s directly, without any changes, save the data to geoparquet data format as one file. To do so we can use geopandas implementation of geoparquet ( v0.4 as of the time of writing this blog). 

To save to geoparquet we need one line of code (assuming that our dataset is named gdf)

gdf.to_parquet(
    "ncal_points.geoparquet"
)

geopraquet line code

We have already reduced the size of the file to 2.7 MB. Additionally the read time is faster, including the fact that you can easily select only those columns you want to analyze.

Now we can look at the file metadata. 

file-metadata

We can see information about the primary geometry column, its crs, geometry type and bbox. From a query standpoint, the bbox field is really important, because we can easily select only those parquet files which we need to do our spatial operations, like spatial join.

Now let’s split the area into smaller rectangles and create parquet files based on those. 

map

Target file sizes look like those below. We can consider other data splits to distribute data evenly, but for the purpose of this blog let’s keep it simple.

data-2

One partition is totally empty, but for other 3 the bbox looks like this:

'bbox': [-123.67493, 35.790856, -120.0241473, 38.8988234]

'bbox': [-124.3963383, 38.8997457, -120.0221298, 42.0085916]

'bbox': [-120.0214167, 38.9019867, -119.9051073, 41.99494]

The data split into smaller files allows for techniques like data skipping which can improve query performance tremendously. Also, geoparquet format aims to have indexes implemented which can also drastically reduce the amount of time needed to analyze geospatial data at scale.

Is Geoparquet the future of geospatial data processing at scale? 

The Geospatial community needs one unified data format for storing and processing big geospatial data workloads. Currently, many data engineers struggle with integrating geospatial data across systems, often wkt and wkb columns are treated as interchangeable processing units between systems, yet in some cases can not be handled by other systems (ex. Redshift ). A lot of geospatial work is not performed optimally due to a lack of geospatial support in popular big data formats like parquet and orc. It’s common that the data is loaded and then filtered to a specific extent to perform spatial joins, which is not an optimal approach. With geoparquet, a lot of computation power and data storage can be saved by efficiently storing big geospatial data. A lot of frameworks create data loading optimization (skipping, filtering, indexing) to perform geospatial queries even faster. There is a lot of work to do to make the geoparquet format acceptable by the community and be widely used and production ready, but this is exactly what geospatial data processing at scale needs.

Want to know more about Big Data?

Join our newsletter and do not miss anything!

The administrator of your personal data is GetInData Sp. z o.o. Sp.k with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the  Terms & Conditions. For more information on personal data processing and your rights please see  Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy
streaming
big data
geospatial data
geospatial analytics
python
Streaming Geospatial
8 August 2022

Want more? Check our articles

16kxTuxGkZjskytKJLQKsJg
Tech News

Two BI companies in play. Tableau acquired by Salesforce and Looker by Google.

The two recently announced acquisitions by Google and Salesforce in the thriving business analytics market appear to be strategic moves to remain…

Read more
modern data platform dp framework components getindata
Tech News

Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers

The GID Modern Data Platform is live now! The Modern Data Platform (or Modern Data Stack) is on the lips of basically everyone in the data world right…

Read more
dynamodb aws jedraszewski getindata big data blog
Tutorial

Amazon DynamoDB - single table design

DynamoDB is a fully-managed NoSQL key-value database which delivers single-digit performance at any scale. However, to achieve this kind of…

Read more
kafka gobblin hdfs getindata linkedin
Tutorial

Data pipeline evolution at Linkedin on a few pictures

Data Pipeline Evolution The LinkedIn Engineering blog is a great resource of technical blog posts related to building and using large-scale data…

Read more
big data blog getindata from spreadsheets automated data pipelines how this can be achieved 2png
Tutorial

From spreadsheets to automated data pipelines - and how this can be achieved with support of Google Cloud

CSVs and XLSXs files are one of the most common file formats used in business to store and analyze data. Unfortunately, such an approach is not…

Read more
screenshot 2022 08 02 at 10.56.56
Tech News

2022 Big Data Trends: Retail and eCommerce become one of the hottest sectors for AI/ML

Nowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

The administrator of your personal data is GetInData Sp. z o.o. Sp.k with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the  Terms & Conditions. For more information on personal data processing and your rights please see  Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy