The need for a unified format for geospatial data
In recent years, a lot of geospatial frameworks have been created to process and analyze big geospatial data from various data sources. A lot of them struggled with a unified data format which can be distributed across many machines. The most popular geospatial data format so far is shapefile but it has many drawbacks, such as:
- it's multi file format (.shp, .shx, .dbf, and .prj and even some optional ones)
- it doesn’t compress well
- it is limited by its column name size (10 characters)
- it doesn’t support nested fields
- it has a 2GB size limit
- it can only store one geometry field in the file, also with the size of shapefile processing efficiency drops a lot
- Its format is not specifically designed for distributed systems.
Geojson can store nested fields but it is not distributed. There was an approach to store geojson as separate lines to enable distributing these files:
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [21.022516, 52.207257] }, "properties": { "city": "Warszawa"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [19.511426, 51.765019] }, "properties": { "city": "Łódź"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [2.121482, 41.381212] }, "properties": { "city": "Barcelona"}}
Nevertheless it is not compressed and stored as plain text, so as you can see, it's size is huge. Additionally, geojson only allows you to store one geometry field at a time and it doesn't support spatial indexes.
There is also geopackage which is an SQLLite container, its size is less than shapefile and it’s relatively new but it’s not constructed for distributed systems.
In geospatial big data frameworks like Geomesa and Apache Sedona (incubating), the community uses parquets a lot for storing geospatial data, but in many cases it uses wkb only or wkt to store geospatial data objects, without any improvements in the case of metadata or indexing techniques.
Parquet introduction
Apache parquet is an open source columnar data storage format which stores data for efficient loads, it compresses well and decreases data size tremendously. It has API for languages like Python, Java, C++ and more and is well integrated with Apache Arrow. For more information on this, please refer to the official website of apache parquet https://parquet.apache.org/docs/.
Let’s take a look at the biggest benefits of using parquet:
- It is compressed well so it reduces storage costs on cloud
- Data skipping and field statistics can help improve performance of data processing by loading smaller chunks of the data
- Designed to store big data of any type
Parquet is a great data format for storing complex huge amounts of data, but it is missing geospatial support, so that’s where the idea of geoparquet came about.
Goal of geoparquet
- To establish a geospatial columnar data format, which increases efficiency in analytical based use cases. Current data formats like shapefile can’t easily be partitioned and processed across many machines.
- To introduce a columnar data format to the geospatial world. Most of the current geospatial data analytics tools do not benefit from all the breakthroughs developed over recent years.
- To create a unified, easy to interchange data format which can be used interchangeably across modern analytics systems like BigQuery, Snowflake, Apache Spark or Redshift. Currently, loading geospatial data can be problematic in the case of such systems, for example in redshift, geospatial data can only be loaded using copy command from shapefile*. Apache Spark users often store the data using a wkt or wkb format, but in many ways (with crs or without, using ewkt or wkt) Geoparquet can solve interoperability problems.
- To persist geospatial data from Apache Arrow (GeoArrow spec was developed in parallel with geoparquet).
We went through the advantages of geoparquet, now let’s look at file specification.
Geoparquet format specification
The key components of geoparquet are:
- Geometry columns
- Metadata
- File metadata
Column metadata
Geometry columns
Geometry columns are stored in wkb (well known binary) data format. Wkb translates a geometry object into an array of bytes.
The first two bytes indicate the byte order
00 - means big endian
01 - means little endian
Next is to four-code the geometry type (which also includes the geometry dimension) ie the 2D point is represented by 0001. The last 16 bytes represent x and y coordinates (8 byte floats) for points.
POINT(50.323281, 19.029889)
01 | 1 0 0 0 |-121 53 -107 69 97 41 73 64 | -103 -126 53 -50 -90 7 51 64
For more details please visit https://libgeos.org/specifications/wkb/
* It’s common that a geometry column in wkt or wkb format has a lot of points and exceeds the redshift column length.
* This blog was written when version 0.4 was specified.
File Metadata
Version - the version of the geoparquet file at the time of writing this article was geoparquet 0.4.0
Primary_column - the primary geometry column name, some systems allow for the storage of multiple geometry columns and require the default ones.
Columns - a map of geometry column names and its metadata.
Geometry column metadata
- encoding - the geometry column is encoded into bytes, as the initial approach is wkb but additional methods can be included in later versions of the specification. This field refers to the encoding method.
- geometry_type - defining the geometry type of the column like "Point", "LineString", "Polygon" - for 3D points it is PointZ. It is useful to optimize the data loading and during complex operations like spatial join, prior information about geometry column type may be useful.
- crs - coordinate reference system specifying in PROJJSON format. This parameter is optional so when it's not available the data should be stored in longitude latitude format (EPSG:4326 crs)
- orientation - coordinate order can be clockwise or counterclockwise.
- edges - indicates how to approach the edge between two points, whether it's a cartesian straight line or a spherical distance, possible options are planar and spherical.
- bbox - defining the boundary box for each geometry column, it can speed up read times by pruning only parquet within a specified boundary box. It can be especially useful for queries between huge datasets. Also, providing multiple geometry columns allows the user to pre-calculate the boundary box for all columns, which can improve spatial join performance for complex geometry objects (ex polygons with a lot of coordinates). Bbox is an array with min and max coordinates for each dimension.
- epoch is the time when the coordinates were measured and is important for points measured on the surface which can change over time.
Geoparquet features
- Multiple spatial reference systems - it’s important to keep the information about coordinate reference systems in the file to be able to compare different data records.
- Multiple geometry columns - the specification tells us about one specific geometry default column or many others.
- Great compression / small files - the huge disadvantage of shapefiles, geojson files or KML are their size;parquet reduces file size.
- Working with both planar and spherical coordinates, geoparquet aims to support 2D and 3D data.
- Great at read-heavy analytic workflows, columnar data formats like parquet allow a cheap reads subset of columns or filtering based on column statistics.
- Support for data partitioning - parquet allows data splitting to increase parallelism, geoparquet aims to create geospatial partitions to efficiently load data from the data lake.
- Enable spatial indices - for the best possible performance, spatial indexes are essential.
The Road map of the geoparquet project
The Geoparquet project aims to provide a unified data format within months, not years; the official repository was created in August 2021.
0.1 - basics established, target for implementations
0.2 / 0.3 - First feedback, 3D coordinates support, geometry types, crs optional.
0.x - spatial indexes etc.
1.0.0-RC.1 - 6 implementations which work interoperably.
1.0.0 - The first version will be released once 12 working implementations are available.
Current implementations
Packages which currently implement geoparquet
- Geopandas (Python)
- Sfarrow (R)
- GDAL/OGR (C++, wrappers to many languages like Python)
- GeoParquet.jl (Julia)
- Geopandas example code (Python)
Let’s analyze some points of interest data from California (downloaded as shape files geofabrik).
As you can see on screen image there, the shape files representing northern California in total have around 13 MB. When the data is loaded onto the geopandas dataframehas 82425 records and 5 columns.
Let’s directly, without any changes, save the data to geoparquet data format as one file. To do so we can use geopandas implementation of geoparquet ( v0.4 as of the time of writing this blog).
To save to geoparquet we need one line of code (assuming that our dataset is named gdf)
gdf.to_parquet(
"ncal_points.geoparquet"
)
We have already reduced the size of the file to 2.7 MB. Additionally the read time is faster, including the fact that you can easily select only those columns you want to analyze.
Now we can look at the file metadata.
We can see information about the primary geometry column, its crs, geometry type and bbox. From a query standpoint, the bbox field is really important, because we can easily select only those parquet files which we need to do our spatial operations, like spatial join.
Now let’s split the area into smaller rectangles and create parquet files based on those.
Target file sizes look like those below. We can consider other data splits to distribute data evenly, but for the purpose of this blog let’s keep it simple.
One partition is totally empty, but for other 3 the bbox looks like this:
'bbox': [-123.67493, 35.790856, -120.0241473, 38.8988234]
'bbox': [-124.3963383, 38.8997457, -120.0221298, 42.0085916]
'bbox': [-120.0214167, 38.9019867, -119.9051073, 41.99494]
The data split into smaller files allows for techniques like data skipping which can improve query performance tremendously. Also, geoparquet format aims to have indexes implemented which can also drastically reduce the amount of time needed to analyze geospatial data at scale.
Is Geoparquet the future of geospatial data processing at scale?
The Geospatial community needs one unified data format for storing and processing big geospatial data workloads. Currently, many data engineers struggle with integrating geospatial data across systems, often wkt and wkb columns are treated as interchangeable processing units between systems, yet in some cases can not be handled by other systems (ex. Redshift ). A lot of geospatial work is not performed optimally due to a lack of geospatial support in popular big data formats like parquet and orc. It’s common that the data is loaded and then filtered to a specific extent to perform spatial joins, which is not an optimal approach. With geoparquet, a lot of computation power and data storage can be saved by efficiently storing big geospatial data. A lot of frameworks create data loading optimization (skipping, filtering, indexing) to perform geospatial queries even faster. There is a lot of work to do to make the geoparquet format acceptable by the community and be widely used and production ready, but this is exactly what geospatial data processing at scale needs.