Enabling Hive on Spark on CDH 5.14 — a few problems (and solutions)
Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…
Read moreIn recent years, a lot of geospatial frameworks have been created to process and analyze big geospatial data from various data sources. A lot of them struggled with a unified data format which can be distributed across many machines. The most popular geospatial data format so far is shapefile but it has many drawbacks, such as:
Geojson can store nested fields but it is not distributed. There was an approach to store geojson as separate lines to enable distributing these files:
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [21.022516, 52.207257] }, "properties": { "city": "Warszawa"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [19.511426, 51.765019] }, "properties": { "city": "Łódź"}}
{"type": "Feature", "geometry": {"type": "Point", "coordinates": [2.121482, 41.381212] }, "properties": { "city": "Barcelona"}}
Nevertheless it is not compressed and stored as plain text, so as you can see, it's size is huge. Additionally, geojson only allows you to store one geometry field at a time and it doesn't support spatial indexes.
There is also geopackage which is an SQLLite container, its size is less than shapefile and it’s relatively new but it’s not constructed for distributed systems.
In geospatial big data frameworks like Geomesa and Apache Sedona (incubating), the community uses parquets a lot for storing geospatial data, but in many cases it uses wkb only or wkt to store geospatial data objects, without any improvements in the case of metadata or indexing techniques.
Apache parquet is an open source columnar data storage format which stores data for efficient loads, it compresses well and decreases data size tremendously. It has API for languages like Python, Java, C++ and more and is well integrated with Apache Arrow. For more information on this, please refer to the official website of apache parquet https://parquet.apache.org/docs/.
Let’s take a look at the biggest benefits of using parquet:
Parquet is a great data format for storing complex huge amounts of data, but it is missing geospatial support, so that’s where the idea of geoparquet came about.
We went through the advantages of geoparquet, now let’s look at file specification.
Geoparquet format specification
The key components of geoparquet are:
Column metadata
Geometry columns are stored in wkb (well known binary) data format. Wkb translates a geometry object into an array of bytes.
The first two bytes indicate the byte order
00 - means big endian
01 - means little endian
Next is to four-code the geometry type (which also includes the geometry dimension) ie the 2D point is represented by 0001. The last 16 bytes represent x and y coordinates (8 byte floats) for points.
POINT(50.323281, 19.029889)
01 | 1 0 0 0 |-121 53 -107 69 97 41 73 64 | -103 -126 53 -50 -90 7 51 64
For more details please visit https://libgeos.org/specifications/wkb/
* It’s common that a geometry column in wkt or wkb format has a lot of points and exceeds the redshift column length.
* This blog was written when version 0.4 was specified.
Version - the version of the geoparquet file at the time of writing this article was geoparquet 0.4.0
Primary_column - the primary geometry column name, some systems allow for the storage of multiple geometry columns and require the default ones.
Columns - a map of geometry column names and its metadata.
The Geoparquet project aims to provide a unified data format within months, not years; the official repository was created in August 2021.
0.1 - basics established, target for implementations
0.2 / 0.3 - First feedback, 3D coordinates support, geometry types, crs optional.
0.x - spatial indexes etc.
1.0.0-RC.1 - 6 implementations which work interoperably.
1.0.0 - The first version will be released once 12 working implementations are available.
Packages which currently implement geoparquet
Let’s analyze some points of interest data from California (downloaded as shape files geofabrik).
As you can see on screen image there, the shape files representing northern California in total have around 13 MB. When the data is loaded onto the geopandas dataframehas 82425 records and 5 columns.
Let’s directly, without any changes, save the data to geoparquet data format as one file. To do so we can use geopandas implementation of geoparquet ( v0.4 as of the time of writing this blog).
To save to geoparquet we need one line of code (assuming that our dataset is named gdf)
gdf.to_parquet(
"ncal_points.geoparquet"
)
We have already reduced the size of the file to 2.7 MB. Additionally the read time is faster, including the fact that you can easily select only those columns you want to analyze.
Now we can look at the file metadata.
We can see information about the primary geometry column, its crs, geometry type and bbox. From a query standpoint, the bbox field is really important, because we can easily select only those parquet files which we need to do our spatial operations, like spatial join.
Now let’s split the area into smaller rectangles and create parquet files based on those.
Target file sizes look like those below. We can consider other data splits to distribute data evenly, but for the purpose of this blog let’s keep it simple.
One partition is totally empty, but for other 3 the bbox looks like this:
'bbox': [-123.67493, 35.790856, -120.0241473, 38.8988234]
'bbox': [-124.3963383, 38.8997457, -120.0221298, 42.0085916]
'bbox': [-120.0214167, 38.9019867, -119.9051073, 41.99494]
The data split into smaller files allows for techniques like data skipping which can improve query performance tremendously. Also, geoparquet format aims to have indexes implemented which can also drastically reduce the amount of time needed to analyze geospatial data at scale.
The Geospatial community needs one unified data format for storing and processing big geospatial data workloads. Currently, many data engineers struggle with integrating geospatial data across systems, often wkt and wkb columns are treated as interchangeable processing units between systems, yet in some cases can not be handled by other systems (ex. Redshift ). A lot of geospatial work is not performed optimally due to a lack of geospatial support in popular big data formats like parquet and orc. It’s common that the data is loaded and then filtered to a specific extent to perform spatial joins, which is not an optimal approach. With geoparquet, a lot of computation power and data storage can be saved by efficiently storing big geospatial data. A lot of frameworks create data loading optimization (skipping, filtering, indexing) to perform geospatial queries even faster. There is a lot of work to do to make the geoparquet format acceptable by the community and be widely used and production ready, but this is exactly what geospatial data processing at scale needs.
Recently I’ve had an opportunity to configure CDH 5.14 Hadoop cluster of one of GetInData’s customers to make it possible to use Hive on Spark…
Read moreOutstanding customer experience is usually backed by robust data analytics. Same applies to Mamava, a business that celebrates and supports…
Read moreWe are proud to present you our first e-book, created by GetInData specialists. Apache NiFi: A Complete Guide is the result of long and fruitful work…
Read moreMachine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…
Read moreBig Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…
Read moreA year is definitely a long enough time to see new trends or technologies that get more traction. The Big Data landscape changes increasingly fast…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?