Building Bridges: Arrow, Parquet, and Geospatial Computing

May 17, 2022 18 min read

There’s lots of cool chatter about the potential for Apache Arrow in geospatial and for geospatial in Apache Arrow. Some recent blog posts have touched on some of the opportunities that are unlocked by storing geospatial vector data in Parquet files (as opposed to something like a Shapefile or a GeoPackage), like the ability to read directly from JavaScript on the web, the ability to register a Parquet file as a virtual table in Postgres/PostGIS, and the general support for Parquet in cloud-based data providers. Most recently, Even Rouault submitted an RFC to GDAL proposing a native Arrow C Data interface to the OGR drivers (which power most of the “read this file” capability of modern GIS systems).

The Arrow project is a couple of things. First and foremost, it’s an in-memory data format that’s built to take advantage of how modern computers are built. Instead of dripping one value at a time, the Arrow format makes it easy to operate on data in “chunks”, which is often much faster and more easily parallelized than operating on a single row or value at a time. Secondly, it’s a C++ library with bindings in Python and R that implements reading and writing of file formats that follow the “data in chunks” philosophy (like Parquet). There are also implementations in Java, Go, JavaScript, Julia, Rust, MATLAB, and an ABI-stable C Data interface enabling zero-copy transfer among them all.

If you do anything with geospatial data, you should care about Arrow because Arrow is fast. You should care about GeoParquet because it follows the Arrow data storage philosophy and, therefore, is fast.

As you can imagine, that statement needs qualifying. Importantly, when I call Arrow “fast”, I’m not calling anything else slow. Many file formats are fast; GDAL is fast; QGIS is fast; the geopandas package for Python is fast; the sf package for R is fast.

This entire post is about what happens when your data flows from one of those things to another—which happens many times every time you do work with geospatial data—and how Arrow—all the Arrows—can make that faster, for everybody.

Using Arrow, Parquet, and Geospatial today

A lot of what I’m about to write will refer to brand new or in-development functionality. That said, if you want to use Parquet to store geospatial information today, you can do it two ways. In R, you can read Parquet with a geometry column to sf like this:

library(geoarrow)
# curl::curl_download(
#   "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.parquet",
#   "nshn_water_line.parquet"
# )
# curl::curl_download(
#   "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
#   "nshn_water_line.gpkg"
# )
(water_line <- read_geoparquet_sf("nshn_water_line.parquet"))
#> Simple feature collection with 483268 features and 33 fields
#> Geometry type: MULTILINESTRING
#> Dimension:     XYZ
#> Bounding box:  xmin: 215869.1 ymin: 4790519 xmax: 781792.9 ymax: 5237312
#> z_range:       zmin: -41.732 zmax: 529.8
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia
#> # A tibble: 483,268 × 34
#>    OBJECTID FEAT_CODE ZVALUE PLANLENGTH  MINZ  MAXZ LINE_CLASS FLOWDIR
#>       <int> <chr>      <dbl>      <dbl> <dbl> <dbl>      <int>   <int>
#>  1        2 WACO20       0.2      280.    0.2   0.2          3       0
#>  2        3 WACO20       0.2      185.    0.2   0.2          3       0
#>  3        4 WACO20       0.2      179.    0.2   0.2          3       0
#>  4        5 WACO20       0.2     1779.    0.2   0.2          3       0
#>  5        6 WACO20       0.2      470.    0.2   0.2          3       0
#>  6        7 WACO20       0.2       57.7   0.2   0.2          3       0
#>  7        8 WACO20       0.2      223.    0.2   0.2          3       0
#>  8        9 WACO20       0.2       89.3   0.2   0.2          3       0
#>  9       10 WACO20       0.2      315.    0.2   0.2          3       0
#> 10       11 WACO20       0.2      799.    0.2   0.2          3       0
#> # … with 483,258 more rows, and 26 more variables: LEVELPRIOR <int>,
#> #   LAKEID_1 <chr>, LAKENAME_1 <chr>, LAKEID_2 <chr>, LAKENAME_2 <chr>,
#> #   RIVID_1 <chr>, RIVNAME_1 <chr>, RIVID_2 <chr>, RIVNAME_2 <chr>,
#> #   MISCID_1 <chr>, MISCNAME_1 <chr>, MISCID_2 <chr>, MISCNAME_2 <chr>,
#> #   MISCID_3 <chr>, MISCNAME_3 <chr>, MISCID_4 <chr>, MISCNAME_4 <chr>,
#> #   MISCID_5 <chr>, MISCNAME_5 <chr>, MISCID_6 <chr>, MISCNAME_6 <chr>,
#> #   MISCID_7 <chr>, MISCNAME_7 <chr>, HID <chr>, SHAPE_LEN <dbl>, …

…and you can write a Parquet file with a geometry column like this:

write_geoparquet(water_line, tempfile(fileext = ".parquet"))

Both of these are currently faster than the corresponding sf invocations using the GeoPackage format:

bench::mark(
  read_geoparquet_sf("nshn_water_line.parquet"),
  sf::read_sf("nshn_water_line.gpkg"),
  check = FALSE
) %>% 
  select(expression, min, median)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 3
#>   expression                                         min   median
#>   <bch:expr>                                    <bch:tm> <bch:tm>
#> 1 read_geoparquet_sf("nshn_water_line.parquet")    3.27s    3.27s
#> 2 sf::read_sf("nshn_water_line.gpkg")             13.81s   13.81s

bench::mark(
  write_geoparquet(water_line, tempfile(fileext = ".parquet")),
  sf::write_sf(water_line, tempfile(fileext = ".gpkg")),
  check = FALSE
) %>% 
  select(expression, min, median)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 3
#>   expression                                                        min   median
#>   <bch:expr>                                                   <bch:tm> <bch:tm>
#> 1 write_geoparquet(water_line, tempfile(fileext = ".parquet"))    2.21s    2.21s
#> 2 sf::write_sf(water_line, tempfile(fileext = ".gpkg"))          12.36s   12.36s

In Python, geopandas can also read/write GeoParquet much faster than the equivalent pyogrio equivalent¹.

import pyogrio
import geopandas
water_line = pyogrio.read_dataframe("nshn_water_line.gpkg")
%timeit water_line.to_parquet("temp.parquet")
# 3.27 s ± 42.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pyogrio.write_dataframe(water_line, "temp.gpkg")
# 10.2 s ± 45.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can also use GDAL’s GeoParquet driver to read and write geospatial data like that written by geoarrow::write_geoparquet(), although until binary packages catch up, you will probably have to build GDAL and Arrow yourself or use the osgeo/gdal:latest Docker image.

Reading files

The first time your data flows from one place to another is when you read a file. A lot of workflows start with geopandas.read_file() in Python or sf::read_sf() in R. For some reasons I’ll go into in a second, pyarrow.parquet.read_pandas() and arrow::read_parquet() are much faster than the usual alternative:

import pyogrio
import pyarrow.parquet as parquet
%timeit parquet.read_pandas("nshn_water_line.parquet")
# 1.61 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pyogrio.read_dataframe("nshn_water_line.gpkg")
# 7.71 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

library(arrow)
library(geoarrow)

bench::mark(
  read_parquet = arrow::read_parquet("nshn_water_line.parquet"),
  read_sf = sf::read_sf("nshn_water_line.gpkg"),
  iterations = 5,
  check = FALSE
) %>% 
  select(expression, min, median)

#> # A tibble: 2 × 3
#>   expression        min   median
#>   <bch:expr>   <bch:tm> <bch:tm>
#> 1 read_parquet    1.56s    1.69s
#> 2 read_sf         6.94s    7.08s

The Arrow Parquet readers in Python and R are faster for more or less the same reason: the underlying C++ that reads the files is effectively parallelized and the conversion to Pandas/R vectors is optimized to be a zero-copy operation in most cases. In R this is cheating a little bit because it means that when the values do have to be accessed that the operation is slower; however, in many cases a lot of those rows or columns will never see the light of day so the gamble is worth it (I’m looking at you, MISCNAME_7).

All benchmarks should be taken with a grain of salt including this one, which is just one high-level operation on one data set on one computer. Importantly, there are lots of tricks that you can use once loading everything into memory becomes impossible or annoying. There’s dask-geopandas; there’s passing SQL into GDAL to select only some features or columns; there’s PostGIS for when you need the full Postgres stack and the full OSGeo stack combined; and, as we’ll see in a moment, Arrow has some tricks of its own.

All of those techniques might save you analysis time but they take valuable thinking time to get started, particularly the first time you have to learn them. Using GeoParquet + pyarrow.parquet.read_pandas()/arrow::read_parquet() might mean your data have to be a lot bigger before you spend valuable thinking time on alternatives.

Querying a multi-file dataset

Before I start delving into some of the nerdier points about the bits and bytes of storing geometry in Arrow Arrays, I want to highlight a particularly slick workflow for querying multiple (Geo)Parquet files at once. In R, it looks like this:

library(dplyr)

bucket <- s3_bucket("voltrondata-public-datasets")
ds <- open_dataset(bucket$path("phl-parking"))
ds %>% 
  select(issue_datetime, violation_desc, fine, geometry) %>% 
  filter(fine > 300) %>%
  collect()
#> # A tibble: 168,959 × 4
#>    issue_datetime      violation_desc       fine geometry                
#>    <dttm>              <chr>               <dbl> <grrw_pnt>              
#>  1 2011-12-31 22:03:00 HP RESERVED SPACE     301 POINT (484642.5 4419268)
#>  2 2011-12-31 20:52:00 UNREG/ABANDONED VEH   301 POINT (487284.6 4421182)
#>  3 2015-05-15 16:52:00 HP RESERVED SPACE     301 POINT (486607.1 4422361)
#>  4 2015-04-14 11:06:00 HP RESERVED SPACE     301 POINT (485989.5 4422722)
#>  5 2015-04-25 17:46:00 HP RESERVED SPACE     301 POINT (485622.7 4422203)
#>  6 2015-05-21 10:15:00 HP RESERVED SPACE     301 POINT (486690.4 4422851)
#>  7 2015-06-23 08:21:00 HP RESERVED SPACE     301 POINT (487488.5 4422247)
#>  8 2015-06-25 07:27:00 HP RESERVED SPACE     301 POINT (485329.5 4423043)
#>  9 2015-04-22 17:59:00 HP RESERVED SPACE     301 POINT (485291.3 4422285)
#> 10 2015-04-13 09:57:00 UNREG/ABANDONED VEH   301 POINT (486285.5 4426662)
#> # … with 168,949 more rows

Here ds is an Arrow Dataset made up of a few files (here, one for each year):

ds$files
#> [1] "phl-parking/year=2011/part-0.parquet"
#> [2] "phl-parking/year=2012/part-0.parquet"
#> [3] "phl-parking/year=2013/part-0.parquet"
#> [4] "phl-parking/year=2014/part-0.parquet"
#> [5] "phl-parking/year=2015/part-0.parquet"
#> [6] "phl-parking/year=2016/part-0.parquet"
#> [7] "phl-parking/year=2017/part-0.parquet"

Users familiar with GDAL’s SQL capability will note that you can do something similar there (e.g., using the sql argument in sf::read_sf() or pyogrio.read_dataframe()), although doing so with multiple files that might be stored remotely (in this example, on S3 storage) requires some manual engineering². Arrow’s compute engine support provides functionality well beyond the SQLite/OGR SQL dialects, providing support for many advanced datetime and string operations. Geometry column support will hopefully come soon as well so that you’ll be able to leverage the Arrow query engine’s built-in parallelization and lazy streaming out of the box.

Bridging packages, bridging languages

The relatively simple example of “read this file” involves data flow across an important boundary: the boundary between GDAL and GDAL’s language binding (for this example in R that’s sf; for this example in Python it’s pyogrio). Both do something along the lines of: for each feature, loop over each column and convert the value from the GDAL’s representation of the value into R/Python’s representation of the value. For geometry, this means that the language binding has to parse well-known binary into something else: in Python that’s a pointer to a GEOS geometry, in R it’s a list() where each element is some combination of list() and matrix(). Before it even gets to that point, each value has been read from a file somewhere and converted into GDAL’s representation of the value.

That’s two more layers of conversion than will happen with the Parquet read, where the data is stored on disk in what is more-or-less already Arrow format³, and where the R/Python object from which you access it is just a shell around Arrow’s C++ representation. The GDAL-to-R/Python conversion is particularly expensive; for example, if we read the GeoPackage directly into an Arrow Table, we see a speed bump over sf::read_sf().

bench::mark(
  read_gpkg_table = as.data.frame(gpkg::read_gpkg_table("nshn_water_line.gpkg")),
  read_sf = sf::read_sf("nshn_water_line.gpkg"),
  check = FALSE,
  iterations = 2
)

#> # A tibble: 2 × 3
#>   expression           min   median
#>   <bch:expr>      <bch:tm> <bch:tm>
#> 1 read_gpkg_table    1.21s    1.31s
#> 2 read_sf             7.6s    8.47s

Sure, we could build more of that zero-copy logic into sf and geopandas, but why should we maintain it for every language binding/runtime out there when we already have an Arrow language binding in pretty much every language? There’s a lot of code to make this kind of improvement available for every data format, but it’s a good demonstration of how using Arrow’s representation can let developers and users leverage the growing family of general-purpose data manipulation tools built on Arrow.

The data conversion problem arises from an important constraint when linking tools together: ABI stability. GDAL provides a C API with strong backward compatibility so that tools (like the R/Python language bindings) can be written on top of it without having to build the underlying library. This is the programmatic equivalent of squeezing a large and powerful library through a very small hole. The constraints on a C API are rather harsh and force software that’s built on top of them to adopt a specific set of types defined in that specific C API header: if you build a converter for GDAL’s OGR geometry API to R objects, you can only use it for converting OGR geometry to R objects. Your work might inspire a Python developer, but they can’t reuse your code.

Enter the Arrow C Data interface. This interface is the ABI-stable C-language materialization of the Arrow columnar format and consists of 3 struct definitions that take up a few dozen lines of C. As a developer, you can copy them into any project and produce/consume data from any library that has done the same. The prototype gpkg reader I used in the above benchmark was built using the Arrow C Data interface and takes advantage of the extensive engineering work on the Arrow R language bindings implementing the zero-copy Arrow–R conversion; the GDAL RFC I mentioned above is able to provide massive speed gains for some formats by allowing individual drivers to export Arrow Arrays directly without going through the current rowwise OGR API.

This doesn’t just apply to big compiled libraries like GDAL, PROJ, and GEOS, it’s also a problem within package ecosystems. As a developer it’s hard to control the build environment of your users/package distributors, so you can’t just pass any data structure between packages and assume the other package will interpret it correctly. The sf package handles this by exclusively using R data structures to represent geometry, which are ABI stable and can be interpreted by any other R package. This is great for the R package ecosystem, but doesn’t help a C++/Python developer who wants to make a computational tool available in R. The typical approach is to convert to well-known binary, do a computation, and convert back again (i.e., the GDAL reading problem all over again).

I’m less familiar with how this is done in the Python ecosystem, although I gather that the primary storage type is a pointer to a GEOS geometry (a wrapper around a C++ type), where interpreting the pointer from another package can be problematic depending on how much one can trust that the underlying library was compiled with the same version/flags.

It sounds like a niche problem but the language/package bridge problem has an important consequence: instead of computational tools (and their associated maintenance load) being spread across packages and languages, existing computational tools are forced to duplicate more and more functionality unless they want to be slow. A current example of this is the terra package for R, which duplicates almost all of the sf/stars stack. It’s often faster because it makes better use of sharing C++ objects, at the expense of any would-be extension developer who can’t maintain a tool separately from terra without one or more (probably) slow conversion steps.

In the OSGeo world, QGIS, GRASS, GDAL, SAGA, lwgeom, Whitebox and others all maintain copies of a lot of format drivers and algorithms. I’m a relatively new to the open source GIS world, but I have seen all of those libraries add functionality over time. Without the history it’s hard to speculate on why that is, but I suspect that it’s because the bridges between them are either slow or difficult to maintain. I think that with Arrow we can build those bridges to be faster and stronger.

The GeoArrow Specification

Arrow defines a rich set of types to encode vectors of pretty much anything…integers, doubles, strings, dates, times, nested lists, and more. It doesn’t define an encoding for a geometry, but there is an in-development set of extension types to do just that. This means we can leverage the ABI-stability of the Arrow C Data interface for sending geometries among languages, packages, and runtimes. The first draft memory layout was just finalized and we should have a formalized set of metadata fields soon for encoding things like CRS and spherical/planar edges. (All of these are prototyped in the geoarrow R package prototype, the geoarrow Python package prototype and the GDAL driver so that we can get community feedback before finalizing the definitions).

I think the best way to demonstrate the format is to write a function! Let’s see how fast we can compute the length of all of the features in the water_lines dataset:

#include 
#include "narrow.h"

SEXP c_multilinestring_length(SEXP array_data_xptr) {
  struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");
  
  int coord_size = 3;
  double* coords = (double*) array_data->children[0]->children[0]->children[0]->buffers[1];
  int32_t* coord_offsets = (int32_t*) array_data->children[0]->buffers[1];
  int32_t* geom_offsets = (int32_t*) array_data->buffers[1];
  
  geom_offsets = geom_offsets + array_data->offset;
  coord_offsets = coord_offsets + geom_offsets[0];
  coords = coords + coord_offsets[0];
  
  double* coord0;
  double* coord1;
  int32_t* coord_offset;
  double feature_length, dx, dy;
  SEXP output_sexp = PROTECT(Rf_allocVector(REALSXP, array_data->length));
  double* output = REAL(output_sexp);
  
  for (int64_t feat_id = 0; feat_id < array_data->length; feat_id++) {
    feature_length = 0;
    int32_t n_geoms = geom_offsets[feat_id + 1] - geom_offsets[feat_id];
    coord_offset = coord_offsets + geom_offsets[feat_id];
    
    for (int32_t geom_id = 0; geom_id < n_geoms; geom_id++) {
      int32_t n_coords = coord_offset[geom_id + 1] - coord_offset[geom_id];
      if (n_coords < 2) {
        continue;
      }
      
      for (int32_t point_id = 1; point_id < n_coords; point_id++) {
        coord0 = coords + (coord_size * (coord_offset[geom_id] + point_id - 1));
        coord1 = coords + (coord_size * (coord_offset[geom_id] + point_id));
        dx = coord1[0] - coord0[0];
        dy = coord1[1] - coord0[1];
        
        feature_length += sqrt(dx * dx + dy * dy);
      }
    }
    
    output[feat_id] = feature_length;
  }
  
  UNPROTECT(1);
  return output_sexp;
}

I’m not saying this is the most readable bit of C anybody has ever written…there’s a lot of index math that would be well-served by some helper functions. The point is that we’re storing this array of geometries using 3 buffers: one for the coordinates themselves (xyzxyzxyzxyz…), one for the offsets into the coordinate array for each linestring, and one for the offsets into the linestring array for each multilinestring. The exact arrangement of these for the 6 main geometry types is detailed in the memory layout spec and follows this pattern of a single coordinate buffer while providing offsets into the various levels of nesting.

This tiny bit of code also happens to be faster than the other ways I know how to do this in R:

multilinestring_length <- function(x) {
  x <- narrow::as_narrow_array(as_geoarrow(x))
  .Call("c_multilinestring_length", x$array_data)
}

water_line_arrow <- as_geoarrow(water_line)
water_line_geos <- geos::as_geos_geometry(water_line)
water_line_sfc <- water_line$geometry

bench::mark(
  length_arrow = multilinestring_length(water_line_arrow),
  length_geos = geos::geos_length(water_line_geos),
  length_sf = units::drop_units(sf::st_length(water_line_sfc)),
  iterations = 5
)
#> # A tibble: 3 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 length_arrow  19.46ms  19.62ms    28.2      3.69MB     0   
#> 2 length_geos   48.53ms  48.82ms    16.0      3.69MB     0   
#> 3 length_sf       2.06s    2.06s     0.486   568.5MB     1.95

In particular, it’s faster than the sf version because sf converts to the GEOS representation to do the calculation, which in this case happens to take a while because it’s a fairly big dataset. There’s no reason why somebody couldn’t implement an sf-native length calculation in another R package, although it wouldn’t help users of any other language unless they want to spin up an R interpreter, serialize their geometries to WKB, and create the sf object. With the Arrow representation we get the ability to pass the object between languages and packages, with the added bonus that it’s memory layout lends itself to fast calculations.

What next?

I hope the takeaway here is that the Arrow project—all the components—have a lot to offer geospatial. The Arrow C++ library can read files really fast and can simplify the process of reading from many files hosted locally or remotely. The Arrow R and Python bindings are really good at converting Arrow Arrays into R and Python objects, and the Arrow C Data interface provides a means by which components can send data to each other, frequently without copying. Finally, the columnar “data in chunks” philosophy enables parallelization and other performance gains that are just not possible with a rowwise approach.

In this post I’ve referred to a lot of prototyped/in-development code and specifications. There’s a lot of potential for geospatial in Arrow and for Arrow in geospatial, but there’s also a lot of work to do. Here’s where some of that work is happening:

geopandas/geoarrow-spec Developing a specification for encoding geometries using the Arrow Columnar format
opengeospatial/geoparquet Developing a specification for encoding geometries specific to the Parquet file format
apache/arrow Improving extension type support and the extensibility of the query engine for future geospatial functions
osgeo/gdal Implementing the Arrow and Parquet drivers, which include support for the GeoArrow and GeoParquet specifications in various forms
python-geoarrow Implementing the GeoArrow extension types using the Arrow Python bindings
paleolimbot/geoarrow Implementing the GeoArrow extension types using the Arrow R bindings
paleolimbot/geoarrow-cpp A small C++ library that can convert between GeoArrow-encoded arrays and WKB, WKT, or other representations (powers the R package)
paleolimbot/geoarrow-data Data files encoded in GeoArrow/GeoParquet format for testing and benchmarking

The pyogorio package is the forthcoming GDAL/OGR interface for geopandas and is much faster than the previous iteration based on the fiona package (hence its use in benchmarks here). ↩︎
As of this writing, Dataset support was just added into GDAL development as part of the Parquet driver, which means that some of this, minus the Arrow-native querying, will be possible through GDAL driver in the future. ↩︎
If you want to store your data exactly in Arrow format on disk you can use the Feather file format, which is faster if you want to read the whole thing, but potentially slower if you want to query it (one of the features of Parquet is the ability to store column statistics so that the reader can skip reading entire chunks of the file if it knows about the user-provided filter). ↩︎