GDAL 3.6, ADBC, nanoarrow, and the humble ArrowArrayStream

Nov 9, 2022 11 min read

GDAL’s 3.6 release is a huge deal. As with every GDAL release, there are a lot of bugfixes and incremental updates that represent a huge amount of maintainer time and effort. The feature I’m going to sing the praises of, though, is a shiny new feature first proposed by Even Rouault in RFC 86 and whose implementation could make reading vector layers (i.e., sf::read_sf()/geopandas.read_file()) a lot faster. The official name is the column-oriented read API and not only will it make read operations a lot faster, it has the potential to make the lives of spatial package maintainers (R, Python, and otherwise) a lot easier.

I come at this, of course, as a maintainer of several R packages in the r-spatial universe, a contributor to the arrow R package, a huge fan of Arrow Database Connectivity (ADBC), and a contributor to the brand-new in-development nanoarrow library whose vision is that it should be trivial for other libraries to follow in GDAL’s footsteps.

With all that in mind, let’s start with a basic read operation using the existing API in R:

library(sf)
read_sf("nshn_water_line.gpkg")

## Simple feature collection with 483268 features and 33 fields
## Geometry type: MULTILINESTRING
## Dimension:     XYZ
## Bounding box:  xmin: 215869.1 ymin: 4790519 xmax: 781792.9 ymax: 5237312
## z_range:       zmin: -41.732 zmax: 529.8
## Projected CRS: NAD83 / UTM zone 20N
## # A tibble: 483,268 × 34
##    OBJECTID FEAT_CODE ZVALUE PLANLENGTH  MINZ  MAXZ LINE_CLASS FLOWDIR
##       <dbl> <chr>      <dbl>      <dbl> <dbl> <dbl>      <int>   <int>
##  1        2 WACO20       0.2      280.    0.2   0.2          3       0
##  2        3 WACO20       0.2      185.    0.2   0.2          3       0
##  3        4 WACO20       0.2      179.    0.2   0.2          3       0
##  4        5 WACO20       0.2     1779.    0.2   0.2          3       0
##  5        6 WACO20       0.2      470.    0.2   0.2          3       0
##  6        7 WACO20       0.2       57.7   0.2   0.2          3       0
##  7        8 WACO20       0.2      223.    0.2   0.2          3       0
##  8        9 WACO20       0.2       89.3   0.2   0.2          3       0
##  9       10 WACO20       0.2      315.    0.2   0.2          3       0
## 10       11 WACO20       0.2      799.    0.2   0.2          3       0
## # ℹ 483,258 more rows
## # ℹ 26 more variables: LEVELPRIOR <int>, LAKEID_1 <chr>, LAKENAME_1 <chr>,
## #   LAKEID_2 <chr>, LAKENAME_2 <chr>, RIVID_1 <chr>, RIVNAME_1 <chr>,
## #   RIVID_2 <chr>, RIVNAME_2 <chr>, MISCID_1 <chr>, MISCNAME_1 <chr>,
## #   MISCID_2 <chr>, MISCNAME_2 <chr>, MISCID_3 <chr>, MISCNAME_3 <chr>,
## #   MISCID_4 <chr>, MISCNAME_4 <chr>, MISCID_5 <chr>, MISCNAME_5 <chr>,
## #   MISCID_6 <chr>, MISCNAME_6 <chr>, MISCID_7 <chr>, MISCNAME_7 <chr>, …

…and in Python:

import pyogrio
pyogrio.read_dataframe("nshn_water_line.gpkg")

        OBJECTID FEAT_CODE  ...    SHAPE_LEN                                           geometry
0              2    WACO20  ...   280.019087  MULTILINESTRING Z ((328482.026 4837060.779 0.2...
1              3    WACO20  ...   185.330590  MULTILINESTRING Z ((346529.825 4850301.279 0.2...
2              4    WACO20  ...   178.544110  MULTILINESTRING Z ((346689.525 4850232.378 0.2...
3              5    WACO20  ...  1779.161641  MULTILINESTRING Z ((328730.525 4838191.479 0.2...
4              6    WACO20  ...   469.582433  MULTILINESTRING Z ((328454.526 4835592.979 0.2...
...          ...       ...  ...          ...                                                ...
483263    480838  WARVLK59  ...   101.018982  MULTILINESTRING Z ((539704.874 4979211.005 33....
483264    483210    WARV50  ...   413.965206  MULTILINESTRING Z ((531918.900 4981152.347 64....
483265    483215    WARV50  ...   626.079500  MULTILINESTRING Z ((534077.319 4981625.182 75....
483266    483232  WACOIS10  ...    79.876979  MULTILINESTRING Z ((555691.253 4971749.403 0.1...
483267    483246  WACOIS10  ...   119.084324  MULTILINESTRING Z ((564092.019 4971405.182 0.1...

[483268 rows x 34 columns]

If you’re a user, you should be excited by how much faster the column-based API can make the read operation. With the new API comes driver optimizations for GeoParquet, GeoPackage, and FlatGeoBuf: the original RFC document quotes a speedup of 6-10x, depending on the driver, which probably reflects a combination of the Arrow-to-Python converters and the driver improvements that were implemented in tandem with the new API. The PR that will implement this in pyogrio noted an improvement of about 2x over its previous implementation for a GeoPackage file; the PR that implements this in R/sf implements a few other optimizations such that the improvement will probably be more like 3-4x for a similar GeoPackage dataset.

If you have access to the latest distribution of GDAL, you can give this a try yourself! See the end of this post for how to get the test data…it’s about 400,000 line segments of Nova Scotia rivers. Timing our original example from above, we get:

# install.packages("sf")
library(sf)
system.time(
  tbl1 <- read_sf("nshn_water_line.gpkg")
)
#>    user  system elapsed 
#>  20.264   1.355  21.619

…and with a few very specific dependencies, we can time the same thing with the columnar access API:

# ...with GDAL 3.6 and
# remotes::install_github("apache/arrow-nanoarrow/r#65")
# remotes::install_github("paleolimbot/sf@stream-reading")
library(sf)
system.time(
  tbl2 <- read_sf("nshn_water_line.gpkg", use_stream = TRUE)
)
#>    user  system elapsed 
#>   5.960   0.568   5.421

Timing our example from above in Python, we get:

# pip install pyogrio
import pyogrio
pyogrio.read_dataframe("nshn_water_line.gpkg")

…which takes 20.7 seconds on the same machine. With some very specific dependencies, we can do the same thing with the columnar access API:

# ...with GDAL 3.6 and
# pip3 install git+https://github.com/jorisvandenbossche/pyogrio@gdal-arrow-stream
import pyogrio
pyogrio.read_dataframe("some/file.gpkg", use_arrow=True)

For me, this takes 13.2 seconds. (The comparison is not quite the same to the R code…geopandas uses external pointers to GEOS geometries and sf’s representation is much faster to generate but makes subsequent operations that use GEOS slower).

Getting into the weeds: the Arrow C Stream

If you are a spatial package developer, you should be excited for a completely different reason: the new API returns an ArrowArrayStream (one of three copyable struct definitions in the Arrow C Data and Arrow C Stream interfaces) that reads the layer one chunk at a time (you can pick how many rows are in a chunk) and returns it in a well-understood, well-documented, and well-supported format for consumption. If you’re a developer who has ever dabbled in reading an OGR layer yourself from C++, you might be excited to see that your implementation in compiled code can now be considerably simpler:

#include <ogr_api.h>
#include <ogrsf_frmts.h>
// GDALDataset* poDS = GDALDataset::Open("path/to/file.gpkg");
int read_ogr_stream(GDALDataset* poDS, struct ArrowArrayStream* stream) {
  OGRLayer* poLayer = poDS->GetLayer(0);
  OGRLayerH hLayer = OGRLayer::ToHandle(poLayer);
  return OGR_L_GetArrowStream(hLayer, stream, nullptr);
}

With the help of the arrow R package, you can consume the layer into a data.frame() with a few lines:

library(arrow)
stream <- arrow:::allocate_arrow_array_stream()
stopifnot(read_ogr_stream(some_gdal_dataset, stream))
reader <- RecordBatchReader$import_from_c(stream)
as.data.frame(reader)

Outside of R, of course, there are implementations in Python, Java, Rust, JavaScript, Julia, and many other languages: if you’re willing to take on the dependency, you get a lot of tools at your disposal for free. If you’re not willing to take on the dependency and want to work with the ArrowArray objects directly, the code you just wrote to consume your favourite GDAL layer now immediately applies to any library that has a C interface based on the ArrowArray/ArrowArrayStream, which includes (you guessed it!) Arrow implementations in just about every language plus independent projects like DuckDB and Arrow Database Connectivity (ADBC).

It’s worth taking a look at the existing implementations of the humble read operation, which ranges from “a little verbose” to “horrifyingly complex”. The complexity mostly has to do with type support: in addition to the geometry column that we know and love, we often care as much or more about the attributes which can be any one of a dozen or so types. A cursory review of some implementations revealed that the “convert from GDAL’s C++ representation to some user-accessible form” (e.g., an R data.frame() or Python/pandas DataFrame/Series/numpy Array) has been implemented a lot of times: sf, vapour, rgdal, terra, Fiona, pyogrio, and I’m pretty sure it’s in GDAL’s built-in Python bindings too I just couldn’t find it quickly.

That isn’t to say the duplication is bad…implementations can learn from each other and are all built around different constraints. Still, if you’re a developer and you need to read a GDAL/OGR layer, there’s a lot of boilerplate to support (or not) all of the field types and it’s not clear what access pattern is best (e.g., Will iterating over columns or iterating over rows be fastest? Is reading geometries separately from attributes going to be slow?).

The result of that from a user standpoint is that some of the less common types like int64 have varying degrees of support: depending on what library you use to read a layer in R, you might get a double(), an integer(), or a character(). Some libraries warn for out-of-range values; some do not. There’s a lot of variability in end-user flexibility and performance that in most cases has nothing to do with the maximum theoretical flexibility or speed of the OGR driver itself.

With an Arrow-based API, the type conversion problem doesn’t go away but it does open a future where (1) those type conversions don’t need to happen at all because the data never leaves the Arrow format or (2) the code implementing those type conversions can be written fewer times because conversions exist in most languages already.

Enter Arrow Database Connectivity (ADBC)

Another corner of the universe where many, many thousands of lines of code have been written to create table-like things is the world of database drivers. I did a quick perusal of R packages implementing database access and without much effort found some version of optimized data frame creation in RSQLite, RPostgreSQL, odbc, RODBC, duckdb, and bigrquery. I don’t spend any time in the Python database world, but all of these and more exist there, too. Database access has been implemented in almost every language under the sun and maintainers have spent a huge amount of time writing and optimizing these conversions: databases are a big deal.

The Arrow Database Connectivity (ADBC) project is a new component of the Arrow ecosystem that provides a framework for database drivers whose access point is an ArrowArrayStream, just like GDAL’s columnar access API. Just as GDAL’s new API shrinks the amount of code required to access a table to a few lines of C, ADBC lets developers re-use the same few lines of code to access any database implementing a driver. The output of any ADBC operation returning tables is an ArrowArrayStream, so any tools written to process the output apply to more than just ADBC! Because the basic unit of the ArrowArrayStream is more than a single row, database drivers have more flexibility to implement optimizations (e.g., if looping over columns is more efficient than looping over rows for a particular database).

As of this writing, the ADBC drivers and a driver manager coordinating access are still under development. Keep an eye out for more as this exciting project evolves!

Enter nanoarrow

I hope I’ve painted a picture of a future where major library after major library has made its array- and table-like outputs available as fast and friendly (streams of) ArrowArrays. GDAL’s 3.6 release is an example of how that future can provide immediate tangible benefits to users (via speed improvements) and developers (via code portability) alike. To get to that future, I think two things have to be true:

Building ArrowArrays needs to be trivial: GDAL is fortunate to have a talented, hard-working, and forward-thinking maintainer that implemented creating the C-data interface structures from scratch. Open source maintenance time is at a premimum and not every library maintainer has the time/motivation to do this.
Converting ArrowArrays to data frames needs to be trivial: Again, GDAL has a talented, hard-working, and forward-thinking maintainer that implemented his own conversions from ArrowArray streams into Python objects; not every library is going to do this. Many libraries can use the converters provided by Arrow implementations like pyarrow, the arrow R package, or the Arrow C++ library that powers them both; however, the size and scope of these libraries is often a poor fit for targets like GDAL (e.g., that have considerable install complexity to manage) or ADBC (e.g., that are composed of small components distributed as independent R/Python packages).

The nanoarrow C and C++ library is all about building ArrowArrays. For example, the ADBC Postgres driver uses nanoarrow to create ArrowArrays from PostgreSQL results and more nanoarrow-based drivers (e.g., for SQLite3) are in the works. You can use nanoarrow in your C/C++ code by copy/pasting two files from the nanoarrow repo; there are utilities to build data types, arrays, and streams of all kinds. It will take some iteration to realize the vision in its entirity, but the idea is simple: with minimal effort, your library can expose an Arrow-based API, with all the speed and portability that entails.

The nanoarrow R package is all about data.frame() conversion from (streams of) ArrowArrays. The dozen or so implementations I linked to above that were all doing various forms of creating the data.frame() from C/C++ are all solving the same rather hard problem of converting database/GDAL data row-by-row into R vector land. If you’re the hard-working maintainer of a library that just made a super fast and flexible ArrowArray-based interface, the nanoarrow R package helps you make that that API to R users with minimal effort. For example, the PR implementing GDAL’s columnar interface in the sf package that I linked to earlier uses the nanoarrow R package.

Experiments with nanoarrow in Python are ongoing…if you have ideas about what that package might look like or how it should be implemented there’s an issue just for you!. Stay tuned to the repo for updates as we move development focus from the core C/C++ library to bindings and extensions.

The future

As of this writing, I’m in the process of developing both of these components. For example, I’m trying to use the C/C++ library to do things like write geospatial Arrow types and functions and make a minimal GeoPackage reader to refine the size/capability tradeoff (i.e., make sure it’s not so minimal that it can’t do anything useful). From the R package perspective, I’m in the process of implementing the data frame conversion in a fast and flexible way so that libraries of all types can use it. There’s a lot of work to do, but an exciting future ahead.

Getting the data

curl::curl_download(
  "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
  "nshn_water_line.gpkg"
)