Profiling point representations in GeoArrow

Sep 20, 2022 14 min read

The first post I wrote about GeoArrow builds on a lot of great work in the GeoArrow specifcation repo that’s accumulated a lot of really exciting discussion over the past year or so. For that post I prototyped a low-level implementation, an R implementation that uses it, and made a bunch of data available in that form for testing. There have been more great blog posts like the one by Kyle Barron on using the C Data interface in JavaScript and talks by Joris Van Den Bossche, and a GDAL implementation that continues to keep current with the state of the specification in the repository. Cool stuff!

A note that for most people currently using GeoParquet, the part of the GeoArrow specification that I’m about to talk about doesn’t apply to you yet! The current GeoParquet spec mandates a WKB geometry encoding…this post is about a potentially more efficient encoding that more effectively uses Arrow types.

When I started prototyping the low-level library, I wrote in support for points that are stored like this (which I’ll call xxyyzz for the rest of this post):

x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
z <- c(9, 10, 11, 12)

…in addition to points that are stored like this (which I’ll call xyzxyz):

xyz <- c(1, 5, 9, 2, 6, 10, 3, 7, 11, 4, 8, 12)

Eventually that got too hard to support in the way that I’d written the library so I dropped support for xx yy zz. Since then there have been questions about why that decision was made. That decision happened before I even knew about what Arrow was, really, and I never did any investigation. This post is an attempt to investigate the tradeoffs.

Example data

Scan to the end of this post for how to generate it! It’s based off of the Nova Scotia Hydrometric dataset. It’s approximately 23 million points organized as 483,000 ish multilinestrings. There’s a Z dimension in the original dataset and I didn’t drop it…I don’t think it makes a difference but feel free to give it a shot! You can find the parquet files here.

Reading files

First, is there any advantage to either representation with respect to how fast it is to read the raw Parquet file? I’ll try just the points first:

library(arrow)

bench::mark(
  xyzxyz = read_parquet("points-xyzxyz.parquet", as_data_frame = FALSE),
  xxyyzz = read_parquet("points-xxyyzz.parquet", as_data_frame = FALSE),
  iterations = 5,
  check = FALSE
)

## # A tibble: 2 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 xyzxyz        886ms    888ms      1.13    4.25MB        0
## 2 xxyyzz        469ms    473ms      2.04    3.54KB        0

It looks like the columnar approach (xx yy zz) is about 1.6 times faster to read, although it’s mellowed by the layers of nesting around the multilinestring:

bench::mark(
  xyzxyz = read_parquet("multilines-xyzxyz.parquet", as_data_frame = FALSE),
  xxyyzz = read_parquet("multilines-xxyyzz.parquet", as_data_frame = FALSE),
  iterations = 5,
  check = FALSE
)

## # A tibble: 2 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 xyzxyz        1.42s    1.51s     0.652    3.54KB        0
## 2 xxyyzz        1.02s    1.12s     0.888    3.54KB        0

Still, reading points in xx yy zz form is about 1.2 times faster.

Zero-copy support in R

I’m less familiar with what the options are in Python, but in R, we get out-of-the box zero-copy support for conversion both to and from Arrow in a form that users can work with without any extra code:

table_points_xxyyzz <- read_parquet("points-xxyyzz.parquet", as_data_frame = FALSE)
system.time(df_points_xxyyzz <- as.data.frame(table_points_xxyyzz))

##    user  system elapsed 
##   0.002   0.000   0.003

# The Struct array maps nicely to a data.frame
str(df_points_xxyyzz)

## 'data.frame':	23040506 obs. of  3 variables:
##  $ feature_id: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ part_id   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ geometry  : tibble [23,040,506 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ x: num  328482 328486 328489 328490 328495 ...
##   ..$ y: num  4837061 4837061 4837062 4837065 4837083 ...
##   ..$ z: num  0.2 0.2 0.2 0.2 0.2 ...

# Underneath this is just an arrow::ChunkedArray
.Internal(inspect(df_points_xxyyzz$geometry$x))

## @1375c97b0 14 REALSXP g0c0 [REF(65535)] arrow::array_dbl_vector<0x131f3c8e8, double, 1 chunks, 0 nulls> len=23040506

With layers of nesting we loose the speed and the zero-copy of the out-of-the-box conversion, but it’s still in a recognizable form (list of list of list of data.frame).

table_lines_xxyyzz <- read_parquet("multilines-xxyyzz.parquet", as_data_frame = FALSE)
df_lines_xxyyzz2 <- as.data.frame(head(table_lines_xxyyzz, 2))
head(df_lines_xxyyzz2[[1]][[1]][[1]])

## # A tibble: 6 × 3
##         x        y     z
##     <dbl>    <dbl> <dbl>
## 1 328482. 4837061. 0.200
## 2 328486. 4837061. 0.200
## 3 328489. 4837062. 0.200
## 4 328490. 4837065. 0.200
## 5 328495. 4837083. 0.200
## 6 328496. 4837089. 0.200

With the xyz xyz version, the out-of-the box conversion is slow (although this probably can and should be fixed).

table_points_xyzxyz <- read_parquet("points-xyzxyz.parquet", as_data_frame = FALSE)
# Takes ~170 seconds
# system.time(df_points_xyzxyz <- as.data.frame(table_points_xyzxyz))

The structure is easy enough to work with, although we loose the speed and the zero copy that the xx yy zz encoding gave us.

table_lines_xyzxyz <- read_parquet("multilines-xyzxyz.parquet", as_data_frame = FALSE)
df_lines_xyzxyz2 <- as.data.frame(head(table_lines_xyzxyz, 2))
head(df_lines_xyzxyz2[[1]][[1]][[1]])

## <fixed_size_list<double, 3>[6]>
## [[1]]
## [1]  328482.0 4837060.8       0.2
## 
## [[2]]
## [1]  328486.0 4837060.7       0.2
## 
## [[3]]
## [1]  328489.0 4837061.7       0.2
## 
## [[4]]
## [1]  328490.1 4837064.6       0.2
## 
## [[5]]
## [1]  328495.4 4837083.5       0.2
## 
## [[6]]
## [1]  328495.5 4837089.5       0.2

Arrow compute support

Arrow’s compute function registry contains an ever growing repository of kernels that are optimized for data in Arrow form. Functions like min, max, multiplication, division and many more are included! If your processor has special SIMD instruction sets that apply to your operation, Arrow will (probably) dispatch to kernels that use it. As an example, we can calculate a bounding box without any special handling:

# (these all have a single chunk in the chunked array)
array_points_xxyyzz <- table_points_xxyyzz$geometry$chunk(0)
call_function("min_max", array_points_xxyyzz$x)

## StructScalar
## {min:double = 215869.09059999976, max:double = 781792.8552999999}

call_function("min_max", array_points_xxyyzz$y)

## StructScalar
## {min:double = 4790519.438100001, max:double = 5237312.2392}

call_function("min_max", array_points_xxyyzz$z)

## StructScalar
## {min:double = -41.73200000000361, max:double = 529.8000000000029}

The nested version doesn’t work out of the box in R (although I believe you can access children of list arrays in Python).

Parquet metadata support

One of the advantages to using Parquet is filter pushdown: the Parquet format has a place for “column statistics” the most useful of which in our case is the minimum and maximum values that appear in a column. In the Parquet sense, “column” also applies to a column nested inside a parent array like a struct or a list. In the case of our xx yy zz encoding, we get column statistics out of the box for the points array.

# (not implemented in R/arrow yet!)
from pyarrow import parquet
points_xxyyzz = parquet.ParquetFile("points-xxyyzz.parquet")
points_xxyyzz.metadata.row_group(0).column(2).statistics

<pyarrow._parquet.Statistics object at 0x126f00e00>
  has_min_max: True
  min: 215869.09059999976
  max: 781792.8552999999
  null_count: 0
  distinct_count: 0
  num_values: 23040506
  physical_type: DOUBLE
  logical_type: None
  converted_type (legacy): NONE

We also get column statistics out of the box for the xyz xyz version, but the statistics are meaningless because the x, y, and z values are all mixed together as far as Parquet is concerned:

points_xyzxyz = parquet.ParquetFile("points-xyzxyz.parquet")
points_xyzxyz.metadata.row_group(0).column(2).statistics

This even works for the nested lists of our multilinestring! You can check at the bottom of this post…I did nothing special when I wrote this file.

lines_xxyyzz = parquet.ParquetFile("multilines-xxyyzz.parquet")
lines_xxyyzz.metadata.row_group(0).column(0).statistics

This file was written in a single row group, but you could organize your row groups to reflect spatial proximity, which would let a reader implementation avoid reading chunks of data that are irrelevant to a spatial query.

Kernel speed?

One of the arguments for an xyz xyz encoding is data locality: a lot of algorithms are going to use the x, y, and z value in the same inner for() loop. If they’re all together, there’s a chance that we get fewer cache misses. I’m not an expert here, but we have enough infrasturcture in place to test a few kernels. I’m very happy to defer to any HPC experts and/or GPU-wielding enthusiasts to test this (or tell me that I’m testing the wrong thing!).

Here I’ll use the narrow prototype package to make it easier to stick these kernels in an RMarkdown blog post. This basically gives us access to the Arrow C Data interface structs and the buffers they contain.

library(narrow) # remotes::install_github("paleolimbot/narrow")

lines_xyzxyz <- as_narrow_array(
  read_parquet(
    "multilines-xyzxyz.parquet",
    as_data_frame = FALSE
  )$geometry$chunk(0)
)

lines_xxyyzz <- as_narrow_array(
  read_parquet(
    "multilines-xxyyzz.parquet",
    as_data_frame = FALSE
  )$geometry$chunk(0)
)

lines_xyzxyz$schema$children[[1]]$children[[1]]

## <narrow_schema '+w:3' at 0x245d4040290>
## - format: +w:3
## - name: vertices
## - flags: nullable
## - metadata:  list()
## - dictionary: NULL
## - children[1]:
##   <narrow_schema 'g' at 0x245d4040410>
##   - format: g
##   - name: xyz
##   - flags: nullable
##   - metadata:  list()
##   - dictionary: NULL
##   - children[0]:

lines_xxyyzz$schema$children[[1]]$children[[1]]

## <narrow_schema '+s' at 0x245d4040e90>
## - format: +s
## - name: 
## - flags: 
## - metadata:  list()
## - dictionary: NULL
## - children[3]:
##   <narrow_schema 'g' at 0x121746420>
##   - format: g
##   - name: x
##   - flags: nullable
##   - metadata:  list()
##   - dictionary: NULL
##   - children[0]:
##   <narrow_schema 'g' at 0x121746468>
##   - format: g
##   - name: y
##   - flags: nullable
##   - metadata:  list()
##   - dictionary: NULL
##   - children[0]:
##   <narrow_schema 'g' at 0x1217464b0>
##   - format: g
##   - name: z
##   - flags: nullable
##   - metadata:  list()
##   - dictionary: NULL
##   - children[0]:

Length

Below is a length kernel, which I thought might be a good example of an operation that has to loop over every coordinate, using both x and y together, and one that I would expect to perform better with the xyz xyz encoding. Here’s the xyz xyz version:

#include "R.h"
#include "narrow.h"

SEXP length_xyzxyz(SEXP array_data_xptr) {
  struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");
  
  int coord_size = 3;
  double* coords = (double*) array_data->children[0]->children[0]->children[0]->buffers[1];
  int32_t* coord_offsets = (int32_t*) array_data->children[0]->buffers[1];
  int32_t* geom_offsets = (int32_t*) array_data->buffers[1];
  
  geom_offsets = geom_offsets + array_data->offset;
  coord_offsets = coord_offsets + geom_offsets[0];
  coords = coords + coord_offsets[0];
  
  double* coord0;
  double* coord1;
  int32_t* coord_offset;
  double feature_length, dx, dy;
  SEXP output_sexp = PROTECT(Rf_allocVector(REALSXP, array_data->length));
  double* output = REAL(output_sexp);
  
  for (int64_t feat_id = 0; feat_id < array_data->length; feat_id++) {
    feature_length = 0;
    int32_t n_geoms = geom_offsets[feat_id + 1] - geom_offsets[feat_id];
    coord_offset = coord_offsets + geom_offsets[feat_id];
    
    for (int32_t geom_id = 0; geom_id < n_geoms; geom_id++) {
      int32_t n_coords = coord_offset[geom_id + 1] - coord_offset[geom_id];
      
      for (int32_t point_id = 1; point_id < n_coords; point_id++) {
        coord0 = coords + (coord_size * (coord_offset[geom_id] + point_id - 1));
        coord1 = coords + (coord_size * (coord_offset[geom_id] + point_id));
        dx = coord1[0] - coord0[0];
        dy = coord1[1] - coord0[1];
        
        feature_length += sqrt(dx * dx + dy * dy);
      }
    }
    
    output[feat_id] = feature_length;
  }
  
  UNPROTECT(1);
  return output_sexp;
}

Here’s the xxx yyy version:

#include "R.h"
#include "narrow.h"

SEXP length_xxyyzz(SEXP array_data_xptr) {
  struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");
  
  double* xs  = (double*)array_data->children[0]->children[0]->children[0]->buffers[1];
  double* ys = (double*)array_data->children[0]->children[0]->children[1]->buffers[1];
  int32_t* coord_offsets = (int32_t*) array_data->children[0]->buffers[1];
  int32_t* geom_offsets = (int32_t*) array_data->buffers[1];
  
  geom_offsets += array_data->offset;
  coord_offsets += geom_offsets[0];
  xs += coord_offsets[0];
  ys += coord_offsets[0];

  int32_t* coord_offset;
  double* x;
  double* y;
  double feature_length, dx, dy;
  
  SEXP output_sexp = PROTECT(Rf_allocVector(REALSXP, array_data->length));
  double* output = REAL(output_sexp);
  
  for (int64_t feat_id = 0; feat_id < array_data->length; feat_id++) {
    feature_length = 0;
    int32_t n_geoms = geom_offsets[feat_id + 1] - geom_offsets[feat_id];
    coord_offset = coord_offsets + geom_offsets[feat_id];
    
    for (int32_t geom_id = 0; geom_id < n_geoms; geom_id++) {
      int32_t n_coords = coord_offset[geom_id + 1] - coord_offset[geom_id];
      
      for (int32_t point_id = 1; point_id < n_coords; point_id++) {
        x = xs + (coord_offset[geom_id] + point_id - 1);
        y = ys + (coord_offset[geom_id] + point_id - 1);
        
        dx = x[1] - x[0];
        dy = y[1] - y[0];
        
        feature_length += sqrt(dx * dx + dy * dy);
      }
    }
    
    output[feat_id] = feature_length;
  }
  
  UNPROTECT(1);
  return output_sexp;
}

Besides the fact that the index math is a little easier when we don’t have to consider the coord_size, neither version is faster (on my CPU, for this dataset, anyway):

bench::mark(
  length_xyzxyz = .Call("length_xyzxyz", lines_xyzxyz$array_data),
  length_xxyyzz = .Call("length_xxyyzz", lines_xxyyzz$array_data)
)

## # A tibble: 2 × 6
##   expression         min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 length_xyzxyz     21ms   21.8ms      44.0    3.69MB    10.4 
## 2 length_xxyyzz   19.6ms   20.1ms      48.4    3.69MB     9.68

Next I’ll try one that I would expect to be faster for xxx yyy: a bounding box calculation. For brevity I’ll just do the x min and max. The xy xy version is here:

#include "R.h"
#include "narrow.h"

SEXP minmaxx_xyzxyz(SEXP array_data_xptr) {
  struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");
  
  int coord_size = 3;
  double* coords = (double*) array_data->children[0]->buffers[1];

  double maxx = R_NegInf;
  double minx = R_PosInf;
  double xi;
  for (int64_t i = 0; i < array_data->length; i++) {
    xi = coords[i * coord_size];
    if (xi > maxx) {
      maxx = xi;
    }
    if (xi < minx) {
      minx = xi;
    }
  }
  
  SEXP result = PROTECT(Rf_allocVector(REALSXP, 2));
  REAL(result)[0] = minx;
  REAL(result)[1] = maxx;
  UNPROTECT(1);
  return result;
}

The xx yy version is here:

#include "R.h"
#include "narrow.h"

SEXP minmaxx_xxyyzz(SEXP array_data_xptr) {
  struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");
  
  double* xs = (double*) array_data->children[0]->buffers[1];

  double maxx = R_NegInf;
  double minx = R_PosInf;
  double xi;
  for (int64_t i = 0; i < array_data->length; i++) {
    xi = xs[i];
    if (xi > maxx) {
      maxx = xi;
    }
    if (xi < minx) {
      minx = xi;
    }
  }
  
  SEXP result = PROTECT(Rf_allocVector(REALSXP, 2));
  REAL(result)[0] = minx;
  REAL(result)[1] = maxx;
  UNPROTECT(1);
  return result;
}

These are also about the same:

bench::mark(
  minmaxx_xyzxyz = .Call("minmaxx_xyzxyz", lines_xyzxyz$array_data$children[[1]]$children[[1]]),
  minmaxx_xxyyzz = .Call("minmaxx_xxyyzz", lines_xxyyzz$array_data$children[[1]]$children[[1]])
)

## # A tibble: 2 × 6
##   expression          min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 minmaxx_xyzxyz   28.9ms   30.6ms      31.0    5.11KB        0
## 2 minmaxx_xxyyzz   28.7ms   29.4ms      33.8        0B        0

Of course, the Arrow version blows both of these out of the water (~2x faster), but it only works on the xx yy encoding out of the box:

bench::mark(
  arrow_minmaxx_xxyyzz = {
    points_narrow <- narrow_array(
      lines_xxyyzz$schema$children[[1]]$children[[1]],
      lines_xxyyzz$array_data$children[[1]]$children[[1]],
      validate = FALSE
    )
    points_arrow <- from_narrow_array(points_narrow, arrow::RecordBatch)
    as.vector(call_function("min_max", points_arrow$x))
  }
)

## # A tibble: 1 × 6
##   expression                min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 arrow_minmaxx_xxyyzz   15.8ms     16ms      62.3     610KB        0

And since this is in a Parquet file with column stats, we don’t even need to read the file and we can do it in a few ms (consider also that we had to spend 800 ms reading the file to get the array above).

import timeit
def read_minmax_x():
    points_xxyyzz = parquet.ParquetFile("points-xxyyzz.parquet")
    stats_x = points_xxyyzz.metadata.row_group(0).column(2).statistics
    return (stats_x.min, stats_x.max)

timeit.timeit(read_minmax_x, number=100) * 1000

Zero-copy transfer to other libraries

There are a lot of libraries that use some version of xyz xyz encodings of points. OpenGL seems to use something like xyzw; well-known binary stores coordinates like xyz xyz; cuSpatial uses this representation; S2 mostly uses std::vector<S2Point> which looks something like xyzxyzxyz in memory; GEOS has some exciting changes afoot with its coordinate sequence class to remove virtual method call overhead. Does that mean zero-copy transfer to those libraries? Probably only if the dimensions match, and only if there are no preprocessing steps (although feel free to set me straight by example!).

Another consideration is that the kernels I implemented above are read-only. There’s a very good chance that modifying a buffer with xyz together is faster or more readable (not tested here).

Summary

Other than historical reasons and data locality on a GPU reasons (that I don’t have the hardware to test), I think there is a good case to allow coordinates to be encoded as xxx yyy since it gives some concrete performance benefits today without generic libraries like Arrow or specifications like Parquet having to special case geometry (and without maintainers having to spend time implementing it). This isn’t to say xyz xyz isn’t useful elsewhere or useful as an option in GeoArrow; it’s just that support for it today is not great.

In case you’d like to recreate the dataset…

library(geoarrow)
library(arrow)
library(narrow)


# Download a ~300MB multiline dataset in the current geoarrow spec encoding
curl::curl_download(
  "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.parquet",
  "nshn_water_line.parquet"
)

# Current encoding is xyzxyz
lines_table <- read_parquet("nshn_water_line.parquet", as_data_frame = FALSE)
lines_geom_xyzxyz <- as_narrow_array(lines_table$geometry$chunk(0))

# Manually create xxxyyyzzz
coords <- wk::wk_coords(lines_geom_xyzxyz)
counts <- wk::wk_count(lines_geom_xyzxyz)

points_array <- as_narrow_array(coords[c("x", "y", "z")])
line_offsets <- c(0L, cumsum(rle(coords$part_id)$lengths))
part_offsets <- c(0L, cumsum(counts$n_geom - 1L))

# drop all the metadata for now
lines_geom_xyzxyz$schema$metadata <- NULL
lines_geom_xyzxyz$schema$children[[1]]$metadata <- NULL
lines_geom_xyzxyz$schema$children[[1]]$children[[1]]$metadata <- NULL

lines_geom_xxyyzz_schema <- lines_geom_xyzxyz$schema
lines_geom_xxyyzz_schema$children[[1]]$children[[1]] <- points_array$schema

lines_geom_xxyyzz_data <- narrow_array_data(
  length = length(part_offsets) - 1L,
  null_count = 0L,
  buffers = list(NULL, part_offsets),
  children = list(
    narrow_array_data(
      length = length(line_offsets) - 1L,
      null_count = 0L,
      buffers = list(NULL, line_offsets),
      children = list(points_array$array_data)
    )
  )
)

lines_geom_xxyyzz <- narrow_array(lines_geom_xxyyzz_schema, lines_geom_xxyyzz_data)

# write Parquet
table_xyzxyz <- arrow_table(
  geometry = from_narrow_array(lines_geom_xyzxyz, arrow::Array)
)
write_parquet(table_xyzxyz, "multilines-xyzxyz.parquet")

table_xxyyzz <- arrow_table(
  geometry = from_narrow_array(lines_geom_xxyyzz, arrow::Array)
)
write_parquet(table_xxyyzz, "multilines-xxyyzz.parquet")

# Also write just the points as Parquet
table_points_xyzxyz <- arrow_table(
  feature_id = coords$feature_id,
  part_id = coords$part_id,
  geometry = from_narrow_array(
    narrow_array(
      lines_geom_xyzxyz$schema$children[[1]]$children[[1]],
      lines_geom_xyzxyz$array_data$children[[1]]$children[[1]]
    ),
    arrow::Array
  )
)
write_parquet(table_points_xyzxyz, "points-xyzxyz.parquet")

table_points_xxyyzz <- arrow_table(
  feature_id = coords$feature_id,
  part_id = coords$part_id,
  geometry = from_narrow_array(points_array, arrow::Array)
)
write_parquet(table_points_xxyyzz, "points-xxyyzz.parquet")