geos version 0.2

A new version of the geos package for R is upon us! Buckle up and, if you so choose, get some of the example data to follow along!

library(sf)
library(geos)

curl::curl_download(
  "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
  "nshn_water_line.gpkg"
)

This example data is ~800 MB to highlight some of the features of this release designed to help with bigger data sets, but I’ll load a small sample to illustrate some of the other features, too.

cornwallis_river_segs <- read_sf(
  "nshn_water_line.gpkg", 
  query = "SELECT * from nshn_water_line
    WHERE 
      RIVNAME_1 = 'Cornwallis River' AND
      LINE_CLASS = 1"
) |> 
  as_geos_geometry()

str(cornwallis_river_segs)
##  geos_geometry[1:106] <MULTILINESTRING [386231 4992937...386356 4993194]>, <MUL
plot(cornwallis_river_segs)

New features in GEOS 3.11.0

The main update is adding new functionality available in GEOS version 3.11.0. You can check which version is being used by running geos_version():

geos_version()
## [1] '3.12.0'

If this shows something else, you can run install.packages(c("libgeos", "geos")) and that should update both such that the new features are available.

The first new functions added in this release are geos_line_merge() and geos_line_merge_directed(). Technically the regular line merge function was available in GEOS before 3.11.0, I just forgot to implement it. Now you can do both! This is useful for stringing together multiple segments into a single feature. A river is a prefect example: the cornwallis_river variable is actually a bunch of segments, but could be represented by a single feature. This data set defines coordinates by flow direction, for which the directed option is particularly useful:

cornwallis_river <- cornwallis_river_segs |> 
  wk::wk_flatten() |> 
  wk::wk_collection(wk::wk_geometry_type("multilinestring")) |> 
  geos_line_merge_directed()

str(cornwallis_river)
##  geos_geometry[1:1] <LINESTRING [362369 4989973...392435 4997731]>
plot(cornwallis_river)
plot(geos_point_start(cornwallis_river), col = "red", add = T)
plot(geos_point_end(cornwallis_river), col = "blue", add = T)

An exciting feature added in the underlying GEOS library and exposed in this release is a set of concave hull algorithms. While geos_convex_hull() has been around for a while, geos_concave_hull() and geos_concave_hull_of_polygons() were added to create simplified coverings of geometries where the convex hull may be a poor approximation.

plot(geos_convex_hull(cornwallis_river), col = "grey90", border = NA)
plot(geos_concave_hull(cornwallis_river, ratio = 0.5), col = "grey80", border = NA, add = T)
plot(geos_concave_hull(cornwallis_river, ratio = 0.25), col = "grey70", border = NA, add = T)
plot(cornwallis_river, add = T)

My favourite feature added in GEOS 3.11.0 and exposed in this release is geos_hilbert_code(). Hilbert curves are space-filling curves and you can read up on the (really cool!) details if you’d like…the main takeaway is that this lets you “sort” a vector of geometries so that sequential geometries are more likely to be close to eachother. This is useful for chunking up an arbitrary vector of geometries in such a way that the chunks are meaningful (say, to split them up and write them into multiple files whose bounding boxes are more useful).

library(ggplot2)

data.frame(
  hilbert_code = geos_hilbert_code(cornwallis_river_segs, level = 2),
  geometry = cornwallis_river_segs
) |> 
  st_as_sf() |> 
  ggplot(aes(col = factor(hilbert_code))) +
  geom_sf() +
  theme_bw() +
  theme(legend.position = "bottom")

As you can see, it’s not a perfect system, but as a quick-and-dirty heuristic for arbitrary input it’s much faster than something like clustering. You can adjust level to a higher value (up to 15) to get more “bins”.

A few other functions were added too, including geos_remove_repeated_points(), geos_transform_xy(), geos_create_rectangle(), and the fix_structure argument in the WKB and WKT readers (which allows polygon input with unclosed rings to be fixed on import). They’re exciting too but I’ll leave them to the documentation/a future post!

A brand new geos_geometry_writer()

In a previous release, there was a problem identified with the way I converted sf (and other) objects to geos_geometry() vectors. Instead of fixing the problem, I just fell back on converting sf objects to well-known binary and then using geos’ reader to read them in. This release boasts an all-new converter that is (1) fixed, (2) way faster, and (3) uses a lot less memory!

water_line <- read_sf("nshn_water_line.gpkg")

# old way
system.time(geos_read_wkb(st_as_binary(water_line$geometry)))
##    user  system elapsed 
##   2.243   0.081   2.324
# new way
system.time(as_geos_geometry(water_line$geometry))
##    user  system elapsed 
##   0.442   0.028   0.471

Because as_geos_geometry() is used in every GEOS function to convert non-GEOS input, and because pretty much every geometry in R starts with sf::read_sf(), this should mean that most GEOS operations for most users are much faster.

Better spatial join keys

The “look up all the features in y that intersect x” operation is super important and usually one that is accessed by users silently when they do a spatial join (e.g., using sf::st_join()). I wrote a few months ago about the various overheads associated with spatial joins, but there were too many steps required to make the join operation fast with the existing geos_matrix_*() functions that I decided to include a new workflow in the geos package based on the implementation in GeoPandas.

The below example finds neighbouring Nova Scotia river segments (i.e., locates all pairs of intersecting river segments), which returns about 2 million pairs in total.

water_line_geos <- data.frame(
  OBJECTID = water_line$OBJECTID,
  geometry = as_geos_geometry(water_line)
)

system.time(
  water_line_neighbours <-
    geos_inner_join(water_line_geos, water_line_geos, "intersects")
)
##    user  system elapsed 
##  16.412   0.401  16.922
head(water_line_neighbours)
##   OBJECTID.x                                          geometry.x OBJECTID.y
## 1          2 <MULTILINESTRING [328482 4837061...328624 4837208]>          1
## 2          2 <MULTILINESTRING [328482 4837061...328624 4837208]>      32080
## 3          2 <MULTILINESTRING [328482 4837061...328624 4837208]>      46329
## 4          2 <MULTILINESTRING [328482 4837061...328624 4837208]>          2
## 5          2 <MULTILINESTRING [328482 4837061...328624 4837208]>        183
## 6          3 <MULTILINESTRING [346530 4850232...346690 4850301]>      46340
##                                                                           geometry.y
## 1                                <MULTILINESTRING [328463 4836920...328675 4837061]>
## 2                                <MULTILINESTRING [328453 4837061...328482 4837091]>
## 3   <MULTILINESTRING Z ((328623.5255 4837198.179 0.2, 328674.9025 4836996.215 0.2))>
## 4                                <MULTILINESTRING [328482 4837061...328624 4837208]>
## 5                                <MULTILINESTRING [328624 4837198...328779 4838191]>
## 6 <MULTILINESTRING Z ((346689.5248 4850232.3785 0.2, 346409.2429 4850128.3145 0.2))>

This isn’t all that much faster than sf::st_join(), but it’s built on a more flexible foundation that exposes some of the lower-level tree index functionality available from the GEOS C API.

system.time(
  water_line_neighbours_sf <- st_join(
    water_line["OBJECTID"],
    water_line["OBJECTID"],
  )
)
##    user  system elapsed 
##  31.918   0.609  32.577

One of those foundational pieces are exposing the raw indices of the left and right geometries. This is similar to the sparse matrix returned by sf::st_intersects() and geos_matrix_intersects() but is in a form that’s more easily used to implement joins:

head(geos_inner_join_keys(water_line_geos$geometry, water_line_geos$geometry))
##    x      y
## 1  1 452960
## 2  1  33019
## 3  1  46512
## 4  1      1
## 12 1    120
## 14 2  46578

Even lower level is the geos_basic_strtree() itself, which you can build up incrementally without keeping all the geometries in memory at any given time:

tree <- geos_basic_strtree()

chunk_start <- 0
chunk_end <- 1000

while (TRUE) {
  chunk <- read_sf(
    "nshn_water_line.gpkg", 
    query = glue::glue_sql(
      "SELECT geometry from nshn_water_line 
        WHERE FID >= {chunk_start} AND FID < {chunk_end}")
  )
  
  if (nrow(chunk) == 0) {
    break
  }
  
  geos_basic_strtree_insert(tree, chunk)
  chunk_start <- chunk_start + 1000
  chunk_end <- chunk_end + 1000
}

geos_basic_strtree_size(tree)
## [1] 483268
str(
  geos_basic_strtree_query(
    tree, 
    wk::rct(246816.4, 4809440, 349874, 4896209)
  )
)
## 'data.frame':	40796 obs. of  2 variables:
##  $ x   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ tree: int  46840 2548 47333 2125 46842 864 874 7188 16695 7187 ...

Using the basic strtree requires some bookeeping on your end (hence the convenience functions described above); however, exposing it is the first step towards making bigger-than-memory workflows in R easier to build tools for.

The whole game

Above I’ve described a few incremental changes but I also want to bring a few of them together to demonstrate the “bigger than memory” game. Pretending there’s some world where nshn_water_line.gpkg is bigger than memory and we want to repeatedly query it, lets first divide it up into roughly continguous chunks in space. Because we’re pretending this is bigger than memory, we’re going to do a “loop and collect” approach.

loop_chunks <- function(fun) {
  chunk_start <- 0
  chunk_end <- 10000
  chunk_i <- 0
  
  while (TRUE) {
    chunk_i <- chunk_i + 1
    chunk <- read_sf(
      "nshn_water_line.gpkg", 
      query = glue::glue_sql(
        "SELECT * from nshn_water_line 
          WHERE FID >= {chunk_start} AND FID < {chunk_end}")
    )
    
    if (nrow(chunk) == 0) {
      break
    }
    
    fun(chunk_i, chunk)
    chunk_start <- chunk_start + 10000
    chunk_end <- chunk_end + 10000
  }
}

To use geos_hilbert_code() to chunk our file into roughly spatially contiguous chunks, we first need to know the full bounding box. You might know this already, but if you don’t, you can loop through chunks to calculate it.

chunk_bboxes <- rep(
  wk::rct(
    Inf, Inf, -Inf, -Inf,
    crs = st_layers("nshn_water_line.gpkg")$crs[[1]]
  ),
  100
)

loop_chunks(function(i, chunk) {
  chunk_bboxes[i] <<- st_bbox(chunk)
})

(dataset_bbox <- wk::wk_bbox(chunk_bboxes))
## <wk_rct[1] with CRS=EPSG:26920>
## [1] [215869.1 4790519 781792.9 5237312]

Then we can use geos_hilbert_code() to split up the features into multiple files.

loop_chunks(function(i, chunk) {
  hilbert_code <- geos_hilbert_code(chunk, dataset_bbox, level = 3)
  filename <- sprintf("hilbert_nshn_water_line_%03d.gpkg", hilbert_code)
  by_filename <- split(chunk, filename)
  for (i in seq_along(by_filename)) {
    file <- names(by_filename)[i]
    write_sf(by_filename[[i]], file, append = file.exists(file))
  }
})

Lets see how our chunking did with respect to “spatially continguous”!

hilbert_files <- list.files(".", "^hilbert_")
names(hilbert_files) <- hilbert_files
chunk_bboxes <- lapply(hilbert_files, function(f) st_bbox(read_sf(f)))
chunk_index <- data.frame(
  file = names(chunk_bboxes), 
  geometry = do.call(c, lapply(chunk_bboxes, st_as_sfc))
) |> 
  st_as_sf()

ggplot(chunk_index) +
  geom_sf(fill = rgb(0, 0, 0, 0.1)) +
  theme_bw()

There do happen to be a few really big features in this file which make some of the chunks unnecessarily large, but other than that the method did a pretty good job!

An important part of that last chunk is the chunk_index, which lets us pre-query the data to load fewer files if a query involves loading intersecting features. Also, because we’ve split the data into multiple files, we can perform some of that work in parallel (e.g., using the furrr package to loop over files).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)

target_feature <- cornwallis_river
target_feature_bbox_wkt <- as.character(wk::as_wkt(wk::wk_bbox(target_feature)))

target_feature_intersects <- chunk_index |> 
  # only read some files
  filter(geos_intersects(geometry, target_feature)) |> 
  pull(file) |> 
  # push down the bounding box to GDAL if the datasource supports it
  map_dfr(read_sf, wkt_filter = target_feature_bbox_wkt) |> 
  # filter on the final predicate check
  filter(geos_intersects(geom, target_feature))

plot(target_feature_intersects$geom, col = "red")
plot(target_feature, add = T)

Note that the wkt_filter technique will do a pretty good job for most files you’ll run into in the wild; however, the chunking technique means that your data can (1) live remotely and you don’t have to do any downloading of files that aren’t relevant to you, (2) you can use file types that don’t support spatial indexing but are fast to read (like GeoParquet!), and (3) you can read the files in parallel. Generally, it’s also just easier to work with a bunch of smaller files than an 8 GB any kind of file.

What’s next

This release caught up on a bunch of maintenance tasks and I feel pretty good about where the package is at the moment. Next up is whatever arrives in the next GEOS release!

Software Engineer & Geoscientist