s2 version 1.1.0

s2 version 1.1.0

This release of the s2 package for R is an exciting one! Most users will use s2 implicity through its use in sf; however, the s2 package can be used on its own and these improvements will likely be available in future releases of sf.

Most of the hours spent on this release went into a massive refactor of the initial implementation so that the internals are more useful elsewhere (e.g., maybe a Python package), but that refactor also opened up opportunities to speed things up and add a few nice features. Let’s get to it!

Unions and within-distance queries are faster

A long standing issue with the current release of s2 is that s2_union_agg() are slow. The initial implementation accumulated geographies as it went; the new version starts by grouping geographies into pairs and recursively merges the pairs together until one final feature has been built. This is about 10 times faster than the previous version and should scale much better.

bench::mark(s2_union_agg(s2_data_countries()))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 s2_union_agg(s2_data_countries())    205ms  205ms      4.88     871KB        0

Another long standing issue was that the within-distance query was unusably slow. This is an important type of query because it’s much like checking for an intersection but with a tolerance. In GEOS you can create a buffered feature; in S2 that functionality isn’t implemented yet, and the workaround used a lot of memory. The new release adds s2_prepared_dwithin() and fixes s2_dwithin_matrix() such that it efficiently uses preselection based on a simplified S2 cell covering. For the test case I was using it resulted in a query that was 30 times faster although I’m sure timing will vary depending on how useful preselection is for a given use case.

points <- s2_point(runif(1e4, -1, 1), runif(1e4, -1, 1), runif(1e4, -1, 1))
geog <- s2::as_s2_geography(as_s2_lnglat(points))
countries <- s2_data_countries()

bench::mark(
  s2_dwithin_matrix(countries, geog, 1e6),
  s2_dwithin_matrix(geog, countries, 1e6),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                <bch:tm> <bch:t>     <dbl> <bch:byt>
#> 1 s2_dwithin_matrix(countries, geog, 1e+06)    162ms   165ms      6.08   133.2KB
#> 2 s2_dwithin_matrix(geog, countries, 1e+06)    414ms   416ms      2.40    80.7KB
#> # … with 1 more variable: `gc/sec` <dbl>

For the pairwise s2_dwithin() it’s sometimes faster to use preselection based on an index and sometimes not. For most calls where the length of the second argument is 1 (common, I think), you can now use the new s2_prepared_dwithin() to speed up your query.

bench::mark(
  s2_dwithin(geog, countries[1], 1e6),
  s2_prepared_dwithin(geog, countries[1], 1e6)
)
#> # A tibble: 2 × 6
#>   expression                                          min   median `itr/sec`
#>   <bch:expr>                                     <bch:tm> <bch:tm>     <dbl>
#> 1 s2_dwithin(geog, countries[1], 1e+06)            9.23ms   9.52ms      105.
#> 2 s2_prepared_dwithin(geog, countries[1], 1e+06)   3.38ms   3.48ms      282.
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

Plotting

Testing spherical geometry operations is hard, particularly because most plotting tools are Cartesian-centric and make it hard to visualize geodesic edges like those in S2 geographies. The S2 library has tools that make projecting spherical geometries relatively simple and I was tired of “flying blind” when trying to make sure an operation “worked”. Thus, enter s2_plot()! It only does one projection (orthographic with customizable centre) and is not designed for presentation-ready plots but was a total game changer for me debugging the features in this release.

s2_plot(s2_data_countries("Antarctica"))

Convex hulls and point-on-surface on the sphere

Thanks to Sylvain Piry, the underlying S2 library’s convex hull implementation is now hooked up in R! You can now do spherical convex hulls item-wise (s2_convex_hull()) or as an aggregate function (s2_convex_hull_agg()).

s2_plot(s2_convex_hull(s2_data_countries("Antarctica")))
s2_plot(s2_data_countries("Antarctica"), add = TRUE)

Thanks to Kyle Butts, the s2 R package has an implementation of “point on surface” (s2_point_on_surface()), which is originally a GEOS function that returns a deterministic point that definitely intersects a feature and is sort of in the middle of it (perhaps as a starting point for label placement). The implementation for points and lines is readily translated from GEOS; however, the polygon implementation required a new approach (we use the centre of the largest S2 cell that will fit inside the polygon).

Distance-constrained k-nearest neighbours

Almost a year ago, Roger Bivand opened an issue noting that distance-constrained k-nearest neighbours was slow when using S2 as a backend. This helped push a few features forward (including the faster s2_dwithin_matrix() implementation), but the direct fix was to add a max_distance argument to match the min_distance argument in s2_closest_edges().

cities <- s2_data_cities()
city_names <- s2_data_tbl_cities$name

knn_1000_km <- s2_closest_edges(
  cities[city_names == "Paris"], cities,
  k = 20, 
  min_distance = 0, 
  max_distance = 1e6
)

city_names[knn_1000_km[[1]]]
#>  [1] "Ljubljana"  "San Marino" "Prague"     "The Hague"  "Dublin"    
#>  [6] "Berlin"     "Andorra"    "Geneva"     "Vaduz"      "Monaco"    
#> [11] "London"     "Bern"       "Luxembourg" "Amsterdam"  "Brussels"

Cells and Cell Unions

The previous release quietly added an s2_cell() vector class; however, the most used feature of the underlying s2 library that involved s2 cells (the “covering”) wasn’t implemented. This release adds a s2_cell_union() vector class and s2_covering_cell_ids(), which implements the covering logic returning cell union vector. The covering is a simplified version of a geography that is kind of like a bounding box but potentially more accurate and sphere-friendly.

(covering <- s2_covering_cell_ids(s2_data_countries("Canada")))
#> <s2_cell_union[1]>
#> List of 1
#> $ : s2_cell[1:8] 4b , 4d , 4f , 51 , 53 , 55 , 56b, 89
s2_plot(s2_cell_polygon(unlist(covering)), border = "red")
s2_plot(s2_data_countries(), add = T)

The cell union class supports predicates like contains and intersects that are usually faster than their geographical counterparts. You can use transformers like intersection and union to approximate the result an operation rather than actually compute it which is potentially much faster.

countries <- s2_data_countries()
canada <- s2_data_countries("Canada")
countries_approx <- s2_covering_cell_ids(s2_data_countries())
may_intersect_canada <- s2_cell_union_intersects(countries_approx, covering)
s2_data_tbl_countries$name[may_intersect_canada]
#> [1] "The Bahamas"              "Canada"                  
#> [3] "Cuba"                     "Greenland"               
#> [5] "Iceland"                  "Russia"                  
#> [7] "United States of America"

bench::mark(
  s2_cell_union_intersects(countries_approx, covering),
  s2_intersects(countries, canada),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                               min  median `itr/sec`
#>   <bch:expr>                                           <bch:t> <bch:t>     <dbl>
#> 1 s2_cell_union_intersects(countries_approx, covering)  23.3µs  24.1µs    39981.
#> 2 s2_intersects(countries, canada)                     584.5µs 593.6µs     1664.
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

Another compelling use case for s2 cell manipulation is grouping geographical features such that groups contain relatively nearby features. This kind of grouping is useful because it means that the bouding box/covering of a group can be used more effectively for preselection.

# use the aggregate union to find cells to use for groups
aggregate_union <- s2_covering_cell_ids_agg(s2_data_countries(), max_cells = 10)
groups_cell <- unlist(aggregate_union)

# assign every feature exactly one group
group_id <- rep(NA_integer_, length(countries_approx))
for (i in seq_along(groups_cell)) {
  group_id[s2_cell_union_intersects(countries_approx, groups_cell[i])] <- i
}

# sample a group to see if they're vageuly close together
split(s2_data_tbl_countries$name, group_id)[[9]]
#> [1] "Argentina"                           "Antarctica"                         
#> [3] "French Southern and Antarctic Lands" "Australia"                          
#> [5] "Chile"                               "Falkland Islands"                   
#> [7] "New Zealand"

The cell and cell union vector classes are still experimental and feedback about missing features or rough edges (pun intended) are welcome!

Planar import and export

Many of the issues related to using S2 as a geometry backend for spherical coordinates are in some way or another due to the difference between edges in planar space and spherical space. Almost every data set in longitude/latitude assumes GeoJSON-like handling of edges, meaning that points are connected by straight lines assuming normal planar rules. For example, the following polygon is totally valid in most GIS software’s definition of “valid geometry”:

poly <- "POLYGON ((-70 50, 0 50, 0 70, -35 52, -70 70, -70 50))"
wk::wk_plot(wk::wkt(poly), border = "red")

If you try to import this into s2, you’ll get the dreaded error:

s2_geog_from_text(poly)
#> Error in wk_handle.wk_wkt(wkt, s2_geography_writer(oriented = oriented, : Loop 0 is not valid: Edge 0 crosses edge 2

That’s because s2 interprets the polygon’s boundary like this:

poly_boundary <- wk::wk_linestring(wk::wkt(poly))
s2_plot(s2_geog_from_text(poly_boundary), col = "red")
s2_plot(s2_data_countries(), add = TRUE)

The latest release of s2 adds the ability to keep the usual GIS rules for edges when importing features by adding points to edges to keep them within a given tolerance of the original. Using a smaller tolerance will add more points, so using the largest possible tolerance for your use-case is recommended.

s2_plot(
  s2_geog_from_text(
    poly,
    planar = TRUE,
    tessellate_tol_m = 1e3
  ),
  border = "red"
)
s2_plot(s2_data_countries(), add = TRUE)

Similarly, when exporting geographies from s2 into other GIS software, the previous release made it difficult to export edges in the way that other GIS software expected them. One example of this is exporting s2 cells, which by definition have geodesic edges.

cell <- s2_cell_parent(as_s2_cell(s2_lnglat(-64, 45)), 2)

cell_geod <- s2_as_binary(s2_cell_polygon(cell))
cell_planar <- s2_as_binary(s2_cell_polygon(cell), planar = TRUE)
wk::wk_plot(wk::as_wkb(cell_geod))
wk::wk_plot(wk::as_wkb(cell_planar), border = "red", add = TRUE)

Usually vertices of normal GIS layers are close enough together that this won’t make a meaningful difference in the output. If you are getting invalid geometry errors from s2 input or output, though, it’s another tool you can use to work around the problem.

Internal refactoring

The major change in this release was a big refactor of the initial implementation to make it possible for other to re-use some of the code we’d written to make something like simple features work with s2 as a backend. The result was a standalone library and a lot of code quality improvements! There is interest in the Python geospatial community to add Python bindings and perhaps a set of geoarrow/Arrow query engine bindings in the future!

This release also eliminates a dependency on an old version of the wk package and commits to the infrastructure in recent wk releases. The most useful thing this does is add the ability to speed up conversions between sf and s2 representations of a geometry (particularly in the sf direction).

fast_as_s2 <- function(x) {
  wk::wk_handle(x, s2_geography_writer())
}

fast_as_sfc <- function(geog) {
  wk::wk_handle(geog, wk::sfc_writer())
}

nc_sfc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))$geometry
nc_big_sfc <- rep(nc_sfc, 100)
nc_big_s2 <- as_s2_geography(nc_big_sfc)

bench::mark(
  fast_as_s2(nc_big_sfc),
  sf::st_as_s2(nc_big_sfc),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fast_as_s2(nc_big_sfc)      203ms    204ms      4.88    81.6KB     0   
#> 2 sf::st_as_s2(nc_big_sfc)    224ms    224ms      4.46    5.68MB     2.23

bench::mark(
  fast_as_sfc(nc_big_s2),
  sf::st_as_sfc(nc_big_s2),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fast_as_sfc(nc_big_s2)     15.1ms   15.8ms     55.9     4.36MB     9.31
#> 2 sf::st_as_sfc(nc_big_s2)  744.4ms  744.4ms      1.34  155.52MB    14.8

The handler/writer implementations allow some other wk infrastructure to work out of the box:

wk::wk_flatten(as_s2_geography("MULTIPOINT (0 1, 2 3, 4 5)"))
#> <geodesic s2_geography[3] with CRS=OGC:CRS84>
#> [1] POINT (0 1) POINT (2 3) POINT (4 5)

What’s next?

In addition to maintaining the features added in this release, there are some exciting features coming up! Buffers on the sphere were just added to the upstream google/s2 release, and the ability to simplify features together (e.g., polygon coverages) has always been available in s2 but never exposed in the R package. These plus modernizing the s2 package build system is up next!

Avatar
Dewey Dunnington
Geoscientist, Programmer, Educator