Partitioning strategies for bigger-than-memory spatial data

There’s a great discussion going on in the GeoParquet repo about how exactly to split up GeoParquet datasets for optimal querying. This post is my adventure giving the partition strategies a shot for practicality. All things are possible with enough software engineering effort, of course, but the ability to get these things done today on a bigger-than-memory dataset is what I’m after here. Notably, using DuckDB and Apache Sedona, since DuckDB can spill to disk when it runs out of memory and Apache Sedona can run on more than one computer if you need it to.

In general, the partitioning has a few goals:

  • Rows in each partition should contain features that are near each other. This maximizes the chance that fewer files with highly relevant content will be downloaded for a given search area.
  • For a given search area, there should be as few possible files to download (i.e., partitions shouldn’t contain overlapping areas if possible).
  • Each partition should contain approximately the same number of features. This is helpful because too many tiny files bloat the metadata required for the entire dataset, and huge files take longer to download.

For GeoParquet specifically there is also the question of rolling up those partitions as Parquet row groups or files, which I’ll leave to another post.

TL;DR

A KD tree gives nicely arranged non-overlapping partitions with almost identical numbers of features; however, it is expensive to construct. An adaptive S2-cell based partitioning is cheap to construct (~15 seconds) and gives nicely arranged non-overlapping partitions; however, there can be many small files left over from subdivided cells.

As always, the wrangling tools are awesome!

The Data

I’m using the same data here as from my last post on spatial joins: 130 million buildings from the lower 48 U.S. states. Partition wise, I’m aiming for ~1 million builings per partition which is maybe not the perfect number for querying but is is coarse enough that we can tell from a map some things about how/how well the partitioning strategy worked.

The Tools

I’ll be using the same tools for this post as for the last one on spatial joins. As of this writing, DuckDB Geography is the only one a little tricky to get (hopefully it will be available via a quick INSTALL geography FROM community by the time you read this).

First, we’ll set up the connection. I’m using Ibis here because the geopandas integration lets us plot this quickly and because in general I’d like to get better at it.

import ibis
from ibis import _

ibis.options.interactive = True

# Because the geography extension isn't published yet and we'll use it!
con = ibis.duckdb.connect(allow_unsigned_extensions=True)
con.load_extension("spatial")
con.load_extension("geography")

Using S2 Cells (fixed cell level)

A fairly simple approach is to choose an S2 cell level and go! This could also be an H3 cell, which is approximately the same idea (a 64-bit integer representation with a configurable level), since there’s also an H3 extension for DuckDB.

I fiddled a bit with the level here to get about a million buildings in each cell, but as we can see, that’s pretty impossible to do since there are a lot of buildings in some places and not alot of buildings in others!

fixed_level = 5

partitions_fixed = con.sql(f"""
SELECT
    geom.st_aswkb().s2_arbitrarycellfromwkb().s2_cell_parent({fixed_level}) AS cell,
    count(*) AS n,
    s2_aswkb(cell::GEOGRAPHY).st_geomfromwkb() as geom
FROM 'microsoft-buildings-point.parquet'
GROUP BY
    cell
""")

partitions_fixed_pd = partitions_fixed.to_pandas()

One cool aspect of this is that it is fast: the partitions get calculated in more or less the amount of time it takes to read the file (4 seconds, for me).

When we plot, we can see that this is perhaps not a very good strategy if we want to have any degree of the same number of rows in each file.

partitions_fixed_pd.plot(column="n", legend=True)
<Axes: >

png

partitions_fixed_pd.hist("n")
array([[<Axes: title={'center': 'n'}>]], dtype=object)

png

Using S2 Cells (adaptive cell level)

A slight variation on this would be to subdivide cells with too many buildings into their four S2 child cells. Using a recursive common table expression (CTE), we can start with the “fixed cell” approach and roll up the results of that for progressively coarser cell levels.

max_level = 10
min_level = 1
max_rows = 1_000_000

partition_adaptive_options = con.sql(f"""
WITH RECURSIVE s2tree(cell, level, n) AS (
SELECT
    geom.st_aswkb().s2_arbitrarycellfromwkb().s2_cell_parent({max_level}) AS cell,
    {max_level} AS level,
    count(*) AS n
FROM 'microsoft-buildings-point.parquet'
GROUP BY
    cell

UNION ALL

SELECT
    cell.s2_cell_parent(level::TINYINT - 1) AS cell,
    level - 1 AS level,
    sum(n) AS n
FROM s2tree
WHERE level > {min_level}
GROUP BY level, cell.s2_cell_parent(level::TINYINT - 1)
)

SELECT
    s2_cell_parent(cell, -1) as parent_cell,
    *
FROM s2tree
WHERE n <= {max_rows} OR (level == {max_level})
""")

The way I’ve written this it will return all hypothetical partitions without accounting for duplicates (i.e., groups of four child cells that were rolled up into a parent cell). To get rid of these, we can use an anti-join to ensure that only cells without their parent also listed remain:

all_cells = partition_adaptive_options.select(parent_cell=_.cell)
partitions_adaptive = partition_adaptive_options.anti_join(all_cells, "parent_cell").drop("parent_cell")
con.create_view("partitions_adaptive", partitions_adaptive, overwrite=True)
partitions_adaptive_pd = con.sql("""
SELECT *, s2_aswkb(cell::GEOGRAPHY).st_geomfromwkb() as geom
FROM partitions_adaptive
""").to_pandas()

This took quite a lot of fiddling to get right but ended up running in about 15 seconds. Let’s take a look at the map!

partitions_adaptive_pd.plot(column="n", legend=True)
<Axes: >

png

We get a fairly nice variation in cell size, with smaller cells in more populated places.

partitions_adaptive_pd.n.hist()
<Axes: >

png

This does give a fairly wide variation in sizes: we get some very small partitions around the edges and otherwise our number of rows can vary by a factor of at least four (since a cell that had juuust over a million points would get subdivied with not all of those cells necessarily having 250,000 points each).

Sorting by S2 Cell

A GeoHash sort is the way that Apache Sedona reccomends in its documentation. This would be similar to sorting by S2 cell, although a GeoHash uses a slightly different mechanism for its space-filling curve. To keep all the time comparisons comparable (and because I’m not quite as good at Sedona yet), I’m going to sort by S2 cell here.

partition_size = 1_000_000

partitions_sorted = con.sql(f"""
WITH enumerated AS (
    SELECT
        row_number() OVER (ORDER BY geom.st_aswkb().s2_arbitrarycellfromwkb()) AS row_id,
        row_id // {partition_size} AS partition_id,
        geom
    FROM 'microsoft-buildings-point.parquet'
)

SELECT partition_id, st_extent_agg(geom) FROM enumerated GROUP BY partition_id;
""")

partitions_sorted_pd = partitions_sorted.to_pandas()

This sort took about a minute. We know in this case that there are exactly a million points in each partition, but let’s see how they look on a map:

partitions_sorted_pd.plot(edgecolor="black", facecolor="none")
<Axes: >

png

While we do get a number of tiny boxes around populated areas, we also have a number of overlapping boxes for any given area, which unfortunately means that for a small search area one might still need to download a number of files.

KD-Tree

Jacob Wasserman of Meta/Overature Maps foundation has a very cool bit of code that creates a K-dimensional tree (KD tree) using a recursive common table expression in DuckDB. If we apply that here, we can basically get perfectly constant number of rows AND perfectly non-overlapping boxes. It’s pretty amazing, with the one downside being that it takes (my computer) around 10 minutes.

iteration = 9
con.raw_sql(f"""
COPY(
    WITH RECURSIVE kdtree(iteration, x, y, partition_id) AS (
        SELECT
            0 AS iteration,
            ST_X(geom) AS x,
            ST_Y(geom) AS y,
            '0' AS partition_id
        FROM 'microsoft-buildings-point.parquet'

        UNION ALL

        SELECT
            iteration + 1 AS iteration,
            x,
            y,
            IF(
                IF(MOD(iteration, 2) = 0, x, y) < APPROX_QUANTILE(
                    IF(MOD(iteration, 2) = 0, x, y),
                    0.5
                ) OVER (
                    PARTITION BY partition_id
                ),
                partition_id || '0',
                partition_id || '1'
            ) AS partition_id
        FROM kdtree
        WHERE
            iteration < {iteration}
    )
    SELECT
        x,
        y,
        partition_id
    FROM kdtree
    WHERE
        iteration = {iteration}
)
TO 'partitions.parquet' (FORMAT PARQUET);
""")
<duckdb.duckdb.DuckDBPyConnection at 0x11bdd0230>
partitions_kdtree = con.sql("""
SELECT
    partition_id,
    st_extent_agg(st_point(x, y)) AS geom,
    count(*) AS n
FROM "partitions.parquet"
GROUP BY partition_id
""")

partitions_kdtree_pd = partitions_kdtree.to_pandas()

From here, we can plot our boxes. This does seem to produce somewhat oblong boxes in some places, but they are undeniably more dense around cities which is what we’re going for!

partitions_kdtree_pd.plot(edgecolor="black", facecolor="none")
<Axes: >

png

Finally, we can check the distribution of rows in each partition. The spread here is very tiny (because we quite literally used the 50th percentile to split each box!).

partitions_kdtree_pd.n.hist()
<Axes: >

png

R-Tree

I played around a bit with trying to use the R tree generated by DuckDB Spatial to generate some partition options, but wasn’t able to extract any parent/child information, and the splitting seemed to show a string preference for north/southness and had quite a bit of overlap with eachother. The index took about a minute to build.

con.raw_sql(f"""
CREATE OR REPLACE TABLE buildings AS SELECT geom FROM 'microsoft-buildings-point.parquet';
CREATE INDEX buildings_idx ON buildings USING RTREE (geom) WITH (max_node_capacity = 16);
""")
<duckdb.duckdb.DuckDBPyConnection at 0x11bdd0230>
con.sql("""
SELECT level, bounds::GEOMETRY as geom
FROM rtree_index_dump('buildings_idx')
WHERE level < 1
""").to_pandas().plot(edgecolor="black", facecolor="none")
<Axes: >

png

Software Engineer & Geoscientist