Wrangling and joining 130M points with DuckDB + the open source spatial stack

Dec 1, 2024 23 min read

A little over a two years ago I wrote about point-in-polygon joins in R using an example of parking violations in Philedelphia. This example has been written about a few times involving a GPU example, using it as example for tuning PostGIS, a GeoPandas version for comparison, and a version comparing ArcPro, QGIS, PostGIS, and Manifold.

I don’t want to downplay the parking example because it’s a realistic representation of problems on the larger end of the normal GIS realm. For many general-purpose data engineers, though, 9 million parking tickets is a drop in the bucket and the point-in-polygon counting might just be one step in a long pipeline of transformations. In particular, the problem has a crutial simplification: it fits comfortably in memory on just about any recent laptop.

Let’s take the same problem up a notch but make it big enough that the naive approach will exhaust the memory of the average day-to-day laptop. By naive approach, I mean an approach that doesn’t involve any kind of explicit data format choices, chunking, indexing, or parallelism: these are the kinds of games you have to play when the size of your data approaches or gets bigger than memory. Let’s find a pretty big dataset and let the games begin!

The Data

For this post I’ll use the Microsoft U.S. buildings dataset as published in the GeoArrow Data page in FlatGeoBuf format. Roughly, this is 130 million points (the building centroids) from the “lower 48” (i.e., not Alaska or Hawaii). The Postal code tabulation areas are from the U.S. Census Bureau, and there are about 33,000 of them. (At the end or this post there’s a recipe for how to get the data if you’re itching to give any of this a try.)

Basically, we’ll be doing the same thing as the Philedelphia parking violation example: count points by polygon (in this case, make a list of zip codes and how many buildings are contained within each).

The Results

This is a pretty long post, so I’ll put the results up top. If you prefer to be surprised, feel free to skip and scroll back!

First of all, the winners here are really Ibis, DuckDB, Arrow, and GeoParquet. Those tools saved huge quantities of time off all existing “wrangling” times. DuckDB provided a way to automatically paralellize most of the wrangling operations without me having to think about any of it; Ibis provided a way for me not to have to remember all the SQL I needed to do common wrangling tasks; Arrow and GeoParquet were under the hood making sure Geometry stayed Geometry and I didn’t have to think too hard about writing to/reading from a binary column every time. There are some geospatial types and operators that I wish Ibis knew about, and until DuckDB spatial can propagate a CRS it won’t be useful for the general case, but those are minor gripes about a huge amount of work that it took collectively to get to this wonderful place we are now in.

Next, the results. From fastest to slowest on a nice but sort of old laptop (total time, including wrangling): DuckDB Geography (5.5 minutes), GeoPandas the easy way (6.5 minutes), GeoPandas the hard way (11 minutes), PostGIS (13 minutes).

From fastest to slowest on a Windows machine with more cores and 128 GB of RAM: GeoPandas the easy way (2 minutes), DuckDB Geography (2.5 minutes), PostGIS (6.75 minutes), Apache Sedona (23 minutes), GeoPandas the hard way (>30 minutes, something about GDAL/Windows here is funky).

QGIS:

Wrangling time: 9 minutes (MacOS M1)
Run time: 17 minutes (MacOS M1)
Parallelism: Single core only
Memory requirement: 200 MB
Tweaking required: Pretty much none
Bonus points: For the accurate progress bar!

PostGIS:

Wrangling time: 7 minutes (MacOS M1), 5 minutes (beefy Windows)
Run time: 6 minutes (MacOS M1), 1.75 minutes (beefy Windows)
Parallelism: Runtime seems to be ~linear on number of logical cores
Memory requirement: Respects a set memory limit (but seems to be much faster on a machine with virtually unconstrained memory). The peak usage of the Docker container running PostGIS seemed to be about 16 GB if unconstrained.
Tweaking required: Requires knowing the wrangling, indexing, and global configuration tricks to force the optimal query plan.
Bonus points: For not requiring any fancy SQL tricks for the actual query!

GeoPandas (the hard way):

Wrangling time: 9 minutes (MacOS M1) (needs a GDAL-readable file format with an index)
Run time: 2 minutes (MacOS M1), 20 minutes (Windows)
Parallelism: Configurable (can scale to more cores)
Memory requirement: 200 MB
Tweaking required: Requires knowing the exact query plan you want to execute (and seems to take strangely long on Windows).
Bonus points: For the lowest memory requirement!

GeoPandas (the easy way):

Wrangling time: 5 minutes (MacOS M1), 1.5 minutes (beefy Windows)
Run time: 1.25-1.5 minutes (MacOS M1), 19 seconds (beefy Windows)
Parallelism: Configurable (scales to more cores subject to the limitation of a Python thread pool)
Memory requirement: 40 GB (but much of this can be virtual memory and it seems to run on a laptop with 16 GB physical memory)
Tweaking required: Requires knowing the exact query plan you want to execute
Bonus points: For being the fastest runtime with sufficient memory!

DuckDB Geography:

Wrangling time: Integrated into the runtime (although only a few seconds of explicit wrangling was required).
Run time: 5.5 minutes (MacOS M1), 2.5 minutes (beefy Windows)
Parallelism: DuckDB effectively saturated all cores for most of the join, and this probably scales such that it would be correspondingly faster for more cores.
Memory requirement: DuckDB respects the memory limit you set (although I am guessing this will take quite a long time to complete if there is little memory available based on observing the temporary storage in the .tmp directory get used a fair amount). On a machine with virtually unconstrained memory, the peak usage from the DuckDB process was about 16 GB.
Tweaking required: Requires knowing intimate details of S2 cell math (hopefully we can remove this requirement and do this under the hood in the future!)
Bonus points: For requiring zero (explicit) wrangling time!

Apache Sedona:

Wrangling time: 30 seconds (MacOS M1), 15 seconds (beefy Windows)
Run time: 23 minutes (beefy Windows)
Parallelism: Scales to infinity (because Sedona/Spark can use large distributed clusters)
Memory requirement: Peak usage seemed to be about 16 GB
Tweaking required: Requires knowing some things about Spark clusters if you want this to be fast/scale
Bonus points: For the ability to scale to more than one node in a compute cluster!

Ok…read on for the full story!

The Tools

For this post I’ll be using a few tools:

DuckDB, for reading/writing various file formats (via the spatial extension, which provides bindings to GDAL)
Ibis, so that we can write as little SQL as absolutely necessary
The brand new/in-progress DuckDB Geography extension, because I just wrote it and am itching to give it a try on a real problem!
PostGIS, which is is one of the go-to tools for when data gets to be unwieldy for a purely in-memory solution.
Apache Sedona, another go-to tool for scaling spatial to the next level.
GeoPandas, for a test of the purely in-memory solution.
QGIS’s command-line tools, which provide an implementation of the operation we’re about to try that doesn’t involve loading the whole dataset into memory.

The Hardware

I’ll discuss timings with respect to two different machines:

MacOS M1: My somewhat old day-to-day laptop, a first generation MacBook Air with 16 GB of RAM and 8 cores.
“Beefy Windows”: A Windows machine with an AMD Ryzen 5 with 12 physical/24 logical cores and 128 GB of RAM. Mostly this helped identify the engines that relied on loading everything into RAM under the hood.

Wrangling the data

One of the fun parts of dealing with a lot of points is that they work rather badly with spatial file formats that are mostly designed to store arbitrary “geometry” and aren’t optimized for the point use case. The buildings data is provided as a 9.7 GB FlatGeoBuf file (~2 GB zipped); however, many of the tools we’re about to use are faster if we start with Parquet.

First, I’ll set up the Ibis session as our database frontend (to DuckDB and other things later on).

import ibis
from ibis import _

ibis.options.interactive = True
# Because the geography extension isn't published yet and we'll use it!
con = ibis.duckdb.connect(allow_unsigned_extensions=True)
con.load_extension("spatial")

First we’ll prep the zipcodes table in FlatGeoBuf and Parquet. If we had an Arrow-enabled GDAL build here, we could use ogr2ogr to do all of this, but for that to work you would need conda GDAL or homebrew GDAL (i.e., not a run-of-the-mill Linux installed version). DuckDB’s spatial extension can do this for us too, though, so we’ll just use that for now. This is pretty reasonable (takes 1 second, both files are about 100 MB).

con.raw_sql(
    """
    COPY (
        SELECT ZCTA5CE20 as zipcode, geom
        FROM 'us-zip-codes/cb_2020_us_zcta520_500k.shp'
    ) TO 'us-zip-codes.fgb' WITH (FORMAT GDAL, DRIVER FlatGeoBuf)
    """
)

con.raw_sql(
    """
    COPY (
        SELECT ZCTA5CE20 as zipcode, geom
        FROM 'us-zip-codes/cb_2020_us_zcta520_500k.shp'
    ) TO 'us-zip-codes.parquet'
    """
)

! ls -lh us-zip-codes.*

-rw-r--r--  1 deweydunnington  staff    95M Dec  1 22:07 us-zip-codes.fgb
-rw-r--r--  1 deweydunnington  staff    92M Dec  1 22:07 us-zip-codes.parquet

Next, we’ll make a Parquet file from the FlatGeoBuf of the building centroids.

con.raw_sql(
    """
    COPY (
        SELECT geom,
        FROM 'microsoft-buildings-point.fgb'
    ) TO 'microsoft-buildings-point.parquet'
    """
)

<duckdb.duckdb.DuckDBPyConnection at 0x10bcd66f0>

This takes about 2.5 minutes (9 minutes on Windows), in part because how this is wired together seems to only use a single thread to do most of the scanning. The slowness on Windows is I think GDAL-related but I am not entirely sure.

One particularly nice part about the GeoParquet output is that it’s significantly smaller than the FlatGeoBuf!

! ls -lh microsoft-buildings-point.*

-rw-r--r--  1 deweydunnington  staff   9.7G Sep 28  2023 microsoft-buildings-point.fgb
-rw-r--r--  1 deweydunnington  staff   2.0G Nov 28 22:42 microsoft-buildings-point.parquet

We’re not quite done with the wrangling (we’ll need to do some extra wrangling to get the most out of some engines), but for now we’re ready to get going!

QGIS

I’m going to start with the Desktop GIS solution because it’s point-and-shoot: we want to count points in polygons, and QGIS has a “point in polygon” counter. One slight hiccup is that we need to write some kind of a file with a spatial index: the FlatGeoBuf we’ve been given doesn’t have one, and if you try to run the tool without an index on the points layer you’ll get a stern WARNING: No spatial index exists for points layer, performance will be severely degraded. DuckDB spatial can do this for us (so can ogr2ogr using pretty much the same mechanism). Because GDAL’s default layer cration option for SPATIAL_INDEX is YES, we can just read and then write.

con.raw_sql(
    """
    COPY (
        SELECT geom from 'microsoft-buildings-point.parquet'
    ) TO 'microsoft-buildings-point-indexed.fgb' WITH (FORMAT GDAL, DRIVER FlatGeoBuf)
    """
)

<duckdb.duckdb.DuckDBPyConnection at 0x256c1d3f1f0>

This took DuckDB about 9 minutes (~17 minutes on Windows for I think the same GDAL-related reason). One consequence of this is that the FlatGeoBuf output (i.e., with the index included) is 15 GB!

! ls -lh microsoft-buildings-point*

-rw-r--r--  1 deweydunnington  staff    15G Nov 28 23:30 microsoft-buildings-point-indexed.fgb
-rw-r--r--  1 deweydunnington  staff   9.7G Sep 28  2023 microsoft-buildings-point.fgb
-rw-r--r--  1 deweydunnington  staff   2.0G Dec  1 22:12 microsoft-buildings-point.parquet

You can run the tool from the Processing Toolbox too, but for reproducibility’s sake, we’ll use a shell command here:

qgis_process run native:countpointsinpolygon \
  --POLYGONS='us-zip-codes.fgb' \
  --POINTS='microsoft-buildings-point-indexed.fgb' \
  --OUTPUT='qgis/us-zip-codes-counts.fgb'

Let’s spot-check the result for one zip code to compare our later results for general consistency:

con.sql(
    """
    SELECT NUMPOINTS FROM 'qgis/us-zip-codes-counts.fgb' WHERE zipcode = '05753'
    """
)

┏━━━━━━━━━━━┓
┃ NUMPOINTS ┃
┡━━━━━━━━━━━┩
│ float64   │
├───────────┤
│    4954.0 │
└───────────┘

con.sql("""SELECT sum(NUMPOINTS::INTEGER) FROM 'qgis/us-zip-codes-counts.fgb'""")

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ sum(CAST(NUMPOINTS AS INTEGER)) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ decimal(38, 0)                  │
├─────────────────────────────────┤
│                       120689758 │
└─────────────────────────────────┘

QGIS was able to complete the join in just over 17 minutes. There didn’t seem to be any indication that any more than a single core was being used and the memory usage seemed to hover around 200 MB.

PostGIS

Next, we’ll give PostGIS a shot. I drew heavily on a Paul Ramsay post and an Arthur Lembo post for the configuration (which I may still have wrong here!). We need some wrangling here, too, because PostGIS needs geometries in a database. First we need to start up a PostGIS instance. I’m doing this via Docker (ensuring that my Docker settings allow for unconstrained use of my cores and memory):

docker run -it --rm \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_USER=postgres \
  -p 5432:5432 \
  postgis/postgis:latest

DuckDB does have a PostgreSQL integration but doesn’t have quite enough flexibility to copy GeoParquet directly into a PostgreSQL table (or stream the result of a query into a PostgreSQL table, or at least I can’t figure out how to do it!). GDAL’s ogr2ogr can do this¹, but it takes about half an hour to load the buildings file from a FlatGeoBuf and we can do quite a bit better.

The approach I used here was to use DuckDB’s postgres extension to write PostgreSQL COPY binary, and insert it using psql. I like this approach because (1) it’s faster than anything else I tried and (2) it doesn’t load the whole buildings file into memory at any given time (i.e., this solution would scale to something that’s actually bigger than memory). It does use a sneaky trick: the COPY binary format for GEOMETRY is WKB, so while we write a BLOB we copy into GEOMETRY and it gives us what we need. If we didn’t do this, we’d need to issue a separate command to create the GEOMETRY column which for me took an extra few minutes.

con.load_extension("postgres")
con.raw_sql(
    """
    COPY (
        SELECT zipcode, geom.st_aswkb() as geom from 'us-zip-codes.parquet'
    ) TO "zipcodes.bin" WITH (FORMAT POSTGRES_BINARY)
    """
)

! psql \
   "host=127.0.0.1 user=postgres password=password dbname=postgres" \
   -c "DROP TABLE IF EXISTS zipcodes; CREATE TABLE zipcodes (zipcode VARCHAR, geom GEOMETRY)"
! psql \
   "host=127.0.0.1 user=postgres password=password dbname=postgres" \
   -c "\copy zipcodes FROM 'zipcodes.bin' WITH (FORMAT BINARY)"

DROP TABLE
CREATE TABLE
COPY 33791

con.raw_sql(
    """
    COPY (
        SELECT geom.st_aswkb() as geom from 'microsoft-buildings-point.parquet'
    ) TO "buildings.bin" WITH (FORMAT POSTGRES_BINARY)
    """
)

! psql \
   "host=127.0.0.1 user=postgres password=password dbname=postgres" \
   -c "DROP TABLE IF EXISTS buildings; CREATE TABLE buildings (geom GEOMETRY)"
! psql \
   "host=127.0.0.1 user=postgres password=password dbname=postgres" \
   -c "\copy buildings FROM 'buildings.bin' WITH (FORMAT BINARY)"

DROP TABLE
CREATE TABLE


NOTICE:  table "buildings" does not exist, skipping


COPY 129735970

Collectively, these run for me in less than 2 minutes (same on Windows), with the time for the zipcodes table <3 seconds of that.

I’ll use the Ibis PostgreSQL backend to issue queries to the PostgreSQL connection. First, to check if our zipcodes all made it:

pg_conn = ibis.postgres.connect(user="postgres", password="password", host="127.0.0.1")
pg_conn.sql("SELECT zipcode, st_astext(geom) FROM zipcodes").head(5)

┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ zipcode ┃ st_astext                                                                        ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string  │ string                                                                           │
├─────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 15301   │ POLYGON((-80.368598 40.218953,-80.354929 40.244065,-80.349107 40.248988,-80.350… │
│ 66112   │ POLYGON((-94.797741 39.116388,-94.797719 39.128597,-94.789669 39.128587,-94.788… │
│ 28645   │ POLYGON((-81.755792 35.92534,-81.745947 35.929242,-81.737449 35.941565,-81.7406… │
│ 07502   │ POLYGON((-74.204425 40.918754,-74.202744 40.920531,-74.201612 40.921717,-74.201… │
│ 15658   │ POLYGON((-79.34323 40.172011,-79.337496 40.177769,-79.331105 40.179871,-79.3329… │
└─────────┴──────────────────────────────────────────────────────────────────────────────────┘

Next, check the buildings table to make sure they all made it:

pg_conn.sql("SELECT st_astext(geom) FROM buildings").head(5)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ st_astext                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                                       │
├──────────────────────────────────────────────┤
│ POINT(-84.95972352431309 32.421832237128726) │
│ POINT(-84.9597329778881 32.420910155065535)  │
│ POINT(-84.9599375 32.2352595)                │
│ POINT(-84.96016955991914 32.422296538057)    │
│ POINT(-84.9616848201395 32.41927086747914)   │
└──────────────────────────────────────────────┘

For this to work we need to build an index on the points just like we did with QGIS, which for me took ~5 minutes (3 minutes on a beefier Windows machine):

pg_conn.raw_sql(
    """
    CREATE INDEX buildings_idx ON buildings USING GIST (geom);
    """
)

We’re good to go! Let’s roll:

pg_conn.raw_sql("""SET min_parallel_table_scan_size = '1kB'""")
pg_conn.raw_sql("""SET max_parallel_workers_per_gather = 12""")
pg_conn.sql(
    """
    SELECT zipcodes.zipcode, count(*) as numpoints
    FROM zipcodes
    JOIN buildings ON ST_Contains(zipcodes.geom, buildings.geom)
    GROUP BY zipcodes.zipcode
    """
).to_csv("postgis/result.csv")

This completes for me in 6 minutes (1 minute 45 seconds on a beefier Windows machine). Even though I set the number of workers to 12, I never saw more than 300% CPU usage (suggesting each worker wasn’t saturating a core due to IO constraints, or that fewer than 12 workers were used). Let’s spot check a zipcode and the total count!

con.read_csv("postgis/result.csv")\
    .filter(_.zipcode == "05753")

┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ zipcode ┃ numpoints ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ string  │ int64     │
├─────────┼───────────┤
│ 05753   │      4954 │
└─────────┴───────────┘

con.read_csv("postgis/result.csv")\
    .aggregate(sum=_.numpoints.sum())

┏━━━━━━━━━━━┓
┃ sum       ┃
┡━━━━━━━━━━━┩
│ int64     │
├───────────┤
│ 129603206 │
└───────────┘

Interestingly, we get a slightly different total count. I’m wondering if QGIS uses intersection rather than containment here, which might account for the difference if there were building centroids close enough to vertices/edges of the polygons.

GeoPandas the hard way

I say the hard way here because we’re going to basically reconstruct the nice query plan that PostGIS calculated for us but using GeoPandas intstead to do the execution. The query plan is basically:

Split zipcodes into num_workers pieces
On each worker, loop through zipcode polygons one at a time, querying the point index for each polygon.
After getting getting a list of candidate points, check each of them to make sure they’re actually contained within the polygon (and then count the ones that really were contained).

This is rather specific to this exact task, but if you’re throwing a lot of data at a problem you might need a specialized solution.

This solution shares the requirement that we have a file format with a spatial index that GDAL understands. This is because GeoPandas reads using GDAL/OGR and lets us pass a spatial filter through (that makes use of the index). We’ll reuse microsoft-buildings-point-indexed.fgb but make sure to count the 9 minutes it took to get our data into a format that works here.

The heart of this solution is a single Python function that quite literally counts points for a single polygon. The gymnastics with writing the function to a file and reimporting it is something about the ProcessPoolExecutor and its interaction with Jupyter and probably isn’t necessary in a standalone Python script.

import geopandas

buildings_fun = """
import geopandas
import warnings
warnings.filterwarnings('ignore') # Ignore warnings about "read 0 features"

def count_buildings(zipcode_feat):
    points = geopandas.read_file(
        'microsoft-buildings-point-indexed.fgb',
        bbox=zipcode_feat.bounds
    )

    query = points.geometry.sindex.query(
        zipcode_feat,
        predicate="contains",
        sort=False
    )

    return len(query)
"""

with open("count_buildings.py", "w") as f:
    f.write(buildings_fun)

from count_buildings import count_buildings

zipcodes_df = geopandas.read_file("us-zip-codes.fgb")

from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=8) as executor:
    zipcodes_df["building_count"] = list(executor.map(count_buildings, zipcodes_df.geometry))

This finishes for me in 2 minutes (~20 minutes on Windows for some reason) and responds almost perfectly to the number of workers (provided you have enough cores to house them).

Let’s do our spot check:

zipcodes_df["building_count"][zipcodes_df.zipcode == "05753"]

10930    4954
Name: building_count, dtype: int64

zipcodes_df["building_count"].sum()

np.int64(129603206)

This exactly matches PostGIS!

GeoPandas the easy way

I’m calling this one “the easy way” because it’s a bit more point-and-shoot: we’re just going to load the microsoft-buildings-point file straight into memory and build the index there. That said, it’s easy only if you have a huge quantity of memory available to you. This doesn’t happen to crash on my laptop (16 GB physical memory) but I’ve had a slightly bigger example crash, and it does make the computer almost unusable while it happens (because it moves a huge amount of memory to swap/disk).

First, we’ll load the files. I found a slight advantage to GeoParquet vs. FlatGeoBuf on the read (~15% faster) but this might have been due to some of the extreme memory pressure (after the read, the Python process was using 28 GB of memory and there’s on 16 on my laptop).

import geopandas

zipcodes_df = geopandas.read_file("us-zip-codes.fgb")
buildings_df = geopandas.read_parquet("microsoft-buildings-point.parquet")

The read takes 2.5 minutes / 44 seconds on a beefy Windows machine (28 GB memory usage on both).

sindex = buildings_df.geometry.sindex

The index building takes 2.5-3 minutes / 48 seconds on a beefy Windows machine (40 GB memory usage). It’s just a guess but I think this timing depends considerably on how much swapping has to happen to make this work which can be variable if you have less than 28 GB of actual memory.

Next, we’ll set up the spatial index query attempting to use a bit of parallelism. The approach here is to chunk the zipcodes into reasonably large chunks (12 here) and exploit the fact that sindex.query() can take a vector of geometries.

import pandas as pd

zipcode_chunks = []
chunk_size = len(zipcodes_df.geometry) // 11
n_chunks = 11 + 1
for i in range(n_chunks):
    start = i * chunk_size
    end = start + chunk_size
    chunk = zipcodes_df.geometry.iloc[start:end]
    zipcode_chunks.append(chunk)

def count_buildings_chunk(chunk):
    result = sindex.query(chunk, predicate="contains")
    return pd.Index(pd.Series(chunk.index).iloc[result[0]]).value_counts(sort=False)

Finally, we can set up the executor and run the join.

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as executor:
    chunk_results = list(executor.map(count_buildings_chunk, zipcode_chunks))
    zipcodes_df["building_count"] = pd.concat(chunk_results)

This finishes for me in 1.25-1.5 minutes (19 seconds on a beefy Windows machine). Let’s spot check!

zipcodes_df["building_count"][zipcodes_df.zipcode == "05753"]

10930    4954.0
Name: building_count, dtype: float64

zipcodes_df["building_count"].sum()

np.float64(129603206.0)

DuckDB Geography

I have to admit that this is the one I’m the most excited about, but it’s also not entirely practical at the moment since the DuckDB Geography extension is brand new and not easily installable (e.g., as of this writing you’ll have to grab a binary from the latest CI job, extract/copy it into your working directory and install it from a local file).

con.raw_sql("FORCE INSTALL 'geography.duckdb_extension'")
con.load_extension("geography")

First, we’ll read zipcodes into a table where the “geometry” column is actually a GEOGRAPHY type (as defined by the extension). We’ll use s2_prepare() to exploit a unique type of GEOGRAPHY that is “prepared”, which allows it to re-use quite a bit of effort that would otherwise be duplicated for every point-in-polygon test that’s about to happen.

The key part that will let this join compete with the explicitly indexed solutions is the “covering”. This “covering” is a collection of discrete cells represented by 64-bit integers. It takes a bit of effort in advance to calculate these, but DuckDB can blow through this all in 7 seconds.

con.raw_sql("""
    CREATE OR REPLACE TABLE zipcodes AS
    SELECT
        zipcode,
        geom.st_aswkb().s2_geogfromwkb().s2_prepare() as geog,
        geog.s2_covering() as covering
    FROM "us-zip-codes.parquet"
""")

<duckdb.duckdb.DuckDBPyConnection at 0x2093af5e6f0>

Next, we’ll prepare the two sides of the join as VIEWs. Creating them as VIEWs are free (takes zero seconds) but lets the actual join be a bit more concise. One of the key parts of this is the “unnesting” of the covering, which gives one row for each candidate S2 cell, keeping a copy of the original geography in each row.

For the buildings layer, we’ll approximate each building location as an S2 cell as well. These cells are “leaf” cells, or cells at the highest level (30). Leaf s2 cells have a resolution of a few cm, which is OK for us since the accuracy of the collection was almost certainly not that high to begin with.

con.raw_sql("""
    CREATE OR REPLACE VIEW zipcodes_unnested AS
    SELECT
        zipcode,
        geog,
        unnest(covering) as cell
    FROM zipcodes
""")

con.raw_sql("""
    CREATE OR REPLACE VIEW buildings AS
    SELECT
        geom.st_aswkb().s2_cellfromwkb() as cell
    FROM "microsoft-buildings-point.parquet"
""")

<duckdb.duckdb.DuckDBPyConnection at 0x2093af5e6f0>

Finally, the join. This is a range inequality join that is doing something similar to the index queries above! Except, instead of using a custom index, we exploit the fact that checking one s2 cell’s containment of another is a simple range check. This gets our “preselection” sorted, which lets the expensive WHERE clause only check points that are likely to be inside the polygon.

con.sql("""
    SELECT
        zipcodes_unnested.zipcode,
        count(*) as numpoints
    FROM
        buildings
    LEFT JOIN zipcodes_unnested ON
        zipcodes_unnested.cell.s2_cell_range_min() <= buildings.cell
        AND zipcodes_unnested.cell.s2_cell_range_max() >= buildings.cell
    WHERE s2_contains(zipcodes_unnested.geog, buildings.cell)
    GROUP BY zipcode
""").to_csv("duckdb-geography/result.csv")

This runs for me in 5.5 minutes (2.5 minutes on a beefier Windows machine). Pretty much all of my cores were saturated the whole time, and DuckDB quite nicely respected the memory limit of my system (actively using the .tmp directory to spill to disk a number of times).

Let’s spot check!

join_result = con.read_csv("duckdb-geography/result.csv")
join_result.filter(_.zipcode == "05753")

┏━━━━━━━━━┳━━━━━━━━━━━┓
┃ zipcode ┃ numpoints ┃
┡━━━━━━━━━╇━━━━━━━━━━━┩
│ string  │ int64     │
├─────────┼───────────┤
│ 05753   │      4954 │
└─────────┴───────────┘

join_result.aggregate(_.numpoints.sum())

┏━━━━━━━━━━━━━━━━┓
┃ Sum(numpoints) ┃
┡━━━━━━━━━━━━━━━━┩
│ int64          │
├────────────────┤
│      129603207 │
└────────────────┘

This one is amusing to me because there’s apparently exactly one building of 130 million where the approximation of each building centroid as an S2 leaf cell (whose resolution is on average about 1-2 cm) makes a difference. Barring that, it looks like we get the same result.

Apache Sedona

I recently wrote up my adventures getting started with Apache Sedona, and this post calls for revisiting that since the big join is the bread-and-butter of Sedona’s target workload. I’m not quite to the point where I know how to get Seonda up and running without Docker, so let’s spin up the image and get going!

docker run \
    -e DRIVER_MEM=10G -e EXECUTOR_MEM=50G \
    -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 \
    -v $(pwd):/data \
    --rm \
    apache/sedona:latest

The documentation is careful to mention that the number of partitionions can have a huge impact on join performance. Repartitioning from within Spark/Sedona is somewhat expensive (took me 7 minutes to repartition the zipcodes), so I’ll do it from DuckDB before we hop into Sedona. Here I’m using the geography extension to create the partitions because it’s very fast (30 seconds, or 15 seconds on a beefier Windows machine).

con.load_extension("geography")
con.sql(
    """
    SELECT
        geom.st_aswkb().s2_cellfromwkb().s2_cell_parent(4).s2_cell_token() as partition_cell,
        geom
    FROM 'microsoft-buildings-point.parquet'
    """
).to_parquet("buildings", partition_by="partition_cell", overwrite=True)

Let’s check that we have good spatial partitions!

con.sql(
    """
    SELECT partition_cell, st_extent_agg(geom)
    FROM "buildings/*/*.parquet"
    GROUP BY partition_cell
    """
).to_pandas().plot(edgecolor="black", facecolor="none")

<Axes: >

Next, let’s hop into Spark/Sedona land and create our session.

from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/01 23:47:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Next, we’ll create our tables. I’m using createOrReplaceTempView() here so that we can use these from SQL in the next step.

sedona.read.format("geoparquet")\
    .load("/data/us-zip-codes.parquet")\
    .createOrReplaceTempView("zipcodes")
sedona.read.format("geoparquet")\
    .load("/data/buildings/*/*.parquet")\
    .createOrReplaceTempView("buildings")

Finally, we type the same join that we did in PostGIS. This builds the index for us automatically, which is very cool.

sedona.sql(
    """
    SELECT zipcodes.zipcode, count(*) as numpoints
    FROM zipcodes
    JOIN buildings ON ST_Contains(zipcodes.geom, buildings.geom)
    GROUP BY zipcodes.zipcode
    """
).write.mode("overwrite").csv("/data/sedona/results.csv")

24/12/02 02:38:01 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.

This ran for me in 38 minutes with no partitioning and 20 minutes with the buildings partitioned nicely. Partitioning the zipcodes didn’t seem to have an effect.

Some nitty gritty details

I mostly put together this visualization for myself…for our target zipcode, the S2 cells that make the covering and the bounding rectangle used by the RTree-based methods (i.e., everything except DuckDB Geography). In this case, the bounding rectangle is quite a good approximation because the zipcode area is squareish. We could configure the covering to contain more cells or only contain cells of a certain size, which might improve our timing.

# The polygon we're after
ax = con.read_parquet("us-zip-codes.parquet")\
    .filter(_.zipcode == "05753")\
    .select(_.geom)\
    .to_pandas().plot(edgecolor="black", facecolor="none")

# The "covering" S2 cells (used by duckdb geography)
con.sql(
    """
    SELECT cell.s2_aswkb().st_geomfromwkb() as geom
    FROM zipcodes_unnested
    WHERE zipcode = '05753'
    """
).to_pandas().plot(edgecolor="red", facecolor="none", ax=ax)

# The bounding rectangle (used by everybody else)
con.read_parquet("us-zip-codes.parquet")\
    .filter(_.zipcode == "05753")\
    .mutate(geom=_.geom.envelope())\
    .select(_.geom)\
    .to_pandas().plot(edgecolor="purple", facecolor="none", ax=ax)

<Axes: >

The Setup

This post uses data from the GeoArrow data page, which you can download with curl or similar (it’s a ~2 GB download in FlatGeoBuf format, 8 GB otherwise).

curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/microsoft-buildings-point.fgb.zip \
    -o microsoft-buildings-point.fgb.zip
unzip microsoft-buildings-point.fgb.zip
rm microsoft-buildings-point.fgb.zip

It also uses a zipcode dataset distribued by the U.S. census:

curl -L https://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_zcta520_500k.zip \
    -o cb_2020_us_zcta520_500k.zip
unzip cb_2020_us_zcta520_500k.zip -d us-zip-codes
rm cb_2020_us_zcta520_500k.zip

I used a Python virtual environment created with:

pip install 'ibis-framework[duckdb,postgres,geospatial]' geopandas matplotlib

To use DuckDB geography, you’ll unfortunately need to jump through some hoops until it is a community extension (by the time you read this hopefully it is!). You’ll need to download a version of the extension built for your DuckDB version and platform from the latest CI run, extract it, and then run the SQL INSTALL '/path/to/geography.duckdb_extension'.

! ogr2ogr -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" microsoft-buildings-point.parquet -nln buildings was what I timed for this (using Parquet instead of FlatGeoBuf just in case the format is the issue, and I don’t think that it is). ↩︎