Wrangling and joining 130M points...in 2026

About a year and a half ago I wrote a post on “wrangling” and joining 130 million points using various open source tools. Shortly thereafter I joined Wherobots where I get to spend most of my time building SedonaDB, a tool purpose-built to make exactly that sort of wranging and joining effortless. Until today I had never checked back to the query that got me thinking about all of this to see if it was true.

The TL;DR is that…it is, and this is a very short post.

# pip install "apache-sedona[db]"
import sedona.db

sd = sedona.db.connect()
sd.options.interactive = True

I checked the summary on my initial post, and it looks like the fastest version I was able to come up with was somewhere around 3-5 minutes with quite a bit of explicit wranging/tuning and 50 GB of memory involved.

If you have the files locally (as I did in the original post), SedonaDB 0.3.0 executes the join in 6 seconds on my laptop (Apple M4, 24G RAM) without any tuning or wrangling.

sd.read_parquet(
    "microsoft-buildings-point.parquet"
).to_view("buildings", overwrite=True)
sd.read_parquet(
    "https://github.com/paleolimbot/2025-02-25_duckdb-geography-geopython/raw/refs/heads/main/us-zip-codes.parquet"
).to_memtable().to_view("zipcodes", overwrite=True)
df = sd.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_Contains(zipcodes.geom, buildings.geom)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_memtable()

res = df.to_memtable()
res.show(5)
┌─────────┬────────────────┐
│ zipcode ┆ building_count │
│   utf8  ┆      int64     │
╞═════════╪════════════════╡
│ 60629   ┆          39767 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77449   ┆          38128 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77494   ┆          37960 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 60617   ┆          37125 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 32162   ┆          34956 │
└─────────┴────────────────┘

You can even leave the files on the server and it still executes 2-3x faster than the next fasest version I was able to come up with in that post (38 seconds, although this is mostly dependent on your internet connection).

sd.read_parquet(
    "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.parquet"
).to_view("buildings", overwrite=True)
sd.read_parquet(
    "https://github.com/paleolimbot/2025-02-25_duckdb-geography-geopython/raw/refs/heads/main/us-zip-codes.parquet"
).to_view("zipcodes", overwrite=True)

# With https:// remote buildings data: 38s
df = sd.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_Contains(zipcodes.geom, buildings.geometry)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_memtable()

res = df.to_memtable()
res.show(5)
┌─────────┬────────────────┐
│ zipcode ┆ building_count │
│   utf8  ┆      int64     │
╞═════════╪════════════════╡
│ 60629   ┆          39767 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77449   ┆          38128 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77494   ┆          37960 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 60617   ┆          37125 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 32162   ┆          34956 │
└─────────┴────────────────┘

Since that post, DuckDB spatial (1.5.2 at this writing) has also come a long way and no longer needs any wrangling extra wrangling either. The ST_Contains() version of the query takes ~1m 30s; however, the ST_DWithin(..., 0) approximation is much faster (13 seconds).

import duckdb

duckdb.load_extension("spatial")

duckdb.read_parquet(
    "microsoft-buildings-point.parquet"
).to_view("buildings")
duckdb.read_parquet(
    "https://github.com/paleolimbot/2025-02-25_duckdb-geography-geopython/raw/refs/heads/main/us-zip-codes.parquet"
).to_table("zipcodes")

# Same query as above (1m 30s)
duckdb.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_Contains(zipcodes.geom, buildings.geom)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_table("result")

duckdb.table("result").limit(5)

# DuckDB spatial has a faster version of DWithin
duckdb.sql("DROP TABLE result")
duckdb.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_DWithin(zipcodes.geom, buildings.geom, 0)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_table("result")

duckdb.table("result").limit(5)
┌─────────┬────────────────┐
│ zipcode │ building_count │
│ varchar │     int64      │
├─────────┼────────────────┤
│ 60629   │          39767 │
│ 77449   │          38128 │
│ 77494   │          37960 │
│ 60617   │          37125 │
│ 32162   │          34956 │
└─────────┴────────────────┘

If you’re a nerd who cares about the details, this is probably because the ST_Contains join in DuckDB spatial still uses GEOS, and can’t reuse prepared geometries on the build side of the join becuase they are not thread safe. SedonaDB evaluates ST_Contains with tg, which can effectively query prepared geometries from multiple threads. This is a point-in-polygon check where the preparedness of the polygon matters quite a lot (and the ability to probe them in parallel without a mutex also has a huge impact on performance).

For completeness, the SedonaDB version of the DWithin join runs in 35s and uses a custom port of the rustgeo/geo distance calculation to use geo-traits, which runs the algorithms directly on WKB instead of making a copy first. This doens’t invoke any version of preparedness (I am not sure if DuckDB’s version does).

res = sd.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_DWithin(zipcodes.geom, buildings.geom, 0)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_memtable()

res.show(5)
┌─────────┬────────────────┐
│ zipcode ┆ building_count │
│   utf8  ┆      int64     │
╞═════════╪════════════════╡
│ 60629   ┆          39767 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77449   ┆          38128 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77494   ┆          37960 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 60617   ┆          37125 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 32162   ┆          34956 │
└─────────┴────────────────┘

It’s an exciting time to be working in spatial!

Software Engineer & Geoscientist