Lazy GeoParquet reading in SedonaDB, DuckDB, GeoPandas, and GDAL
I’m an obvious fan of GeoParquet and SedonaDB and it should be no surprise that I worked (still working!) pretty hard making sure SedonaDB could take advantage of all GeoParquet had to offer. This post talks about one of those things: lazy reads (or if you’re a databse nut, “pruning”).
Basically, GeoParquet was designed to allow traditional GIS software to take advantage of a decade plus of heavy investment in the Parquet format and the software that reads and writes it. Two spot examples are that (1) software that reads traditional GIS formats is not particularly good at splitting the work up so that all the cores on your computer are put to good use and (2) most formats are not particularly good at being plonked onto a web server and have sections of them selectively queried1 (which can often save data producers from standing up a 24/7 web service).
FlatGeoBuf, of course, is the shining example of being great at being plonked onto a web server in this way and it’s great as long as your queries are only spatial in nature. If you have a mix of spatial and non/spatial attributes you’d like to query on, the cloud-native features of FlatGeoBuf can’t help you. Parquet is great at being plonked onto a web server but lacks one important feature: a spatial index.
(All of the cloud native features help locally too…reading fewer bytes of data is reading fewer bytes of data no matter where it’s coming from!)
In this post I’ll show the nuts and bolts of how partial reads of Parquet files works for spatial queries and some of the unique awesome we were able to build in to SedonaDB to make it magical. The premise is that we’ve got a url of the amazing (and rather large) Overture Buildings table and a target polygon:
buildings_url = (
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/"
)
target_wkt = (
"POLYGON ((-73.21 44.03, -73.21 43.98, -73.11 43.97, -73.12 44.03, -73.21 44.03))"
)
We’d like to do something with all of the buildings in that polygon. Easy, right?
tl;dr
The punch line here is that SedonaDB implements GeoParquet file and row group pruning to minimize the amount of data read (in addition to the regular Parquet pruning implemented by DataFusion, on which SedonaDB is built). This is done automatically for every query no matter where you put your ST_Intersects() (or any other spatial predicate) and works with GeoParquet 1.0, 1.1, or the new Parquet geometry/geography type as of the next release. Your files can be https://, s3://, gcs://, or local. The easy version of the query we’re going to deconstruct is:
# pip install "apache-sedona[db]"
import sedona.db
sd = sedona.db.connect()
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/",
options={"aws.skip_signature": True, "aws.region": "us-west-2"},
).to_view("buildings")
target_buildings = sd.sql(f"""
SELECT * FROM buildings
WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{target_wkt}'), 4326))
""").to_memtable()
…or if you’re an R user, you can do it there too:
# install.packages('sedonadb', repos = c('https://apache.r-universe.dev', 'https://cloud.r-project.org'))
library(sedonadb)
Sys.setenv(
AWS_SKIP_SIGNATURE = "true",
AWS_DEFAULT_REGION = "us-west-2"
)
buildings_url <-
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/"
target_wkt <-
"POLYGON ((-73.21 44.03, -73.21 43.98, -73.11 43.97, -73.12 44.03, -73.21 44.03))"
sd_read_parquet(buildings_url) |> sd_to_view("buildings")
target <- sd_sql(glue::glue("
SELECT * FROM buildings
WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{target_wkt}'), 4326))
")) |>
sd_compute()
This post will do a deep dive into the mechanism used to do this and give examples of how to get this effect using tools like DuckDB, GeoPandas, and GDAL. Buckle up!
Pruning using GeoParquet 1.0 metadata
The Overture Buildings dataset is pretty amazing and its team (notably: Jacob Wasserman) was the one who spearheaded the effort to ensure pruning was as efficient as it could be in GeoParquet 1.1. To probe it with a bit more granularity we’ll use pyarrow’s dataset, as it exposes the “fragments” (which in this case are the individual Parquet files).
from pyarrow import dataset
ds = dataset.dataset(buildings_url)
ds_fragments = list(ds.get_fragments())
len(ds_fragments)
237
As we can see, there are 237 of them. The fragment API also lets us peek into the Parquet metadata for each of them…of first note here is the GeoParquet metadata, which is what makes this file a GeoParquet file to begin with!
import json
json.loads(ds_fragments[0].metadata.metadata[b"geo"])
{'version': '1.1.0',
'primary_column': 'geometry',
'columns': {'geometry': {'encoding': 'WKB',
'geometry_types': ['Polygon', 'MultiPolygon'],
'bbox': [-179.9999966, -84.2945957, -2.8229824, -0.0031377],
'covering': {'bbox': {'xmin': ['bbox', 'xmin'],
'ymin': ['bbox', 'ymin'],
'xmax': ['bbox', 'xmax'],
'ymax': ['bbox', 'ymax']}}}}}
There are two pieces of information here we’ll use to do the filtering. The first is the bbox, and the second is the covering, which takes a bit more effort (but is worth it!).
import shapely
boxes = []
for f in ds_fragments:
geo_metadata = json.loads(f.metadata.metadata[b"geo"])
boxes.append(shapely.box(*geo_metadata["columns"]["geometry"]["bbox"]))
import geopandas
geopandas.GeoSeries(boxes).plot(facecolor="none", edgecolor="black")
<Axes: >

From these we can reduce the number of files we need to consider! It turns out there are only two files relevant to our query.
target_box = shapely.from_wkt(target_wkt).envelope
relevant_boxes = [b for b in boxes if target_box.intersects(b)]
ax = geopandas.GeoSeries(boxes).plot(facecolor="none", edgecolor="black")
geopandas.GeoSeries(relevant_boxes).plot(facecolor="none", edgecolor="red", ax=ax)
<Axes: >

Pruning using GeoParquet 1.1 metadata
If the Overture files had been written using GeoParquet 1.0 (which is to say, without the bbox feature present in the metadata), two files of 237 would be the best we could do. We’d scan the geometry column from all of both files and calculate the ST_Intersects() predicate for everything and return the matches to the user.
Unfortunately, pruning at the file level is not very efficient. There are severe performance issues that can arise when too many small files are written, so even though we could improve the situation by just making the files smaller (such that each had a smaller set of bounds), Parquet has a mechanism to read just part of a file that has long been used for non Geometry data types to do similar optimizations. The bbox column (a struct with members xmin, ymin, xmax, ymax) lets us do this because Parquet keeps min and max statistics when writing files.
row_group_metadata = ds_fragments[0].row_groups[0].metadata
[
(
row_group_metadata.column(i).statistics.min,
row_group_metadata.column(i).statistics.max,
)
for i in range(2, 6)
]
[(-180.0, -135.00344848632812),
(-179.99989318847656, -135.00332641601562),
(-84.29460906982422, -13.186408996582031),
(-84.2940902709961, -13.186331748962402)]
Using this, we can populate another set of boxes from which we can choose to read given the files that we’ve preselected.
row_group_boxes = []
relevant_fragments = [
ds_fragments[i] for i, b in enumerate(boxes) if target_box.intersects(b)
]
for f in relevant_fragments:
for row_group_metadata in f.row_groups:
xmin_stats, xmax_stats, ymin_stats, ymax_stats = [
(
row_group_metadata.metadata.column(i).statistics.min,
row_group_metadata.metadata.column(i).statistics.max,
)
for i in range(2, 6)
]
box = shapely.box(xmin_stats[0], ymin_stats[0], xmax_stats[1], ymax_stats[1])
row_group_boxes.append(box)
ax = geopandas.GeoSeries(relevant_boxes).plot(facecolor="none", edgecolor="red")
geopandas.GeoSeries(row_group_boxes).plot(facecolor="none", edgecolor="blue", ax=ax)
<Axes: >

From these we can select only those that intersect our target_box!
relevant_row_group_boxes = [b for b in row_group_boxes if target_box.intersects(b)]
ax = geopandas.GeoSeries(relevant_boxes).plot(facecolor="none", edgecolor="red")
geopandas.GeoSeries(row_group_boxes).plot(facecolor="none", edgecolor="blue", ax=ax)
geopandas.GeoSeries(relevant_row_group_boxes).plot(facecolor="yellow", edgecolor="black", ax=ax)
<Axes: >

Of all those independently fetchable chunks, we fulfill our query result by downloading only six if the producer and consumer both speak GeoParquet 1.1! Very cool.
GeoParquet 2.0 / Parquet Geometry
Since GeoParquet 1.0 and 1.1 were released they have gained wide adoption in the ecosystem and should still be considered the go-to format for writing spatial data to Parquet. As of a few months ago, the Parquet file format has a native type with native statistics. The main advantage of this is file size: in the Overture buildings dataset alone the bbox columns occupy about 25 gigabytes. As a percentage of the whole dataset this is certainly worth the space to support selective reads; however, for local data with fewer attributes, this can approach 50% or more of the entire file! The Parquet native statistics solve that by just storing the part we need (the boxes for each row group) instead of including boxes for every single row.
Pruning in SedonaDB
I went through the nuts and bolts of how pruning is implemented above but there’s no world in which we expect any user to want to think about that. As shown above, when you execute a query from one or more (Geo)Parquet files, the query optimizer will try as hard as it can to apply that filter to the data source. We didn’t actually have to do anything to make this happen…DataFusion (like most databases) does this for any filter and we just happened to implement support for the spatial ones on top of all the others. As long as there’s a spatial filter somewhere in a WHERE clause it will get used to filter the data.
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/",
options={"aws.skip_signature": True, "aws.region": "us-west-2"},
).to_view("buildings", overwrite=True)
sd.sql(f"""
SELECT * FROM buildings
WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{target_wkt}'), 4326))
""").to_memtable()
<sedonadb.dataframe.DataFrame object at 0x3246b57f0>
(I’m using to_memtable() to calculate the time-consuming part so I can spend subsequent cells exploring the results).
This example is a simple one and it’s fairly easy to replicate using other tools, as we’ll see below. Where SedonaDB (and Sedona Spark) shines is the ability to do this for any GeoParquet (or Parquet’s new Geometry type, starting in the next release) without a user having to think about any of it or precalculate a bounding box.
Pruning in Sedona Spark
Like SedonaDB, Sedona Spark implements the same sort of “don’t worry, we take care of the details” style of GeoParquet pruning. Our example there would look like:
from sedona.spark import *
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
sedona.read.format("geoparquet").load(
"s3://overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/"
).createOrReplaceTempView("buildings")
target = sedona.sql(f"""
SELECT * FROM buildings
WHERE ST_Intersects(geometry, ST_GeomFromText('{target_wkt}'))
""").to_pandas()
Pruning in DuckDB
As of this writing, DuckDB needs a little help to do pruning on GeoParquet files: the ST_Intersects() isn’t enough to get DuckDB automatically read fewer files/row groups; however, DuckDB can prune based on the bbox column statistics given hard-coded bbox values (this was actually added to DuckDB specifically to support GeoParquet!).
import duckdb
duckdb.load_extension("spatial")
duckdb.sql("SET s3_region='us-west-2';")
duckdb.read_parquet(f"{buildings_url}*").to_view("buildings")
target = duckdb.sql(f"""
SELECT * FROM buildings
WHERE
ST_Intersects(geometry, ST_GeomFromText('{target_wkt}')) AND
bbox.xmin <= -73.11 AND
bbox.ymin <= 44.03 AND
bbox.xmax >= -73.21 AND
bbox.ymax >= 43.97
""").df()
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
As far as I know, DuckDB is not able to automatically (or otherwise) prune using GeoParquet 1.0 metadata as of this writing.
Pruning in GeoPandas
GeoPandas supports GeoParquet 1.1 pruning using pretty much the same mechanism as DuckDB: Acero, the engine powering pyarrow datasets, knows how to filter on the bbox column statistics and GeoPandas wires that into the right place to make the pruning happen. This approach also requires an explicit bounding box (i.e., there’s no automagic detection of the query you’re about to run) and doesn’t prune using GeoParquet 1.0 metadata. As of this writing the native Parquet geometry type isn’t supported but there’s an issue to add support.
import geopandas
target_box = geopandas.GeoSeries.from_wkt([target_wkt]).total_bounds
target = geopandas.read_parquet(buildings_url, bbox=target_box)
Pruning with GDAL/ogr2ogr
With GDAL’s fantastic ogr2ogr tool (built with Arrow C++, such as on Homebrew), you can specify a target bounding box to do this type of pruning when reading one or more files. For our example here the command-line invocation would be:
AWS_NO_SIGN_REQUEST=YES AWS_REGION=us-west-2 \
ogr2ogr \
target.gpkg \
/vsis3/overturemaps-us-west-2/release/2025-08-20.0/theme=buildings/type=building/ \
-spat -73.21 43.97 -73.11 44.03 -if Parquet
I know this works with GeoParquet 1.1 bounding box columns (e.g., Overture); however, I’m not sure if this will prune files using metadata bounding boxes from a directory of GeoParquet 1.0 files. The Parquet driver will also support Parquet Geometry columns and statistics as of the next release, although this won’t work when reading more than one Parquet file.
Apache Iceberg?
For all of the data access above there is the unavoidable cost of fetching the partial bit of the Parquet file that stores our lovely GeoParquet metadata and the column statistics we can use to possibly avoid reading the rest of the file. Some engines paralellize or cache this better than others but there’s still 237 tiny bits of file getting downloaded which takes 30 seconds to a minute depending on the engine (and perhaps your internet connection). What if we could store all that metadata in one place so that it could be accessed with a single request?
Enter Apache Iceberg, which is a lot of things including the ability to reduce the number of requests it takes to access large tables like Overture buildings. The forthcoming release of Iceberg will support geometry and geography as first-class types, which comes with the benefit of first-class statistics support. Duck Lake, which is DuckDB’s reconceptualizing of Iceberg along the same lines, will also support geometry with queryable statistics in a forthcoming version. Stay tuned for exciting options for local and/or remote access to datasets like these!
Reflections
This post was a deep dive into the mechanism that enables GeoParquet to achieve reasonable performance when compared with indexed formats like PostGIS tables, FlatGeoBuf, GeoPackage, and Shapefile for many types of queries. While these indexed formats still excel with repeated queries that are expected to return tiny results, large result sets or queries that involve non-spatial filters benefit from using (Geo)Parquet. SedonaDB strives to make the difference between the two as seamless as possible and prune as many files and/or row groups as possible by inspecting the query so that users can fearlessly query large and small datasets alike.
-
“Cloud Native”, if that’s your thing ↩︎