Stream network traversal with SedonaDB

Last Wednesday the Apache Sedona project announced SedonaDB. There’s also a great post on the Whereobots blog that has a bit more context, or if you’re like me and you just want to see the code you can do that too.

I’ve been a sparse blog poster ever since I (1) got a job and (2) had kids, but if you’ve been vaguely following over the past few years you’ll notice that I mostly have written about what happens when spatial data gets a little too big to be comfortable for the standard R and Python tools to handle (mostly sf and geopandas). By “too big to be comfortable” I mean anything where the key part of your iteration takes more than 10 seconds, which is roughly the amount of time the average person is willing to wait before trying to do something else. The blog posts and the sofware I worked on over the past few years were cool but never solved anything: I basically found some great workarounds that people could implement if they were willing to write a pile of low-level R, C, or Python.

SedonaDB is a solution. It embodies making the “too big to be comfortable” effortless while getting the spatial datails right. We do CRSes. We do Z and M coordinates. We do Geography. We do indexed spatial and k-nearest neighbour joins. We can query sf and GeoPandas data frames derived from all of the files and analyses you’ve already written. But we can also lazily query terrabytes of Parquet files sitting on cloud storage from your laptop.

I hope to be writing quite a lot in the next few months as we evolve our first few releases of SedonaDB…for now I’ll give a quick introduction by way of one of my earliest to-big-to-be comfortable posts: let’s explore some river and lake data and compute a stream network using our fancy new engine!

If you haven’t already, install SedonaDB:

pip install "apache-sedona[db]"
import sedona.db

sd = sedona.db.connect()
sd.options.interactive = True
sd.options.width = 80

Let’s load up our data. This is more like DuckDB and less like sf/geopandas…read_parquet() will read enough of the file to find what the fields are but leave it at that. These files don’t happen to be very big (total ~600 MB) but they could be entire directories of local or remotely stored Parquet files and this will run in roughly the same amount of time.

sd.read_parquet("ns-water_water-poly_geo.parquet").to_view("lakes")

Let’s see what we’re up against:

sd.view("lakes").show(5)
┌──────────┬────────────┬───┬──────────────────────────────────────────────────┐
│ OBJECTID ┆ OBJECTID_1 ┆ … ┆                     geometry                     │
│   int64  ┆    int64   ┆   ┆                     geometry                     │
╞══════════╪════════════╪═══╪══════════════════════════════════════════════════╡
│   103678 ┆     108371 ┆ … ┆ POLYGON Z((255017.9274000004 4834135.7807 1.199… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│    46327 ┆      50917 ┆ … ┆ POLYGON Z((256922.12829999998 4821474.2807 0.89… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│   154367 ┆     149165 ┆ … ┆ POLYGON Z((257264.3273 4821808.1807 1.199999999… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│    25725 ┆      30360 ┆ … ┆ POLYGON Z((257456.37930000015 4821990.445699999… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│    58413 ┆      64881 ┆ … ┆ POLYGON Z((257101.62729999982 4821664.080700001… │
└──────────┴────────────┴───┴──────────────────────────────────────────────────┘

Ah yes…like most government-provided shapefiles, they have a bunch of fields of varying relevance. The .schema property will show you all of them to find the one you’re probably looking for:

sd.view("lakes").schema
SedonaSchema with 14 fields:
  OBJECTID: int64<Int64>
  OBJECTID_1: int64<Int64>
  FEAT_CODE: utf8<Utf8View>
  ZVALUE: float64<Float64>
  MINZ: float64<Float64>
  MAXZ: float64<Float64>
  POLY_CLASS: int32<Int32>
  NAMEID_1: utf8<Utf8View>
  NAME_1: utf8<Utf8View>
  HID: utf8<Utf8View>
  SHAPE_LENG: float64<Float64>
  SHAPE_AREA: float64<Float64>
  SHAPE_LEN: float64<Float64>
  geometry: geometry<WkbView({...})>

If you’re a GIS nut you’ll also want to know what coordinate reference system you’re up against:

import pyproj

pyproj.CRS(sd.view("lakes").schema.field("geometry").type.crs).name
'NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height'

Fun! We have two coordinate reference systems here (one for the xy, one for the z), which is why a friendly authority/code wasn’t displayed in the schema (SedonaDB can handle pretty much any CRS but doesn’t know how to make all of them look pretty yet).

Next we’ll have to find East Lake. There’s a NAME_1 column that we can see about:

sd.sql("""
    SELECT "NAME_1", geometry FROM lakes WHERE "NAME_1" = 'East Lake'
""")
┌───────────┬──────────────────────────────────────────────────────────────────┐
│   NAME_1  ┆                             geometry                             │
│    utf8   ┆                             geometry                             │
╞═══════════╪══════════════════════════════════════════════════════════════════╡
│ East Lake ┆ POLYGON Z((491430.12150000036 4961432.5792 17.60000000000582,49… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((460692.9227 4957887.4794 29.69999999999709,460693.92… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((466310.42260000017 4962958.8796 68.10000000000582,46… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((485608.6218999997 4969875.780200001 81.1000000000058… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((339556.52469999995 4956609.877599999 188.39999999999… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((501128.41949999984 5062277.793500001 66.089999999996… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((559271.1180999996 5016656.183599999 90,559276.118099… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((540415.4187000003 4998070.5822 144.89999999999418,54… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((531918.0190000003 4996563.081900001 128.399999999994… │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ East Lake ┆ POLYGON Z((493795.62239999976 4965289.8803 52.19999999999709,49… │
└───────────┴──────────────────────────────────────────────────────────────────┘

Right…I forgot that this is Nova Scotia we’re dealing with here, where there’s a ton of waterbodies and the people who named them were often not very creative 1.

Let’s take a look at these to figure out which one I’m looking for:

import lonboard

df = sd.sql("""
    SELECT "OBJECTID", ST_Centroid(geometry) FROM lakes WHERE "NAME_1" = 'East Lake'
""")

lonboard.viz(df)

With some clicking and scrolling, the one I’m looking for has an objectid of 1976. Let’s save that one:

sd.sql(
    """SELECT "OBJECTID" as lake_id, geometry AS lake FROM lakes WHERE "OBJECTID" = 1976"""
).to_view("east_lake")

Now that we have our lake, let’s find some inlets and outlets. We can load the Parquet file like we did for the lakes table:

sd.read_parquet("ns-water_water-line_geo.parquet").to_view("rivers")

…and use a join to find intersecting segments.

inlets_and_outlets = sd.sql("""
SELECT "OBJECTID", "FEAT_CODE", geometry
FROM rivers
JOIN east_lake ON ST_Intersects(east_lake.lake, rivers.geometry)
""")

Let’s take a look to see if this makes sense:

lonboard.viz(inlets_and_outlets.to_pandas())

Hmm…not quite right. Some of those segments are also the boundary and “flow lines” through the lake. From clicking around a bit we can see that the segments we’re looking for all have "FEAT_CODE" = 'WARV50'. Let’s retry:

inlets_and_outlets = sd.sql("""
SELECT "OBJECTID", "FEAT_CODE", geometry
FROM rivers
JOIN east_lake ON ST_Intersects(east_lake.lake, rivers.geometry)
WHERE "FEAT_CODE" = 'WARV50'
""")

lonboard.viz(inlets_and_outlets.to_pandas())

Much better! The next part is the the stream network traversal to build a drainage network. This was the crux of the original post, where the geos R package provides some primitives to effectively reuse a spatial index for each iteration to look up the next upstream segment (geopandas provides these too). SedonaDB can’t reuse index structures between queries today, but let’s see if we can approximate the algorithm:

  • Start with some initial set of segments (here, our inlets to East Lake)
  • Find a set of segments whose endpoint matches the current set of start points
  • Iterate, adding a fresh set of newly found upstream segments on each iteration

Because this involves a lot of repeated lookups, we’ll calculate and cache in memory (.to_memtable()) the segments and their start/end points. Normally ST_StartPoint() and ST_EndPoint() are what you want here, but I forgot to put those on the list for our initial version and so I’ll use ST_LineInterpolatePoint() instead. This is a place where SedonaDB (or Arrow generally) is quite nice: the memory required for collecting the whole thing in memory can be much less than reading in a GeoPandas table.

sd.sql("""
SELECT
    "OBJECTID",
    "FEAT_CODE",
    geometry,
    ST_LineInterpolatePoint(geometry, 0.0) AS start,
    ST_LineInterpolatePoint(geometry, 1.0) AS end
FROM
    rivers
WHERE
    ST_GeometryType(geometry) = 'ST_LineString' AND
    "FEAT_CODE" IN ('WARV50', 'WARV55', 'WARV59', 'WALK59')
""").to_memtable().to_view("rivers_with_ends", overwrite=True)

Next, we’ll start off our river network with the endpoints intersecting the lake:

sd.sql("""
SELECT lake_id, "OBJECTID", geometry, start, end
FROM rivers_with_ends
JOIN east_lake ON ST_Intersects(east_lake.lake, rivers_with_ends.end)
WHERE "FEAT_CODE" IN ('WARV50', 'WARV55', 'WARV59')
""").to_view("network", overwrite=True)

Then, we’ll iterate using a JOIN to fetch the next round of upstream segments from rivers_with_ends and stack it on top of the previous network (being careful to set the previous network’s start to NULL so that we don’t fetch repeated segments…or in other words, only the “fresh” segments get a start point so that each iteration is only fetching fresh segments). SQL die hards will recognize this pattern as a recursive common table expression (CTE)…here we’re managing the recursivity in Python which I far prefer (e.g., you have visibility on the size and number of iterations of the intermediate tables).

Getting this done was an iterative process…I started with range(1) to make sure I was getting the first round, and I had to fiddle with the list of allowed FEAT_CODEs as I noticed that some segments weren’t getting picked up.

crs = sd.view("network").schema.field("geometry").type.crs
network_count = sd.view("network").count()
network_iterations = 0

for i in range(100):
    sd.sql(f"""
    SELECT
        lake_id,
        rivers_with_ends."OBJECTID",
        rivers_with_ends.geometry,
        rivers_with_ends.start,
        rivers_with_ends.end
    FROM rivers_with_ends
    JOIN network ON ST_Intersects(network.start, rivers_with_ends.end)
    WHERE rivers_with_ends."OBJECTID" NOT IN (SELECT "OBJECTID" FROM network)
    UNION ALL
    SELECT
        lake_id,
        "OBJECTID",
        geometry,
        ST_SetCrs(ST_GeomFromText(NULL), '{crs.to_json()}') AS start,
        end
    FROM network
    """).to_memtable().to_view("network", overwrite=True)

    new_network_count = sd.view("network").count()
    if new_network_count == network_count:
        break
    else:
        network_count = new_network_count
        network_iterations += 1

print(network_iterations)
45

Great! We finished in 45 iterations (~2.5 seconds). Let’s take a look in context with a relevant subset of the original network:

near = sd.sql("""
    SELECT rivers."FEAT_CODE", rivers.geometry FROM rivers
    JOIN east_lake ON ST_DWithin(rivers.geometry, east_lake.lake, 10000)
""")

lonboard.viz([near.to_pandas(), sd.sql("SELECT geometry from network").to_pandas()])

The cool part about using a join in this way is that it scales to multiple lakes. For example, we can do this for all of the famed “East Lake"s. The time cost here is mostly per iteration (where SedonaDB will be rebuilding an index), so doing multiple lakes at once is basically free.

sd.sql("""
    SELECT "OBJECTID" as lake_id, geometry as lake FROM lakes
    WHERE "NAME_1" = 'East Lake'
""").to_view("east_lakes")

sd.sql("""
    SELECT lake_id, "OBJECTID", geometry, start, end
    FROM rivers_with_ends
    JOIN east_lakes ON ST_Intersects(east_lakes.lake, rivers_with_ends.end)
    WHERE "FEAT_CODE" IN ('WARV50', 'WARV55', 'WARV59')
""").to_view("network", overwrite=True)

network_count = sd.view("network").count()
network_iterations = 0

for i in range(100):
    sd.sql(f"""
    SELECT
        lake_id,
        rivers_with_ends."OBJECTID",
        rivers_with_ends.geometry,
        rivers_with_ends.start,
        rivers_with_ends.end
    FROM rivers_with_ends
    JOIN network ON ST_Intersects(network.start, rivers_with_ends.end)
    WHERE rivers_with_ends."OBJECTID" NOT IN (SELECT "OBJECTID" FROM network)
    UNION ALL
    SELECT
        lake_id,
        "OBJECTID",
        geometry,
        ST_SetCrs(ST_GeomFromText(NULL), '{crs.to_json()}') AS start,
        end
    FROM network
    """).to_memtable().to_view("network", overwrite=True)

    new_network_count = sd.view("network").count()
    if new_network_count == network_count:
        break
    else:
        network_count = new_network_count
        network_iterations += 1

print(network_iterations)
70
lonboard.viz(sd.sql("SELECT geometry FROM network").to_pandas())

In my initial post I quoted that I could analyze ~650 lakes in about 3 seconds. SedonaDB here is about on par with that and doesn’t require nearly the specialized implementation. In other words, my previous solution was a custom job whose implementation didn’t really apply to any other problem. The techniques we used here (notably: spatial joins) are universal and far easier to apply to your next uncomfortable spatial workflow.

Reflections

I spent quite a lot of time alongside a great team building SedonaDB over the past few months and I’m an obvious fan…writing this post was fun for me, and I’m chuffed that all the things we hoped SedonaDB would work seamlessly with (here: GeoPandas, GeoParquet, VSCode + Jupyter, and lonboard) really did just work. The spatial joins were blazing fast and the tabular output was pretty. The spatial details I hoped we would get right we got right (the lonboard output showing up exactly where it should have on the basemap was the result of careful CRS propagation throughout the engine that we spent a lot of time on!).

I was initially worried that our initial release wouldn’t have enough of a DataFrame interface to be useful (or: it would be too cumbersome to only use SQL). I was pleasantly surprised that I didn’t really mind…our interface nicely allowed me to write small pieces of named SQL (.to_view()). LLMs are great at small pieces of SQL and helped me out quite a bit (my SQL is not great). From a developer perpsective, relying on SQL let us focus on making a great engine without inventing an interface we weren’t sure if anybody actually wanted. While I’d still like to invest in that, I’m pleased at the user experience we were able to acheive with a mostly SQL driven interface.

Bringing it back to our timings I quoted at the beginning of this post (0.1s, 1s and 10s), nothing I ran here took more than 3 seconds and almost all of it ran in less than 0.1 seconds. The data I was working with here wasn’t huge, but it was also not trivial. While SedonaDB is a lot about speeding up tasks users already have, it also allows users to work with and iterate with bigger data while staying within those critical attention limits.

Stay tuned for more SedonaDB goodness!

Getting the data

The data can be downloaded from https://geoarrow.org/data:

curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-line_geo.parquet \
    -o ns-water_water-line_geo.parquet

curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-poly_geo.parquet \
    -o ns-water_water-poly_geo.parquet

  1. A quick LLM-aided SQL adventure suggests that the top three least creatively named waterbodies in Nova Scotia are Black Brook (87), Little River (62), and Long Lake (61). I had the misfortune of studying one of the Long Lakes for my undergraduate thesis and this tracks with my constant struggle at the time to find relevant literature for my Long Lake. ↩︎

Software Engineer & Geoscientist