Day 6: Dimensions
Day 6 of the 30 day map challenege is Dimensions, and when I read the prompt I immediately thought: I have to do something with M values.
M values, you say? It’s true that these don’t come up as often as Z, but many spatial specifications out there allow combinations of XY, XYZ, XYM, or XYZM values. The M stands for “measure” because I think the original motivation was that sometimes the linear distance along a feature is actually the easiest thing to measure (think: you’re mapping a rail line before reliable high-precision GPS and what you write down is the odometer reading of the train). The ability to store a fourth dimension is also helpful when communicating time. I’m not 100% on the history here (if you have a definitive history feel free to share and I’ll link it!) but I did once meet somebody from ESRI who told me their dad invented them and that M values are popular in South America. Who knew?
From an implementation junkie’s perspective (me), M values (in particular the XYM combination) are bothersome because we can’t just count the dimensions anymore. In other words, three dimensions no longer longer automatically just means “XYZ”. When putting together the Parquet GEOMETRY/GEOGRAPHY specification, this was difficult to explain to non-geo people; however, all four of you that know what M values are and use them regularly will be pleased to know that Parquet GEOMETRY/GEOGRAPHY columns contain M statistics if the data contain them.
For this challenege I’d hoped to leverage M values to make a cool animation, but it turns out they’re not well-supported in the ecosystem (or I don’t know about the right tools that use them), so I’ll make the file and then do the animation the “old fashioned” way (which is to just store date times in a separate column).
For this challenge I’ll use data from the International Argo Program. I worked for the better part of a year at Fisheries and Oceans Canada with a large part of my work funded by Argo Canada. Argo is the program that manages thousands of tiny floats that float around the ocean and take in-situ measurements, transmitting them by satellite as they go. One of my projects was to make that data accessible with projects like the argodata package for R and argopandas (with some quality time on the side translating Fortran into Python).
We’ll start this journey in R, since the argodata package is what I’m most familiar with in terms of getting the data. It’s organized as a gazillion tiny NetCDF files and most of what argodata does is search and resolve various collections of those into tables with a reasonable cache configuration to avoid saturating severely underprovisioned government servers.
# install.packages('sedonadb', repos = c('https://apache.r-universe.dev', 'https://cloud.r-project.org'))
# pak::pak("ArgoCanada/argodata")
library(tidyverse)
library(argodata)
library(sedonadb)
argo_set_cache_dir("argo-cache")
We’ll use floats that are vaguely off the coast of Nova Scotia. Because floats in the Pacific have a bounding box that stretches across the entire globe (ocean floats rather rudely traverse the 180/-180 longitude with very little regard for Cartesian math), if we don’t filter out extra wide bounding boxes we end up downloading a lot of extra data from the Pacific that we won’t end up using in our final map.
traj_df <- argo_global_traj() |>
argo_filter_rect(28, 52, -75, -28) |>
filter((longitude_max - longitude_min) < 90) |>
argo_extract_path_info() |>
filter(file_data_mode == "R")
traj_df |>
as_sedonadb_dataframe() |>
sd_write_parquet("trajectories.parquet")
Next, we’ll download the files. This takes a while, although one of the crowning achievements of the argodata package was the ability to run this using parallel HTTP requests (almost always a multi-x speedup when downloading tons of tiny files).
traj_df |> argo_download()
For anybody who has worked with oceanographers before, the predisposition to believe that their data is a multidimensional array runs deep1, regardless of whether their data is effectively represented by a multidimensional array or not. Argo data is, unfortunately, actually a table (i.e., observations as rows, variables as columns)2. NetCDF as a file format only stores multidimensional arrays but will never be pried from the clutches of those distributing the data, so argodata provides functions to do this conversion.
The conversion in argodata was much faster than any other way to do this at the time I wrote it (almost four years ago!), but it’s still very slow compared to Parquet. In the rest of my analysis I’ll be doing a bunch of ad-hoc querying to figure out how to process the data so I’ll do the conversion once.
This seemed like a good opportunity to try out the purrr packge’s new in_parallel adverb plus mirai’s multiprocessing capability to use all my cores for the conversion. I’ll write GeoParquet using SeodnaDB (of course!), roughly one file per core. Worked like a charm! This runs almost instantaneously.
mirai::daemons(16)
traj_df$file |>
split(seq_len(16)) |>
walk(in_parallel(~{
argodata::argo_set_cache_dir("argo-cache")
out <- file.path("parquet", paste0(rlang::hash(.x), ".parquet"))
argodata::argo_traj_measurement(
tibble::tibble(file = .x),
vars = c("date", "longitude", "latitude", "position_qc")
) |>
dplyr::filter(position_qc == "1") |>
dplyr::transmute(
file,
n_measurement,
date,
geometry = wk::as_wkb(wk::xy(longitude, latitude, crs = wk::wk_crs_longlat()))
) |>
sedonadb::as_sedonadb_dataframe() |>
sedonadb::sd_write_parquet(out, geoparquet_version = "1.1")
}), .progress = TRUE)
Now I’ll drop into Python, although the sedonadb R package can do a lot of this too.
# pip install "apache-sedona[db]"
import sedona.db
sd = sedona.db.connect()
sd.options.interactive = True
sd.read_parquet("parquet/*.parquet").to_view("measurements", overwrite=True)
sd.view("measurements").count()
1783292
sd.sql("SELECT * FROM measurements")
┌────────────────────┬───────────────┬────────────────────┬────────────────────┬───────────────────┐
│ file ┆ n_measurement ┆ date ┆ bbox ┆ geometry │
│ utf8 ┆ int32 ┆ timestamp ┆ struct ┆ geometry │
╞════════════════════╪═══════════════╪════════════════════╪════════════════════╪═══════════════════╡
│ meds/4901080/4901… ┆ 1333 ┆ 2009-11-21T23:57:… ┆ {xmin: -68.79301,… ┆ POINT(-68.792999… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1334 ┆ 2009-11-22T00:00:… ┆ {xmin: -68.797005… ┆ POINT(-68.796997… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1335 ┆ 2009-11-22T00:34:… ┆ {xmin: -68.79502,… ┆ POINT(-68.795013… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1336 ┆ 2009-11-22T00:36:… ┆ {xmin: -68.796, y… ┆ POINT(-68.795989… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1337 ┆ 2009-11-22T02:14:… ┆ {xmin: -68.77201,… ┆ POINT(-68.772003… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1338 ┆ 2009-11-22T03:20:… ┆ {xmin: -68.758, y… ┆ POINT(-68.757995… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1339 ┆ 2009-11-22T05:55:… ┆ {xmin: -68.717995… ┆ POINT(-68.717987… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1340 ┆ 2009-11-22T06:13:… ┆ {xmin: -68.721016… ┆ POINT(-68.721008… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1341 ┆ 2009-11-22T07:34:… ┆ {xmin: -68.69199,… ┆ POINT(-68.691986… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ meds/4901080/4901… ┆ 1342 ┆ 2009-11-22T07:54:… ┆ {xmin: -68.68702,… ┆ POINT(-68.687011… │
└────────────────────┴───────────────┴────────────────────┴────────────────────┴───────────────────┘
Because this is Dimensions day, let’s give these points M values and assign an appropriate CRS. To find the CRS specification I had to actually just read the OGC WKT2 spec as no amount of googling or AIing led me to reasonable examples. I am guessing this is because nobody has bothered to actually do this (or do this and write about it), so the real world applicability of doing this is approximately nil (at least today).
import pyproj
crs = pyproj.CRS("""
COMPOUNDCRS["WGS84 + time",
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563]],
CS[ellipsoidal,2],
AXIS["latitude",north],
AXIS["longitude",east],
UNIT["degree",0.0174532925199433]],
TIMECRS[
"Unix time",
TDATUM["Unix epoch", TIMEORIGIN[1970-01-01T00:00:00Z]],
CS[TemporalCount, 1], AXIS["Time", future, TIMEUNIT["second"]]]]
""")
sd.sql(f"""
SELECT
ST_SetCrs(
ST_PointM(
ST_X(geometry),
ST_Y(geometry),
arrow_cast(date, 'Timestamp(Second, None)')::BIGINT::DOUBLE
),
'{crs.to_json()}'
) as geometry
FROM measurements
""")
┌───────────────────────────────────────────────────────────┐
│ geometry │
│ geometry │
╞═══════════════════════════════════════════════════════════╡
│ POINT M(-68.79299926757813 36.73400115966797 1258847820) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.7969970703125 36.733001708984375 1258848000) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.79501342773438 36.733001708984375 1258850040) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.79598999023438 36.733001708984375 1258850160) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.77200317382813 36.73699951171875 1258856040) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.75799560546875 36.74700164794922 1258860000) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.71798706054688 36.74300003051758 1258869300) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.72100830078125 36.75 1258870380) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.69198608398438 36.74700164794922 1258875240) │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ POINT M(-68.68701171875 36.74599838256836 1258876499) │
└───────────────────────────────────────────────────────────┘
I’m exhausted…let’s work with these data in non-Mified format since SedonaDB needs these deconstructed to work with them anyway.
We’ve got about 2 million points and in order to make the pretty animated map I have in mind, we’ll need to grid them on the time dimension. The points we’ve been given are quite literally measurements…GPS ones, and some of them are garbage (despite our best efforts to cherry pick position_qc == "1"). These floats are also mostly at 1000 meters depth floating most of the time, which means we only get their position once every two weeks.
My strategy is a simple one that is probably good enough for making an animation: group by float and calendar day and compute the geographical centroid (the geometry centroid is vaguely meaningless here).
sd.sql("""
SELECT
REGEXP_MATCH(file, '.*/([0-9]+)/.*')[1] AS float_id,
date::DATE as date,
geometry
FROM measurements
""").to_view("measurements_to_group", overwrite=True)
sd.sql("""
SELECT
float_id, date,
COUNT(*) as n,
ST_Centroid(ST_GeogFromWkb(ST_AsBinary(ST_Collect(geometry)))) as geography
FROM measurements_to_group
GROUP BY float_id, date
ORDER BY float_id, date
""").to_view("measurements_by_day", overwrite=True)
sd.sql("SELECT * FROM measurements_by_day WHERE n >= 2").to_view(
"floats_by_date", overwrite=True
)
sd.view("floats_by_date")
┌──────────┬────────────┬───────┬──────────────────────────────────────────────┐
│ float_id ┆ date ┆ n ┆ geography │
│ utf8 ┆ date32 ┆ int64 ┆ geography │
╞══════════╪════════════╪═══════╪══════════════════════════════════════════════╡
│ 3901605 ┆ 2019-07-10 ┆ 14 ┆ POINT(-72.06835728957104 31.758428656024137) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2019-09-08 ┆ 14 ┆ POINT(-75.03428543600577 30.65785725573738) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2020-03-16 ┆ 13 ┆ POINT(-72.61399994205586 24.4247695214107) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2020-12-01 ┆ 16 ┆ POINT(-75.12412501163695 24.354437644714665) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2021-04-30 ┆ 16 ┆ POINT(-74.09062507185162 25.136937538958353) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2021-12-26 ┆ 14 ┆ POINT(-74.01528521876831 26.51685725162998) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2022-02-24 ┆ 13 ┆ POINT(-72.16292203448143 27.222461716223144) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2022-05-15 ┆ 15 ┆ POINT(-71.82386652604556 29.6466669922378) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2022-11-01 ┆ 18 ┆ POINT(-71.23810999955224 31.262833707093865) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3901605 ┆ 2022-11-21 ┆ 15 ┆ POINT(-71.784134334966 30.89140031210049) │
└──────────┴────────────┴───────┴──────────────────────────────────────────────┘
Next, we’ll make segments. We need segments to do the gridding process where we select matching segments for a given date range and interpolate along it based on the date.
sd.sql("""
WITH numbered_floats AS (
SELECT
float_id,
date,
geography,
ROW_NUMBER() OVER (PARTITION BY float_id ORDER BY date) as rn
FROM floats_by_date
)
SELECT
a.float_id,
a.date as start_date,
b.date as end_date,
ST_MakeLine(a.geography, b.geography) as geography
FROM numbered_floats a
JOIN numbered_floats b
ON a.float_id = b.float_id
AND b.rn = a.rn + 1
ORDER BY a.float_id, a.date
""").to_view("float_segments", overwrite=True)
sd.view("float_segments")
┌──────────┬────────────┬────────────┬─────────────────────────────────────────────────────────────┐
│ float_id ┆ start_date ┆ end_date ┆ geography │
│ utf8 ┆ date32 ┆ date32 ┆ geography │
╞══════════╪════════════╪════════════╪═════════════════════════════════════════════════════════════╡
│ 1900037 ┆ 2001-12-18 ┆ 2001-12-28 ┆ LINESTRING(-27.252008730644764 36.60083498750121,-27.71950… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2001-12-28 ┆ 2002-01-07 ┆ LINESTRING(-27.71950030136404 36.598628050262306,-28.19417… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-07 ┆ 2002-01-08 ┆ LINESTRING(-28.194178635126544 36.579175564453074,-28.2955… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-08 ┆ 2002-01-17 ┆ LINESTRING(-28.295500029062985 36.504500001043226,-28.9701… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-17 ┆ 2002-01-18 ┆ LINESTRING(-28.970165753666567 36.54550070613418,-28.98850… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-18 ┆ 2002-01-27 ┆ LINESTRING(-28.988500203782618 36.55050005114263,-29.22959… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-27 ┆ 2002-01-28 ┆ LINESTRING(-29.22959957653678 36.588200156245286,-29.17799… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-01-28 ┆ 2002-02-06 ┆ LINESTRING(-29.17799968229882 36.6115002047024,-29.4943338… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-02-06 ┆ 2002-02-07 ┆ LINESTRING(-29.494333857282083 36.647501768686034,-29.4699… │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1900037 ┆ 2002-02-07 ┆ 2002-02-16 ┆ LINESTRING(-29.469999961044604 36.64900000417924,-30.00616… │
└──────────┴────────────┴────────────┴─────────────────────────────────────────────────────────────┘
sd.sql("SELECT * FROM float_segments WHERE float_id = '1900037'").to_pandas().plot()
<Axes: >

Now we’ll do the gridding along the time dimension. For each frame in our animation we need the position of all the floats. For this we’ll select the segments applicable to that date and interpolate along the segment. Because we’re using geography, our interpolations are vaguely realistic (although don’t account for the fact that these things don’t always drift in a straight line!).
sd.sql("SELECT MIN(date), MAX(date) FROM measurements_to_group")
┌─────────────────────────────────┬─────────────────────────────────┐
│ min(measurements_to_group.date) ┆ max(measurements_to_group.date) │
│ date32 ┆ date32 │
╞═════════════════════════════════╪═════════════════════════════════╡
│ 1999-07-04 ┆ 2025-11-04 │
└─────────────────────────────────┴─────────────────────────────────┘
I’ll do once a week because that will result in vaguely the length of animation I’m after (68 seconds at 20 fps). Pandas seems to have a nice function for this one.
import pandas as pd
date_range = pd.date_range(start="1999-07-04", end="2025-11-04", freq="W").date
sd.create_data_frame(pd.DataFrame({"frame_date": date_range})).to_view(
"date_range", overwrite=True
)
sd.view("date_range").head(5)
┌────────────┐
│ frame_date │
│ date32 │
╞════════════╡
│ 1999-07-04 │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1999-07-11 │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1999-07-18 │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1999-07-25 │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1999-08-01 │
└────────────┘
The gridding can be done with a range join, which from a SQL perspective looks very similar to a regular join. AI did a great job writing this range join for me.
sd.sql("""
SELECT
dr.frame_date,
fs.float_id,
fs.start_date,
fs.end_date,
fs.geography
FROM date_range dr
JOIN float_segments fs
ON dr.frame_date >= fs.start_date
AND dr.frame_date < fs.end_date
ORDER BY fs.float_id, dr.frame_date
""").to_view("frame_segments", overwrite=True)
sd.view("frame_segments").head(5)
┌────────────┬──────────┬────────────┬────────────┬────────────────────────────────────────────────┐
│ frame_date ┆ float_id ┆ start_date ┆ end_date ┆ geography │
│ date32 ┆ utf8 ┆ date32 ┆ date32 ┆ geography │
╞════════════╪══════════╪════════════╪════════════╪════════════════════════════════════════════════╡
│ 2001-12-23 ┆ 1900037 ┆ 2001-12-18 ┆ 2001-12-28 ┆ LINESTRING(-27.252008730644764 36.60083498750… │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001-12-30 ┆ 1900037 ┆ 2001-12-28 ┆ 2002-01-07 ┆ LINESTRING(-27.71950030136404 36.598628050262… │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-06 ┆ 1900037 ┆ 2001-12-28 ┆ 2002-01-07 ┆ LINESTRING(-27.71950030136404 36.598628050262… │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-13 ┆ 1900037 ┆ 2002-01-08 ┆ 2002-01-17 ┆ LINESTRING(-28.295500029062985 36.50450000104… │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-20 ┆ 1900037 ┆ 2002-01-18 ┆ 2002-01-27 ┆ LINESTRING(-28.988500203782618 36.55050005114… │
└────────────┴──────────┴────────────┴────────────┴────────────────────────────────────────────────┘
Now, we do the interpolation. The function here is ST_LineInterpolatePoint() with some math to get the 0..1 value we need as an argument.
sd.sql("""
SELECT
float_id,
frame_date,
ST_LineInterpolatePoint(
geography,
(frame_date - start_date)::DOUBLE / (end_date - start_date)::DOUBLE
) AS geography
FROM frame_segments
""").to_memtable().to_view("frame_points", overwrite=True)
sd.view("frame_points").head(5)
┌──────────┬────────────┬──────────────────────────────────────────────┐
│ float_id ┆ frame_date ┆ geography │
│ utf8 ┆ date32 ┆ geography │
╞══════════╪════════════╪══════════════════════════════════════════════╡
│ 69019 ┆ 1999-10-03 ┆ POINT(-24.742166043488947 50.30561555647618) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 69019 ┆ 1999-10-10 ┆ POINT(-24.583762721526426 50.30349173957814) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 69020 ┆ 2000-02-06 ┆ POINT(-28.535781206665877 51.19808428806561) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 69020 ┆ 2000-02-13 ┆ POINT(-28.602207991381 51.16722990440236) │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 69020 ┆ 2000-04-30 ┆ POINT(-25.22081603506072 53.439014807784055) │
└──────────┴────────────┴──────────────────────────────────────────────┘
Next I’ll query some bounds in projected space to get me started on the animation:
sd.sql("""
SELECT ST_Envelope(ST_GeomFromWKB(ST_AsBinary(ST_Transform(geography, 4326, 3857))))
FROM frame_points
""").to_pandas().total_bounds
array([-10398764.05732582, -6285889.46017892, 1681334.76592725,
14851331.28146015])
…pull some country data from GeoArrow example data to use as a backdrop:
sd.read_parquet(
"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet"
).to_view("countries", overwrite=True)
countries = sd.sql("SELECT ST_Transform(geometry, 3857) as geometry FROM countries").to_pandas()
…and loop through our date_range to make the animation.
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 600
plt.rcParams['savefig.dpi'] = 600
for date in date_range[200:]:
frame_df = sd.sql(f"""
SELECT float_id, frame_date, ST_Transform(geography, 4326, 3857) as geography
FROM frame_points WHERE frame_date = '{date}' ORDER BY frame_date
""").to_pandas()
ax = countries.plot(color='lightgrey', edgecolor='none')
frame_df.plot(markersize=4, color='black', ax=ax)
ax.set_xlim(-9390000, -4390000)
ax.set_ylim(2550000, 5850000)
ax.set_xticks([])
ax.set_yticks([])
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.savefig(f"anim/gen_{date}.png")
plt.close()
The GIF code is the same as for the river network animation I did a few days ago:
import glob
from PIL import Image
# Get all PNG files in the anim directory
png_files = sorted(glob.glob("anim/gen*.png"))
# Open all images
images = []
for png_file in png_files:
img = Image.open(png_file)
images.append(img)
# Append the last one a few times to make it sick
for i in range(20):
images.append(img)
# Save as animated GIF
images[0].save(
"animation.gif",
save_all=True,
append_images=images[1:],
duration=50, # milliseconds per frame
loop=0
)
And here it is!

-
In this case at a depth of roughly 1000 meters. ↩︎
-
If you are one of these people and find yourself completely fabricating dimensions named things like
N_MEASUREMENTwhere the values of all variables are largely empty for huge swaths of the file, please consider a reasonable tabular distribution format like Parquet. ↩︎