Day 5: Earth

It’s day 5…earth! I was particular excited about this day in the 30 day map challenege because earth is my thing…I did a M.Sc. in Geology and taught Geomorphology at Acadia University for several years. While my first thought was mapping bedrock or surficial geology in Nova Scotia, I was foiled by the fact that Nova Scotia DNR distributes its files as self-extracting executable files and I’m on a Mac. Instead, I turned to the open data of my new home in Winnipeg, Manitoba for the earthiest thing they could come up with. Turns out: it’s LiDAR.

My initial attempt to map the whole city fell apart after it took took too long to download the files, so I constrained my problem to downtown Winnipeg where there was a fighting chance of observable elevation change and a truly excellent market with some of the best Fish & Chips out there. I took a look at the tile map and figured out the download URL was a bit of a pattern.

# pip install "apache-sedona[db]"
import sedona.db

sd = sedona.db.connect()

tiles = [301, 302, 303, 273, 274, 275, 245, 246, 247]
urls = [f"https://wpgopendata.blob.core.windows.net/open-data-lidar/2011-000{i:03d}.las" for i in tiles]

Then I hacked together a quick download script. I am sure some package has figured this out by now but it was one of my first complaints when typing Python code regularly that there’s no built-in way to do this well (R has a helpful download.file() because, as Hadley Wickham once noted once in a keynote, R was built to get sh*it done).

import urllib.request
from pathlib import Path
import shutil

out_dir = Path("las")
out_dir.mkdir(exist_ok=True)

for url in urls:
    fout_path = out_dir / Path(url).name
    if fout_path.exists():
        continue
    print(url)
    with urllib.request.urlopen(url) as fin, open(fout_path.with_name("tmp"), "wb") as fout:
        shutil.copyfileobj(fin, fout)
    fout_path.with_name("tmp").rename(fout_path)

I have to apologize for the rest of this post to anybody who has literally any experience processing LiDAR data because that is what I’m about to do and I have never looked at it at all. Like any file that I’ve never looked at before, these LAS files need some peeking at to see what’s in the tin.

import laspy

with laspy.open("las/2011-000245.las") as fh:
    las_data = fh.read()
    print("Header info:")
    print(f"Number of points: {las_data.header.point_count}")
    print(f"Point format: {las_data.header.point_format}")
    print(f"Available point dimensions: {list(las_data.point_format.dimension_names)}")
    for i in range(3):
        point = las_data.points[i]
        print(f"Point {i}: X={point.x}, Y={point.y}, Z={point.z}")
Header info:
Number of points: 6065622
Point format: <PointFormat(1, 0 bytes of extra dims)>
Available point dimensions: ['X', 'Y', 'Z', 'intensity', 'return_number', 'number_of_returns', 'scan_direction_flag', 'edge_of_flight_line', 'classification', 'synthetic', 'key_point', 'withheld', 'scan_angle_rank', 'user_data', 'point_source_id', 'gps_time']
Point 0: X=<ScaledArrayView(632906.54)>, Y=<ScaledArrayView(5529099.94)>, Z=<ScaledArrayView(234.68)>
Point 1: X=<ScaledArrayView(632907.0700000001)>, Y=<ScaledArrayView(5529099.84)>, Z=<ScaledArrayView(234.71)>
Point 2: X=<ScaledArrayView(632907.59)>, Y=<ScaledArrayView(5529099.74)>, Z=<ScaledArrayView(234.78)>

These look like UTM coordinates…let’s have a look at those dimensions since surely we’ll need at least some of that data.1

las_data.point_format.dimensions
[DimensionInfo(name='X', kind=<DimensionKind.SignedInteger: 0>, num_bits=32, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='Y', kind=<DimensionKind.SignedInteger: 0>, num_bits=32, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='Z', kind=<DimensionKind.SignedInteger: 0>, num_bits=32, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='intensity', kind=<DimensionKind.UnsignedInteger: 1>, num_bits=16, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='return_number', kind=<DimensionKind.BitField: 3>, num_bits=3, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='number_of_returns', kind=<DimensionKind.BitField: 3>, num_bits=3, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='scan_direction_flag', kind=<DimensionKind.BitField: 3>, num_bits=1, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='edge_of_flight_line', kind=<DimensionKind.BitField: 3>, num_bits=1, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='classification', kind=<DimensionKind.BitField: 3>, num_bits=5, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='synthetic', kind=<DimensionKind.BitField: 3>, num_bits=1, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='key_point', kind=<DimensionKind.BitField: 3>, num_bits=1, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='withheld', kind=<DimensionKind.BitField: 3>, num_bits=1, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='scan_angle_rank', kind=<DimensionKind.SignedInteger: 0>, num_bits=8, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='user_data', kind=<DimensionKind.UnsignedInteger: 1>, num_bits=8, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='point_source_id', kind=<DimensionKind.UnsignedInteger: 1>, num_bits=16, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None),
 DimensionInfo(name='gps_time', kind=<DimensionKind.FloatingPoint: 2>, num_bits=64, num_elements=1, is_standard=True, description='', offsets=None, scales=None, no_data=None)]

Ok, cool. I’ll need an Arrow schema to get this into SedonaDB so let’s make one. It took a bit of poking to figure out the bitfields but it seems like they come out of the data as int8s. I looked up the vertical CRS for the data and paired with the horizontal CRS so that the GeoParquet I’m about to write effectively communicates the locations.

import pyarrow as pa
import pyproj
import geoarrow.pyarrow as ga

crs = pyproj.CRS("EPSG:26915+EPSG:10588")

pa_names = []
pa_types = []
for d in las_data.point_format.dimensions:
    if d.name in ("X", "Y", "Z"):
        continue
    elif d.kind.name == "BitField":
        pa_types.append(pa.int8())
    elif d.kind.name == "SignedInteger" and d.num_bits == 8:
        pa_types.append(pa.int8())
    elif d.kind.name == "UnsignedInteger" and d.num_bits == 8:
        pa_types.append(pa.uint8())
    elif d.kind.name == "UnsignedInteger" and d.num_bits == 16:
        pa_types.append(pa.uint16())
    elif d.kind.name == "FloatingPoint" and d.num_bits == 64:
        pa_types.append(pa.float64())
    else:
        raise ValueError(f"Unsupported kind: {d}")

    pa_names.append(d.name)

attr_names = list(pa_names)

pa_types.append(ga.wkb().with_crs(crs))
pa_names.append("geometry")
pa_schema = pa.schema({nm: dt for nm, dt in zip(pa_names, pa_types)})

Next I’ll write a reader. You can read more about this strategy for getting files loaded into SedonaDB or other Arrow-compatible systems in my previous post…the idea is that if you can write it as an generator that spits out RecordBatches, pyarrow can make you a RecordBatchReader which SedonaDB will happily scan for you.

def read_las(lidar_file_paths, chunk_size=65536):
    for path in lidar_file_paths:
        with laspy.open(path) as fh:
            for chunk in fh.chunk_iterator(chunk_size):
                cols = [getattr(chunk, attr) for attr in attr_names]
                pt = ga.make_point(chunk.x, chunk.y, chunk.z)
                cols.append(ga.as_wkb(pt))

                yield pa.record_batch(cols, schema=pa_schema)

next(read_las(["las/2011-000245.las"], 5)).to_pandas()

Let’s read in all our input files and write them to GeoParquet. This is mostly because we’re reading the LAS in Python from a single thread and because of that it’s slow, whereas SedonaDB is great at reading GeoParquet from all my laptop’s CPU cores without me trying.

from pathlib import Path

lidar_file_paths = Path("las").glob("*.las")
reader = pa.RecordBatchReader.from_batches(pa_schema, read_las(lidar_file_paths))
sd.create_data_frame(reader).to_parquet("las/parquet")

This took about 4 minutes with all the attributes (about 1 minute the first time I tried it without any of the attributes). This was about 1.65 GB of LAS files that turned into 800 MB of GeoParquet files.

Let’s see what we’re up against:

sd.read_parquet("las/parquet/*.parquet").count()
58924130

Hmm…60 million points. That’s more than we’ll be able to plot at once. Let’s see some of the attributes.

sd.read_parquet("las/parquet/*.parquet").show(5, width=1000)
┌───────────┬───────────────┬───────────────────┬─────────────────────┬─────────────────────┬────────────────┬───────────┬───────────┬──────────┬─────────────────┬───────────┬─────────────────┬────────────────────┬─────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────┐
│ intensity ┆ return_number ┆ number_of_returns ┆ scan_direction_flag ┆ edge_of_flight_line ┆ classification ┆ synthetic ┆ key_point ┆ withheld ┆ scan_angle_rank ┆ user_data ┆ point_source_id ┆      gps_time      ┆                                 bbox                                ┆                   geometry                   │
│   uint16  ┆      int8     ┆        int8       ┆         int8        ┆         int8        ┆      int8      ┆    int8   ┆    int8   ┆   int8   ┆       int8      ┆   uint8   ┆      uint16     ┆       float64      ┆                                struct                               ┆                   geometry                   │
╞═══════════╪═══════════════╪═══════════════════╪═════════════════════╪═════════════════════╪════════════════╪═══════════╪═══════════╪══════════╪═════════════════╪═══════════╪═════════════════╪════════════════════╪═════════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════╡
│         7 ┆             1 ┆                 1 ┆                   0 ┆                   0 ┆              3 ┆         0 ┆         0 ┆        0 ┆             -11 ┆         0 ┆           29415 ┆  500960.0611180067 ┆ {xmin: 633030.06, ymin: 5528471.0, xmax: 633030.2, ymax: 5528472.0} ┆ POINT Z(633030.15 5528471.58 232.53)         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│        15 ┆             1 ┆                 1 ┆                   0 ┆                   0 ┆              3 ┆         0 ┆         0 ┆        0 ┆             -11 ┆         0 ┆           29415 ┆  500960.0611280203 ┆ {xmin: 633029.4, ymin: 5528471.0, xmax: 633029.5, ymax: 5528472.0}  ┆ POINT Z(633029.4500000001 5528471.58 232.41) │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│        16 ┆             1 ┆                 1 ┆                   0 ┆                   0 ┆              2 ┆         0 ┆         0 ┆        0 ┆             -11 ┆         0 ┆           29415 ┆ 500960.06113803387 ┆ {xmin: 633028.7, ymin: 5528471.0, xmax: 633028.8, ymax: 5528472.0}  ┆ POINT Z(633028.74 5528471.57 232.26)         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│        16 ┆             1 ┆                 1 ┆                   0 ┆                   0 ┆              3 ┆         0 ┆         0 ┆        0 ┆             -11 ┆         0 ┆           29415 ┆ 500960.06114804745 ┆ {xmin: 633028.0, ymin: 5528471.0, xmax: 633028.1, ymax: 5528472.0}  ┆ POINT Z(633028.08 5528471.5600000005 232.38) │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│        14 ┆             1 ┆                 1 ┆                   0 ┆                   0 ┆              2 ┆         0 ┆         0 ┆        0 ┆             -11 ┆         0 ┆           29415 ┆   500960.061158061 ┆ {xmin: 633027.3, ymin: 5528471.0, xmax: 633027.44, ymax: 5528472.0} ┆ POINT Z(633027.38 5528471.5600000005 232.27) │
└───────────┴───────────────┴───────────────────┴─────────────────────┴─────────────────────┴────────────────┴───────────┴───────────┴──────────┴─────────────────┴───────────┴─────────────────┴────────────────────┴─────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────┘

The first time I tried this I ignored the classification but it’s particularly relevant to the prompt here…we’re supposed to map stuff under our feet and things like buildings are not in that category. (Also, in an incredibly flat place like Winnipeg, it throws off the colour ramp so you can’t see anything cool). Let’s see what values we have available to us:

sd.read_parquet("las/parquet/*.parquet").to_view("las")
sd.sql("""
SELECT
    classification,
    COUNT(*) as count
FROM las
GROUP BY classification
ORDER BY count DESC
""").show()
┌────────────────┬──────────┐
│ classification ┆   count  │
│      int8      ┆   int64  │
╞════════════════╪══════════╡
│              3 ┆ 22046813 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│              2 ┆ 14944896 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│              6 ┆ 10621892 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│              5 ┆ 10385574 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│              9 ┆   610799 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│             10 ┆   295615 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│              1 ┆    18541 │
└────────────────┴──────────┘

A quick search led me to a reference guide by ESRI that listed some of what these are.

Because there are way too many points to plot in Python at once in any reasonable amount of time, I had AI write me some SQL to grid this for me and it worked quite well (only takes ~0.5 seconds).

sd.sql("""
SELECT
    FLOOR(ST_X(geometry) / 10) * 10 AS x_grid,
    FLOOR(ST_Y(geometry) / 10) * 10 AS y_grid,
    MIN(ST_Z(geometry)) AS z,
FROM las
WHERE classification IN (2, 9)
GROUP BY
    FLOOR(ST_X(geometry) / 10),
    FLOOR(ST_Y(geometry) / 10)
""").to_view("las_thinner", overwrite=True)

df = sd.sql(
    "SELECT ST_SetCrs(ST_Point(x_grid, y_grid), 'EPSG:26920') as geometry, z FROM las_thinner"
).to_pandas()

Let’s make sure we have a reasonably small number of points:

len(df)
79525

Finally, we make the plot. AI wrote this plot code for me and I’m ever so grateful it did because I spent a decade loading ggplot2 into my brain and know none of it.

import matplotlib.pyplot as plt

plt.rcParams['figure.dpi'] = 600
plt.rcParams['savefig.dpi'] = 600

fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(df.geometry.x, df.geometry.y, c=df['z'], cmap='viridis', s=0.2)
plt.colorbar(scatter, ax=ax, label='Elevation (z)')
ax.set_xticks([])
ax.set_yticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.show()

png

And there we have it! There are some cool things here…you can see a historic river channel hidden under the streets and the high ground on which the Manitoba Parliament building resides. The rail lines are also very visible (as well as some misclassified buildings). In the middle, the Red River and Assiniboine river meet (this is where you can find the Fish & Chips, if that’s what you were hoping for).


  1. Don’t laugh at me LiDAR people, I can see your eyes rolling and hear a chorus of (“of course he’s going to need some of that data…”) ↩︎

Software Engineer & Geoscientist