Public Data Dive: 2018 Boeing 737 MAX flights

The recent grounding of almost all Boeing 737 MAX-series aircraft in the world is, according to a recent CBC commentator, unprecedented. I’m not an aircraft expert (or even a hobbyist), but I do love data and mining publicly-available datasets. Inspired by the nycflights13 R package (a dataset of all the flights in and out of New York City in 2013) and the FlightRadar24 blog post regarding Lion Air flight JT610, I thought I would see what information is accessible to the public about flights that used the 737 MAX-series aircraft.

The nycflights13 R package uses the Bureau of Transportation Statistics Reporting Carrier On-Time Performance and the FAA Releasable Aircraft datasets, both of which are downloadable for free. As per my usual MO, I’ll be using the tidyverse to script up the analysis. Because I’ll be reading in a lot of foreign CSVs with weird names, I’ll also make myself a name cleaner-upper to make nice_columns from BADLY FORMATTED-COLUMNS.

library(tidyverse)
nice_names <- . %>%
  str_to_lower() %>%
  str_replace_all("[- ]", "_")

I opted to get the plane information first. This code was heavily inspired by the scripts in the data-raw/ folder of the nycflights13 R package.

planes_src <- "http://registry.faa.gov/database/yearly/ReleasableAircraft.2018.zip"
planes_tmp <- tempfile(fileext = ".zip")
curl::curl_download(planes_src, planes_tmp)
dir.create("planes")
unzip(planes_tmp, exdir = "planes")

The “planes” folder now contains several oddly named files, of which we are interested in two: “ACFTREF.txt” (airplane model information), and “planes/MASTER.txt” (all registered planes in the US). According to the emergency order by the FAA, the models that were grounded were the Boeing 737-8 and 737-9, although for this analysis I’ll include all 737s for comparison. To do this, I’ll need to build a clean table of plane model info for the planes we’re interested in.

planes_models <- read_csv(
  "planes/ACFTREF.txt",
  col_types = cols(
    .default = col_character(),
    X12 = col_skip()
  ),
  # there are quotation marks as parts of values in this table
  quote = ''
) %>%
  rename_all(nice_names)

planes_models_keep <- planes_models %>%
  filter(mfr == "BOEING", str_detect(model, "^737")) %>%
  mutate(model_label = if_else(str_detect(model, "^737-[89]$"), "737 MAX", "737")) %>%
  select(mfr_mdl_code = code, model_label, model_mfr = mfr, model)

planes_models_keep
## # A tibble: 606 x 4
##    mfr_mdl_code model_label model_mfr model  
##    <chr>        <chr>       <chr>     <chr>  
##  1 1382526      737         BOEING    737-66N
##  2 1382527      737         BOEING    737-7K2
##  3 1382528      737         BOEING    737-9K2
##  4 1382529      737         BOEING    737-8DR
##  5 1382530      737         BOEING    737-8DV
##  6 1382531      737         BOEING    737-7HF
##  7 1382532      737         BOEING    737-46J
##  8 1382540      737         BOEING    737-6Q8
##  9 1383333      737         BOEING    737-2E7
## 10 1384400      737         BOEING    737-7FB
## # … with 596 more rows

Next, I’ll use this to create a table of planes that we’re interested in. I modify the n_number to paste0("N", n_number) here because the tail_num variable in the flights dataset below has the tail number in that form.

planes_master <- read_csv(
  "planes/MASTER.txt",
  col_types = cols(
    .default = col_character(),
    `YEAR MFR` = col_integer(),
    X35 = col_skip()
  )
) %>%
  rename_all(nice_names)

planes_keep <- planes_master %>%
  inner_join(planes_models_keep, by = "mfr_mdl_code") %>%
  mutate(tail_num = paste0("N", n_number)) %>%
  select(tail_num, model_label, model_mfr, model, model_year = year_mfr)

planes_keep
## # A tibble: 2,197 x 5
##    tail_num model_label model_mfr model   model_year
##    <chr>    <chr>       <chr>     <chr>        <int>
##  1 N103WT   737         BOEING    737-700         NA
##  2 N108MS   737         BOEING    737-7BC       2002
##  3 N108WT   737         BOEING    737-700         NA
##  4 N11206   737         BOEING    737-824       2000
##  5 N114JF   737         BOEING    737-4H6       1993
##  6 N12216   737         BOEING    737-824       1998
##  7 N12218   737         BOEING    737-824       1998
##  8 N12221   737         BOEING    737-824       1998
##  9 N12225   737         BOEING    737-824       1998
## 10 N12238   737         BOEING    737-824       1999
## # … with 2,187 more rows

Interestingly, you can use the tail_num to look up flight data from FlightRadar24 like this https://www.flightradar24.com/data/aircraft/<TAIL NUMBER>. You can see an example for one 737 MAX-8 here (note that there aren’t any flights after March 13, when the ban went into effect).

Unfortunately, the pre-zipped option for the on-time performance data was not working at the time of this writing, so I had to download that data by hand (this amounts to one click per month…not too bad for just 2018, but still annoying). I downloaded these files to “flight_zips/2018-{month}.zip”. The code to clean this up was also inspired by the nycflights13 R package cleaning scripts. Note that this is about 300 MB of compressed data and 2.5 GB of uncompressed data before filtering.

extract_month <- function(year = 2018, month) {
  zipfile <- sprintf("flight_zips/%s_%s.zip", year, month)

  files <- unzip(zipfile, list = TRUE)
  # Only extract biggest file
  csv <- files$Name[order(files$Length, decreasing = TRUE)[1]]

  unzip(zipfile, exdir = "flights", junkpaths = TRUE, files = csv)

  src <- paste0("flights/", csv)
  dst <- paste0("flights/", year, "-", month, ".csv")
  file.rename(src, dst)
  dst
}

flight_raw_csvs <- tibble(
  year = 2018,
  month = 1:12,
  csv_file = map2_chr(year, month, extract_month)
)

flight_data <- flight_raw_csvs$csv_file %>%
  map(read_csv, col_types = cols(
    .default = col_character(),
    OP_CARRIER_FL_NUM = col_integer(),
    FL_DATE = col_date(),
    DIVERTED = col_double(),
    AIR_TIME = col_double(),
    CRS_ELAPSED_TIME = col_double(),
    ACTUAL_ELAPSED_TIME = col_double()
  )) %>%
  bind_rows() %>%
  rename_all(nice_names)

flights_keep <- flight_data %>%
  mutate(
    fl_num = paste0(op_unique_carrier, op_carrier_fl_num),
    diverted = if_else(diverted == 1, TRUE, FALSE)
  ) %>%
  select(
    fl_date, fl_num,
    tail_num, origin, dest, diverted,
    schedule_elapsed = crs_elapsed_time,
    actual_elapsed = actual_elapsed_time,
    air_time
  ) %>%
  inner_join(planes_keep, by = "tail_num")

flights_keep

Now that we have a big ol’ data frame of all the flights that took place on a 737 in the US during 2018, we can answer at least one question: were more 737-MAX flights diverted than flights on other 737 aircraft?

flights_keep %>%
  group_by(model_label) %>%
  summarise(
    n = n(),
    n_diverted = sum(diverted),
    pct_diverted = mean(diverted) * 100
  )
## # A tibble: 2 x 4
##   model_label       n n_diverted pct_diverted
##   <chr>         <int>      <int>        <dbl>
## 1 737         2378407       5322        0.224
## 2 737 MAX       38472         98        0.255

The answer appears to be yes, but not by much: 0.25% of 737 MAX flights were diverted, whereas on other 737 flights, 0.22% of flights were diverted. I’m not much of a statistician, but I believe this is a classic case for a Chi-square test, although I think I remember that Chi-squared tests don’t perform well with respect to rare events (flight diversions are definitely rare events). Still, according to this test, even though there were slightly more diverted 737 MAX flights, it is reasonably likely that we would observe this by chance (1 in 5).

observed_diverted <- 98
(observed_not_diverted <- 38472 - 98)
## [1] 38374
(expected_diverted <- 0.2237632 / 100 * 38472)
## [1] 86.08618
(expected_not_diverted <- 38472 - expected_diverted)
## [1] 38385.91
(chi_sq <- sum(
  (observed_diverted - expected_diverted)^2 / expected_diverted,
  (observed_not_diverted - expected_not_diverted)^2 / expected_not_diverted
))
## [1] 1.652501
(p_value <- 1 - pchisq(chi_sq, df = 1))
## [1] 0.1986189

In the off chance this is an artifact of diverted flights being rare, we could also randomly sample the non-MAX flights to see how likely it is that 98 flights of 38472 would be diverted.

non_max_flights <- flights_keep %>%
  filter(model_label == "737") %>%
  select(fl_date, fl_num, diverted)

set.seed(4937)
non_max_flights_boot <- tibble(
  sample_number = 1:1000,
  sample = map(sample_number, ~sample_n(non_max_flights, 38472))
) %>%
  unnest() %>%
  group_by(sample_number) %>%
  summarise(n_diverted = sum(diverted))

ggplot(non_max_flights_boot, aes(n_diverted)) +
  geom_histogram(bins = 30) +
  geom_vline(xintercept = 98) +
  theme_bw()

This is a confirmation of our earlier calculation: it is fairly reasonable that 98 flights of 38472 could be diverted, assuming the two datasets are comparable.

It would be interesting to see similar data as that used by the FlightRadar24 blog post regarding Lion Air flight JT610 (elevation trajectory of the plane during the first 20 minutes of flight) but for all 737 MAX flights. Were there other flights that looked like pilots were fighting with the automatic control system? This data exists at 5-second resolution from FlightRadar24, although even the highest of the paid data plans only lets you download 100 or so of these per day (and there are 2.5 million rows in our flights_keep dataset to examine).

Software Engineer & Geoscientist