The Circumpolar Diatom Database using R, the tidyverse, and mudata2

It is an exciting time for the integration of limnological and paleolimnological datasets. The National (US) Water Quality Monitoring Council Water Quality Portal has just made decades of state and federal water quality measurements available, the Pages2k project has collected hundreds of temperature proxy records for the last 2000 (ish) years, and the Neotoma database provides access to a large number of paleoecological datasets. For a final project in a course last fall, I chose to analyze the Circumpolar Diatom Database (CDD), which is a collection of water chemistry and diatom assemblage data hosted by the Aquatic Paleoecology Laboratory at ULaval. In this post, we will (1) download and clean the data and metadata from the CDD website, and (2) use the mudata2 package to extract some data. Part 2 is more exciting, so I’ll put that first…
The point of it all
This post takes a single webpage, and uses it to create a mudata()
object. The (Mostly) Universal Data format is a format and R package that lets you store metadata (i.e., location, parameter, and dataset information) and data in a way that is easy to read, write, and subset. It’s a good way to store clean data that will get used later, or distribute data for others to use. With it, we can do things like this:
library(mudata2)
cdd <- read_mudata("cdd.mudata")
cdd %>%
select_datasets(ALERT, BYL) %>%
filter_params(param_type == "diatom_count") %>%
filter_data(value > 5)
## A mudata object aligned along <none>
## distinct_datasets(): "ALERT", "BYL"
## distinct_locations(): "A-A", "A-AA" ... and 35 more
## distinct_params(): "ACsubato", "ACsuchla" ... and 56 more
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 x 5
## dataset location param value unit
## <chr> <chr> <chr> <dbl> <chr>
## 1 BYL BI-02 Navicula rynchocephala 6.82 % rel abund
## 2 BYL BI-02 Staurosira construens var. subrotunda 17.3 % rel abund
## 3 BYL BI-02 PIbalf 8.40 % rel abund
## 4 BYL BI-02 SApinnat 47.0 % rel abund
## 5 BYL BI-23 Undetermined diatom [CDD] 6.76 % rel abund
## 6 BYL BI-23 ACsuchla 8.75 % rel abund
The resulting object contains all the information of the original object, but only for diatom data from the ALERT and BYL datasets, where relative abundance is greater than 5%. Similarly, one could find all the locations where silica was measured:
cdd %>%
select_params(SiO2)
## A mudata object aligned along <none>
## distinct_datasets(): "ALERT", "BYL" ... and 10 more
## distinct_locations(): "A-A", "A-AA" ... and 402 more
## distinct_params(): "SiO2"
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 x 5
## dataset location param value unit
## <chr> <chr> <chr> <dbl> <chr>
## 1 BYL BI-01 SiO2 0.0600 mg/L
## 2 BYL BI-02 SiO2 0.0700 mg/L
## 3 BYL BI-03 SiO2 0.270 mg/L
## 4 BYL BI-04 SiO2 0.660 mg/L
## 5 BYL BI-05 SiO2 1.44 mg/L
## 6 BYL BI-06 SiO2 1.02 mg/L
To get data out of the object, one can use the tbl_data()
(or tbl_data_wide()
for the classic parameter-wide version), tbl_params()
, tbl_datasets()
and tbl_locations()
functions. Where on earth is the CDD, anyway?
cdd %>%
tbl_locations() %>%
ggplot(aes(x = long_w, y = lat_n, col = dataset)) +
annotation_map(map_data("world"), fill = "white", col = "grey50") +
geom_point() +
coord_map("ortho") +
theme(panel.background = element_rect(fill = "lightblue"))
For some good old-fashioned relative abundance diagrams, we can use tbl_data()
to extract ggplot-friendly data (in this case, only plotting diatom abundance for diatoms with >25% relative abundance somewhere).
cdd %>%
select_datasets(ALERT) %>%
filter_params(param_type == "diatom_count") %>%
tbl_data() %>%
group_by(param) %>%
filter(any(value > 25)) %>%
ungroup() %>%
ggplot(aes(x = location, y = value)) +
geom_col() +
facet_grid(~param, space = "free_x", scales = "free_x") +
coord_flip()
Convenient, right!? To get there, however, there is a long road of data cleaning ahead…
Cleaning the data
Welcome to the data cleaning section! It is a story of intrigue, excitement, webscraping, and mildly complicated regular expressions. This section of the post is for those who are somewhat familiar with data manipulation and data tidying using the tidyverse, and are looking to practice on a real-world dataset.
In this post I’ll use the tidyverse core packages to do the hard work, the lubridate package to do some date parsing, the curl package to download files, the rvest package to extract data from web pages, the readxl package to read the spreadsheets that contain the data, and the mudata2 package to organize the cleaned data at the end.
library(tidyverse)
I’ll also use the rename_friendly()
function below to make column names from the wild (in this case, Excel files that other people have created) easier to type in R. Essentially, it takes the current column names, makes them lowercase, and replaces any non-alpha-numeric character with an underscore. I promise this is useful…
rename_friendly <- function(df) {
names(df) <- names(df) %>%
str_to_lower() %>%
str_replace_all("[^a-z0-9-_]+", "_") %>%
str_remove("^_") %>%
str_remove("_$") %>%
tidy_names()
df
}
Extracting the dataset information
The CDD is one of the more accessible datasets on the web, in that the entire dataset is available for download, and the metadata is included on the website (notably in the datasets table). We could copy and paste the table into a spreadsheet and then do some modifying, but that would be no fun. Instead, we will use the rvest package to extract the relevant information about each dataset from the table. Below, the page HTML is read using read_html()
, and the tables are extracted using html_table()
(the datasets table is the first one on the page, hence the [[1]]
).
library(rvest)
# scrape datasets info from datasets page
datasets_page_url <- "http://www.cen.ulaval.ca/CDD/DatasetList.aspx"
datasets_page <- read_html(datasets_page_url)
datasets_table <- html_table(datasets_page)[[1]] %>%
rename_friendly() %>%
select(-data)
The datasets table is great for display, but in R we will need that information in a machine-understandable format. For example, the range of visit dates is a bit of a mess as extracted by html_table()
:
datasets_table %>%
select(dataset_code, visit_date) %>%
head()
dataset_code | visit_date |
---|---|
BYL | June 3, 2005to August 5, 2006 |
SAL | April 1st, 2002to September 30, 2004 |
SI | July 12, 2004to July 18, 2004 |
ALERT | July 24, 2000to August 7, 2000 |
MB | July 12, 1999to July 21, 1999 |
ABK | August 15, 1997to September 7, 1997 |
In R, these are more useful as two columns of Date
objects: visit_date_start
and visit_date_end
. The strategy I used was to use the separate()
function to split the visit_date
column using the text "to"
, and then use the mdy()
(month, date, year) function in the lubridate package to make them Date
objects.
datasets_table <- datasets_table %>%
separate(
visit_date,
into = c("visit_date_start", "visit_date_end"),
sep = "to"
) %>%
mutate(
visit_date_start = lubridate::mdy(visit_date_start),
visit_date_end = lubridate::mdy(visit_date_end)
)
datasets_table %>%
select(dataset_code, visit_date_start, visit_date_end) %>%
head()
dataset_code | visit_date_start | visit_date_end |
---|---|---|
BYL | 2005-06-03 | 2006-08-05 |
SAL | 2002-04-01 | 2004-09-30 |
SI | 2004-07-12 | 2004-07-18 |
ALERT | 2000-07-24 | 2000-08-07 |
MB | 1999-07-12 | 1999-07-21 |
ABK | 1997-08-15 | 1997-09-07 |
The second column that is readable to humans but perhaps less useful in R is the coordinates
column.
datasets_table %>%
select(dataset_code, coordinates) %>%
head()
dataset_code | coordinates |
---|---|
BYL | LAT: 72.51° to 73.12°LONG: -79.25° to -80.08° |
SAL | LAT: 60.86° to 62.18°LONG: -69.56° to -75.72° |
SI | LAT: 64.21° to 65.21°LONG: -82.52° to -84.2° |
ALERT | LAT: NA° to NA°LONG: NA° to NA° |
MB | LAT: NA° to NA°LONG: NA° to NA° |
ABK | LAT: 67.07° to 68.48°LONG: 23.52° to 17.67° |
This is a little more complicated to parse, but each string is in the form "(LAT or LON): (a number or NA)° to (a number or NA)°"
. Because the text is structured, we can use a regular expression and the extract()
function to extract the numbers. Then, we can apply the as.numeric()
function to the numbers we just extracted.
datasets_table <- datasets_table %>%
extract(
coordinates,
into = c("lat_min", "lat_max", "lon_min", "lon_max"),
regex = "LAT: ([0-9.-]+|NA)° to ([0-9.-]+|NA)°LONG: ([0-9.-]+|NA)° to ([0-9.-]+|NA)°"
) %>%
mutate_at(vars(lat_min, lat_max, lon_min, lon_max), as.numeric)
datasets_table %>%
select(dataset_code, lat_min, lat_max, lon_min, lon_max) %>%
head()
dataset_code | lat_min | lat_max | lon_min | lon_max |
---|---|---|---|---|
BYL | 72.51 | 73.12 | -79.25 | -80.08 |
SAL | 60.86 | 62.18 | -69.56 | -75.72 |
SI | 64.21 | 65.21 | -82.52 | -84.20 |
ALERT | NA | NA | NA | NA |
MB | NA | NA | NA | NA |
ABK | 67.07 | 68.48 | 23.52 | 17.67 |
There are two columns in the online table that contain links, which aren’t kept by html_table()
, which only extracts the text. The links will be useful for us when we need to download the data for each dataset, but even if we weren’t, it would be useful information to have as a column in our datasets_table
. The approach I took for this was to create a data.frame (tibble) of all the links, then filter out the ones I needed (the data link and the more information link) separately. Essentially, it is looking for text in the web page that looks like <a href="...data_(dataset code).zip">...</a>
for the data link, and <a href="...DatasetInfo2.aspx...">(dataset code)</a>
for the dataset info link. Finally, it uses a left_join()
to add this information to the datasets_table
.
dataset_page_links <- tibble(
node = html_nodes(datasets_page, "a"),
href = html_attr(node, "href"),
text = html_text(node)
)
dataset_links_tbl <- dataset_page_links %>%
select(data_link = href) %>%
extract(
data_link,
into = "dataset_code",
regex = "data_([a-z]+).zip$",
remove = FALSE
) %>%
mutate(dataset_code = str_to_upper(dataset_code)) %>%
filter(!is.na(dataset_code))
dataset_info_links_tbl <- dataset_page_links %>%
filter(str_detect(href, "DatasetInfo2.aspx")) %>%
select(dataset_code = text, info_link = href) %>%
mutate(info_link = paste0("http://www.cen.ulaval.ca/CDD/", info_link))
datasets_table <- datasets_table %>%
left_join(dataset_links_tbl, by = "dataset_code") %>%
left_join(dataset_info_links_tbl, by = "dataset_code")
datasets_table %>%
select(dataset_code, data_link, info_link) %>%
head()
Parsing the data
Now that we have the link for all the datasets, we can download them! Because they are ZIP files, we can extract them as well. To do this in one go, I used pmap()
to take a tibble with the columns url
and destfile
, and apply the curl_download()
function from the curl package to each row. The curl_download()
function outputs the downloaded file, so we can then take the result and pipe it to the unzip()
function, which extracts a zip file in the working directory.
datasets_table %>%
select(url = data_link) %>%
mutate(destfile = basename(url)) %>%
pmap(curl::curl_download, quiet = TRUE) %>%
map(unzip)
If all went well, you should have a whole lot of Excel files in your working directory, named something like diatom_count_....xlsx
and water_chemistry_....xlsx
. These excel files are the CDD, but what do they contain? We can use the readxl package to have a look.
library(readxl)
read_excel("diatom_count_byl.xlsx", skip = 1) %>%
head()
Diatom taxon | Taxon code | BI-02 | BI-23 | BI-24 | BI-27 |
---|---|---|---|---|---|
Chamaepinnularia grandupii | NA | 0.2624672 | NA | NA | NA |
Diatoma monoliformis | NA | NA | NA | 6.0000000 | 1.4950166 |
Encyonema elginense | NA | NA | NA | 0.1980198 | NA |
Encyonema fogedii | NA | NA | NA | 0.1980198 | 0.8305648 |
Eunotia binularis | NA | NA | NA | 1.5841584 | NA |
Fragilaria construens var. subrotunda | NA | NA | 4.373757 | NA | NA |
read_excel("water_chemistry_byl.xlsx", skip = 1) %>%
select(1:7) %>%
head()
Lake # | Lat (°N) | Long (°W) | Alt (m) | Depth (m) | Area (ha) | pH |
---|---|---|---|---|---|---|
BI-01 | 73.13333 | -80.10000 | 8 | 2.8 | 84491 | 6.7 |
BI-02 | 73.05000 | -80.13333 | 10 | 2.1 | 18265 | 6.3 |
BI-03 | 73.18333 | -79.86667 | 10 | 10.0 | 17916 | 7.0 |
BI-04 | 73.18333 | -79.86667 | 10 | 4.0 | 30113 | 7.0 |
BI-05 | 73.20000 | -79.85000 | 10 | 9.5 | 47481 | 7.0 |
BI-06 | 73.20000 | -79.85000 | 10 | 10.5 | 70546 | 6.7 |
It looks like the diatom_counts...
spreadsheets have one row for each taxon, with locations as columns, whereas the water_chemistry...
spreadsheets have one row for each location, with each water quality parameter. Additionally, the water_chemistry...
spreadsheets have information about each location that is unlikely to change with time. This is perhaps a semantic difference, but an important one for structuring the data, as water quality measurements and the relative abundance of diatom taxa are measurements from the CDD, and the location/altitude/depth/area information are not measurements in quite the same way.
The two tables contain similar information that is structured in slightly different ways: one is “location-wide”, and the other is “parameter-wide”. One way to combine them is to make them both tables that have one row per measurement. From a tidy data standpoint, this means we can gather()
values that are in columns into rows. Here is what it looks like for both spreadsheets (note that the negative numbers mean all the columns except those ones):
read_excel("diatom_count_byl.xlsx", skip = 1) %>%
gather(-1, -2, key = "lake", value = "rel_abundance") %>%
rename_friendly() %>%
head()
diatom_taxon | taxon_code | lake | rel_abundance |
---|---|---|---|
Chamaepinnularia grandupii | NA | BI-02 | 0.2624672 |
Diatoma monoliformis | NA | BI-02 | NA |
Encyonema elginense | NA | BI-02 | NA |
Encyonema fogedii | NA | BI-02 | NA |
Eunotia binularis | NA | BI-02 | NA |
Fragilaria construens var. subrotunda | NA | BI-02 | NA |
read_excel("water_chemistry_byl.xlsx", skip = 1) %>%
select(-(2:6)) %>%
gather(-1, key = "param", value = "value") %>%
rename_friendly() %>%
head()
lake | param | value |
---|---|---|
BI-01 | pH | 6.7 |
BI-02 | pH | 6.3 |
BI-03 | pH | 7.0 |
BI-04 | pH | 7.0 |
BI-05 | pH | 7.0 |
BI-06 | pH | 6.7 |
As I mentioned above, some of the information in the water chemistry spreadsheet is more suited to a “location information” table (similar to our “dataset information” table above).
read_excel("water_chemistry_byl.xlsx", skip = 1) %>%
select(1:6) %>%
rename_friendly() %>%
head()
lake | lat_n | long_w | alt_m | depth_m | area_ha |
---|---|---|---|---|---|
BI-01 | 73.13333 | -80.10000 | 8 | 2.8 | 84491 |
BI-02 | 73.05000 | -80.13333 | 10 | 2.1 | 18265 |
BI-03 | 73.18333 | -79.86667 | 10 | 10.0 | 17916 |
BI-04 | 73.18333 | -79.86667 | 10 | 4.0 | 30113 |
BI-05 | 73.20000 | -79.85000 | 10 | 9.5 | 47481 |
BI-06 | 73.20000 | -79.85000 | 10 | 10.5 | 70546 |
The code above works for one dataset, but we have 15! Because the files are well-named and all structured the same way, we can write a function to read a single dataset for each table type from above.
read_diatom_count <- function(dataset_code) {
paste0("diatom_count_", str_to_lower(dataset_code), ".xlsx") %>%
read_excel(skip = 1) %>%
gather(-1, -2, key = "lake", value = "rel_abundance") %>%
rename_friendly() %>%
mutate(dataset_code = dataset_code)
}
read_water_chemistry <- function(dataset_code) {
paste0("water_chemistry_", str_to_lower(dataset_code), ".xlsx") %>%
read_excel(skip = 1) %>%
select(-(2:6)) %>%
gather(-1, key = "label", value = "value") %>%
rename_friendly() %>%
mutate(dataset_code = dataset_code)
}
read_lake_info <- function(dataset_code) {
paste0("water_chemistry_", str_to_lower(dataset_code), ".xlsx") %>%
read_excel(skip = 1) %>%
select(1:6) %>%
rename_friendly() %>%
mutate(dataset_code = dataset_code)
}
Then, we can apply the functions along the list of dataset codes (map_
) and combining the resulting data frames by row (_dfr
) to make one big table for each data type.
diatom_counts <- datasets_table %>%
pull(dataset_code) %>%
map_dfr(read_diatom_count)
water_chemistry <- datasets_table %>%
pull(dataset_code) %>%
map_dfr(read_water_chemistry)
lake_info <- datasets_table %>%
pull(dataset_code) %>%
map_dfr(read_lake_info)
Creating a mudata object
A mudata object (standing for (Mostly) Universal Data) is a way to store cleaned data so that it’s amenable to reading, writing, documentation, and subsetting. It’s implemented in the mudata2 package, and consists of five tables: data (a one-measurement-per-row table), locations (a table containing location information), params (a table containing parameter information), datasets (a table containing dataset information), and columns (containing column documentation). For a demo of why this might be useful in this context, see the the beginning of the this post.
We already have the raw data for these tables, but the column names used by mudata2 (dataset
, location
, param
, and value
) are slightly different than what we have used so far. The datasets table is the closest to its final form, with the dataset_code
column needing to be renamed to dataset
.
datasets_table <- datasets_table %>%
rename(dataset = dataset_code)
The locations table is close as well, but using the benefit of hindsight I can tell you that four locations that are included in the diatom counts are not documented in the lake info table. The mudata()
function (which combines all these tables) is really picky about referential integrity, so we have to add those locations here to avoid an error later.
locations_table <- lake_info %>%
rename(location = lake, dataset = dataset_code) %>%
bind_rows(
tibble(
dataset = c("ALERT", "ALERT", "MB", "ISACH"),
location = c("A-LDL2", "A-Self0cm", "MB-WS", "I-C")
)
)
The parameter table doesn’t explicitly exist yet, but we can create it by finding distinct combinations of dataset
and label
in the water chemistry table. The params table is a good place to put units, although the param identifier is not, so I’ve extracted the unit (if it exists) to a separate column.
params_water_chemistry <- water_chemistry %>%
rename(dataset = dataset_code) %>%
filter(!is.na(value)) %>%
distinct(dataset, label) %>%
extract(
label,
into = c("param", "unit"),
regex = "(.*?)\\s*\\((.*?)\\)",
remove = FALSE
) %>%
mutate(param = coalesce(param, label)) %>%
mutate(param_type = "water_chemistry")
This is usually a good place to make sure that everything is measured in the same unit. Luckily, each parameter only has one unit!
params_water_chemistry %>%
group_by(param) %>%
summarise(n_units = n_distinct(unit)) %>%
filter(n_units > 1) %>%
pull(param)
## character(0)
The diatom counts are a little more complicated, since they are identified by both diatom_taxon
and taxon_code
. The mudata format needs one unique identifier for each parameter, and because not all diatom_taxon
values have a taxon_code
(which is consistent across the CDD), we can coalesce()
the two columns. That is, we can prefer the value of taxon_code
, but if it is missing, we can fall back on the value of diatom_taxon
instead.
It isn’t explicit anywhere I could find on the CDD website, but the values associated with each taxon appear to be percent relative abundance values (in general, they all add up to 100 for each location). Because there’s a unit for the water chemistry parameters, I decided to add this tidbit of information to the diatom count parameters as well.
params_diatom_count <- diatom_counts %>%
rename(dataset = dataset_code) %>%
filter(!is.na(rel_abundance)) %>%
distinct(dataset, diatom_taxon, taxon_code) %>%
mutate(param = coalesce(na_if(taxon_code, "(blank)"), diatom_taxon)) %>%
mutate(label = diatom_taxon) %>%
mutate(param_type = "diatom_count") %>%
mutate(unit = "% rel abund")
Finally, we can combine the two parameter tables to make the final table.
params_table <- bind_rows(params_water_chemistry, params_diatom_count)
All that’s left is the data table! The water_chemistry
and diatom_counts
tables are almost in the right form, but we need to make sure that the param
identifiers we assigned in the previous step are the same ones that get used in the data table. It’s also nice to have the unit
next to the value
, which makes reading the table a bit easier.
data_water_chemistry <- water_chemistry %>%
filter(!is.na(value)) %>%
rename(dataset = dataset_code, location = lake) %>%
left_join(params_table, by = c("dataset", "label")) %>%
select(dataset, location, param, value, unit)
data_diatom_counts <- diatom_counts %>%
filter(!is.na(rel_abundance)) %>%
rename(dataset = dataset_code, location = lake, value = rel_abundance) %>%
left_join(params_table, by = c("dataset", "diatom_taxon", "taxon_code")) %>%
select(dataset, location, param, value, unit)
data_table <- bind_rows(data_water_chemistry, data_diatom_counts)
Now it’s mudata()
time! Usually the data table contains some kind of qualifying information about each measurement (e.g., time collected, depth collected in the water column), but because of how the original data were structured, there was no place for this information. The mudata()
function calls these x_columns
(they’re usually on the x
axis of plots), but here there are none (hence character(0)
).
library(mudata2)
cdd_mudata <- mudata(
data = data_table,
locations = locations_table,
params = params_table,
datasets = datasets_table,
x_columns = character(0)
)
cdd_mudata
## A mudata object aligned along <none>
## distinct_datasets(): "ABK", "ALERT" ... and 13 more
## distinct_locations(): "A-A", "A-AA" ... and 569 more
## distinct_params(): "ACaltaic", "ACamoena" ... and 1145 more
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 x 5
## dataset location param value unit
## <chr> <chr> <chr> <dbl> <chr>
## 1 BYL BI-01 pH 6.70 <NA>
## 2 BYL BI-02 pH 6.30 <NA>
## 3 BYL BI-03 pH 7.00 <NA>
## 4 BYL BI-04 pH 7.00 <NA>
## 5 BYL BI-05 pH 7.00 <NA>
## 6 BYL BI-06 pH 6.70 <NA>
The default print()
function suggests a few ways to interact with the object, but what we’re after here is the write_mudata()
function, which writes the collection of tables to a folder (or ZIP file or JSON file, but folder is usually the most useful).
write_mudata_dir(cdd_mudata, "cdd.mudata")