The code here scrapes the Santa Barbara County Public Health Department’s COVID-19 status page for case counts. It’s hard to imagine how they could have made the data harder to extract, seeing as what is counted, how it is counted, and how counts are presented have all changed multiple times since the pandemic began. Our goal is to extract cumulative total confirmed cases (only) broken down by date and by geographic area within the county. Without going to significantly more work, this is possible only as far back as April 27, 2020.
The data frequency is daily, except for weekends and holidays.
library(tidyverse)
library(lubridate)
library(rvest)
# Unicode characters
nbsp <- intToUtf8(0x00A0)
emdash <- intToUtf8(0x2014)
url <- "https://publichealthsbc.org/status-reports/"
The page can be loaded live, but here we assume that the page has
already been downloaded via wget
or other. There are a few
places where non-breaking spaces appear in column names. These make the
columns impossible to reference from R, so we just remove the spaces
entirely.
raw_page <- read_file("data/cache.html") %>% str_replace_all(nbsp, "")
page <- read_html(raw_page)
Each day’s data is contained in a pair of adjacent sibling
<div>
’s, the first of which contains the date inside
a hyperlink (a couple different formats are used for the date), the
second of which contains the data table we want (mixed in with other
tables and content in variable order). We identify the table we want by
looking for the presence of key column names. The cases column name
might be “Confirmed Cases” or “Total Confirmed Cases”.
blocks
is a list of HTML nodes representing the first
<div>
in each pair.
table_constraint <- paste("descendant::td//text()[contains(.,'Geographic Area')]",
"and",
"descendant::td//text()[contains(.,'Confirmed Cases')]")
blocks <- page %>%
html_nodes(xpath=paste("//div[starts-with(@id,'elementor-tab-title')]",
"[following-sibling::div/table[", table_constraint, "]]"))
Extract the dates and tables. dates
is a vector of POSIX
dates. tables
is a list of data frames.
dates <- blocks %>%
html_node(xpath="a") %>%
html_text %>%
str_extract("\\w+ \\d{1,2}, \\d{4}") %>%
mdy
tables <- blocks %>%
html_node(xpath=paste("following-sibling::div/table[", table_constraint, "]")) %>%
html_table(header=TRUE)
We need to join the dates to the data frames as an additional column, concatenate the data frames to form a single data frame, and perform other cleanups. This is most easily done all at once.
form_df <- function (table, date) {
# combine a single table (data frame) and a single date
if ("Total Confirmed Cases" %in% colnames(table)) {
colname <- "Total Confirmed Cases"
} else {
colname <- "Confirmed Cases"
}
table %>% select(all_of(c("Geographic Area", colname))) %>%
rename(area="Geographic Area") %>%
rename(cases=all_of(colname)) %>%
mutate(cases=as.character(cases)) %>% # ensure type is chr for next line to work
mutate(cases=if_else(cases == emdash, "0", cases)) %>% # type must be chr here
mutate(cases=as.integer(cases)) %>%
mutate(date=date) %>% # add a date column
filter(!grepl("Total|Pending", area)) # filter out non-areas
}
table <- map2_df(tables, dates, ~ form_df(.x, .y))
A couple more cleanups are required. First, there are several duplicate dates. For each of these we keep the data for the first occurrence of the date (i.e., what appears to be the most recently added data) only. Second, the status page was modified at some point to go back only as far as June 1. We append a cache of historical data we obtained previously to form a single table going back to April 27.
table <- table %>% distinct(area, date, .keep_all=TRUE)
table <- rbind(table, read_csv("data/historical.csv", col_types="ciD"))
Write the data out.
write_csv(table, "data/data.csv")