Scraping SBCPHD

The code here scrapes the Santa Barbara County Public Health Department’s COVID-19 status page for case counts. It’s hard to imagine how they could have made the data harder to extract, seeing as what is counted, how it is counted, and how counts are presented have all changed multiple times since the pandemic began. Our goal is to extract cumulative total confirmed cases (only) broken down by date and by geographic area within the county. Without going to significantly more work, this is possible only as far back as April 27, 2020.

The data frequency is daily, except for weekends and holidays.

library(tidyverse)
library(lubridate)
library(rvest)

# Unicode characters
nbsp <- intToUtf8(0x00A0)
emdash <- intToUtf8(0x2014)

url <- "https://publichealthsbc.org/status-reports/"

The page can be loaded live, but here we assume that the page has already been downloaded via wget or other. There are a few places where non-breaking spaces appear in column names. These make the columns impossible to reference from R, so we just remove the spaces entirely.

raw_page <- read_file("data/cache.html") %>% str_replace_all(nbsp, "")
page <- read_html(raw_page)

Each day’s data is contained in a pair of adjacent sibling <div>’s, the first of which contains the date inside a hyperlink (a couple different formats are used for the date), the second of which contains the data table we want (mixed in with other tables and content in variable order). We identify the table we want by looking for the presence of key column names. The cases column name might be “Confirmed Cases” or “Total Confirmed Cases”.

blocks is a list of HTML nodes representing the first <div> in each pair.

table_constraint <- paste("descendant::td//text()[contains(.,'Geographic Area')]",
                          "and",
                          "descendant::td//text()[contains(.,'Confirmed Cases')]")
blocks <- page %>%
  html_nodes(xpath=paste("//div[starts-with(@id,'elementor-tab-title')]",
                         "[following-sibling::div/table[", table_constraint, "]]"))

Extract the dates and tables. dates is a vector of POSIX dates. tables is a list of data frames.

dates <- blocks %>%
  html_node(xpath="a") %>%
  html_text %>%
  str_extract("\\w+ \\d{1,2}, \\d{4}") %>%
  mdy

tables <- blocks %>%
  html_node(xpath=paste("following-sibling::div/table[", table_constraint, "]")) %>%
  html_table(header=TRUE)

We need to join the dates to the data frames as an additional column, concatenate the data frames to form a single data frame, and perform other cleanups. This is most easily done all at once.

form_df <- function (table, date) {
  # combine a single table (data frame) and a single date
  if ("Total Confirmed Cases" %in% colnames(table)) {
    colname <- "Total Confirmed Cases"
  } else {
    colname <- "Confirmed Cases"
  }
  table %>% select(all_of(c("Geographic Area", colname))) %>%
    rename(area="Geographic Area") %>%
    rename(cases=all_of(colname)) %>%
    mutate(cases=as.character(cases)) %>% # ensure type is chr for next line to work
    mutate(cases=if_else(cases == emdash, "0", cases)) %>% # type must be chr here
    mutate(cases=as.integer(cases)) %>%
    mutate(date=date) %>% # add a date column
    filter(!grepl("Total|Pending", area)) # filter out non-areas
}

table <- map2_df(tables, dates, ~ form_df(.x, .y))

A couple more cleanups are required. First, there are several duplicate dates. For each of these we keep the data for the first occurrence of the date (i.e., what appears to be the most recently added data) only. Second, the status page was modified at some point to go back only as far as June 1. We append a cache of historical data we obtained previously to form a single table going back to April 27.

table <- table %>% distinct(area, date, .keep_all=TRUE)
table <- rbind(table, read_csv("data/historical.csv", col_types="ciD"))

Write the data out.

write_csv(table, "data/data.csv")