Scraping SBCPHD¶

The code here scrapes the Santa Barbara County Public Health Department's COVID-19 status page for case counts. It's hard to imagine how they could have made the data harder to extract, seeing as what is counted, how it is counted, and how counts are presented have all changed multiple times since the pandemic began. Our goal is to extract cumulative total confirmed cases (only) broken down by date and by geographic area within the county. Without going to significantly more work, this is possible only as far back as April 27, 2020.

The data frequency is daily, except for weekends and holidays.

In [1]:
import lxml.html, re
import pandas as pd

# Unicode characters
nbsp = chr(0x00A0)
emdash = chr(0x2014)

url = "https://publichealthsbc.org/status-reports/"

The page can be loaded live, but here we assume that the page has already been downloaded via wget or other. There are a few places where non-breaking spaces appear in column names and values, making them difficult to work with, so we just remove the spaces entirely.

In [2]:
raw_page = open("data/cache.html").read().replace(nbsp, "")
page = lxml.html.fromstring(raw_page)

Each day's data is contained in a pair of adjacent sibling <div>'s, the first of which contains the date inside a hyperlink (a couple different formats are used for the date), the second of which contains the data table we want (mixed in with other tables and content in variable order). We identify the table we want by looking for the presence of key column names. The cases column name might be "Confirmed Cases" or "Total Confirmed Cases".

blocks is a list of HTML elements representing the first <div> in each pair.

In [3]:
table_constraint = (
    "descendant::td//text()[contains(.,'Geographic Area')]"
    " and "
    "descendant::td//text()[contains(.,'Confirmed Cases')]"
)

blocks = page.xpath(
    "//div[starts-with(@id,'elementor-tab-title')]"
    "[following-sibling::div/table[" + table_constraint + "]]"
)

Extract the dates and tables. dates is a list of Pandas timestamps. tables is a list of Pandas dataframes.

In [4]:
dates = [
    pd.to_datetime(re.match(r"\w+ \d{1,2}, \d{4}", b.find("a").text).group(0))
    for b in blocks
]

tables = [
    pd.read_html(
        lxml.html.tostring(
            b.xpath("following-sibling::div/table[" + table_constraint + "]")[0]
        ),
        header=0
    )[0]
    for b in blocks
]

We need to join the dates to the dataframes as an additional column, concatenate the dataframes to form a single dataframe, and perform other cleanups. This is most easily done all at once.

In [5]:
def form_df(table, date):
    # combine a single table (dataframe) and a single date
    if "Total Confirmed Cases" in table.columns:
        colname = "Total Confirmed Cases"
    else:
        colname = "Confirmed Cases"
    table = table[["Geographic Area", colname]].rename(
        columns={"Geographic Area": "area", colname: "cases"}
    )
    table["cases"] = table["cases"].replace(emdash, "0").astype(int)
    table["date"] = date  # add a date column
    table = table[~table.area.str.contains(r"Total|Pending")]  # filter out non-areas
    return table

table = pd.concat(form_df(t, d) for t, d in zip(tables, dates))

A couple more cleanups are required. First, there are several duplicate dates. For each of these we keep the data for the first occurrence of the date (i.e., what appears to be the most recently added data) only. Second, the status page was modified at some point to go back only as far as June 1. We append a cache of historical data we obtained previously to form a single table going back to April 27.

In [6]:
table = pd.concat(
    [
        table.drop_duplicates(["area", "date"]),
        pd.read_csv("data/historical.csv", parse_dates=["date"])
    ]
)

Write the data out.

In [7]:
table.to_csv("data/data.csv", index=False)