Intro to {rvest}

PUBLISHED ON JUL 24, 2020

TL;DR – {rvest} is awesome. Before yesterday, I had 0 experience web scraping and very little experience with HTML/CSS in general, and in a couple of hours, I managed to submit a form and scrape the resulting tables. And it would have been even less time if I weren’t a dummy a remembered that NULL is a thing…more on that later.

Motivation

Yesterday, for work, I needed to get a list of all of the Family Day Homes (FDH) in Virginia to check against another dataset. For those not in the early childhood game, FDHs are childcare programs run out of the provider’s home. During the COVID epidemic, there’s been even more attention on them because they typically have smaller group sizes than schools or childcare centers, and so may be a more durable option for many families. But that’s not really the point of this post.

Anyway, I needed to get this list of FDHs. Normally, I’d reach out to someone at the Department of Social Services (DSS), who is responsible for licensing childcare in Virginia, to ask for this. But I needed it last night, outside of normal working hours. I knew their website has a search function, and so I decided to take a stab at scraping the webpage to get the info I needed. Since it worked out pretty well, I also figured I’d write a short blog post about it in case that can help others navigate webscraping for the first time.

Disclaimer

I am not a web developer. I know a handful of the more common HTML tags, and I know enough CSS to make Rmd reports look a little prettier, but that’s basically it. I’m not going to pretend I understand all of what {rvest} does under the hood, but I also think that’s kinda the point of the package. With that out of the way, onward and upward!

Setup

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

library(tidyverse)
library(rvest)

First, let’s go visit the web page itself. This page is where families can search for child care centers in the DSS database. Here’s the relevant part of the page, including the form we’ll be using to search:

I’m going to capture this using the read_html() function from {rvest}.

dss <- read_html("https://www.dss.virginia.gov/facility/search/cc2.cgi")

Next, I need to isolate this form in the page. I’m going to to this is a combination of the html_nodes() and html_form() functions, plus extract2() from {magrittr}.

dss_form <- dss %>% 
  html_nodes("form") %>% 
  magrittr::extract2(2) %>% 
  html_form()

Let’s walk through this. First, we pipe the dss object into html_nodes(), which will – in this case – extract all of the “form” elements from the page. Note that I’m using html_nodes() rather than html_node() here – this is because the form I want is actually the 2nd one on the page, so I get both of them and then extract the second one using magrittr::extract2(). Next, I pass that into html_form, which does some voodoo to tell R that this is a form.

When we take a peek at the form, here’s what we see:

dss_form
## <form> '<unnamed>' (POST /facility/search/cc2.cgi)
##   <input hidden> 'rm': Search
##   <input text> 'search_keywords_name': 
##   <select> 'search_exact_fips' [0/135]
##   <input text> 'search_contains_zip': 
##   <input checkbox> 'search_modifiers_mod_cde': 1
##   <input checkbox> 'search_quality_rating_1': 1
##   <input checkbox> 'search_quality_rating_2': 2
##   <input checkbox> 'search_quality_rating_3': 3
##   <input checkbox> 'search_quality_rating_4': 4
##   <input checkbox> 'search_quality_rating_5': 5
##   <input checkbox> 'search_quality_rating_all': 1,2,3,4,5
##   <input checkbox> 'search_require_client_code-2101': 1
##   <input checkbox> 'search_require_client_code-2102': 1
##   <input checkbox> 'search_require_client_code-2106': 1
##   <input checkbox> 'search_require_client_code-2105': 1
##   <input checkbox> 'search_require_client_code-2201': 1
##   <input checkbox> 'search_require_client_code-2104': 1
##   <input checkbox> 'search_require_client_code-3001': 1
##   <input checkbox> 'search_require_client_code-3002': 1
##   <input checkbox> 'search_require_client_code-3003': 1
##   <input checkbox> 'search_require_client_code-3004': 1
##   <input submit> '': Search

If you look back up at the screenshot of the page (or if you visit the actual page), you’ll notice that the input elements here are the things we can search by. Well, you might not notice because the “search_require_client_code-2102” element doesn’t scream that this is the checkbox for Family Day Home, but it is.

What I did at this point was use the Inspector tool in Firefox to figure out which of these elements I want to select. This took more time and more submissions that returned the wrong values than I care to admit. It turns out that the relevant elements to select FDHs are 2102, 2201, and 3002.

Cool, so now we need to set the values of the elements in this form. I’m not sure if the default behavior for all checkboxes is to be checked in a form, but these are (which you can see by the value of 1 for all of them). This was not intuitive to me. Even less intuitive was how to uncheck them. It turns out that the way to do it is to set the value to NULL. For whatever reason, this also wasn’t intuitive to me – maybe because it was late at night, who knows. I tried 0s, empty strings, lots of different stuff before NULL saved me.

Regardless, you can set the values in the form using the set_values() function. The code below will set the values of everything to NULL except for the checkboxes that correspond to family day homes. The object this returns is an updated form.

fdh_val <- set_values(dss_form,
                      `search_modifiers_mod_cde` = NULL,
                      `search_quality_rating_1` = NULL,
                      `search_quality_rating_2` = NULL,
                      `search_quality_rating_3` = NULL,
                      `search_quality_rating_4` = NULL,
                      `search_quality_rating_5` = NULL,
                      `search_quality_rating_all` = NULL,
                      `search_require_client_code-2101` = NULL,
                      `search_require_client_code-2102` = 1,
                      `search_require_client_code-2106` = NULL,
                      `search_require_client_code-2105` = NULL,
                      `search_require_client_code-2201` = 1,
                      `search_require_client_code-2104` = NULL,
                      `search_require_client_code-3001` = NULL,
                      `search_require_client_code-3002` = 1,
                      `search_require_client_code-3003` = NULL,
                      `search_require_client_code-3004` = NULL)

Next, we need a way to submit the form, which apparently requires an html session. My understanding of html sessions is that they store data/values in the browser window temporarily – until it’s closed. This is pretty much what I know about them. But apparently we need one to submit a form, so here we go.

And once the session is started, I’m going to submit the form – using submit_form() – with the updated values created just above and save the output in subbed. The submit_form function gives you an option to specify which submit button to use on the page if there are multiple, but that’s not an issue here.

dss_session <- html_session("https://www.dss.virginia.gov/facility/search/cc2.cgi")

subbed <- submit_form(dss_session, fdh_val)

After submitting the form, the data will be populated in some new tables, so the next step is to extract these tables from the subbed session object. I’m using the html_nodes() function again but with the “tables” argument to pull all of the tables from the session.

tbls <- subbed %>% html_nodes("table")

length(tbls)
## [1] 4

There are 4 tables. I looked at these all using View() (e.g. View(tbls[[1]]) etc) and figured out that the ones I want are tbls 3 and 4. These correspond to licensed and unlicensed FDHs, respectively. To get these out, I’m going to use the html_table() function on each, which creates a tibble from the html table.

fdh_tbl <- bind_rows(tbls[[3]] %>% html_table(),
                       tbls[[4]] %>% html_table())

glimpse(fdh_tbl)
## Rows: 2,093
## Columns: 3
## $ Test...1 <chr> "Abida Mufti\n\t\t\tVirginia Quality Level:", "Abida Munir...
## $ Test...2 <chr> "2201 Hunter Mill Road\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tVIEN...
## $ Test...3 <chr> "(703) 281-7860", "(703) 360-6783", "(703) 435-4591", "(70...

This gives us a three column table. The first column has the name of the FDH – which is typically just the name of the person who owns it – as well as an indicator of the program’s Virginia Quality level (which is our quality rating system for childcare). The second has the address. And the third column has the phone number. Doing a little bit of cleaning and column naming, we can get a nicer tibble to work with

names(fdh_tbl) <- c("name", "address", "phone")

fdh_tbl_clean <- fdh_tbl %>%
  mutate(name = str_remove_all(name, "\\\n.*$"),
         address = str_replace(address, "\\\n", " ") %>%
           str_remove_all("\\\n|\\\t") %>%
           str_to_title()
         )

glimpse(fdh_tbl_clean)
## Rows: 2,093
## Columns: 3
## $ name    <chr> "Abida Mufti", "Abida Munir", "Ada Lazo", "Adeela Saleem", ...
## $ address <chr> "2201 Hunter Mill Road Vienna, Va  22181", "1902 Sherwood H...
## $ phone   <chr> "(703) 281-7860", "(703) 360-6783", "(703) 435-4591", "(703...

Et voila!

This got me to where I needed to be for my work project, but a next step here might be to geocode the addresses and then make a map. I’ll probably do this in the future as part of a different post, but I’m going to leave off the current post here for now. Hopefully this helps people with their first venture into web scraping! I know I’m excited to do more with it now that I’ve dipped my toes in.