Pulling YouTube Transcripts

PUBLISHED ON MAY 15, 2020

I’ve been a fan of the Your Mom’s House Podcast for a long time now, and I thought it would be interesting to do some analysis of their speech patterns. If you follow the show at all, you know that the conversations are…special (you can check here for a visualization I did of their word usage over time if you’re so inclined). Fortunately, it’s possible to get transcripts of YouTube videos. Getting transcripts for a single video using the {youtubecaption} R package is fairly straightforward; getting transcripts for a full playlist is a touch more involved, so I wanted to create a quick walkthrough illustrating my process for doing this. Hopefully this will help others who might want to analyze text data from YouTube.

Setup

First, let’s load the packages we need to pull our data. I’m going to use the following:

  • {tidyverse} for data wrangling
  • {youtubecaption} for calling the YouTube API to get transcripts
  • {janitor} pretty much just for the clean_names() function
  • {lubridate} to work with the publication_date variable that’s part of the YT video data. (This is optional if you don’t want to work with this variable at all)
knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(youtubecaption)
library(janitor)
library(lubridate)

Getting Transcripts for a Single Video

Like I mentioned previously, getting transcripts for a single video is pretty easy thanks to the {youtubecaption} package. All we need is the URL for the video and the get_caption() function can go do its magic. I’ll illustrate that here using the most recent YMH podcast full episode.

ymh_new <- get_caption("https://www.youtube.com/watch?v=VMloBlnczzI")

glimpse(ymh_new)
## Rows: 3,157
## Columns: 5
## $ segment_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ text       <chr> "this episode of your mom's house is", "brought to you b...
## $ start      <dbl> 0.000, 1.140, 3.659, 7.859, 8.910, 14.820, 20.789, 27.19...
## $ duration   <dbl> 3.659, 6.719, 5.251, 6.961, 11.879, 9.080, 3.111, 3.029,...
## $ vid        <chr> "VMloBlnczzI", "VMloBlnczzI", "VMloBlnczzI", "VMloBlnczz...

We can see above that this gives us a tibble with the text (auto-transcribed by YouTube) broken apart into short segments and corresponding identifying information for each text segment.

One thing worth mentioning here is that the transcripts are automatically transcribed by a speech-to-text model. It seems really good, but it will make some mistakes, particularly around brand names and website addresses (in my limited experience).

Getting Transcripts for Several Videos

But what if we want to get transcripts for several videos? The get_caption() function requires the URL of each video that we want to get a caption for. If you want to analyze transcripts from more than a handful of videos, it would get really tedious really quickly to go and grab the individual URLs. And, more specifically, what if you wanted to get the transcripts for all videos from a single playlist?

Get URLS

I found this tool that will take a YouTube playlist ID and provide an Excel file with, among other information, the URL for each video in the playlist, which is exactly what we need for the get_caption() function.

I used the tool on 5/14/20 to get a file with the data for all of the videos in the YMH Podcast - Full Episodes playlist. I’ll go ahead an upload the file, plus do some light cleaning, in the code below.

ep_links <- read_csv("~/Data/YMH/Data/ymh_ep_links.csv") %>%
  clean_names() %>%
  mutate(ep_num = str_replace_all(title, ".*Ep.*(\\d{3}).*", "\\1") %>%
           as.double(),
         ep_num = if_else(ep_num == 19, 532, ep_num),
         published_date = mdy_hm(published_date),
         vid = str_replace_all(video_url, ".*=(.*)$", "\\1"))
## Parsed with column specification:
## cols(
##   `Published Date` = col_character(),
##   `Video URL` = col_character(),
##   Channel = col_character(),
##   Title = col_character(),
##   Description = col_character()
## )
## Warning in function_list[[k]](value): NAs introduced by coercion
glimpse(ep_links)
## Rows: 225
## Columns: 7
## $ published_date <dttm> 2020-05-13 12:00:00, 2020-05-06 12:00:00, 2020-04-2...
## $ video_url      <chr> "https://www.youtube.com/watch?v=VMloBlnczzI", "http...
## $ channel        <chr> "YourMomsHousePodcast", "YourMomsHousePodcast", "You...
## $ title          <chr> "Your Mom's House Podcast - Ep. 551", "Your Mom's Ho...
## $ description    <chr> "Last week, 10mg Tom made his debut. This week, we h...
## $ ep_num         <dbl> 551, 550, 549, 548, 547, 546, 545, 544, 543, NA, 542...
## $ vid            <chr> "VMloBlnczzI", "JGNn-C_dxuY", "xw3KNj2ywVo", "_BVQvq...

We can see that this gives us the URLs for all 225 episodes in the playlist.

The cleaning steps for the published_date variable and the vid variable should be pretty universal. The step to get the episode number extracts that from the title of the video, and so this step is specific to the playlist I’m using.

“Safely” Pull Transcripts

Now that we have all of the URLs, we can iterate through all of them using the get_caption() function. Before we do that, though, we want to make the get_caption() robust to failure. Basically, we don’t want the whole series of iterations to fail if one returns an error. In other words, we want the function to get all of the transcripts that it can get and let us know which it can’t, but not to fail if it can’t get every transcript.

To do this, we just wrap the get_caption() function in the safely() function from {purrr}.

safe_cap <- safely(get_caption)

You can read more about safely() in the {purrr} documentation, but it basically returns, for each call, a 2-element list: 1 element with the “result” of the function and another with the “error.” If the function succeeds, “error” will be NULL and “result” will have the result of the function. If the function fails, “result” will be NULL and “error” will show the error message.

Now that we have your safe_cap() function, we can use map() from {purrr} to pull transcripts from all of the videos we have URLs for.

ymh_trans <- map(ep_links$video_url,
                 safe_cap)

glimpse(ymh_trans)
## List of 225
##  $ :List of 2
##   ..$ result: tibble [3,157 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,966 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,663 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,093 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,727 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,701 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,276 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,382 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,340 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: NULL
##   ..$ error :List of 3
##   .. ..$ message : chr "TranscriptsDisabled: \nCould not retrieve a transcript for the video https://www.youtube.com/watch?v=9vdtq_JcSg"| __truncated__
##   .. ..$ call    : language py_call_impl(callable, dots$args, dots$keywords)
##   .. ..$ cppstack: NULL
##   .. ..- attr(*, "class")= chr [1:4] "Rcpp::exception" "C++Error" "error" "condition"
##  $ :List of 2
##   ..$ result: tibble [3,355 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,510 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,127 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,684 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,200 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,158 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,812 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,918 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [1,960 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,268 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,960 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,034 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,986 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,846 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [1,783 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,765 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,408 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,949 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,028 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,162 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,589 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,935 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,280 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,637 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,440 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,756 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,226 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,958 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,063 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,323 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,297 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,922 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,339 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,118 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,387 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,629 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,279 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,652 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,728 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,101 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,145 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,197 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,473 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,073 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [5,351 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,649 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,868 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,213 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,180 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [5,320 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,196 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,557 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [5,006 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,154 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,959 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,873 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,422 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,698 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,810 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: NULL
##   ..$ error :List of 3
##   .. ..$ message : chr "TranscriptsDisabled: \nCould not retrieve a transcript for the video https://www.youtube.com/watch?v=yXtxWpteWU"| __truncated__
##   .. ..$ call    : language py_call_impl(callable, dots$args, dots$keywords)
##   .. ..$ cppstack: NULL
##   .. ..- attr(*, "class")= chr [1:4] "Rcpp::exception" "C++Error" "error" "condition"
##  $ :List of 2
##   ..$ result: tibble [3,623 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,907 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,035 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,054 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,906 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,227 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,314 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,188 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [1,579 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,417 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [4,062 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,159 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,661 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,580 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,415 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,654 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,987 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,519 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,041 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [1,965 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,563 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,832 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,795 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,583 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,698 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,170 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,342 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [2,657 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: tibble [3,039 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ error : NULL
##   [list output truncated]

Format Data

This returns a list the same length as our vector of URLs (225 in this case) in the format described above. We want to get the “result” element from each of these lists. (You might also be interested in looking at the errors, but any errors are all going to be the same here – basically that a transcript isn’t available for a specific video). To do that, we want to iterate over all elements of our transcript list (using map() again) and use the pluck() function from {purrr} to get the result object. We then used the compact() function to get rid of any NULL elements in this list (remember that the “result” element will be NULL if the function couldn’t get a transcript for the video). This will give us a list of transcripts that the function successfully fetched.

Next, we use the bind_rows() function to take this list and turn it into a tibble. And finally, we can inner_join() this with our tibble that had the URLs so that metadata for each video and transcripts are in the same tibble.

res <- map(1:length(ymh_trans),
           ~pluck(ymh_trans, ., "result")) %>%
  compact() %>%
  bind_rows() %>%
  inner_join(x = ep_links,
            y = .,
            by = "vid")

glimpse(res)
## Rows: 445,429
## Columns: 11
## $ published_date <dttm> 2020-05-13 12:00:00, 2020-05-13 12:00:00, 2020-05-1...
## $ video_url      <chr> "https://www.youtube.com/watch?v=VMloBlnczzI", "http...
## $ channel        <chr> "YourMomsHousePodcast", "YourMomsHousePodcast", "You...
## $ title          <chr> "Your Mom's House Podcast - Ep. 551", "Your Mom's Ho...
## $ description    <chr> "Last week, 10mg Tom made his debut. This week, we h...
## $ ep_num         <dbl> 551, 551, 551, 551, 551, 551, 551, 551, 551, 551, 55...
## $ vid            <chr> "VMloBlnczzI", "VMloBlnczzI", "VMloBlnczzI", "VMloBl...
## $ segment_id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ text           <chr> "this episode of your mom's house is", "brought to y...
## $ start          <dbl> 0.000, 1.140, 3.659, 7.859, 8.910, 14.820, 20.789, 2...
## $ duration       <dbl> 3.659, 6.719, 5.251, 6.961, 11.879, 9.080, 3.111, 3....

Hopefully this helps folks & best of luck with your text analyses!