# Scrantonicity - Part 3

###### PUBLISHED ON AUG 29, 2020

TL;DR:: In this blog, I use LASSO logistic regression and multilevel logistic regression to predict the speaker of lines of dialogue from The Office.

What feels like forever ago, I wrote two blog posts analyzing transcripts from The Office. The first was a basic EDA of the dialogue, and the second used k-means clustering to determine types of Office episodes based on who speaks to whom. At the end of that second blog, I mentioned that I might do some predictive analysis with that data in the future. Well, it’s four months later, and I’m declaring that the future is now!

Basically, the goal here is going to be, for a given line of dialogue from the show, to predict whether it’s Michael talking or someone else. At first blush, this seems like it shouldn’t be too hard. Many of Michael’s lines are iconic (e.g. see the above gif), but I feel like this might be more a function of the delivery than the actual words themselves, and I’m curious to see how well a model (or multiple models) could predict this just from the text.

In doing this, there are a couple of things I’m interested in doing here:

• Generally getting more practice with {tidymodels}
• Learning to use the {textrecipes} package
• Trying the {glmmTMB} package (not part of the {tidymodels} ecosystem)

Also, before getting too much further, I want to acknowledge that I looked at this blog by Julia Silge and this blog by Emil Hvitfeldt for some background on {textrecipes}. Both are really great for people interested in text analysis.

Anyway, without much further ado, let’s get into it. As has been the case in all of my “Scrantonicity” posts, the data I’m using here comes from the {schrute} package. First, I’ll load in libraries and set some defaults/options. I’m also going to read in the data, limiting the dialogue to the first seven seasons of the show (the Michael Scott era).

## Setup

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) library(tidyverse) library(eemisc) library(tidymodels) library(schrute) library(textrecipes) library(themis) library(vip) library(glmmTMB) herm <- harrypotter::hp(n = 1, option = "HermioneGranger") opts <- options( ggplot2.discrete.fill = list( harrypotter::hp(n = 3, option = "HermioneGranger"), harrypotter::hp(n = 5, option = "Always") ) ) theme_set(theme_ee()) office <- schrute::theoffice %>% filter(as.numeric(season) <= 7) %>% mutate(is_mike = if_else(character == "Michael", "Yes", "No")) ## Brief EDA and Data Preprocessing Before modeling data, I would typically do a more thorough EDA. But I’ve already explored this data pretty closely (albeit months ago) in two previous blog posts, so rather than re-doing that EDA, I’m just going to look at those posts. One thing I will include here, though, is a quick look at the number of lines spoken by Michael Scott vs other characters, since this is the outcome I’m interested in predicting here. office %>% count(character) %>% top_n(10) %>% ggplot(aes(x = n, y = fct_reorder(character, n))) + geom_col(fill = herm) + labs( title = "Lines by Character", subtitle = "First seven seasons", y = NULL, x = "Number of Lines" ) So, Michael has far and away the most lines of any character. But it’ll also be useful to look at Michael vs all of the others lumped together (since this is what I’m actually predicting). office %>% count(is_mike) %>% ggplot(aes(x = n, y = fct_reorder(is_mike, n))) + geom_col(fill = herm) + labs( title = "Mike vs Not Mike", y = "Is Michael?", x = "Number of Lines" ) Even though Michael speaks more than any other given character, he speaks about a third as many lines as all of the other characters combined. This is relevant here because it means I’ll want to downsample when I train my model to ensure the number of observations in each class are similar, which will help the model fit. ### Data Splitting & Preprocessing Next, I’m going to split my data into a training a testing set. set.seed(0408) office_split <- initial_split(office, strata = is_mike) tr <- training(office_split) te <- testing(office_split) Now that I’ve split my data, I’m going to preprocess the data using {recipes}, {textrecipes}, and {themis} (to handle class imbalance). One thing to clarify here: I’m building a model to predict whether the speaker of a given line of dialogue is Michael. In this analysis, I want to build this model using only the text data, although there are plenty of other text-based features I could include. More specifically, I am going to handle the preprocessing such that the model I end up fitting is a bag-of-words model. This means that I want my data to include a variable for each word* (not really each word, but I’ll show later) in the transcript, each row to represent a line of dialogue, and the value in each cell to represent the tf-idf of that word. From this data structure, I can build a model where each word has an individual effect on the odds that the line is spoken by Michael, although note that this model will have no sense of word order. I’ll specify this recipe and then walk through each step afterward. office_recipe <- recipe(is_mike ~ text + episode_name, data = tr) %>% themis::step_downsample(is_mike) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 200) %>% step_tfidf(text) %>% prep() tr_prepped <- juice(office_recipe) tr_prepped_noep <- tr_prepped %>% select(-episode_name) te_prepped <- bake(office_recipe, te) te_prepped_noep <- te_prepped %>% select(-episode_name) Let’s unpack this step-by-step: • step_downsample() will balance the data so that the number of cases where Michael is the speaker is equal to the number of cases where Michael is not the speaker. This is done by randomly dropping rows. • step_tokenize() will take the text column in the data and create a isolate each word per line. • step_stopwords() will remove stop words (e.g. “the”, “it”, “a”) that likely won’t contain much useful information. • step_tokenfilter(), as I’m using it here, will retain only the 200 most frequently used words. This is a pretty large number, but I’m going to fit a LASSO regression later, which can select out some of these if necessary. • step_tfidf() calculates the term frequency-inverse document frequency, which provides a metric for how important a word is to a given document (e.g. a line in this case). Another thing to note here is that I’m creating two versions of this preprocessed data for the training and test sets. The differences between “tr_prepped” and “tr_prepped_noep” (as well as their “te” counterparts) is that the “noep” versions do not have a variable identifying which line the episode came from (but are otherwise identical). This is because I don’t want to include the episode identifier in my single-level LASSO model but do want to include it in the multilevel model. I could also accomplish this by specifying the formula and having it not include the episode_number variable rather than creating two datasets. Moving along! Next, I’m going to specify my model. Since I have a binary outcomes (yes/no if the speaker is Michael), I’m going to run a logistic regression. I’m going to run this as a LASSO model, which will provide some feature selection and generally shrink coefficients. I’m going to tune the model to choose the best amount of penalty as well. reg_spec <- logistic_reg(mixture = 1, penalty = tune()) %>% set_engine("glmnet") reg_spec ## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = tune() ## mixture = 1 ## ## Computational engine: glmnet Here, I’m creating some resamples of my training data to help with the tuning. I’m creating 10 bootstrap samples here. set.seed(0408) booties <- bootstraps(tr_prepped_noep, strata = is_mike, times = 10) ## LASSO Model Fitting & Examination Now it’s time to fit the LASSO model. I’m going to add the logistic regression specification that I just created to a workflow. Along with that model specification, I’m also going to add a formula where is_mike is regressed on all of the word features I just created. Then, I’m going to tune the model across 10 candidate values of the penalty parameter (i.e. how much regularization I’m adding). office_wf <- workflow() %>% add_model(reg_spec) %>% add_formula(is_mike ~ .) set.seed(0408) logreg_fit <- tune_grid( office_wf, resamples = booties, grid = 10 ) Great. Now that the models have been fit with various penalty values across the bootstrap resamples, I can check to see what the best penalty value is to move forward with & finalize a model. I’m going to choose the best by one standard error (which, in this case, happens also to be the best model). The one standard error rule will let me choose the most parsimonious model (in this case, the one with the most penalty) that is within one standard error of the best model. And once I choose the best penalty value, I’ll go ahead and finalize the model and refit on the training set. logreg_fit %>% show_best("accuracy") ## # A tibble: 5 x 6 ## penalty .metric .estimator mean n std_err ## <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 2.27e- 3 accuracy binary 0.579 10 0.00162 ## 2 1.02e-10 accuracy binary 0.578 10 0.00147 ## 3 1.27e- 9 accuracy binary 0.578 10 0.00147 ## 4 7.94e- 8 accuracy binary 0.578 10 0.00147 ## 5 4.46e- 7 accuracy binary 0.578 10 0.00147 best_params <- logreg_fit %>% select_by_one_std_err(metric = "accuracy", desc(penalty)) final_logreg <- office_wf %>% finalize_workflow(best_params) %>% fit(data = tr_prepped_noep) So, the best model here has an accuracy of ~58%. Not great, but better than just straight-up guessing. Remember that this is on the training set. Now, I’ll take a look at what the accuracy is on the test set. bind_cols( predict(final_logreg, te_prepped_noep), te_prepped_noep ) %>% accuracy(is_mike, .pred_class) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.610 61% – not bad! It’s actually better than the training set accuracy, which means our training process didn’t lead to overfitting, which is great. Now, I’m going to take a look at which words are the most important to predicting whether the speaker of a line of dialogue is Michael or not. final_logreg %>% pull_workflow_fit() %>% vi() %>% slice_max(order_by = abs(Importance), n = 10) %>% ggplot(aes(x = abs(Importance), y = fct_reorder(Variable %>% str_remove("tfidf_text_"), abs(Importance)), fill = Sign)) + geom_col() + labs( title = "Most Important Words Identifying Michael Scott", subtitle = "Positive values more representative of MS, negative values more representative of others", y = NULL ) Not surprisingly, the word “Michael” is the strongest predictor, and has a negative effect – if a line has the word “Michael” in it, it is less likely to be spoken by Michael. Intuitively, this makes sense. Other people use Michael’s name when speaking to or about him. The rest of the effects in this chart make sense to me as well (except for “mifflin” and “dunder,” which I don’t really get). But Michael is certainly more likely to talk about Jan and David than are other characters, and “everybody” feels right to me as well… And the final thing I’m going to do with this logistic regression is to pull out names of the non-zero coefficients. Recall that the lasso penalty can (but doesn’t always) shrink coefficients to zero. These variables will have no effect on the outcome. The reason I’m doing this is because I want to fit a multilevel model next, but I’m not going to regularize that model. Instead, I’ll just specify a formula that doesn’t include the variables that got shrunk to zero in this model. keep_vars <- final_logreg %>% pull_workflow_fit() %>% vi() %>% filter(Importance != 0) %>% pull(Variable) ## Multilevel Model Fitting Now, I’m going to dive into fitting a multilevel model. To give a very brief overview of multilevel models, they are models that can take into account dependencies (nesting) within data. Recall that one of the assumptions of a linear regression is that each observation is independent. We often violate that assumption in the real world. In my work, for instance, students are often nested within classrooms (i.e. a common effect – their teacher – influences them & introduces a dependency). Another common case of nesting is when you have multiple observations over time from the same set of people. In the case of this current data, we can consider that each line is nested within an episode (terminology note: episode would be the “clustering variable” or “grouping variable” here). We could also go a step further and nest episodes within seasons to get a 3-level model rather than a 2-level model, but I’m not going to do that here. Fitting multilevel models allows for random effects, where the coefficient of a given term differs based on the clustering variable. Any term in the model can have a random effect, but the simplest form of a multilevel model – and the one I’m going to fit here – is a random intercept model, where the value of the intercept changes depending on the clustering variable. In the current dataset, this would mean that Michael might be more (or less) likely to speak overall in a given episode (when compared to all other episodes), and so the intercept value will change to reflect that. It’s also possible to fit random slopes, where the effect of a given non-intercept term differs from episode to episode. Contextualizing that in the current data, it might mean that the word “Jan” is more (or less) associated with being spoken by Michael depending on the episode. Usually, you want a pretty clear theoretical rationale for specifying random slopes, and I don’t really have that here. Plus, it would be unreasonable to try to estimate random slopes for all of the words in the dataset (even though I only have a subset of ~190). If you’re interested in learning more about multilevel models, Raudenbush & Bryk (2002) is a classic, and John Fox’s Applied Regression Analysis is just generally a really good book that has a chapter on MLMs. Anyway – onward and upward. First, I want to specify the formula of the model. I’m going to include all of the variables that had non-zero coefficients in the lasso model earlier, and I’m also going to add a term at the end to specify the random intercept for each episode – (1 | episode_name). glmm_formula <- as.formula(paste("is_mike ~ ", paste(keep_vars, collapse = " + "), " + (1 | episode_name)")) I’m going to fit this model using the {glmmTMB} package, which provides an interface for fitting all sort of generalized linear mixed models. I haven’t used this specific package before, but I have used {lme4}, which has similar syntax and is essentially the same thing for fitting linear models. I’m going to fit the model using the training data – note that I’m not tuning anything here – and I’m specifying the binomial family because this is a logistic regression. glmm_fit <- glmmTMB(glmm_formula, data = tr_prepped, family = binomial) I’m going to show the summary of the model here, but it’s going to be a biiig printout since we have so many terms in the model, so feel free to scroll on by. One thing you might want to check out, though, is the summary of the variance of the intercept, which summarizes the amount of randomness in that effect. summary(glmm_fit) ## Family: binomial ( logit ) ## Formula: ## is_mike ~ tfidf_text_jan + tfidf_text_everybody + tfidf_text_scott + ## tfidf_text_stanley + tfidf_text_dunder + tfidf_text_best + ## tfidf_text_life + tfidf_text_david + tfidf_text_thinking + ## tfidf_text_friend + tfidf_text_holly + tfidf_text_people + ## tfidf_text_find + tfidf_text_alright + tfidf_text_ah + tfidf_text_okay + ## tfidf_text_well + tfidf_text_somebody + tfidf_text_going + ## tfidf_text_god + tfidf_text_scranton + tfidf_text_feel + ## tfidf_text_fun + tfidf_text_may + tfidf_text_right + tfidf_text_business + ## tfidf_text_know + tfidf_text_give + tfidf_text_dwight + tfidf_text_second + ## tfidf_text_good + tfidf_text_come + tfidf_text_see + tfidf_text_yes + ## tfidf_text_ok + tfidf_text_ryan + tfidf_text_ever + tfidf_text_go + ## tfidf_text_just + tfidf_text_thing + tfidf_text_show + tfidf_text_big + ## tfidf_text_pam + tfidf_text_check + tfidf_text_friends + ## tfidf_text_point + tfidf_text_say + tfidf_text_toby + tfidf_text_look + ## tfidf_text_kind + tfidf_text_party + tfidf_text_head + tfidf_text_part + ## tfidf_text_bad + tfidf_text_name + tfidf_text_room + tfidf_text_need + ## tfidf_text_guy + tfidf_text_hold + tfidf_text_guys + tfidf_text_getting + ## tfidf_text_every + tfidf_text_phyllis + tfidf_text_never + ## tfidf_text_oscar + tfidf_text_wanted + tfidf_text_stop + ## tfidf_text_first + tfidf_text_today + tfidf_text_please + ## tfidf_text_actually + tfidf_text_meet + tfidf_text_wrong + ## tfidf_text_get + tfidf_text_told + tfidf_text_love + tfidf_text_work + ## tfidf_text_lot + tfidf_text_talk + tfidf_text_little + tfidf_text_tell + ## tfidf_text_oh + tfidf_text_time + tfidf_text_paper + tfidf_text_old + ## tfidf_text_around + tfidf_text_baby + tfidf_text_want + tfidf_text_happy + ## tfidf_text_day + tfidf_text_done + tfidf_text_company + tfidf_text_talking + ## tfidf_text_help + tfidf_text_call + tfidf_text_mm + tfidf_text_said + ## tfidf_text_kevin + tfidf_text_else + tfidf_text_something + ## tfidf_text_ask + tfidf_text_place + tfidf_text_things + tfidf_text_hello + ## tfidf_text_can + tfidf_text_keep + tfidf_text_trying + tfidf_text_wow + ## tfidf_text_believe + tfidf_text_year + tfidf_text_hear + ## tfidf_text_morning + tfidf_text_care + tfidf_text_hot + tfidf_text_um + ## tfidf_text_fine + tfidf_text_let + tfidf_text_jim + tfidf_text_idea + ## tfidf_text_manager + tfidf_text_hey + tfidf_text_take + tfidf_text_home + ## tfidf_text_try + tfidf_text_next + tfidf_text_much + tfidf_text_really + ## tfidf_text_went + tfidf_text_many + tfidf_text_nice + tfidf_text_thank + ## tfidf_text_make + tfidf_text_hmm + tfidf_text_five + tfidf_text_now + ## tfidf_text_listen + tfidf_text_back + tfidf_text_yeah + tfidf_text_anyone + ## tfidf_text_even + tfidf_text_another + tfidf_text_better + ## tfidf_text_uh + tfidf_text_made + tfidf_text_boss + tfidf_text_maybe + ## tfidf_text_everyone + tfidf_text_together + tfidf_text_wanna + ## tfidf_text_huh + tfidf_text_think + tfidf_text_us + tfidf_text_two + ## tfidf_text_long + tfidf_text_car + tfidf_text_got + tfidf_text_away + ## tfidf_text_put + tfidf_text_new + tfidf_text_great + tfidf_text_anything + ## tfidf_text_still + tfidf_text_one + tfidf_text_whole + tfidf_text_gonna + ## tfidf_text_always + tfidf_text_money + tfidf_text_cool + ## tfidf_text_thought + tfidf_text_real + tfidf_text_man + tfidf_text_thanks + ## tfidf_text_guess + tfidf_text_sure + tfidf_text_job + tfidf_text_sorry + ## tfidf_text_wait + tfidf_text_might + tfidf_text_hi + tfidf_text_probably + ## tfidf_text_three + tfidf_text_person + tfidf_text_andy + ## tfidf_text_coming + tfidf_text_someone + tfidf_text_angela + ## tfidf_text_mean + tfidf_text_cause + tfidf_text_pretty + ## tfidf_text_mifflin + tfidf_text_michael + (1 | episode_name) ## Data: tr_prepped ## ## AIC BIC logLik deviance df.resid ## 21674.6 23161.5 -10644.3 21288.6 16187 ## ## Random effects: ## ## Conditional model: ## Groups Name Variance Std.Dev. ## episode_name (Intercept) 0.2359 0.4857 ## Number of obs: 16380, groups: episode_name, 139 ## ## Conditional model: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.3029209 0.0544706 -5.561 2.68e-08 *** ## tfidf_text_jan 0.7372910 0.1507742 4.890 1.01e-06 *** ## tfidf_text_everybody 0.7635790 0.1601581 4.768 1.86e-06 *** ## tfidf_text_scott 0.7600609 0.1786215 4.255 2.09e-05 *** ## tfidf_text_stanley 0.5691587 0.1469049 3.874 0.000107 *** ## tfidf_text_dunder 0.7789319 0.6218075 1.253 0.210319 ## tfidf_text_best 0.5026417 0.1588389 3.164 0.001554 ** ## tfidf_text_life 0.5213888 0.1690371 3.084 0.002039 ** ## tfidf_text_david 0.3545679 0.1252182 2.832 0.004632 ** ## tfidf_text_thinking 0.3721628 0.1431924 2.599 0.009348 ** ## tfidf_text_friend 0.3437911 0.1418104 2.424 0.015338 * ## tfidf_text_holly 0.3810667 0.1233370 3.090 0.002004 ** ## tfidf_text_people 0.3518585 0.1116141 3.152 0.001619 ** ## tfidf_text_find 0.3880312 0.1402456 2.767 0.005661 ** ## tfidf_text_alright 0.3289620 0.0668870 4.918 8.74e-07 *** ## tfidf_text_ah 0.3266135 0.0857779 3.808 0.000140 *** ## tfidf_text_okay 0.3522780 0.0451175 7.808 5.81e-15 *** ## tfidf_text_well 0.3071124 0.0562277 5.462 4.71e-08 *** ## tfidf_text_somebody 0.2804234 0.1181741 2.373 0.017646 * ## tfidf_text_going 0.2918391 0.0674205 4.329 1.50e-05 *** ## tfidf_text_god 0.3001693 0.0802145 3.742 0.000183 *** ## tfidf_text_scranton 0.3339592 0.1275372 2.619 0.008831 ** ## tfidf_text_feel 0.2845247 0.1441670 1.974 0.048430 * ## tfidf_text_fun 0.3032541 0.1024671 2.960 0.003081 ** ## tfidf_text_may 0.2847724 0.1353545 2.104 0.035387 * ## tfidf_text_right 0.2883921 0.0557878 5.169 2.35e-07 *** ## tfidf_text_business 0.3290740 0.1122117 2.933 0.003361 ** ## tfidf_text_know 0.2754553 0.0555031 4.963 6.95e-07 *** ## tfidf_text_give 0.2555021 0.0895789 2.852 0.004341 ** ## tfidf_text_dwight 0.2521584 0.0638145 3.951 7.77e-05 *** ## tfidf_text_second 0.2489978 0.1584121 1.572 0.115989 ## tfidf_text_good 0.2657544 0.0564201 4.710 2.47e-06 *** ## tfidf_text_come 0.2212737 0.0706368 3.133 0.001733 ** ## tfidf_text_see 0.2564652 0.0767773 3.340 0.000837 *** ## tfidf_text_yes 0.2658820 0.0415529 6.399 1.57e-10 *** ## tfidf_text_ok 0.2123422 0.0521134 4.075 4.61e-05 *** ## tfidf_text_ryan 0.2216123 0.0988844 2.241 0.025018 * ## tfidf_text_ever 0.2643488 0.1047767 2.523 0.011637 * ## tfidf_text_go 0.2546152 0.0639061 3.984 6.77e-05 *** ## tfidf_text_just 0.2353517 0.0668502 3.521 0.000431 *** ## tfidf_text_thing 0.2206678 0.0982126 2.247 0.024650 * ## tfidf_text_show 0.2667401 0.1402265 1.902 0.057144 . ## tfidf_text_big 0.2245832 0.1055598 2.128 0.033375 * ## tfidf_text_pam 0.2058977 0.0649203 3.172 0.001516 ** ## tfidf_text_check 0.1942910 0.1044745 1.860 0.062928 . ## tfidf_text_friends 0.1624983 0.1625989 0.999 0.317610 ## tfidf_text_point 0.1908723 0.1698123 1.124 0.261005 ## tfidf_text_say 0.2092691 0.0705187 2.968 0.003002 ** ## tfidf_text_toby 0.1472744 0.0971435 1.516 0.129507 ## tfidf_text_look 0.1996459 0.0756503 2.639 0.008314 ** ## tfidf_text_kind 0.1833095 0.1140876 1.607 0.108111 ## tfidf_text_party 0.1610299 0.0917887 1.754 0.079370 . ## tfidf_text_head 0.1584521 0.1256650 1.261 0.207342 ## tfidf_text_part 0.1386087 0.1185380 1.169 0.242275 ## tfidf_text_bad 0.1745407 0.1017822 1.715 0.086374 . ## tfidf_text_name 0.1749990 0.0957496 1.828 0.067598 . ## tfidf_text_room 0.1868136 0.1168691 1.598 0.109935 ## tfidf_text_need 0.1620772 0.0848763 1.910 0.056189 . ## tfidf_text_guy 0.1670725 0.0948399 1.762 0.078132 . ## tfidf_text_hold 0.1547548 0.1203024 1.286 0.198310 ## tfidf_text_guys 0.1497446 0.0790940 1.893 0.058325 . ## tfidf_text_getting 0.1755995 0.1146826 1.531 0.125725 ## tfidf_text_every 0.1617291 0.1511232 1.070 0.284538 ## tfidf_text_phyllis 0.1361298 0.0972280 1.400 0.161481 ## tfidf_text_never 0.1604898 0.0887524 1.808 0.070562 . ## tfidf_text_oscar 0.1547884 0.0959862 1.613 0.106829 ## tfidf_text_wanted 0.1280130 0.1331137 0.962 0.336210 ## tfidf_text_stop 0.1337950 0.0711891 1.879 0.060186 . ## tfidf_text_first 0.2430160 0.1320562 1.840 0.065732 . ## tfidf_text_today 0.1088150 0.1044473 1.042 0.297497 ## tfidf_text_please 0.1114035 0.0716934 1.554 0.120211 ## tfidf_text_actually 0.1312115 0.1037242 1.265 0.205870 ## tfidf_text_meet 0.1252490 0.1256944 0.996 0.319028 ## tfidf_text_wrong 0.1763792 0.0997280 1.769 0.076960 . ## tfidf_text_get 0.1071142 0.0625835 1.712 0.086981 . ## tfidf_text_told 0.0868422 0.1119252 0.776 0.437811 ## tfidf_text_love 0.1217909 0.0725991 1.678 0.093429 . ## tfidf_text_work 0.1072141 0.0971309 1.104 0.269676 ## tfidf_text_lot 0.1067340 0.1094996 0.975 0.329688 ## tfidf_text_talk 0.1247306 0.0897013 1.391 0.164374 ## tfidf_text_little 0.1075946 0.0946355 1.137 0.255565 ## tfidf_text_tell 0.1362850 0.0824981 1.652 0.098539 . ## tfidf_text_oh 0.0939749 0.0482322 1.948 0.051369 . ## tfidf_text_time 0.1225423 0.0783078 1.565 0.117611 ## tfidf_text_paper 0.1562722 0.1184287 1.320 0.186986 ## tfidf_text_old 0.0673718 0.0977522 0.689 0.490691 ## tfidf_text_around 0.1188791 0.1461178 0.814 0.415883 ## tfidf_text_baby 0.1191267 0.1066110 1.117 0.263825 ## tfidf_text_want 0.0876257 0.0647476 1.353 0.175946 ## tfidf_text_happy 0.1048543 0.0955342 1.098 0.272398 ## tfidf_text_day 0.1037690 0.1037597 1.000 0.317267 ## tfidf_text_done 0.0859977 0.0884952 0.972 0.331161 ## tfidf_text_company 0.0797313 0.1347148 0.592 0.553949 ## tfidf_text_talking 0.0717903 0.0793967 0.904 0.365890 ## tfidf_text_help 0.0612079 0.1014742 0.603 0.546384 ## tfidf_text_call 0.0941603 0.0936287 1.006 0.314571 ## tfidf_text_mm 0.0752606 0.0841799 0.894 0.371298 ## tfidf_text_said 0.0822590 0.0652810 1.260 0.207642 ## tfidf_text_kevin 0.0832092 0.0795433 1.046 0.295521 ## tfidf_text_else 0.0848029 0.1334305 0.636 0.525064 ## tfidf_text_something 0.0501609 0.0945360 0.531 0.595695 ## tfidf_text_ask 0.0591700 0.1271760 0.465 0.641745 ## tfidf_text_place 0.0718600 0.1165014 0.617 0.537356 ## tfidf_text_things 0.0454318 0.1150841 0.395 0.693012 ## tfidf_text_hello 0.0536443 0.0640592 0.837 0.402358 ## tfidf_text_can 0.0682979 0.0692843 0.986 0.324250 ## tfidf_text_keep 0.0562764 0.1008415 0.558 0.576798 ## tfidf_text_trying 0.0697282 0.1168495 0.597 0.550684 ## tfidf_text_wow 0.0558013 0.0586582 0.951 0.341454 ## tfidf_text_believe 0.0694428 0.0975346 0.712 0.476477 ## tfidf_text_year 0.0198931 0.1200495 0.166 0.868387 ## tfidf_text_hear 0.0193085 0.0912039 0.212 0.832336 ## tfidf_text_morning 0.0827975 0.1001271 0.827 0.408280 ## tfidf_text_care 0.0726388 0.1560745 0.465 0.641637 ## tfidf_text_hot 0.0132303 0.1010992 0.131 0.895883 ## tfidf_text_um 0.0054918 0.0730398 0.075 0.940064 ## tfidf_text_fine 0.0146944 0.0882897 0.166 0.867815 ## tfidf_text_let 0.0169750 0.1053511 0.161 0.871993 ## tfidf_text_jim 0.0285969 0.0688487 0.415 0.677880 ## tfidf_text_idea 0.0066760 0.1021193 0.065 0.947875 ## tfidf_text_manager 0.0339253 0.1123951 0.302 0.762774 ## tfidf_text_hey 0.0307439 0.0436914 0.704 0.481644 ## tfidf_text_take 0.0127983 0.0870952 0.147 0.883175 ## tfidf_text_home 0.0300250 0.1459259 0.206 0.836982 ## tfidf_text_try 0.0127101 0.1149946 0.111 0.911991 ## tfidf_text_next 0.0326256 0.1289321 0.253 0.800234 ## tfidf_text_much 0.0144692 0.0838654 0.173 0.863022 ## tfidf_text_really 0.0118776 0.0555883 0.214 0.830804 ## tfidf_text_went 0.0335359 0.1190097 0.282 0.778103 ## tfidf_text_many 0.0035689 0.1223784 0.029 0.976735 ## tfidf_text_nice -0.0147389 0.0712816 -0.207 0.836189 ## tfidf_text_thank 0.0127512 0.0517098 0.247 0.805225 ## tfidf_text_make -0.0008785 0.0941629 -0.009 0.992557 ## tfidf_text_hmm -0.0032915 0.0656409 -0.050 0.960008 ## tfidf_text_five -0.0173667 0.0876847 -0.198 0.842999 ## tfidf_text_now -0.0264073 0.0868250 -0.304 0.761019 ## tfidf_text_listen -0.0234708 0.1497932 -0.157 0.875491 ## tfidf_text_back -0.0284046 0.0895100 -0.317 0.750990 ## tfidf_text_yeah -0.0356910 0.0398897 -0.895 0.370925 ## tfidf_text_anyone -0.0176885 0.1382204 -0.128 0.898170 ## tfidf_text_even -0.0499322 0.1045696 -0.478 0.633005 ## tfidf_text_another -0.0562043 0.1434155 -0.392 0.695134 ## tfidf_text_better -0.0114586 0.1217039 -0.094 0.924989 ## tfidf_text_uh -0.0483613 0.0630707 -0.767 0.443213 ## tfidf_text_made -0.0667750 0.1133116 -0.589 0.555657 ## tfidf_text_boss -0.0160220 0.1235471 -0.130 0.896817 ## tfidf_text_maybe -0.0651173 0.0794681 -0.819 0.412550 ## tfidf_text_everyone -0.0249679 0.1220456 -0.205 0.837901 ## tfidf_text_together -0.0412958 0.1430626 -0.289 0.772845 ## tfidf_text_wanna -0.0739256 0.1033942 -0.715 0.474617 ## tfidf_text_huh -0.0639522 0.0813526 -0.786 0.431802 ## tfidf_text_think -0.0418497 0.0614487 -0.681 0.495839 ## tfidf_text_us -0.0535433 0.0970478 -0.552 0.581139 ## tfidf_text_two -0.0402446 0.0879470 -0.458 0.647239 ## tfidf_text_long -0.0510322 0.1262587 -0.404 0.686075 ## tfidf_text_car -0.0266573 0.1093915 -0.244 0.807473 ## tfidf_text_got -0.0488823 0.0686154 -0.712 0.476211 ## tfidf_text_away -0.1182013 0.1164697 -1.015 0.310169 ## tfidf_text_put -0.0447067 0.0955902 -0.468 0.640005 ## tfidf_text_new -0.0549771 0.1196758 -0.459 0.645959 ## tfidf_text_great -0.1021129 0.0726217 -1.406 0.159696 ## tfidf_text_anything -0.0528189 0.0930687 -0.568 0.570357 ## tfidf_text_still -0.0774119 0.0866660 -0.893 0.371739 ## tfidf_text_one -0.0657405 0.0702777 -0.935 0.349562 ## tfidf_text_whole -0.0708486 0.1522132 -0.465 0.641605 ## tfidf_text_gonna -0.1042882 0.0835974 -1.248 0.212212 ## tfidf_text_always -0.1066373 0.1185158 -0.900 0.368241 ## tfidf_text_money -0.1155684 0.1286889 -0.898 0.369161 ## tfidf_text_cool -0.1082140 0.0829498 -1.305 0.192038 ## tfidf_text_thought -0.1016551 0.0938529 -1.083 0.278750 ## tfidf_text_real -0.1213314 0.1312351 -0.925 0.355208 ## tfidf_text_man -0.1112639 0.0807376 -1.378 0.168174 ## tfidf_text_thanks -0.1238876 0.0730649 -1.696 0.089965 . ## tfidf_text_guess -0.1313798 0.1257158 -1.045 0.295998 ## tfidf_text_sure -0.1097666 0.0747863 -1.468 0.142176 ## tfidf_text_job -0.0943828 0.1325235 -0.712 0.476343 ## tfidf_text_sorry -0.1202065 0.0760825 -1.580 0.114119 ## tfidf_text_wait -0.1560121 0.0805076 -1.938 0.052641 . ## tfidf_text_might -0.1657607 0.1282435 -1.293 0.196168 ## tfidf_text_hi -0.1425151 0.0727327 -1.959 0.050062 . ## tfidf_text_probably -0.1759289 0.1449074 -1.214 0.224718 ## tfidf_text_three -0.1696357 0.1087326 -1.560 0.118732 ## tfidf_text_person -0.1824426 0.1584761 -1.151 0.249637 ## tfidf_text_andy -0.1549839 0.0872749 -1.776 0.075764 . ## tfidf_text_coming -0.2454545 0.1444184 -1.700 0.089205 . ## tfidf_text_someone -0.2621048 0.1514974 -1.730 0.083613 . ## tfidf_text_angela -0.2743710 0.1276223 -2.150 0.031566 * ## tfidf_text_mean -0.3043742 0.0861024 -3.535 0.000408 *** ## tfidf_text_cause -0.3283995 0.1569350 -2.093 0.036386 * ## tfidf_text_pretty -0.3818384 0.1586936 -2.406 0.016122 * ## tfidf_text_mifflin -0.5893730 0.5956769 -0.989 0.322459 ## tfidf_text_michael -1.3116352 0.1181541 -11.101 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Right, so, the next logical step in my mind is to take a closer look at the random intercepts. We see some variance in the intercept (.23), which suggests that there are meaningful between-episode differences in the number of times Michael Scott speaks. Rather than looking at all of these, let’s take a look at the largest 10 effects (as a benchmark, recall that the mean intercept is -.3) ranef(glmm_fit) %>% as.data.frame() %>% select(grp, condval) %>% slice_max(order_by = abs(condval), n = 10) %>% ggplot(aes(x = abs(condval), y = fct_reorder(grp, abs(condval)), fill = if_else(condval > 0, "Pos", "Neg"))) + geom_col() + scale_fill_discrete(name = "Sign") + labs( y = NULL, title = "Top Random Intercepts" ) This plot shows the largest (in absolute value) intercepts. The way to interpret this is that, in these episodes, Michael is more or less likely to speak. The effects of each of the words remains the same across episodes (since I didn’t specify random slopes), but these change the assumed “base rate” that Michael speaks. What we see here makes sense, because Michael actually isn’t in the three episodes that have the highest values here (I should have addressed this in data cleaning – whoops!). Finally, I’ll take a look at the accuracy of the predictions from the multilevel model. glmm_preds_response <- predict(glmm_fit, te_prepped, type = "response") glmm_preds <- ifelse(glmm_preds_response < .5, "No", "Yes") %>% as_factor() %>% fct_relevel("No", "Yes") bind_cols(te_prepped$is_mike, glmm_preds) %>%
repair_names() %>%
accuracy(truth = ...1, estimate = ...2)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.593

It’s a little bit disappointing that the multilevel model isn’t more accurate than the single-level model I ran previously, but one thing to keep in mind is that the single level model was regularized, whereas the multilevel model wasn’t (beyond omitting the variables that got completely omitted from the single level model). So, even though our intercept seems to have a decent amount of variance – meaning random effects are probably warranted – the gains in predictive accuracy we’d get from that are more than offset by the regularization in the first model. There’s probably a way to regularize a multilevel model, but I might save that one for another day. I could also play around with changing the probability threshold for classifying a line as Michael by setting it to something higher than 50% (e.g. a line needs to have a 70% probability before being classified as spoken by Michael), but I’m also not going to go down that rabbit hole here.

So, I’m going to wrap it up for now. And who knows, maybe I’ll revisit this dataset in another 4 months.