library(tidyverse)
library(eemisc) #ggplot theme
library(jsonlite) #work with json data
library(tidyLPA)
library(harrypotter) #colors
library(tidytext) #for reorder_within function
library(gt) #make tables
<- harrypotter::hp(n = 1, option = "HermioneGranger")
herm
<- options(
opts ggplot2.discrete.fill = list(
::hp(n = 3, option = "HermioneGranger"),
harrypotter::hp(n = 7, option = "Always")
harrypotter
)
)
theme_set(theme_ee())
<- "https://www.dnd5eapi.co/api/monsters/"
dnd_base
#getting data from api -- see 1st d&d post
#for process explanation
<- function(monster) {
fetch_monster <- "https://www.dnd5eapi.co/api/monsters/"
dnd_url
<- fromJSON(paste0(dnd_url, monster)) %>%
ret enframe() %>%
pivot_wider(names_from = name,
values_from = value)
return(ret)
}
<- function(x, size = 1) {
compare_lens all(map_lgl(x, ~length(unlist(.x)) == size))
}<- function(x) {
cond_unlist if (compare_lens(x) == TRUE) {
unlist(x)
else {
}
x
}
}
<- fromJSON(dnd_base)$results %>%
mons pull(index)
<- purrr::map(mons, fetch_monster)
monster_lists
<- bind_rows(monster_lists)
mons_bind
<- mons_bind %>%
mons_df mutate(across(.cols = everything(), ~cond_unlist(x = .x)))
This post is the next installment in a mini-series I’m doing where I explore Dungeons and Dragons (D&D) data. In the first post, I showed how to wrangle JSON data from a D&D API, and in the second post, I explored monster statistics.
In this post, I’m going to use latent profile analysis (LPA) and monsters’ ability scores to try to classify monsters into different “classes.” For instance, we might suspect that there is a “class” of up-in-your-face monsters that have high strength and constitution but low intelligence, whereas there may be another class of spell casters that has high intelligence and charisma but low strength and constitution. LPA gives us a framework to estimate how many of these classes exist as well as which monsters fall into which classes.
Note: I’m probably going to use the terms profile, class, group, and cluster somewhat interchangeably throughout this post (because I’m undisciplined), so just as a warning ahead of time – these all mean pretty much the same thing.
Before getting into the meat of this post, I want to give a shoutout to the excellent {tidyLPA]}
package, which provides some functions to make doing LPA easier.
So let’s get into it, then.
Setup
First, I’ll load some packages that are useful here. Notably, we’ll load {tidyLPA}
for doing the actual LPA. This calls the {mclust}
package to fit the model, and I could use {mclust}
directly, but {tidyLPA}
makes the process somewhat more convenient. I’m also going to get a tibble of monster data in this setup chunk. I walk through this process in my first D&D blog post, so if you’re curious about that, feel free to check out that post.
What is LPA?
Before getting into actually fitting any models, I want to do a very brief and minimally technical toe-dip into what LPA actually is. Like I mentioned just above, LPA is an approach for estimating groups of people/things/monsters/whatever. You might think of it as akin to principal components analysis (PCA) or exploratory factor analysis (EFA), except instead of grouping variables together, LPA groups observations together. LPA is itself a special type of mixture model, and the basic idea of mixture modeling is that there are k latent (underlying) subpopulations (groups) within a population, where the number of groups k is something the researcher/analyst should determine either empirically or using past research.
If you’re familiar with k-means clustering, this probably sounds very similar. And it pretty much is. The main difference between LPA and k-means is that, since LPA estimates distributions for each subpopulation in the model, we get the probability that each observation belongs to each class, e.g. Monster 1 might have a 99% probability of belonging to class 1, whereas Monster 2 might have an 80% probability of belonging to class 1. And depending on how you intend to use your model, this information might be useful. Another potentially useful difference is that LPA provides a number of fit statistics that (at least as far as I know) aren’t available for k-means clustering.
As with EFA (and other latent-variable models), the assumption of LPA is that the latent (unobserved) factor “causes” (I’m using the term loosely here) observed scores on the indicator variables. So, to refer back to my initial hypothetical example, a monster being a spell caster (the unobserved class) causes it to have high intelligence, low strength, etc. rather than the inverse. This is a worthwhile distinction to keep in mind, since it has implications for how the model is fit.
There are some other, more practical considerations regarding LPA that I’ll bring up once we get into the modeling, but this is enough background to get started.
Binning Monsters
I’m not really going to do any data exploration here, since I did that in a previous post. But I do want to quickly pull up a distribution of monster challenge rating because it’s relevant to a step we need to take before fitting our LPA.
%>%
mons_df ggplot(aes(x = challenge_rating)) +
geom_histogram(alpha = .8, fill = herm, color = herm) +
labs(
title = "Distribution of Monster Challenge Rating"
)
For those not in the know, a monster’s challenge rating in D&D refers to how difficult it is to fight (you probably could have guess that from the name), where higher CR = more difficult. Monsters can be difficult for many reasons, such as how they move and how many/which types of spells they have, but often, higher CR means higher ability scores. Since I’m going to be doing an LPA where ability scores are the profile indicators, I want to account for the fact that monsters with higher CRs will tend to have higher ability scores. Without accounting for this, CR will (likely) dominate our LPA, and we’d (probably) end up with profiles that represent “very low level monster,” “kinda low level monster,” “mid-level monster,” etc, rather than profiles that represent a monster archetype (e.g. spellcaster, beefcake).
The approach I’m going to take is to create sexile bins using monster CR to get 6 “bins” of monsters. Using this approach, monsters in each bin will have (roughly) the same CR. We can see this in the plot below.
#divide monsters into hexiles
<- mons_df %>%
mons_bin mutate(cr_bin = ntile(x = challenge_rating, n = 6))
%>%
mons_bin ggplot(aes(x = challenge_rating, y = cr_bin, color = cr_bin)) +
geom_jitter() +
scale_color_hp() +
labs(
y = "Challenge Bin",
x = "Challenge Rating",
title = "Binned CR by Actual CR"
+
) theme(
legend.position = "none"
)
Since the distribution we saw above has a long right tail, the higher-CR bins are more spread out than the lower bins, but obviously each bin has less variance than the entire population.
The next step is to group mean center each monster’s ability scores. To do this, I’ll subtract the bin-average ability score from the monster ability score. So, for example, to group mean center a monster in bin 1’s strength score, I’d subtract the bin-average strength score from the monster’s strength score. The resulting score tells me if the monster’s score is higher than average for it’s CR or lower than average for it’s CR. Importantly, this removes the effect of the bin average on our LPA. I could go a step further and divide by the bin-specific standard deviation to remove the effect of bin-specific variance, but that makes the estimates somewhat harder to interpret, so I’m not going to.
<- c("strength", "dexterity", "constitution", "intelligence", "wisdom", "charisma")
ab_scores
<- mons_bin %>%
mons_bin group_by(cr_bin) %>%
mutate(across(.cols = ab_scores, .fns = mean, .names = "{.col}_bin_mean")) %>%
ungroup()
<- str_replace_all(ab_scores, "$", "_bin_mean")
ab_scores_grp
<- map2_dfc(mons_bin[, ab_scores], mons_bin[, ab_scores_grp],
mons_centered ~.x - .y) %>%
rename_with(.fn = ~str_replace_all(.x, "$", "_centered")) %>%
bind_cols(mons_bin, .) %>%
select(name, ends_with("centered"))
Fit LPA & Select Best Model
The next step is to estimate our LPA. This is very easy using {tidyLPA}
’s estimate_profiles()
function. Here’s what I’m doing in the step below:
- Setting the
n_profiles
option to 1:5, which will fit models with 1 profile, 2 profiles, …, 5 profiles. - Setting the
variances
option to equal and varying. This will fit 1-5 profile models where the conditional variances of each indicator are constrained to be equal across profiles as well as 1-5 profile models where the conditional variances are allowed to vary. - Setting
covariances
equal to 0. The means that indicator variables are conditionally independent. - Telling the model to only use the centered ability score variables to estimate the profiles.
Also! always remember to set your seed.
set.seed(0408)
<- mons_centered %>%
lpa_fits estimate_profiles(1:5,
variances = c("equal", "varying"),
covariances = c("zero", "zero"),
select_vars = str_subset(names(mons_centered), "centered"))
Now that the model is fit, I want to select the best model. {tidyLPA}
makes this pretty easy via the compare_solutions()
function, which I’ll use later, but I also want to explore a few fit indices – Bayesian Information Criterion (BIC) and entropy. I’ve forgotten what BIC actually is, but I do know that it’s the fit index that people who know more about statistics than I do suggest we want to minimize in LPA. Entropy is a measure of how distinct classes are from one another and ranges from 0-1, where values closer to 1 represent more separation between classes (which is what we want). I’m going to pull these out and plot them.
<- names(lpa_fits)
mods #recall that model 1 corresponds to equal variances; model 2 corresponds to varying variances
#getting some fit indices
<- map_dbl(1:10, ~pluck(lpa_fits, .x, "fit", "BIC"))
bics <- map_dbl(1:10, ~pluck(lpa_fits, .x, "fit", "Entropy"))
entrops
<- bind_cols(mods, bics, entrops) %>%
fit_indices set_names(c("model", "bic", "entrop")) %>%
pivot_longer(cols = c("bic", "entrop"),
names_to = "metric",
values_to = "val")
%>%
fit_indices ggplot(aes(x = val, y = reorder_within(model, val, metric), fill = metric)) +
geom_col() +
geom_text(aes(label = if_else(val > 1, round(val, 0), round(val, 3)), x = val - .01), hjust = 1, color = "white") +
facet_wrap(vars(metric), scales = "free") +
scale_y_reordered() +
labs(
y = "Model",
x = "Value",
title = "Selected Fit Indices"
+
) theme(
legend.position = "none"
)
A couple things to keep in mind here:
- Models with 1 class will always have an entropy of 1 since there’s no overlap between classes.
- Model 1, Class 5 (equal variances, 5 profiles) threw a warning indicating that < 1% of observations were classified into one profile, so we probably don’t want to use that one.
We can see in the above plot that Model 2, Class 5 (varying variances, 5 profiles) has the lowest BIC (other than the model we want to ignore anyway) and the highest entropy (other than the 1-profile models). So that’s the one we likely want to go with, but let’s see what compare_solutions()
tells me.
%>%
lpa_fits compare_solutions()
Compare tidyLPA solutions:
Model Classes BIC
1 1 10599.528
1 2 10366.807
1 3 10237.207
1 4 10145.513
1 5 10127.963
2 1 10599.528
2 2 10383.456
2 3 10230.918
2 4 10160.966
2 5 10127.499
Best model according to BIC is Model 2 with 5 classes.
An analytic hierarchy process, based on the fit indices AIC, AWE, BIC, CLC, and KIC (Akogul & Erisoglu, 2017), suggests the best solution is Model 2 with 5 classes.
Examine Profiles
Now that we’ve selected a model specification, let’s look at what each profile looks like. The plot below shows the average bin-centered ability scores for each profile.
<- get_estimates(lpa_fits) %>%
prof_estimates filter(Model == 2,
== 5)
Classes
%>%
prof_estimates filter(Category == "Means") %>%
mutate(Class = str_replace_all(Class, "^", "Profile "),
Parameter = str_remove_all(Parameter, "_centered")) %>%
ggplot(aes(x = Estimate, y = Parameter)) +
geom_col(aes(fill = if_else(Estimate > 0, TRUE, FALSE))) +
facet_wrap(vars(Class)) +
labs(
y = NULL,
x = "CR-Centered Ability Score",
title = "Ability Score Means by Profile",
caption = "5 profiles, varying variances"
+
) theme(
legend.position = "none"
)
Profiles 1 and 4 aren’t terribly interesting – they represent monsters that have all ability scores either higher or lower than the bin average. This is probably an artifact of the challenge rating variation within each bin (especially the higher ones). Profiles 1, 3, and 5 are interesting, though.
- Profile 2 seems to represent spell-casting monsters with lower strength and constitution but higher intelligence, dexterity, and charisma.
- Likewise, Profile 3 looks like a different type of spellcaster, with very high intelligence and charisma.
- Profile 5 is more of a fighter-type monster, with higher strength and constitution but low intelligence and charisma.
Next, I’ll check how many monsters ended up in each class. I didn’t get a warning for this model suggesting that a class had dangerously low membership, but it’s still good practice to check this.
<- get_data(lpa_fits) %>%
class_assigns filter(model_number == 2,
== 5) %>%
classes_number group_by(id) %>%
filter(Probability == max(Probability)) %>%
ungroup() %>%
select(name, Class, Probability)
%>%
class_assigns count(Class)
# A tibble: 5 × 2
Class n
<dbl> <int>
1 1 68
2 2 69
3 3 32
4 4 45
5 5 120
This feels reasonable to me. The smallest class (Class 3) still comprises ~10% of the total monsters in the dataset.
Finally, let’s take a quick look at one example monster from each class as well as the estimated probability that the mosnter belongs to that class (note that the assigned/estimated class is the class for each monster with the highest probability.)
set.seed(0409)
%>%
class_assigns group_by(Class) %>%
sample_n(size = 1) %>%
ungroup() %>%
select(Class, name, Probability) %>%
gt() %>%
tab_header(
title = "Example Monster for Each Class"
%>%
) cols_label(
Class = "Esimated Class",
name = "Monster",
Probability = "Estimated Class Prob"
%>%
) fmt_percent(
columns = vars(Probability)
)
Example Monster for Each Class | ||
---|---|---|
Esimated Class | Monster | Estimated Class Prob |
1 | Grimlock | 96.19% |
2 | Drider | 98.36% |
3 | Veteran | 99.14% |
4 | Scorpion | 64.04% |
5 | Swarm of Rats | 89.60% |
That’s it for now! Hopefully that was interesting to people who like D&D and helpful for anyone interested in LPA. I’ll probably do one more post in this series, and likely one that gets into some predictive modeling.
Reuse
Citation
@online{ekholm2021,
author = {Ekholm, Eric},
title = {Dungeons and {Dragons} - {Part} 3},
date = {2021-02-03},
url = {https://www.ericekholm.com/posts/dungeons-and-dragons-part-3},
langid = {en}
}