How many groups of Star Wars characters are there? R-squared and cross-validation approaches

2017/07/02

Background

How many groups, or types, of Star Wars characters are there? I’ve been wanting to use the starwars dataset built-in to the dplyr package, and at the same time, have been working hard on an R package to carry out an analysis suited to doing this. Part of the challenge of using the approach in this R package is determining how groups groups there are.

Many approaches (Latent Profile Analysis, for example) use Maximum Likelihood estimation (while the approach I’ve developed uses a two-step cluster analysis based around the geometric (and algebraic) idea of “distance”, or how close (similar) observations are). This is easy enough when we’re talking about something like length. If something is 4 long and another thing 8, then what is there distance (4!)? When we’re talking about more than just length - say, length and width - then it’s the exact same idea, except the distance represents how far two things are across both measures - length and width.

But back to groups of Star Wars characters. How many are there? Let’s see what data we have:

library(dplyr)

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

It looks like we only have three measures that are numbers (height, mass, and birth_year) - though there are others we could possibly turn into numbers (maybe), and there are other approaches (Latent Class Analysis) that can deal with non-numeric measures (such as hair_color). But we’ll have to stick to the three measures that are numbers, for better or worse, for now.

R2

Let’s first take a look at the plot of R2 values, which are obtained from the second of the two steps of the cluster analysis - the k-means step (I say this because there are other, perhaps better, ways to calculate the R-squared values, such as from a MANOVA).

We just list the name of the data and the variables we would like to use. Since birth_year is on a very different metric than the other two variables, we’ll set to_scale and to_center to TRUE. We’ll also return a table, instead of a plot.

library(prcr)

plot_r_squared(starwars, height, mass, birth_year, to_scale = TRUE, to_center = TRUE, r_squared_table = T)
## ################################
## Clustering data for iteration 2
## Clustering data for iteration 3
## Clustering data for iteration 4
## Clustering data for iteration 5
## Clustering data for iteration 6
## Clustering data for iteration 7
## Clustering data for iteration 8
## Clustering data for iteration 9
## ################################
##   cluster r_squared_value
## 1       2           0.507
## 2       3              NA
## 3       4              NA
## 4       5              NA
## 5       6              NA
## 6       7              NA
## 7       8              NA
## 8       9              NA

Ooh! Not good. Before the second of the two steps settled on the groups, it ended up with a group with no observations. This is probably in part the result of a small sample, and possibly attributable to the measures we used - and maybe some missing data for some of the measures. Let’s take a look at the data:

starwars_ss <- select(starwars, height, mass, birth_year)
skimr::skim(starwars_ss)
Table 1: Data summary
Namestarwars_ss
Number of rows87
Number of columns3
_______________________
Column type frequency:
numeric3
________________________
Group variablesNone

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
height60.93174.3634.7766167.0180191.0264▁▁▇▅▁
mass280.6897.31169.461555.67984.51358▇▁▁▁▁
birth_year440.4987.57154.69835.05272.0896▇▁▁▁▁

It looks like the birth_year is missing for a lot - 44 - of the observations for the 87 Star Wars characters we have. We’re down to the bare-bones number of measures, but let’s try with just height and mass. We probably don’t need to scale the data.

plot_r_squared(starwars, height, mass, to_scale = TRUE, to_center = TRUE, r_squared_table = T)
## ################################
## Clustering data for iteration 2
## Clustering data for iteration 3
## Clustering data for iteration 4
## Clustering data for iteration 5
## Clustering data for iteration 6
## Clustering data for iteration 7
## Clustering data for iteration 8
## Clustering data for iteration 9
## ################################
##   cluster r_squared_value
## 1       2           0.485
## 2       3           0.872
## 3       4              NA
## 4       5              NA
## 5       6           0.977
## 6       7              NA
## 7       8              NA
## 8       9              NA

That’s better - in a sense. We have two, three, and six groups solutions. I wouldn’t trust the six group solution very much. The R2 value does increase substantialy between two and three groups. This suggests maybe there are three groups (when we use just the measures for weight and mass).

Groups

two_profiles <- create_profiles(starwars, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)
plot(two_profiles)

three_profiles <- create_profiles(starwars, height, mass, n_profiles = 3, to_scale = TRUE, to_center = TRUE)
plot(three_profiles)

The third group: Massive, not so tall

It looks like there is one very massive (literally) observation that makes up one profile in both the two and three profile solutions. Who is it?

three_profiles$.data %>% 
    filter(cluster == 3) %>% 
    knitr::kable()
nameheightmasshair_colorskin_coloreye_colorbirth_yeargenderhomeworldspeciesfilmsvehiclesstarshipscluster
Jabba Desilijic Tiure1751358NAgreen-tan, brownorange600hermaphroditeNal HuttaHuttc(“The Phantom Menace”, “Return of the Jedi”, “A New Hope”)character(0)character(0)3

Jabba. Of course. It looks like with two or three groups, Jabba ends up in one cluster.

The second group: Less massive, small height

What about the seven - who seem to be less massive and with a small height - in the second group?

three_profiles$.data %>% 
    filter(cluster == 2) %>% 
    knitr::kable()
nameheightmasshair_colorskin_coloreye_colorbirth_yeargenderhomeworldspeciesfilmsvehiclesstarshipscluster
R2-D29632NAwhite, bluered33NANabooDroidc(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)character(0)character(0)2
R5-D49732NAwhite, redredNANATatooineDroidA New Hopecharacter(0)character(0)2
Yoda6617whitegreenbrown896maleNAYoda’s speciesc(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”)character(0)character(0)2
Wicket Systri Warrick8820brownbrownbrown8maleEndorEwokReturn of the Jedicharacter(0)character(0)2
Sebulba11240nonegrey, redorangeNAmaleMalastareDugThe Phantom Menacecharacter(0)character(0)2
Dud Bolt9445noneblue, greyyellowNAmaleVulpterVulptereenThe Phantom Menacecharacter(0)character(0)2
Ratts Tyerell7915nonegrey, blueunknownNAmaleAleen MinorAleenaThe Phantom Menacecharacter(0)character(0)2

These seem to be droids, Yoda, and some other tiny characters.

(Some from) the first group: Above average height, below average mass

The 51 in the first group, with slightly above average height, and slightly below average mass? It’s a big group, so here are just the first six, with a lot of familiar characters:

three_profiles$.data %>% 
    filter(cluster == 1) %>% 
    head() %>% 
    knitr::kable()
nameheightmasshair_colorskin_coloreye_colorbirth_yeargenderhomeworldspeciesfilmsvehiclesstarshipscluster
Luke Skywalker17277blondfairblue19.0maleTatooineHumanc(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)c(“Snowspeeder”, “Imperial Speeder Bike”)c(“X-wing”, “Imperial shuttle”)1
C-3PO16775NAgoldyellow112.0NATatooineDroidc(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”)character(0)character(0)1
Darth Vader202136nonewhiteyellow41.9maleTatooineHumanc(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”)character(0)TIE Advanced x11
Leia Organa15049brownlightbrown19.0femaleAlderaanHumanc(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)Imperial Speeder Bikecharacter(0)1
Owen Lars178120brown, greylightblue52.0maleTatooineHumanc(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”)character(0)character(0)1
Beru Whitesun lars16575brownlightblue47.0femaleTatooineHumanc(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”)character(0)character(0)1

Cross-validation

The other technique for determining the number of groups, cross-validation, may be folly because of how it works: Split the data into two, and see how well groups in one half can be reproduced in the other. This may be a problem due to the Jabba-group.

We’ll use the same arguments except for plot_r_squared, which we don’t need, and for one argument, n_profiles, for how many groups we want to cross-validate the groupings for (we have to deal with complete cases, which is what the first two lines are for), for the three group solution:

starwars_ss <- starwars_ss[complete.cases(starwars_ss), ]
cross_validate(starwars_ss, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)

Not pretty. Convergence issues galore (I decided not to print the messages because there were so many). The Fleiss’ Kappa was close to 0; the percentage agreement 0.61.

Conclusion

Looking at height and weight, we seem to be able to identify three broad groups of Star Wars characters. However, we shouldn’t have a ton of confidence in howe well these groups generalize to all Star Wars characters: Our sample is small, the measures we could use were limited, and our cross-validation did not provide us with much evidence to back up our three distinct groups.

On the other hand, we did have a starting point for how many groups to look for from our R2 values, which was good, and the groups seem interpretable on the basis of those characters in our three groups.

Try it out

The prcr package used to create the groups and calculate the R2 values is available in R using install.packages("prcr"). An in-development version with the function for cross-validation is available using the following two commands (if you have devtools installed already then only the second command is needed:

install.packages("devtools")
devtools::install_github("jrosen48/prcr")

Thanks and credit to Rebecca Steingut now at Teachers’s College - Columbia University for contributing to the in-development version of the package and the cross validation strategy implemented in it.