Background

How many groups, or types, of Star Wars characters are there? I’ve been wanting to use the starwars dataset built-in to the dplyr package, and at the same time, have been working hard on an R package to carry out an analysis suited to doing this. Part of the challenge of using the approach in this R package is determining how groups groups there are.

Many approaches (Latent Profile Analysis, for example) use Maximum Likelihood estimation (while the approach I’ve developed uses a two-step cluster analysis based around the geometric (and algebraic) idea of “distance”, or how close (similar) observations are). This is easy enough when we’re talking about something like length. If something is 4 long and another thing 8, then what is there distance (4!)? When we’re talking about more than just length - say, length and width - then it’s the exact same idea, except the distance represents how far two things are across both measures - length and width.

But back to groups of Star Wars characters. How many are there? Let’s see what data we have:

library(dplyr)

starwars

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

It looks like we only have three measures that are numbers (height, mass, and birth_year) - though there are others we could possibly turn into numbers (maybe), and there are other approaches (Latent Class Analysis) that can deal with non-numeric measures (such as hair_color). But we’ll have to stick to the three measures that are numbers, for better or worse, for now.

R²

Let’s first take a look at the plot of R² values, which are obtained from the second of the two steps of the cluster analysis - the k-means step (I say this because there are other, perhaps better, ways to calculate the R-squared values, such as from a MANOVA).

We just list the name of the data and the variables we would like to use. Since birth_year is on a very different metric than the other two variables, we’ll set to_scale and to_center to TRUE. We’ll also return a table, instead of a plot.

library(prcr)

plot_r_squared(starwars, height, mass, birth_year, to_scale = TRUE, to_center = TRUE, r_squared_table = T)

## ################################

## Clustering data for iteration 2

## Clustering data for iteration 3

## Clustering data for iteration 4

## Clustering data for iteration 5

## Clustering data for iteration 6

## Clustering data for iteration 7

## Clustering data for iteration 8

## Clustering data for iteration 9

## ################################

##   cluster r_squared_value
## 1       2           0.507
## 2       3              NA
## 3       4              NA
## 4       5              NA
## 5       6              NA
## 6       7              NA
## 7       8              NA
## 8       9              NA

Ooh! Not good. Before the second of the two steps settled on the groups, it ended up with a group with no observations. This is probably in part the result of a small sample, and possibly attributable to the measures we used - and maybe some missing data for some of the measures. Let’s take a look at the data:

starwars_ss <- select(starwars, height, mass, birth_year)
skimr::skim(starwars_ss)

Table 1: Data summary
Name	starwars_ss
Number of rows	87
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
height	6	0.93	174.36	34.77	66	167.0	180	191.0	264	▁▁▇▅▁
mass	28	0.68	97.31	169.46	15	55.6	79	84.5	1358	▇▁▁▁▁
birth_year	44	0.49	87.57	154.69	8	35.0	52	72.0	896	▇▁▁▁▁

It looks like the birth_year is missing for a lot - 44 - of the observations for the 87 Star Wars characters we have. We’re down to the bare-bones number of measures, but let’s try with just height and mass. We probably don’t need to scale the data.

plot_r_squared(starwars, height, mass, to_scale = TRUE, to_center = TRUE, r_squared_table = T)

## ################################

## Clustering data for iteration 2

## Clustering data for iteration 3

## Clustering data for iteration 4

## Clustering data for iteration 5

## Clustering data for iteration 6

## Clustering data for iteration 7

## Clustering data for iteration 8

## Clustering data for iteration 9

## ################################

##   cluster r_squared_value
## 1       2           0.485
## 2       3           0.872
## 3       4              NA
## 4       5              NA
## 5       6           0.977
## 6       7              NA
## 7       8              NA
## 8       9              NA

That’s better - in a sense. We have two, three, and six groups solutions. I wouldn’t trust the six group solution very much. The R² value does increase substantialy between two and three groups. This suggests maybe there are three groups (when we use just the measures for weight and mass).

Groups

two_profiles <- create_profiles(starwars, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)
plot(two_profiles)

three_profiles <- create_profiles(starwars, height, mass, n_profiles = 3, to_scale = TRUE, to_center = TRUE)
plot(three_profiles)

The third group: Massive, not so tall

It looks like there is one very massive (literally) observation that makes up one profile in both the two and three profile solutions. Who is it?

three_profiles$.data %>% 
    filter(cluster == 3) %>% 
    knitr::kable()

name	height	mass	hair_color	skin_color	eye_color	birth_year	gender	homeworld	species	films	vehicles	starships	cluster
Jabba Desilijic Tiure	175	1358	NA	green-tan, brown	orange	600	hermaphrodite	Nal Hutta	Hutt	c(“The Phantom Menace”, “Return of the Jedi”, “A New Hope”)	character(0)	character(0)	3

Jabba. Of course. It looks like with two or three groups, Jabba ends up in one cluster.

The second group: Less massive, small height

What about the seven - who seem to be less massive and with a small height - in the second group?

three_profiles$.data %>% 
    filter(cluster == 2) %>% 
    knitr::kable()

name	height	mass	hair_color	skin_color	eye_color	birth_year	gender	homeworld	species	films	vehicles	starships	cluster
R2-D2	96	32	NA	white, blue	red	33	NA	Naboo	Droid	c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)	character(0)	character(0)	2
R5-D4	97	32	NA	white, red	red	NA	NA	Tatooine	Droid	A New Hope	character(0)	character(0)	2
Yoda	66	17	white	green	brown	896	male	NA	Yoda’s species	c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”)	character(0)	character(0)	2
Wicket Systri Warrick	88	20	brown	brown	brown	8	male	Endor	Ewok	Return of the Jedi	character(0)	character(0)	2
Sebulba	112	40	none	grey, red	orange	NA	male	Malastare	Dug	The Phantom Menace	character(0)	character(0)	2
Dud Bolt	94	45	none	blue, grey	yellow	NA	male	Vulpter	Vulptereen	The Phantom Menace	character(0)	character(0)	2
Ratts Tyerell	79	15	none	grey, blue	unknown	NA	male	Aleen Minor	Aleena	The Phantom Menace	character(0)	character(0)	2

These seem to be droids, Yoda, and some other tiny characters.

(Some from) the first group: Above average height, below average mass

The 51 in the first group, with slightly above average height, and slightly below average mass? It’s a big group, so here are just the first six, with a lot of familiar characters:

three_profiles$.data %>% 
    filter(cluster == 1) %>% 
    head() %>% 
    knitr::kable()

name	height	mass	hair_color	skin_color	eye_color	birth_year	gender	homeworld	species	films	vehicles	starships	cluster
Luke Skywalker	172	77	blond	fair	blue	19.0	male	Tatooine	Human	c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)	c(“Snowspeeder”, “Imperial Speeder Bike”)	c(“X-wing”, “Imperial shuttle”)	1
C-3PO	167	75	NA	gold	yellow	112.0	NA	Tatooine	Droid	c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”)	character(0)	character(0)	1
Darth Vader	202	136	none	white	yellow	41.9	male	Tatooine	Human	c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”)	character(0)	TIE Advanced x1	1
Leia Organa	150	49	brown	light	brown	19.0	female	Alderaan	Human	c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”)	Imperial Speeder Bike	character(0)	1
Owen Lars	178	120	brown, grey	light	blue	52.0	male	Tatooine	Human	c(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”)	character(0)	character(0)	1
Beru Whitesun lars	165	75	brown	light	blue	47.0	female	Tatooine	Human	c(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”)	character(0)	character(0)	1

Cross-validation

The other technique for determining the number of groups, cross-validation, may be folly because of how it works: Split the data into two, and see how well groups in one half can be reproduced in the other. This may be a problem due to the Jabba-group.

We’ll use the same arguments except for plot_r_squared, which we don’t need, and for one argument, n_profiles, for how many groups we want to cross-validate the groupings for (we have to deal with complete cases, which is what the first two lines are for), for the three group solution:

starwars_ss <- starwars_ss[complete.cases(starwars_ss), ]
cross_validate(starwars_ss, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)

Not pretty. Convergence issues galore (I decided not to print the messages because there were so many). The Fleiss’ Kappa was close to 0; the percentage agreement 0.61.

Conclusion

Looking at height and weight, we seem to be able to identify three broad groups of Star Wars characters. However, we shouldn’t have a ton of confidence in howe well these groups generalize to all Star Wars characters: Our sample is small, the measures we could use were limited, and our cross-validation did not provide us with much evidence to back up our three distinct groups.

On the other hand, we did have a starting point for how many groups to look for from our R² values, which was good, and the groups seem interpretable on the basis of those characters in our three groups.

Try it out

The prcr package used to create the groups and calculate the R² values is available in R using install.packages("prcr"). An in-development version with the function for cross-validation is available using the following two commands (if you have devtools installed already then only the second command is needed:

install.packages("devtools")
devtools::install_github("jrosen48/prcr")

Thanks and credit to Rebecca Steingut now at Teachers’s College - Columbia University for contributing to the in-development version of the package and the cross validation strategy implemented in it.

How many groups of Star Wars characters are there? R-squared and cross-validation approaches

2017/07/02