Background

tidyLPA provides an interface to the powerful mclust package to easily carry out Latent Profile Analysis (LPA). Its main contribution is corresponding that are commonly specified when carrying out LPA. Its secondary contribution is to make it easier to use the output in subsequent analysis through a tidy user interface, in that output is in the form of a tibble (closely related to a data.frame) that can subsequently be computed on.

The goal of tidyLPA is to provide tools to make it easier to use the R package mclust for Latent Profile Analysis analyses.

tidyLPA has been benchmarked to MPlus, at least for a simple dataset (the iris dataset) and with three of the more common model specifications. You can find the results of that benchmarking, which showed the results to be nearly identical, here.

Example

#> Loading tidyLPA

Here is a very short example using the built-in dataset pisaUSA15.

First, install the package from GitHub:

install.packages("devtools")
devtools::install_github("jrosen48/tidyLPA")

Here is the simple example, using just a subset of the built-in pisaUSA15 data, data from the 2015 PISA assessment (more details can be found here).

library(tidyLPA)
library(dplyr, warn.conflicts = FALSE)
d <- pisaUSA15[1:100, ]
m3 <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 3, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 3 profiles.
#> LogLik is 279.692
#> AIC is 593.384
#> CAIC is 653.62
#> BIC is 636.62
#> SABIC is 582.951
#> ICL is 681.845
#> Entropy is 0.798
plot_profiles_lpa(m3)

More in-depth examples

Here are more in-depth examples.

A data-first, tidy approach

tidyLPA is presently designed to be used in two ways:

  1. In a “data-first” (or tidy) way, akin to the interface popularized in the dplyr and tidyr (and other) packages - this approach is termed a “tidy” approach in large parts of the R users community

  2. In a more conventional, object-oriented way - this approach will be familiar to those who have used functions such as the built-in lm() function for general linear models (i.e., regression and ANOVA)

This example makes use of the first, data-first, tidy approach, in which a data.frame (or a tibble to those who use packages such as dplyr and other “tidy” packages) is both input and output by the main function in this package.

Using the built-in pisaUSA15 dataset (using just 200 observations for illustrative purposes) and variables for broad interest, enjoyment, and self-efficacy, we can explore a three profile solution, with, for example, 4 profiles, and varying means across profiles, but equal variances and covariances (specified with model = 2, which happens to be the default model if none is specified). More information the models described later in this vignette. Note that the number of profiles are typically chosen on the basis of evidence from multiple sources, including information criteria, statistical tests, and concerns of interpretability and parsimony.

library(tidyLPA)
library(dplyr, warn.conflicts = FALSE)
d <- pisaUSA15
d <- sample_n(pisaUSA15, 500)
create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.
#> # A tibble: 469 x 5
#>    broad_interest enjoyment self_efficacy profile posterior_prob
#>             <dbl>     <dbl>         <dbl>   <dbl>          <dbl>
#>  1           3.25      2.40          2.00    3.00          0.377
#>  2           2.40      3.20          2.00    2.00          0.378
#>  3           3.20      2.80          1.75    3.00          0.385
#>  4           1.00      1.00          2.25    1.00          0.999
#>  5           2.80      2.80          1.75    2.00          0.367
#>  6           3.40      3.00          2.38    3.00          0.438
#>  7           3.40      2.00          2.75    3.00          0.399
#>  8           2.40      2.80          1.00    2.00          0.340
#>  9           1.60      2.20          1.75    1.00          0.920
#> 10           2.00      1.00          1.00    1.00          0.994
#> # ... with 459 more rows

**Note that if you get the warning “Some profiles are associated with no assignments. Interpret this solution with caution and consider other models”, then this is a sign that a simpler model (with fewer profiles) likely fits better.

You can see the output is simply the same tibble that is input as the first function to create_profiles_LPA, but modified so that the classification and posterior probability of the classification are added (and incomplete cases with respect to the variables used to estimate the profiles removed).

We can then plot this output using the plot_profiles_lpa() function and the pipe (%>%) from the dplyr package:

create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2) %>% 
    plot_profiles_lpa()
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.

You can, if you like, use the arguments center_raw_data (to center all of the clustering variables to have a mean of 0) or scale_raw_data (to scale all of the clustering variables to have a standard deviation of 1) in create_profiles_lpa().

Below is identical to the two lines of code above, except that instead of using the pipe, we save the output the object m3, and then call the plot_profiles_lpa() function on the output that it points to.

m3 <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
plot_profiles_lpa(m3, to_center = TRUE)

We can also return all of the data. This returns a tibble with all of the variables in the original data, in this case d, so that subsequent analyses using variables that are not used to create the profile can be easily carried out.

create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2, return_orig_df = T)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.
#> # A tibble: 469 x 6
#>    broad_interest enjoyment instrumental_mot self_efficacy profile poster…
#>             <dbl>     <dbl>            <dbl>         <dbl>   <dbl>   <dbl>
#>  1           3.25      2.40             4.00          2.00    3.00   0.377
#>  2           2.40      3.20             1.00          2.00    2.00   0.378
#>  3           3.20      2.80             2.00          1.75    3.00   0.385
#>  4           1.00      1.00             4.00          2.25    1.00   0.999
#>  5           2.80      2.80             2.25          1.75    2.00   0.367
#>  6           3.40      3.00             2.50          2.38    3.00   0.438
#>  7           3.40      2.00             2.25          2.75    3.00   0.399
#>  8           2.40      2.80             1.75          1.00    2.00   0.340
#>  9           1.60      2.20             2.00          1.75    1.00   0.920
#> 10           2.00      1.00             2.75          1.00    1.00   0.994
#> # ... with 459 more rows

An object-oriented approach

In addition to being used as part of a “tidy” approach, there is also an option to use it as part of a more conventional, object-oriented approach. In this example, instead of outputting a tibble, we will output an object of class Mclust.

d <- pisaUSA15
d <- sample_n(pisaUSA15, 500)
m3_mclust <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 2, model = 2, to_return = "mclust")
#> Fit varying means, equal variances and covariances (Model 2) model with 2 profiles.
#> LogLik is 1430.561
#> AIC is 2887.121
#> CAIC is 2954.326
#> BIC is 2941.326
#> SABIC is 2900.066
#> ICL is 3010.308
#> Entropy is 0.94

Any of the functions from the mclust package that work with this type of output will work on this output; the only difference is that the model is specified in this package (with the create_profiles_lpa() function) instead of the Mclust() function from the mclust package.

Models

As mentioned earlier, there are a number of different models representing different covariance matrix parameterizations that can be fit using tidyLPA. These are passed to the model argument to the create_profiles_lpa() function, i.e. create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 3, model = 2, to_return="tibble").

In general, the approach to choosing the model is similar to choosing the number of profiles, requiring deciding on the basis of evidence from multiple sources, including information criteria, statistical tests, and concerns of interpretability and parsimony.

Here is more information on their specification, drawing from Pastor’s (2007) article.

A few notes:

  • p represents different profiles (or mixture components).

  • each covariance parameterization is represented by a 4 x 4 covariance matrix and therefore would represent the parameterization for a four-profile solution.

1. Varying means, equal variances, and covariances fixed to 0 (model 1)

corresponds to the mclust model “EEI”, “diagonal, equal volume and shape”

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & 0 & 0 & 0 \\ 0 & { \sigma }_{ 2 }^{ 2 } & 0 & 0 \\ 0 & 0 & { \sigma }_{ 3 }^{ 2 } & 0 \\ 0 & 0 & 0 & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \]

2. Varying means, equal variances, and equal covariances (model 2)

corresponds to the mclust model “EEE”, “ellipsoidal, equal volume, shape, and orientation”

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & { \sigma }_{ 21 } & { \sigma }_{ 31 } & { \sigma }_{ 41 } \\ { \sigma }_{ 12 } & { \sigma }_{ 2 }^{ 2 } & { \sigma }_{ 23 } & { \sigma }_{ 24 } \\ { \sigma }_{ 13 } & { \sigma }_{ 12 } & { \sigma }_{ 3 }^{ 2 } & { \sigma }_{ 33 } \\ { \sigma }_{ 14 } & { \sigma }_{ 12 } & { \sigma }_{ 12 } & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \]

3. Varying means, varying variances, and covariances fixed to 0 (model 3)

corresponds to the mclust model “VVI”, “diagonal, varying volume and shape”

\[ \left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & 0 & 0 & 0 \\ 0 & { \sigma }_{ 2p }^{ 2 } & 0 & 0 \\ 0 & 0 & { \sigma }_{ 3p }^{ 2 } & 0 \\ 0 & 0 & 0 & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right] \]

4. Varying means, varying variances, and equal covariances (model 4)

This model is available in MPlus but not Mclust

\[ \left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & { \sigma }_{ 21 } & { \sigma }_{ 31 } & { \sigma }_{ 41 } \\ { \sigma }_{ 12 } & { \sigma }_{ 2p }^{ 2 } & { \sigma }_{ 23 } & { \sigma }_{ 24 } \\ { \sigma }_{ 13 } & { \sigma }_{ 12 } & { \sigma }_{ 3p }^{ 2 } & { \sigma }_{ 33 } \\ { \sigma }_{ 14 } & { \sigma }_{ 12 } & { \sigma }_{ 12 } & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right] \] 5. Varying means, equal variances, and varying covariances

This model is available in MPlus but not Mclust

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & { \sigma }_{ 21p } & { \sigma }_{ 31p } & { \sigma }_{ 41p } \\ { \sigma }_{ 12p } & { \sigma }_{ 2 }^{ 2 } & { \sigma }_{ 23p } & { \sigma }_{ 24p } \\ { \sigma }_{ 13p } & { \sigma }_{ 12p } & { \sigma }_{ 3 }^{ 2 } & { \sigma }_{ 33p } \\ { \sigma }_{ 14p } & { \sigma }_{ 12p } & { \sigma }_{ 12p } & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \quad \]

6. Varying means, varying variances, and varying covariances

corresponds to the mclust model “VVV”, “ellipsoidal, varying volume, shape, and orientation”

\[ \left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & { \sigma }_{ 21p } & { \sigma }_{ 31p } & { \sigma }_{ 41p } \\ { \sigma }_{ 12p } & { \sigma }_{ 2p }^{ 2 } & { \sigma }_{ 23p } & { \sigma }_{ 24p } \\ { \sigma }_{ 13p } & { \sigma }_{ 12p } & { \sigma }_{ 3p }^{ 2 } & { \sigma }_{ 33p } \\ { \sigma }_{ 14p } & { \sigma }_{ 12p } & { \sigma }_{ 12p } & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right] \]

Comparing many models using information criteria

We can quickly explore a number of models - both in terms of the specification of the model and the number of profiles using the Bayesian Information Criteria (BIC) values or the Integrated Completed Likelihood (ICL). For illustration, the built-in (to R) iris dataset is used a number of the models using the PISA data do not reach convergence.

Note that although BIC and ICL statistics may appear for the combination of a particular model and number of profiles, the output for these models may have profiles with no observations assigned to them, and so these statistics and figures should be considered starting points, not the sole criterion on which to decide on which model to select.

compare_models_lpa(iris, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, statistic = "BIC")

compare_models_lpa(iris, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, statistic = "ICL")

Bootstrapped likelihood-ratio test

To determine the number of profies for a specified model (i.e., models 1-4 described above, we can carry out a bootstrapped likelihood-ratio test. Note that the code is shown but run because it can take substantial time, even for a small dataset.

bootstrap_lrt(d, broad_interest, enjoyment, self_efficacy, model = 3)

Run the same models through Mplus

There is an in-development function to generate the model syntax (and prepare the data) and run the same models through the Mplus software:

m1 <- create_profiles_mplus(iris,
                            Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                            n_profiles = 2,
                            model = 1)
#> This first list item is the model output and the second is the save data with class probabilities.

The object m1 contains the Mplus output, as is returned in Mplus, but in an R list; for quick comparisons, create_profiles_mplus() prints the log-likelihood, BIC, and entropy statistics.

Notes

  • This is a sister-project to prcr, for two-step cluster analysis.

  • Pastor (2007) provides an accessible introduction to LPA, with an applied example in educational research.

  • To contribute, file issues via GitHub here or get in touch via email or Twitter.