`tidyLPA`

provides an interface to the powerful `mclust`

package to easily carry out Latent Profile Analysis (LPA). Its main contribution is corresponding that are commonly specified when carrying out LPA. Its secondary contribution is to make it easier to use the output in subsequent analysis through a tidy user interface, in that output is in the form of a `tibble`

(closely related to a `data.frame`

) that can subsequently be computed on.

The goal of `tidyLPA`

is to provide tools to make it easier to use the `R`

package mclust for Latent Profile Analysis analyses.

`tidyLPA`

has been benchmarked to MPlus, at least for a simple dataset (the iris dataset) and with three of the more common model specifications. You can find the results of that benchmarking, which showed the results to be nearly identical, here.

`#> Loading tidyLPA`

Here is a very short example using the built-in dataset `pisaUSA15`

.

First, install the package from GitHub:

```
install.packages("devtools")
devtools::install_github("jrosen48/tidyLPA")
```

Here is the simple example, using just a subset of the built-in `pisaUSA15`

data, data from the 2015 PISA assessment (more details can be found here).

`library(tidyLPA)`

```
library(dplyr, warn.conflicts = FALSE)
d <- pisaUSA15[1:100, ]
m3 <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 3, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 3 profiles.
#> LogLik is 279.692
#> AIC is 593.384
#> CAIC is 653.62
#> BIC is 636.62
#> SABIC is 582.951
#> ICL is 681.845
#> Entropy is 0.798
plot_profiles_lpa(m3)
```

Here are more in-depth examples.

`tidyLPA`

is presently designed to be used in two ways:

In a “data-first” (or tidy) way, akin to the interface popularized in the

`dplyr`

and`tidyr`

(and other) packages - this approach is termed a “tidy” approach in large parts of the R users communityIn a more conventional, object-oriented way - this approach will be familiar to those who have used functions such as the built-in

`lm()`

function for general linear models (i.e., regression and ANOVA)

This example makes use of the first, data-first, tidy approach, in which a `data.frame`

(or a `tibble`

to those who use packages such as `dplyr`

and other “tidy” packages) is both input and output by the main function in this package.

Using the built-in `pisaUSA15`

dataset (using just 200 observations for illustrative purposes) and variables for broad interest, enjoyment, and self-efficacy, we can explore a three profile solution, with, for example, `4`

profiles, and varying means across profiles, but equal variances and covariances (specified with model = `2`

, which happens to be the default model if none is specified). More information the models described later in this vignette. Note that the number of profiles are typically chosen on the basis of evidence from multiple sources, including information criteria, statistical tests, and concerns of interpretability and parsimony.

```
library(tidyLPA)
library(dplyr, warn.conflicts = FALSE)
d <- pisaUSA15
d <- sample_n(pisaUSA15, 500)
create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.
#> # A tibble: 469 x 5
#> broad_interest enjoyment self_efficacy profile posterior_prob
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.25 2.40 2.00 3.00 0.377
#> 2 2.40 3.20 2.00 2.00 0.378
#> 3 3.20 2.80 1.75 3.00 0.385
#> 4 1.00 1.00 2.25 1.00 0.999
#> 5 2.80 2.80 1.75 2.00 0.367
#> 6 3.40 3.00 2.38 3.00 0.438
#> 7 3.40 2.00 2.75 3.00 0.399
#> 8 2.40 2.80 1.00 2.00 0.340
#> 9 1.60 2.20 1.75 1.00 0.920
#> 10 2.00 1.00 1.00 1.00 0.994
#> # ... with 459 more rows
```

**Note that if you get the warning “Some profiles are associated with no assignments. Interpret this solution with caution and consider other models”, then this is a sign that a simpler model (with fewer profiles) likely fits better.

You can see the output is simply the same `tibble`

that is input as the first function to `create_profiles_LPA`

, but modified so that the classification and posterior probability of the classification are added (and incomplete cases with respect to the variables used to estimate the profiles removed).

We can then plot this output using the `plot_profiles_lpa()`

function and the pipe (`%>%`

) from the `dplyr`

package:

```
create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2) %>%
plot_profiles_lpa()
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.
```

You can, if you like, use the arguments `center_raw_data`

(to center all of the clustering variables to have a mean of 0) or `scale_raw_data`

(to scale all of the clustering variables to have a standard deviation of 1) in `create_profiles_lpa()`

.

Below is identical to the two lines of code above, except that instead of using the pipe, we save the output the object `m3`

, and then call the `plot_profiles_lpa()`

function on the output that it points to.

```
m3 <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
plot_profiles_lpa(m3, to_center = TRUE)
```

We can also return all of the data. This returns a `tibble`

with all of the variables in the original data, in this case `d`

, so that subsequent analyses using variables that are not used to create the profile can be easily carried out.

```
create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 4, model = 2, return_orig_df = T)
#> Fit varying means, equal variances and covariances (Model 2) model with 4 profiles.
#> LogLik is 1386.954
#> AIC is 2815.908
#> CAIC is 2924.071
#> BIC is 2903.071
#> SABIC is 2836.421
#> ICL is 3639.904
#> Entropy is 0.485
#> Warning in create_profiles_lpa(d, broad_interest, enjoyment,
#> self_efficacy, : Some profiles are associated with no assignments.
#> Interpret this solution with caution and consider other models.
#> # A tibble: 469 x 6
#> broad_interest enjoyment instrumental_mot self_efficacy profile poster…
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.25 2.40 4.00 2.00 3.00 0.377
#> 2 2.40 3.20 1.00 2.00 2.00 0.378
#> 3 3.20 2.80 2.00 1.75 3.00 0.385
#> 4 1.00 1.00 4.00 2.25 1.00 0.999
#> 5 2.80 2.80 2.25 1.75 2.00 0.367
#> 6 3.40 3.00 2.50 2.38 3.00 0.438
#> 7 3.40 2.00 2.25 2.75 3.00 0.399
#> 8 2.40 2.80 1.75 1.00 2.00 0.340
#> 9 1.60 2.20 2.00 1.75 1.00 0.920
#> 10 2.00 1.00 2.75 1.00 1.00 0.994
#> # ... with 459 more rows
```

In addition to being used as part of a “tidy” approach, there is also an option to use it as part of a more conventional, object-oriented approach. In this example, instead of outputting a `tibble`

, we will output an object of class `Mclust`

.

```
d <- pisaUSA15
d <- sample_n(pisaUSA15, 500)
m3_mclust <- create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 2, model = 2, to_return = "mclust")
#> Fit varying means, equal variances and covariances (Model 2) model with 2 profiles.
#> LogLik is 1430.561
#> AIC is 2887.121
#> CAIC is 2954.326
#> BIC is 2941.326
#> SABIC is 2900.066
#> ICL is 3010.308
#> Entropy is 0.94
```

Any of the functions from the mclust package that work with this type of output will work on this output; the only difference is that the model is specified in this package (with the `create_profiles_lpa()`

function) instead of the `Mclust()`

function from the `mclust package`

.

As mentioned earlier, there are a number of different models representing different covariance matrix parameterizations that can be fit using `tidyLPA`

. These are passed to the `model`

argument to the `create_profiles_lpa()`

function, i.e. `create_profiles_lpa(d, broad_interest, enjoyment, self_efficacy, n_profiles = 3, model = 2, to_return="tibble")`

.

In general, the approach to choosing the model is similar to choosing the number of profiles, requiring deciding on the basis of evidence from multiple sources, including information criteria, statistical tests, and concerns of interpretability and parsimony.

Here is more information on their specification, drawing from Pastor’s (2007) article.

A few notes:

*p*represents different profiles (or mixture components).each covariance parameterization is represented by a 4 x 4 covariance matrix and therefore would represent the parameterization for a four-profile solution.

**1. Varying means, equal variances, and covariances fixed to 0 (model 1)**

*corresponds to the mclust model “EEI”, “diagonal, equal volume and shape”*

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & 0 & 0 & 0 \\ 0 & { \sigma }_{ 2 }^{ 2 } & 0 & 0 \\ 0 & 0 & { \sigma }_{ 3 }^{ 2 } & 0 \\ 0 & 0 & 0 & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \]

**2. Varying means, equal variances, and equal covariances (model 2)**

*corresponds to the mclust model “EEE”, “ellipsoidal, equal volume, shape, and orientation”*

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & { \sigma }_{ 21 } & { \sigma }_{ 31 } & { \sigma }_{ 41 } \\ { \sigma }_{ 12 } & { \sigma }_{ 2 }^{ 2 } & { \sigma }_{ 23 } & { \sigma }_{ 24 } \\ { \sigma }_{ 13 } & { \sigma }_{ 12 } & { \sigma }_{ 3 }^{ 2 } & { \sigma }_{ 33 } \\ { \sigma }_{ 14 } & { \sigma }_{ 12 } & { \sigma }_{ 12 } & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \]

**3. Varying means, varying variances, and covariances fixed to 0 (model 3)**

*corresponds to the mclust model “VVI”, “diagonal, varying volume and shape”*

\[ \left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & 0 & 0 & 0 \\ 0 & { \sigma }_{ 2p }^{ 2 } & 0 & 0 \\ 0 & 0 & { \sigma }_{ 3p }^{ 2 } & 0 \\ 0 & 0 & 0 & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right] \]

**4. Varying means, varying variances, and equal covariances (model 4)**

*This model is available in MPlus but not Mclust*

\[
\left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & { \sigma }_{ 21 } & { \sigma }_{ 31 } & { \sigma }_{ 41 } \\ { \sigma }_{ 12 } & { \sigma }_{ 2p }^{ 2 } & { \sigma }_{ 23 } & { \sigma }_{ 24 } \\ { \sigma }_{ 13 } & { \sigma }_{ 12 } & { \sigma }_{ 3p }^{ 2 } & { \sigma }_{ 33 } \\ { \sigma }_{ 14 } & { \sigma }_{ 12 } & { \sigma }_{ 12 } & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right]
\] **5. Varying means, equal variances, and varying covariances**

*This model is available in MPlus but not Mclust*

\[ \left[ \begin{matrix} { \sigma }_{ 1 }^{ 2 } & { \sigma }_{ 21p } & { \sigma }_{ 31p } & { \sigma }_{ 41p } \\ { \sigma }_{ 12p } & { \sigma }_{ 2 }^{ 2 } & { \sigma }_{ 23p } & { \sigma }_{ 24p } \\ { \sigma }_{ 13p } & { \sigma }_{ 12p } & { \sigma }_{ 3 }^{ 2 } & { \sigma }_{ 33p } \\ { \sigma }_{ 14p } & { \sigma }_{ 12p } & { \sigma }_{ 12p } & { \sigma }_{ 4 }^{ 2 } \end{matrix} \right] \quad \]

**6. Varying means, varying variances, and varying covariances**

*corresponds to the mclust model “VVV”, “ellipsoidal, varying volume, shape, and orientation”*

\[ \left[ \begin{matrix} { \sigma }_{ 1p }^{ 2 } & { \sigma }_{ 21p } & { \sigma }_{ 31p } & { \sigma }_{ 41p } \\ { \sigma }_{ 12p } & { \sigma }_{ 2p }^{ 2 } & { \sigma }_{ 23p } & { \sigma }_{ 24p } \\ { \sigma }_{ 13p } & { \sigma }_{ 12p } & { \sigma }_{ 3p }^{ 2 } & { \sigma }_{ 33p } \\ { \sigma }_{ 14p } & { \sigma }_{ 12p } & { \sigma }_{ 12p } & { \sigma }_{ 4p }^{ 2 } \end{matrix} \right] \]

We can quickly explore a number of models - both in terms of **the specification of the model** and **the number of profiles** using the Bayesian Information Criteria (BIC) values or the Integrated Completed Likelihood (ICL). For illustration, the built-in (to R) iris dataset is used a number of the models using the PISA data do not reach convergence.

**Note that although BIC and ICL statistics may appear for the combination of a particular model and number of profiles, the output for these models may have profiles with no observations assigned to them, and so these statistics and figures should be considered starting points, not the sole criterion on which to decide on which model to select.**

`compare_models_lpa(iris, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, statistic = "BIC")`

`compare_models_lpa(iris, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, statistic = "ICL")`

To determine the number of profies for a specified model (i.e., models `1`

-`4`

described above, we can carry out a bootstrapped likelihood-ratio test. Note that the code is shown but run because it can take substantial time, even for a small dataset.

`bootstrap_lrt(d, broad_interest, enjoyment, self_efficacy, model = 3)`

There is an in-development function to generate the model syntax (and prepare the data) and run the same models through the Mplus software:

```
m1 <- create_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 1)
#> This first list item is the model output and the second is the save data with class probabilities.
```

The object `m1`

contains the Mplus output, as is returned in Mplus, but in an R list; for quick comparisons, `create_profiles_mplus()`

prints the log-likelihood, BIC, and entropy statistics.

This is a sister-project to prcr, for two-step cluster analysis.

Pastor (2007) provides an accessible introduction to LPA, with an applied example in educational research.

To contribute, file issues via GitHub here or get in touch via email or Twitter.