Comparing Mplus and mclust output

Joshua Rosenberg

2018-02-05

This document compares mclust and Mplus output, for the purpose of benchmarking the use of mclust, an open-source tool, to Mplus.estimate_profiles() uses the mclust R package, while estimate_profiles_mplus() generates Mplus model syntax which is then run with Mplus. Note that for the estimate_profiles_mplus() function to work, you must have Mplus installed (and purchased).

library(tidyLPA)
#> tidyLPA provides the functionality to carry out Latent Profile Analysis. Note that tidyLPA is still at the beta stage! 
#> Please report any bugs at https://github.com/jrosen48/tidyLPA or send an email to jrosen@msu.edu.

How functions that use MPlus work

The functions that use MPlus rely on the MplusAutomation R package (and MPlus) to run models from R. For example, let’s run the following function:

m1_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 2,
                                  model = 1)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 488.915
#> BIC is 1042.968
#> Entropy is 0.991

Running this generates four files necessary for running an LPA model through MPlus: 1. a data file (.dat) 1. an input / syntax file (.inp) 1. (from running the model through MPlus) an output file (.out) 1. (also from running the model through MPlus) a modified data file, with the class probabilities and class (profile) assignments (also .dat, but with a different file name than the data file used in the input)

Here, for instance, is the input file dynamically generated by tidyLPA by the function above:

TITLE: test
DATA: File is d.dat;
VARIABLE:
Names are Sepal_Length Sepal_Width Petal_Length Petal_Width;
Classes = c(2);
MODEL:
%overall%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width;
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
%c#1%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width(1-4);
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
%c#2%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width(1-4);
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
ANALYSIS:
Type is mixture;
start = 20 4;
miterations = 500;
stiterations = 10;
convergence = 1e-06;
processors = 1;
OUTPUT: TECH1 TECH11;
SAVEDATA: File is d-mod.dat;
SAVE = CPROBABILITIES;

Along with the data file that is also automatically generated, this input file is the sent to MPlus (you don’t have to open MPlus or do anything - the MplusAutomation package makes this both possible and easy), and then the output file is returned.

We can see how this input file dynamically changes if we change the function’s arguments by removing a variable, increasing the number of profiles, and changing the model that is specified:

m1_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, 
                                  n_profiles = 3,
                                  model = 2, remove_tmp_files=F)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 308.07
#> BIC is 701.32
#> Entropy is 0.917
TITLE: test
DATA: File is d.dat;
VARIABLE:
Names are Sepal_Length Sepal_Width Petal_Length;
Classes = c(3);
MODEL:
%overall%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length;
Sepal_Length WITH Sepal_Width;
Sepal_Length WITH Petal_Length;
Sepal_Width WITH Petal_Length;
%c#1%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
%c#2%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
%c#3%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
ANALYSIS:
Type is mixture;
start = 20 4;
miterations = 500;
stiterations = 10;
convergence = 1e-06;
processors = 1;
OUTPUT: TECH1 TECH11;
SAVEDATA: File is d-mod.dat;
SAVE = CPROBABILITIES;

Before proceeding, I will note that the functions that use MPlus are still under-development; dynamically generating the input files is technically demanding, and it is possible that these functions will change significantly in later versions. Nevertheless, they work well with testing using both built-in data (like the iris and geyser datasets used in this document) as well as a range of other data.

In the rest of this document, Mplus and mclust are benchmarked to one another. For ease of comparison, the estimated log-likelihood; if these are identical it probably means all of the estimated paramteters are identical (other benchmarks we have done with Mplus and mclust have demonstrated this).

Using the built-in iris dataset

2 profiles

Mplus


m1_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 2,
                                  model = 1)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 488.915
#> BIC is 1042.968
#> Entropy is 0.991

m2_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 2,
                                  model = 2)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 296.448
#> BIC is 688.097
#> Entropy is 1

m3_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 2,
                                  model = 3)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 386.185
#> BIC is 857.551
#> Entropy is 1

m5_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 2,
                                  model = 6)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 214.355
#> BIC is 574.018
#> Entropy is 1

mclust

Note that model 4 can only be specified in Mplus, so is not included here.

m1_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 2,
                                 model = 1)
#> Fit varying means, equal variances, covariances fixed to 0 (Model 1) model with 2 profiles.
#> LogLik is 488.915
#> BIC is 1042.968
#> Entropy is 0.998

m2_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 2,
                                 model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 2 profiles.
#> LogLik is 296.448
#> BIC is 688.097
#> Entropy is 1

m3_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 2,
                                 model = 3)
#> Fit varying means and variances, covariances fixed to 0 (Model 3) model with 2 profiles.
#> LogLik is 386.185
#> BIC is 857.551
#> Entropy is 1

m5_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 2,
                                 model = 6)
#> Fit varying means, variances, and covariances (Model 4) model with 2 profiles.
#> LogLik is 214.355
#> BIC is 574.018
#> Entropy is 1

Here is a summary of the log-likelihood statistics:

library(tibble)

iris_log_lik <- tribble(
    ~model, ~mplus, ~mclust,
    1, 488.915, 488.915,
    2, 296.448, 296.448,
    3, 386.185, 386.185,
    6, 214.355, 214.355
)

# iris_log_lik$model <- as.integer(iris_log_lik$model)
#>   model   mplus  mclust
#> 1     1 488.915 488.915
#> 2     2 296.448 296.448
#> 3     3 386.185 386.185
#> 4     6 214.355 214.355

3 profiles

What if we specify 3 profiles?

Mplus

m1_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 3,
                                  model = 1)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 361.426
#> BIC is 813.042
#> Entropy is 0.957

m2_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 3,
                                  model = 2)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 256.354
#> BIC is 632.963
#> Entropy is 0.962

m3_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 3,
                                  model = 3)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 307.178
#> BIC is 744.632
#> Entropy is 0.948

m5_mplus <- estimate_profiles_mplus(iris,
                                  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                  n_profiles = 3,
                                  model = 6)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 180.185
#> BIC is 580.839
#> Entropy is 0.97

mclust

Note that model 4 can only be specified in Mplus, so is not included here.

m1_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 3,
                                 model = 1)
#> Fit varying means, equal variances, covariances fixed to 0 (Model 1) model with 3 profiles.
#> LogLik is 361.429
#> BIC is 813.05
#> Entropy is 0.979

m2_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 3,
                                 model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 3 profiles.
#> LogLik is 256.355
#> BIC is 632.965
#> Entropy is 0.985

m3_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 3,
                                 model = 3)
#> Fit varying means and variances, covariances fixed to 0 (Model 3) model with 3 profiles.
#> LogLik is 307.181
#> BIC is 744.638
#> Entropy is 0.979

m5_mclust <- estimate_profiles(iris,
                                 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
                                 n_profiles = 3,
                                 model = 6)
#> Fit varying means, variances, and covariances (Model 4) model with 3 profiles.
#> LogLik is 180.186
#> BIC is 580.84
#> Entropy is 0.99
#>   model   mplus  mclust
#> 1     1 361.426 361.429
#> 2     2 256.354 256.355
#> 3     3 307.178 307.181
#> 4     6 180.185 180.186

Using the built-in old faithful geyser dataset

Mplus


m1_mplus <- estimate_profiles_mplus(faithful,
                                  eruptions, waiting,
                                  n_profiles = 2,
                                  model = 1)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 1157.68
#> BIC is 2354.601
#> Entropy is 0.993

m2_mplus <- estimate_profiles_mplus(faithful,
                                  eruptions, waiting,
                                  n_profiles = 2,
                                  model = 2)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 1140.187
#> BIC is 2325.22
#> Entropy is 0.993

m3_mplus <- estimate_profiles_mplus(faithful,
                                  eruptions, waiting,
                                  n_profiles = 2,
                                  model = 3)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 1147.806
#> BIC is 2346.065
#> Entropy is 0.999

m5_mplus <- estimate_profiles_mplus(faithful,
                                  eruptions, waiting,
                                  n_profiles = 2,
                                  model = 6)
#> Note that this (and other functions that use MPlus) is at the experimental stage! Please provide feedback at https://github.com/jrosen48/tidyLPA
#> LogLik is 1130.264
#> BIC is 2322.192
#> Entropy is 0.996

mclust

Note that model 4 can only be specified in Mplus, so is not included here.

m1_mclust <- estimate_profiles(faithful,
                                 eruptions, waiting,
                                 n_profiles = 2,
                                 model = 1)
#> Fit varying means, equal variances, covariances fixed to 0 (Model 1) model with 2 profiles.
#> LogLik is 1157.68
#> BIC is 2354.601
#> Entropy is 0.998

m2_mclust <- estimate_profiles(faithful,
                                 eruptions, waiting,
                                 n_profiles = 2,
                                 model = 2)
#> Fit varying means, equal variances and covariances (Model 2) model with 2 profiles.
#> LogLik is 1140.187
#> BIC is 2325.22
#> Entropy is 0.998

m3_mclust <- estimate_profiles(faithful,
                                 eruptions, waiting,
                                 n_profiles = 2,
                                 model = 3)
#> Fit varying means and variances, covariances fixed to 0 (Model 3) model with 2 profiles.
#> LogLik is 1147.806
#> BIC is 2346.065
#> Entropy is 1

m5_mclust <- estimate_profiles(faithful,
                                 eruptions, waiting,
                                 n_profiles = 2,
                                 model = 6)
#> Fit varying means, variances, and covariances (Model 4) model with 2 profiles.
#> LogLik is 1130.264
#> BIC is 2322.192
#> Entropy is 0.999

Here is a summary of the log-likelihood statistics:

#>   model    mplus   mclust
#> 1     1 1157.680 1157.680
#> 2     2 1140.187 1140.187
#> 3     3 1147.806 1147.806
#> 4     6 1130.264 1130.264

Next steps

Finding conditions when the models do differ, especially for large or complex datasets, is an important consideration. These differences, when encountered, are possibly due to two reasons; investigating these will be an important next step:

  1. Different initialization routines: Mplus uses random starts, and while mclust can use random starts, it defaults to using hierarchical clustering to initialize.
  2. Different estimation methods: Both use Expectation-Maximization (EM) maximum likelihood estimation methods, though Mplus may use an estimator which is more robust to non-normal distributions.