---
title: "Comparing Mplus and mclust output"
author: "Joshua Rosenberg"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to tidyLPA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = F
)
```
This document compares mclust and Mplus output, for the purpose of benchmarking the use of mclust, an open-source tool, to Mplus.`estimate_profiles()` uses the mclust R package, while `estimate_profiles_mplus()` generates Mplus model syntax which is then run with Mplus. Note that for the `estimate_profiles_mplus()` function to work, you must have Mplus installed (and purchased).
```{r}
library(tidyLPA)
```
# How functions that use MPlus work
The functions that use MPlus rely on the MplusAutomation R package (and MPlus) to run models from R. For example, let's run the following function:
```{r}
m1_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 1)
```
Running this generates four files necessary for running an LPA model through MPlus:
1. a data file (`.dat`)
1. an input / syntax file (`.inp`)
1. (from running the model through MPlus) an output file (`.out`)
1. (also from running the model through MPlus) a modified data file, with the class probabilities and class (profile) assignments (also `.dat`, but with a different file name than the data file used in the input)
Here, for instance, is the input file dynamically generated by tidyLPA by the function above:
```{}
TITLE: test
DATA: File is d.dat;
VARIABLE:
Names are Sepal_Length Sepal_Width Petal_Length Petal_Width;
Classes = c(2);
MODEL:
%overall%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width;
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
%c#1%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width(1-4);
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
%c#2%
[Sepal_Length Sepal_Width Petal_Length Petal_Width];
Sepal_Length Sepal_Width Petal_Length Petal_Width(1-4);
Sepal_Length WITH Sepal_Width@0;
Sepal_Length WITH Petal_Length@0;
Sepal_Length WITH Petal_Width@0;
Sepal_Width WITH Petal_Length@0;
Sepal_Width WITH Petal_Width@0;
Petal_Length WITH Petal_Width@0;
ANALYSIS:
Type is mixture;
start = 20 4;
miterations = 500;
stiterations = 10;
convergence = 1e-06;
processors = 1;
OUTPUT: TECH1 TECH11;
SAVEDATA: File is d-mod.dat;
SAVE = CPROBABILITIES;
```
Along with the data file that is also automatically generated, this input file is the sent to MPlus (you don't have to open MPlus or do anything - the MplusAutomation package makes this both possible and easy), and then the output file is returned.
We can see how this input file dynamically changes if we change the function's arguments by removing a variable, increasing the number of profiles, and changing the model that is specified:
```{r}
m1_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length,
n_profiles = 3,
model = 2, remove_tmp_files=F)
```
```{}
TITLE: test
DATA: File is d.dat;
VARIABLE:
Names are Sepal_Length Sepal_Width Petal_Length;
Classes = c(3);
MODEL:
%overall%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length;
Sepal_Length WITH Sepal_Width;
Sepal_Length WITH Petal_Length;
Sepal_Width WITH Petal_Length;
%c#1%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
%c#2%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
%c#3%
[Sepal_Length Sepal_Width Petal_Length];
Sepal_Length Sepal_Width Petal_Length(1-3);
Sepal_Length WITH Sepal_Width(4);
Sepal_Length WITH Petal_Length(5);
Sepal_Width WITH Petal_Length(6);
ANALYSIS:
Type is mixture;
start = 20 4;
miterations = 500;
stiterations = 10;
convergence = 1e-06;
processors = 1;
OUTPUT: TECH1 TECH11;
SAVEDATA: File is d-mod.dat;
SAVE = CPROBABILITIES;
```
Before proceeding, I will note that the functions that use MPlus are still under-development; dynamically generating the input files is technically demanding, and it is possible that these functions will change significantly in later versions. Nevertheless, they work well with testing using both built-in data (like the iris and geyser datasets used in this document) as well as a range of other data.
In the rest of this document, Mplus and mclust are benchmarked to one another. For ease of comparison, the estimated log-likelihood; if these are identical it probably means all of the estimated paramteters are identical ([other benchmarks we have done](https://jrosen48.github.io/blog/comparing-mplus-and-mclust-output/) with Mplus and mclust have demonstrated this).
# Using the built-in iris dataset
### 2 profiles
**Mplus**
```{r}
m1_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 1)
m2_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 2)
m3_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 3)
m5_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 6)
```
**mclust**
Note that model 4 can only be specified in Mplus, so is not included here.
```{r}
m1_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 1)
m2_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 2)
m3_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 3)
m5_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 2,
model = 6)
```
Here is a summary of the log-likelihood statistics:
```{r}
library(tibble)
iris_log_lik <- tribble(
~model, ~mplus, ~mclust,
1, 488.915, 488.915,
2, 296.448, 296.448,
3, 386.185, 386.185,
6, 214.355, 214.355
)
# iris_log_lik$model <- as.integer(iris_log_lik$model)
```
```{r, echo = F}
as.data.frame(iris_log_lik)
```
### 3 profiles
What if we specify 3 profiles?
**Mplus**
```{r}
m1_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 1)
m2_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 2)
m3_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 3)
m5_mplus <- estimate_profiles_mplus(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 6)
```
**mclust**
Note that model 4 can only be specified in Mplus, so is not included here.
```{r}
m1_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 1)
m2_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 2)
m3_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 3)
m5_mclust <- estimate_profiles(iris,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
n_profiles = 3,
model = 6)
```
```{r, echo = F}
iris_log_lik_3 <- tribble(
~model, ~mplus, ~mclust,
1, 361.426, 361.429,
2, 256.354, 256.355,
3, 307.178, 307.181,
6, 180.185, 180.186
)
# iris_log_lik_3$model <- as.integer(iris_log_lik_3$model)
```
```{r, echo = F}
as.data.frame(iris_log_lik_3)
```
# Using the built-in old faithful geyser dataset
**Mplus**
```{r}
m1_mplus <- estimate_profiles_mplus(faithful,
eruptions, waiting,
n_profiles = 2,
model = 1)
m2_mplus <- estimate_profiles_mplus(faithful,
eruptions, waiting,
n_profiles = 2,
model = 2)
m3_mplus <- estimate_profiles_mplus(faithful,
eruptions, waiting,
n_profiles = 2,
model = 3)
m5_mplus <- estimate_profiles_mplus(faithful,
eruptions, waiting,
n_profiles = 2,
model = 6)
```
**mclust**
Note that model 4 can only be specified in Mplus, so is not included here.
```{r}
m1_mclust <- estimate_profiles(faithful,
eruptions, waiting,
n_profiles = 2,
model = 1)
m2_mclust <- estimate_profiles(faithful,
eruptions, waiting,
n_profiles = 2,
model = 2)
m3_mclust <- estimate_profiles(faithful,
eruptions, waiting,
n_profiles = 2,
model = 3)
m5_mclust <- estimate_profiles(faithful,
eruptions, waiting,
n_profiles = 2,
model = 6)
```
Here is a summary of the log-likelihood statistics:
```{r, echo = F}
geyser_log_lik <- tribble(
~model, ~mplus, ~mclust,
1, 1157.68, 1157.68,
2, 1140.187, 1140.187,
3, 1147.806, 1147.806,
6, 1130.264, 1130.264
)
# geyser_log_lik$model <- as.integer(geyser_log_lik$model)
```
```{r, echo = F}
as.data.frame(geyser_log_lik)
```
# Next steps
Finding conditions when the models do differ, especially for large or complex datasets, is an important consideration. These differences, when encountered, are possibly due to two reasons; investigating these will be an important next step:
1. Different initialization routines: Mplus uses random starts, and while mclust can use random starts, it defaults to using hierarchical clustering to initialize.
2. Different estimation methods: Both use Expectation-Maximization (EM) maximum likelihood estimation methods, though Mplus may use an estimator which is more robust to non-normal distributions.