This document aims at providing a first and basic introduction to the implementation of mixed models in R. We use an applicative example coming from the APEX project, studing the impact of a cognitive program. There might be many ways to study this dataset, and uncountable models that could be defined, keep in mind that the following are just illustrative examples. Although various R packages are available to fit mixed models, which are widely studied objects (still an active research topic), the most famous is probably lme4. The syntaxe is pretty straightforward and the implementation is powerfull. Actually, the full presentation of the package is way beyond the scope of the present workshop. You can find the full documentation here: https://cran.r-project.org/web/packages/lme4/lme4.pdf. Besides, here is an additional (and awesome) reference to go (way) further on mixed models with R: https://m-clark.github.io/mixed-models-with-R/.
As for any statistical analysis, let's first load the packages that we will need later on. The meta-package tidyverse provides a really convenient and unified framework to handle data management such as cleaning, tidying, and visualizing data. I must recommend you to use it as often as possible.
## Uncomment if you don't have installed those packages yet
## install.packages(c('tidyverse', 'lme4', 'BEST'))
library(tidyverse)
library(lme4)
Then, we may import the dataset and clean the data as needed. For our simple following examples, we only focus on the variable Stroop_Cog_diff that represents (maybe? probably? Nothing's clear, I'm way out of my safety zone here) a score to a cognitive test performed by children. Since, the variable is originally coded as a string of characters, I rename the variable output and transform it back as actual numbers.
raw_db = read_csv2("documents/APEXAdo_Glucose_dataset.csv")
db = raw_db %>% mutate(output = as.numeric(Stroop_Cog_diff))
For information let's display a quick teasing of the raw data. Notice, that we will use the variable groupe, which codes whether a subject was in the control or the intervention group for these tests. There are way more variables, but it's often better to start slowly, and avoid introducing too many cofounding varables and possibly many biases.
head(db)
## # A tibble: 6 x 17
## code_sujet groupe sexe age age_T0_mois revenus Time_num Time
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 01ZA001 CA F 16 197 2 1_pre pre
## 2 01LJ004 CA F 16 199 6 1_pre pre
## 3 01FL007 CA F 16 206 6 1_pre pre
## 4 01GI009 CA F 16 203 NA 1_pre pre
## 5 01AS011 CA F 16 202 5 1_pre pre
## 6 01ML012 CA F 16 197 4 1_pre pre
## # ... with 9 more variables: heure_aprox_bat_T0 <chr>, heure_1cat_gly <chr>,
## # glycemie_mg_dl_BAT <dbl>, gly2 <dbl>, gly_cat <chr>, Session <chr>,
## # Stroop_Cog_diff <chr>, Stroop_Emo_diff <chr>, output <dbl>
Using the lme4 package, defining a mixed model is pretty straightforward. We can merely use the function lmer() in which we precise the structure of the relationship as first argument, and the dataset as second. First, we chose to study the effect of the intervention alone and thus specify the variable groupe (coding whether we are in control or mindful meditation group) as a fixed effect and the suject as a random effect (since there are several observations, probably dependent, for each subject). Below, we would consider that the random effects only impacts the intercept, meaning that we consider each subject as having its own skills in performing thoses tasks, regarless of the intervention. The intercept is represented by the number 1, between the brackets, which them indicate the random effect.
model_simple = lmer(output ~ groupe + (1|code_sujet), data = db)
summary(model_simple)
## Linear mixed model fit by REML ['lmerMod']
## Formula: output ~ groupe + (1 | code_sujet)
## Data: db
##
## REML criterion at convergence: 2536.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.72910 -0.60567 -0.07112 0.48903 3.00458
##
## Random effects:
## Groups Name Variance Std.Dev.
## code_sujet (Intercept) 1595 39.93
## Residual 11902 109.10
## Number of obs: 207, groups: code_sujet, 53
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 82.069 12.908 6.358
## groupePC -9.407 18.771 -0.501
##
## Correlation of Fixed Effects:
## (Intr)
## groupePC -0.688
The function summary allows us to dig into the results of the estimation of the model. There are many different thing to look at, but the more important are the estimation of the coefficient and the analysis of there variance. The analysis of those results strongly depends on the context and the expert knowledge, thus we will remain careful in this document and let this aspect to the workshop comments and discussion.
If we acknoledge that the intervention should have an effect, we may introduce the variable Time coding the pre/post timing, to define something slightly more sophisticate. We could imagine that not only the individuals have their own natural skills, but they also present their indivudal effect of the intervention (more or less habits with a tablet for example), and thus specify the random effect on Time coefficient as well.
model = lmer(output ~ groupe + (1 + Time|code_sujet), data = db)
summary(model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: output ~ groupe + (1 + Time | code_sujet)
## Data: db
##
## REML criterion at convergence: 2528.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3324 -0.5011 -0.0318 0.4109 2.9337
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## code_sujet (Intercept) 5195 72.08
## Timepre 8090 89.94 -0.76
## Residual 9192 95.88
## Number of obs: 207, groups: code_sujet, 53
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 84.66 12.94 6.543
## groupePC -13.26 18.81 -0.705
##
## Correlation of Fixed Effects:
## (Intr)
## groupePC -0.688
As an illustration of the potential to define more (possibly too much) complex models, here is an example that seemed relevant to me, among the many other possibilities.
model = lmer(output ~ sexe + age + glycemie_mg_dl_BAT + Time + Session + (1 |groupe:code_sujet), data = db)
summary(model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: output ~ sexe + age + glycemie_mg_dl_BAT + Time + Session + (1 |
## groupe:code_sujet)
## Data: db
##
## REML criterion at convergence: 2082
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5626 -0.4915 -0.1031 0.5219 2.9550
##
## Random effects:
## Groups Name Variance Std.Dev.
## groupe:code_sujet (Intercept) 2358 48.56
## Residual 10889 104.35
## Number of obs: 173, groups: groupe:code_sujet, 52
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -140.0846 359.4520 -0.390
## sexeM -4.5971 22.3617 -0.206
## age 1.1767 21.9824 0.054
## glycemie_mg_dl_BAT 2.0233 0.8403 2.408
## Timepre 19.5856 16.8530 1.162
## SessionEnd -1.6245 15.8747 -0.102
##
## Correlation of Fixed Effects:
## (Intr) sexeM age g___BA Timepr
## sexeM 0.086
## age -0.974 -0.101
## glycm___BAT -0.113 -0.026 -0.110
## Timepre -0.015 -0.030 -0.054 0.224
## SessionEnd -0.020 -0.003 -0.004 0.012 -0.003
Once again, not being a specialist in cognitive sciences nor even in applying mixed models, I would restrain myself from over-interpreting such results. However, on another (but really important, I believe) topic, I would like to provide an additional package that would deserve to be way more widespread among the experimental science practitioners, desiring to use simple and meaningful statistical analyses. I am indeed devoted to a endless war, against the decision based on a t-test, a method that no statistician would ever use, altough the entire world seems to enjoy it more than anything (probably because it makes the concept of talking nonsense really straightfoward and appearing scientifically sound. (Am I sarcastic enough here ? Probably)).
For this ambitious task, we will use an awesome package developed by the brilliant John K. Kruschke, Professor of Psychological and Brain Sciences (cheers to your field!). Namely, BEST means 'Bayesian Estimation Superseds T-test' (see ? I'm not the only one who is sarcastic), and propose a really user-friendly framework to conduct a Bayesian analysis of the mean comparisons problem. Whether you want to compare the mean of two different groups (like pre-post test for example) or compare the mean a single group to 0, the syntax remains completly obious. We juste have to use the function BESTmcmc and provides as first (and second if necessary) argument, the vector of observation you want to test. But yet, you will not TEST! Stop testing. Testing is nonsense, it's hiding behind false objectivity and forgetting the experts knowledges of each field. And welcome in the world of estimation, probability, straightforward statements, uncertainty quantification, and correct measure of effect sizes. First of all, let's create two vectors of data, for illustration, to see if we could exhibit a difference in cognitive test based on the gender.
library(BEST)
db_h = raw_db %>%
filter(sexe == 'M') %>%
filter(!is.na(Stroop_Cog_diff)) %>%
pull(Stroop_Cog_diff) %>%
as.numeric
db_f = raw_db %>%
filter(sexe == 'F') %>%
filter(!is.na(Stroop_Cog_diff)) %>%
pull(Stroop_Cog_diff) %>%
as.numeric
We could visualize on the boxplot that there seems to be nothing really interesting here. Ok, but how much ? Maybe not a slight effect ? Let's see it more deeply !
ggplot(raw_db) + geom_boxplot(aes(x = sexe, y = as.numeric(Stroop_Cog_diff), color = sexe))
Below, we use the wunerfully simple function BESTmcmc function to analyze the difference between those 2 groups.
res_hf = BESTmcmc(db_h, db_f, numSavedSteps = 10000)
res_hf
## MCMC fit results for BEST analysis:
## 10002 simulations saved.
## mean sd median HDIlo HDIup Rhat n.eff
## mu1 78.28 15.051 78.17 48.552 107.67 1.000 5605
## mu2 73.52 9.756 73.57 54.765 93.07 1.002 5720
## nu 29.76 25.380 21.63 3.461 81.66 1.005 1420
## sigma1 120.10 12.398 119.88 95.660 144.52 1.001 3566
## sigma2 104.86 8.223 104.96 88.983 121.30 1.005 2840
##
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.
On the results, we have really detailled analysis of the different quantities that have been estimated. However, for a one-look estimation of the gobal effect, let's display the graph of results !
plot(res_hf, 'mean')
As you can see. The difference (or the absence of difference here) is now perfectly quantified. No test needed here, just a nice statistical estimation and your brain that is already expert in making decision by itself! On a second example, we can study the effect of the intervention (this is want we want to know afterall isn't it ?)
db_pre = raw_db %>%
filter(groupe == 'PC') %>%
filter(Time == 'pre') %>%
filter(!is.na(Stroop_Cog_diff)) %>%
pull(Stroop_Cog_diff) %>%
as.numeric()
db_post = raw_db %>%
filter(groupe == 'PC') %>%
filter(Time == 'post') %>%
filter(!is.na(Stroop_Cog_diff)) %>%
pull(Stroop_Cog_diff) %>%
as.numeric()
raw_db %>%
filter(groupe == 'PC') %>%
ggplot() + geom_boxplot(aes(x = Time, y = as.numeric(Stroop_Cog_diff), color = Time))
We can then run the same framework.
res_pre_post = BESTmcmc(db_post, db_pre, numSavedSteps = 10000)
res_pre_post
## MCMC fit results for BEST analysis:
## 10002 simulations saved.
## mean sd median HDIlo HDIup Rhat n.eff
## mu1 89.40 15.22 89.50 59.361 118.51 1.001 5678
## mu2 56.55 14.22 56.43 29.985 85.89 1.000 5971
## nu 38.93 30.36 30.53 2.683 98.19 1.004 1977
## sigma1 97.98 11.27 97.04 77.029 120.92 1.002 4827
## sigma2 97.25 11.24 96.57 76.193 119.98 1.000 4612
##
## 'HDIlo' and 'HDIup' are the limits of a 95% HDI credible interval.
## 'Rhat' is the potential scale reduction factor (at convergence, Rhat=1).
## 'n.eff' is a crude measure of effective sample size.
And observe the nice results exhibiting a highly probable effect. High enough to mention ? It's up to you to tell now. Enjoy !
plot(res_pre_post, 'mean', showCurve = T)
And more results on the difference of standard errors between groups.
plot(res_pre_post, 'sd', showCurve = T)
And even on the effect size of the intervention! Life's beautiful when it's simple and pretty.
plot(res_pre_post, 'effect', showCurve = T)