Random Effects Testing

General issues

There are two or three issues in testing the significance/doing model selection with random effects. These are mentioned in passing in Bolker et al (2009). However, I would say that that discussion is slightly muddled, perhaps due to lack of space:

• boundary issues: see refs by Self and Liang (1987), Stram and Lee (1994), Goldman and Whelan (2000), and others. Fabian Scheipl's RLRsim package is good for doing this testing. In general ignoring these issues will lead to conservative tests (type I errors below nominal $\alpha$ levels. In the simplest case (testing addition of a single random effect) the distribution of the deviance under the null hypothesis is approximately a mixture of $\chi_0^2$ and $\chi_1^2$ distributions, which translates to simply dividing the p-value from a $\chi_1^2$ by 2.
• number-of-degrees-of-freedom issues: beyond the "denominator" degrees-of-freedom issues which plague mixed models, when testing significance/relative predictive capability of models that differ by random effects, you need to count the number of df/parameters associated with a random effect, which is somewhere between 1 (or <1, considering boundary issues) and n-1, where n is the number of levels (blocks). The former corresponds to fitting "just a variance parameter", the latter corresponds to fitting each block separately (i.e., converging on a fixed-effect fit). Vaida and Blanchard (2005) and Spiegelhalter et al (2002) have sensible things to say: if your "level of focus" is the population, i.e. you are trying to assess the overall pattern rather than predict responses from individual units, then you should be able to just count 1 (or <1) for the variance term.
• correspondence between models with/without random effects: if you are trying to compare a model with a single random effect to one with none, you may not be able to fit them both in the same modeling framework (e.g., lme or lmer will only fit models that contain at least one random effect: you have to use lm/gls to fit the model with no random effects). In this case you have to make sure that the log-likelihood calculations are commensurate (i.e., contain the same constant terms etc.) — for example, glmer fits are not commensurate with glm fits (the log-likelihoods do not converge in the limit as the random effect variances go to zero).

Simulation example

Here is an attempt to evaluate whether comparinglme to gls fits makes sense as a way to do significance testing on a single random effect, as suggested by Zuur et al (2009).

library(nlme)


A reasonably generic function to simulate "data" with a specified number of blocks, number of
treatments per block, number of samples per treatment, fixed-effect coefficients, intercept …

rdata <- function(nb=10,
n=100,
sd.b=0,sd.i=0.1,
fc=c(1,2),
int=0) {
d <- expand.grid(ind=factor(1:n),
ttt=factor(LETTERS[1:(length(fc)+1)]),
block=factor(1:nb))
x.b <- rnorm(nb,sd=sd.b)
mm <- cbind(model.matrix(~ttt,data=d),
model.matrix(~block,data=d))
data.frame(d,y=rnorm(nrow(d),mm %*% c(int,fc,x.b),sd=sd.i))
}


Generate data and fit it with ''lme'' and ''gls'':

d = rdata()
fit1 = lme(y~ttt,random=~1|block,data=d)
logLik(fit1)
fit0 = gls(y~ttt,data=d)
logLik(fit0)


It looks like there is at least not a suspicious offset here. (The simplest approach would be to construct a likelihood profile of the lme fit starting from random effects variance = 0, but I don't know how to do that with lme fits.)

A function to fit both to data and generate the difference in log-likelihoods:

ff0 <- function(...) {
d = rdata(...)
fit1 = lme(y~ttt,random=~1|block,data=d)
fit0 = gls(y~ttt,data=d)
c(logLik(fit1)-logLik(fit0))
}


Try it out, draw various pictures:

s = replicate(500,ff0())
hist(s,breaks=100,col="gray")
hist(log10(s),breaks=100,col="gray")  ## warning: some differences are negative
hist(log10(-s[s<0]),breaks=100,col="gray")  ## are they big? no.
s[s<0] = 0                                            ## set negative (small magnitude) values to zero
hist(s,breaks=100,col="gray",freq=FALSE)
## Look only at positive values:
hist(s[s>0],breaks=100,col="gray",freq=FALSE)
curve(dchisq(x,df=1),add=TRUE,col=2,from=0.001,to=4)


It seems that gls and lme are in fact commensurate (which is nice since they're in the same package); I draw this conclusion because the distribution of differences appears to start from zero (or very slightly negative), not to have a consistent offset.

A function to return p-values:

ff <- function(...) {
d = rdata(...)
fit1 = lme(y~ttt,random=~1|block,data=d)
fit0 = gls(y~ttt,data=d)
anova(fit1,fit0)$p-value[2] }  Replicate and calculate power: ## n.b. replicate doesn't apparently do the ## right thing with ... -- loses argument names powfun <- function(nrep,sd.b,alpha=0.05) { z = replicate(nrep,ff(sd.b=sd.b)) mean(z<alpha) }  Run for a series of random-effects standard deviations: sdvec = seq(0,0.03,length=21) nrep = 2000 alpha=0.05 ## do it again but save all the results dvec2 = matrix(nrow=nrep,ncol=21) powvec = numeric(length(sdvec)) for (i in seq_along(sdvec)) { cat(i,"...\n") dvec2[,i] = replicate(nrep,ff(sd.b=sdvec[i])) powvec[i] = mean(dvec2[,i]<alpha) cat(i,powvec[i],"\n") } colnames(dvec2) = paste("sd",round(sdvec,3),sep="_") save("powvec","dvec2","sdvec",file="randfit.RData")  (Load previous results, available here load("randfit.RData")  plot(sdvec,powvec,type="b",las=1, xlab="Random-effects sigma", ylab="Power")  Plotting the nominal-vs-actual p-value curve for the null case (where the observed p-value should be uniformly distributed): dvsc = scale(dvec2,center=apply(dvec2,2,min), scale=apply(dvec2,2,function(x) diff(range(x)))) par(mar=c(5,5,2,2)+0.1) plot((1:nrow(dvec2))/nrow(dvec2),sort(dvsc[,1]),type="l", log="xy",las=1, xlab="nominal p-value", ylab="") mtext(side=2,line=3.5,"observed p-value") abline(v=axTicks(1,log=TRUE),lty=3,col="gray") abline(h=axTicks(2,log=TRUE),lty=3,col="gray") abline(a=0,b=1,col=2) abline(a=log10(2),b=1,col=2,lty=2)  Bottom line: the observed type-I error rate is only 0.013 for a nominal level of$\alpha= 0.05\$; even the usual correction by a factor of 2 (dashed line) appears to undercorrect here.

Is the simple correction failing because we have a "denominator df" problem (i.e., LRT doesn't apply because the number of blocks isn't sufficiently large)? Try switching the number of blocks and number of replicates per treatment/block so that we have the same total number of observations but more blocks (100 vs 10):

dvec3 = replicate(nrep,ff(sd.b=0,nb=100,n=10))
dvsc3 = (dvec3-min(dvec3))/diff(range(dvec3))
library(sfsmisc)  ## for eaxis()
par(mar=c(5,5,2,2)+0.1)
plot((1:length(dvec3))/length(dvec3),sort(dvsc3),type="l",
log="xy",las=1,
xlab="nominal p-value",
ylab="",axes=FALSE)
eaxis(side=1)
eaxis(side=2)
mtext(side=2,line=4,"observed p-value")
abline(v=axTicks(1,log=TRUE),lty=3,col="gray")
abline(h=axTicks(2,log=TRUE),lty=3,col="gray")
abline(a=0,b=1,col=2)
abline(a=log10(2),b=1,col=2,lty=2)


Better, but still not perfect at the low end (below nominal p-values of about 0.02)? [update: re-running with nblock=500, nind=10, nrep=5000 gives a plot where the observed matches nominal*2 down to about p=0.002, which represents about the 10th-smallest point in the set.]

To do

• Is there a reasonable solution to the small-nblock problem: F-tests or some other adjusted LRT approach for finite size?
• test RLRsim: should work more or less exactly?
• figure out an appropriate test for AIC comparisons (cf Richards 2005,2008)