Abstract
The ‘DHARMa’ package uses a simulation-based approach to create readily interpretable scaled (quantile) residuals for fitted generalized linear (mixed) models. Currently supported are linear and generalized linear (mixed) models from ‘lme4’ (classeslmerMod,
glmerMod), ‘glmmTMB’, ‘GLMMadaptive’, ‘spaMM’, and ‘brms’
(simple models); phylogenetic linear models from ‘phylolm’ (classes
‘phylolm’ and ‘phyloglm’); generalized additive models (‘gam’ from
‘mgcv’); ‘glm’ (including ‘negbin’ from ‘MASS’, but excluding
quasi-distributions) and ‘lm’ model classes. Moreover, externally
created simulations, e.g. posterior predictive simulations from Bayesian
software such as ‘JAGS’, ‘STAN’, or ‘BUGS’ can be processed as well. The
resulting residuals are standardized to values between 0 and 1 and can
be interpreted as intuitively as residuals from a linear regression. The
package also provides a number of plot and test functions for typical
model misspecification problems, such as over/underdispersion,
zero-inflation, and residual spatial, temporal and phylogenetic
autocorrelation.
The interpretation of conventional residuals for generalized linear (mixed) and other hierarchical statistical models is often problematic. As an example, here the result of conventional Deviance, Pearson and raw residuals for two Poisson GLMMs, one that is lacking a quadratic effect, and one that fits the data perfectly. Could you tell which is the correct model?
Just for completeness - it was the second one. But don’t get too excited if you got it right. Probably you were just lucky - I can’t really tell a difference. But even so, would you have added a quadratic effect, instead of adding an overdispersion correction? The point here is that misspecifications in GL(M)Ms cannot reliably be diagnosed with standard residual plots, and thus GLMMs are often not as thoroughly checked as they should.
One reason why GL(M)Ms residuals are harder to interpret is that the expected distribution of the data (aka predictive distribution) changes with the fitted values. Reweighting with the expected dispersion, as done in Pearson residuals, or using deviance residuals, helps to some extent, but it does not lead to visually homogeneous residuals, even if the model is correctly specified. As a result, standard residual plots, when interpreted in the same way as for linear models, seem to show all kind of problems, such as non-normality, heteroscedasticity, even if the model is correctly specified. Questions on the R mailing lists and forums show that practitioners are regularly confused about whether such patterns in GL(M)M residuals are a problem or not.
But even experienced statistical analysts currently have few options to diagnose misspecification problems in GLMMs. In my experience, the current standard practice is to eyeball the residual plots for major misspecifications, potentially have a look at the random effect distribution, and then run a test for overdispersion, which is usually positive, after which the model is modified towards an overdispersed / zero-inflated distribution. This approach, however, has a number of drawbacks, notably:
Overdispersion is often the result of missing predictors or a misspecified model structure. Standard residual plots make it difficult to identify these problems by examining residual correlations or patterns of residuals against predictors.
Not all overdispersion is the same. For count data, the negative binomial creates a different distribution than adding observation-level random effects to the Poisson. Once overdispersion is corrected for, such violations of distributional assumptions are not detectable with standard overdispersion tests (because the tests only looks at total dispersion), and nearly impossible to see visually from standard residual plots.
Dispersion frequently varies with predictors (heteroscedasticity). This can have a significant effect on the inference. While it is standard to test for heteroscedasticity in linear regressions, heteroscedasticity is currently hardly ever tested for in GLMMs, although it is likely as frequent and influential.
Moreover, if residuals are checked, they are usually checked conditional on the fitted random effect estimates. Thus, standard checks only check the final level of the random structure in a GLMM. One can perform extra checks on the random effects, but it is somewhat unsatisfactory that there is no check on the entire model structure.
DHARMa aims at solving these problems by creating readily interpretable residuals for generalized linear (mixed) models that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. The basic steps are:
Simulate new response data from the fitted model for each observation.
For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values (and their probability) at the predictor combination of the observed value, assuming the fitted model is correct.
The residual is then defined as the value of the empirical density function at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value.
These steps are visualized in the following figure
The key advantage of this definition is that the so-defined residuals always have the same, known distribution, independent of the model that is fit, if the model is correctly specified. To see this, note that, if the observed data was created from the same data-generating process that we simulate from, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).
I currently prepare a more exact statistical justification for the approach in an accompanying paper, but if you must provide a reference in the meantime, I would suggest citing
Dunn, K. P., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 1-10.
Gelman, A. & Hill, J. Data analysis using regression and multilevel/hierarchical models Cambridge University Press, 2006
p.s.: DHARMa stands for “Diagnostics for HierArchical Regression Models” - which, strictly speaking, would make DHARM. But in German, Darm means intestines; plus, the meaning of DHARMa in Hinduism makes the current abbreviation so much more suitable for a package that tests whether your model is in harmony with your data:
From Wikipedia, 28/08/16: In Hinduism, dharma signifies behaviours that are considered to be in accord with rta, the order that makes life and universe possible, and includes duties, rights, laws, conduct, virtues and ‘right way of living’.
If you haven’t installed the package yet, either run
Or follow the instructions on https://github.com/florianhartig/DHARMa to install a development version.
Loading and citation
##
## To cite package 'DHARMa' in publications use:
##
## Hartig F (2026). _DHARMa: Residual Diagnostics for Hierarchical
## (Multi-Level / Mixed) Regression Models_. R package version 0.5.0,
## <http://florianhartig.github.io/DHARMa/>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models},
## author = {Florian Hartig},
## year = {2026},
## note = {R package version 0.5.0},
## url = {http://florianhartig.github.io/DHARMa/},
## }
Let’s assume we have a fitted model that is supported by DHARMa.
testData = createData(sampleSize = 250)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) ,
family = "poisson", data = testData)Most functions in DHARMa can be calculated directly on the fitted model object. For example, if you are only interested in testing for dispersion problems, you could run
In this case, the randomized quantile residuals are calculated on the
fly inside the function call. If you work in this way, however, residual
calculation will be repeated by every test / plot you call, and this can
take a while. It is therefore highly recommended to first calculate the
residuals once, using the simulateResiduals() function
which calculates randomized quantile residuals according to the
algorithm discussed above. The function returns an object of class
DHARMa, containing the simulations and the scaled
residuals, which can later be passed on to all other plots and test
functions. When specifying the optional argument
plot = TRUE, the standard DHARMa residual plot
is displayed directly. The interpretation of the plot will be discussed
below. Using the simulateResiduals function has the added
benefit that you can modify the way in which residuals are calculated.
For example, you may want to change the number of simulations, or the
REs to condition on. See ?simulateResiduals and section
“simulation options” below for details.
The calculated (scaled) residuals can be plotted and tested via a number of DHARMa functions (see below), or accessed directly via
To interpret the residuals, remember that a scaled residual value of 0.5 means that half of the simulated data are higher than the observed value, and half of them lower. A value of 0.99 would mean that nearly all simulated data are lower than the observed value. The minimum/maximum values for the residuals are 0 and 1. For a correctly specified model we would expect asymptotically
a uniform (flat) distribution of the scaled residuals
uniformity in y direction if we plot against any predictor.
Note: the uniform distribution is the only difference to “conventional” residuals as calculated for a linear regression. If you cannot get used to this, you can transform the uniform distribution to another distribution, for example normal, via
These normal residuals will behave exactly like the residuals of a
linear regression. However, for reasons of a) numeric stability with low
number of simulations, which makes it necessary to decide on which value
outliers are to be transformed and b) my conviction that it is much
easier to visually detect deviations from uniformity than normality,
DHARMa checks all residuals in the uniform space, and I
would personally advise against using the transformation.
The main plot function for the calculated DHARMa object
produced by simulateResiduals() is the
plot.DHARMa() function
The function creates two plots, which can also be called separately, and provide extended explanations / examples in the help
plotQQunif(simulationOutput) # left plot in plot.DHARMa()
plotResiduals(simulationOutput) # right plot in plot.DHARMa()plotQQunif (left panel) creates a qq-plot to detect
overall deviations from the expected distribution, by default with added
tests for correct distribution (KS test), dispersion and outliers. Note
that outliers in DHARMa are values that are by default defined as values
outside the simulation envelope, not in terms of a particular quantile.
Thus, which values will appear as outliers will depend on the number of
simulations. If you want outliers in terms of a particular quantile, you
can use the outliers() `function.
plotResiduals (right panel) produces a plot of the
residuals against the predicted value (or alternatively, other
variable). Simulation outliers (data points that are outside the range
of simulated values) are highlighted as red stars. These points should
be carefully interpreted, because we actually don’t know “how much”
these values deviate from the model expectation. Note also that the
probability of an outlier depends on the number of simulations, so
whether the existence of outliers is a reason for concern depends also
on the number of simulations.
To provide a visual aid in detecting deviations from uniformity in y-direction, the plot function calculates an (optional default) quantile regression, which compares the empirical 0.25, 0.5 and 0.75 quantiles in y direction (red solid lines) with the theoretical 0.25, 0.5 and 0.75 quantiles (dashed black line), and provides a p-value for the deviation from the expected quantile. The significance of the deviation to the expected quantiles is tested and displayed visually, and can be additionally extracted with the testQuantiles function.
By default, plotResiduals plots against predicted
values. However, you can also use it to plot residuals against a
specific other predictors (highly recommended).
Additionally, you can plot residuals against a specific predictor for a specific group, e.g. a random effect level.
If the predictor is a factor, or if there is just a small number of observations on the x axis, plotResiduals will plot a box plot with additional tests instead of a scatter plot.
See ?plotResiduals for details, but very shortly: under
H0 (perfect model), we would expect those boxes to range homogeneously
from 0.25-0.75. To see whether there are deviations from this
expectation, the plot calculates a test for uniformity per box, and a
test for homogeneity of variances between boxes. A positive test will be
highlighted in red.
NOTE on plots: The default color for highlighting
outliers and significant tests is red. However, it can
be changed by setting options(DHARMaSignalColor = "red") to
a different color. This is convenient for a color-blind friendly
display, since red and black are difficult for some people to
disentangle.
To support the visual inspection of the residuals, the
DHARMa package provides a number of specialized
goodness-of-fit tests on the simulated residuals:
testUniformity() - tests if the overall distribution
conforms to expectations.testOutliers() - tests if there are more simulation
outliers than expected.testDispersion() - tests if the simulated dispersion is
equal to the observed dispersion.testQuantiles() - fits a quantile regression or
residuals against a predictor (default predicted value), and tests of
this conforms to the expected quantile.testCategorical(simulationOutput, catPred = testData\$group)
tests residuals against a categorical predictor.testZeroinflation() - tests if there are more zeros in
the data than expected from the simulations.testGeneric() - test if a generic summary statistic
(user-defined) deviates from model expectations.testTemporalAutocorrelation() - tests for temporal
autocorrelation in the residuals.testSpatialAutocorrelation() - tests for spatial
autocorrelation in the residuals. Can also be used with a generic
distance function.testPhylogeneticAutocorrelation() - tests for
phylogenetic signal in the residuals.See the help of the functions and further comments below for a more detailed description.
There are a few important technical details regarding how the simulations are performed, in particular regarding the treatments of random effects and integer responses. It is strongly recommended to read the help of
if refit = FALSE (default), new datasets are
simulated from the fitted model, and residuals are calculated by
comparing the observed data to the new data
if refit = TRUE, a parametric bootstrap is
performed, meaning that the model is refit to all new datasets, and
residuals are created by comparing observed residuals against refitted
residuals
The second option is much much slower, and also seemed to have lower power in some tests I ran. It is therefore not recommended for standard residual diagnostics! I only recommend using it if you know what you are doing, and have particular reasons, for example if you estimate that the tested model is biased. A bias could, for example, arise in small data situations, or when estimating models with shrinkage estimators that include a purposeful bias, such as ridge/lasso, random effects or the splines in GAMs. My idea was then that simulated data would not fit to the observations, but that residuals for model fits on simulated data would have the same patterns/bias than model fits on the observed data.
Note also that refit = TRUE can sometimes run into
numerical problems, if the fitted model does not converge on the newly
simulated data.
The second option is the treatment of the stochastic hierarchy. In a hierarchical model, several layers of stochasticity are placed on top of each other. Specifically, in a GLMM, we have a lower level stochastic process (random effect), whose result enters into a higher level (e.g. Poisson distribution). For other hierarchical models, such as state-space models, similar considerations apply, but the hierarchy can be more complex. When simulating, we have to decide if we want to re-simulate all stochastic levels, or only a subset of those. For example, in a GLMM, it is common to only simulate the last stochastic level (e.g. Poisson) conditional on the fitted random effects, meaning that the random effects are set on the fitted values.
As of DHARMa 0.5.0, the default is conditional simulation on all
random effects of the fitted model. Setting
simulateREs = "user-specified" allows you to return to the
previous DHARMa default, which used the respective default setting for
the simulate function of your model class. Then the simulateResiduals
function allows users to pass on parameters to the simulate function of
the fitted model object, using its respective syntax. It further allows
to specify only a subset of random effects to be conditioned on, but
note that this is not possible for all model classes.
Summary of the syntax for supported packages (this is only relevant for simulateREs = “user-specified”):
| Package | Unconditional | Conditional on all REs | Conditional on specific REs |
| lme4 | re.form = NA (default) |
re.form = NULL |
re.form = ~(1 | group) |
| spaMM | re.form = NA (default) |
re.form = NULL |
re.form = ~(1 | group) |
| glmmTMB | set_simcodes(model$obj, val = "random", terms = "ALL"),
then simulate(model) (default) |
set_simcodes(model$obj, val = "fix", terms = "ALL"),
then simulate(model) |
not available |
| GLMMadaptive | type = "subject_specific",
new_RE = T |
type = "subject_specific" (default) |
not available |
| mgcv | not available | default | not available |
| brms | re.form = NA |
re.form = NULL (default) |
not available |
If the model is correctly specified and the fitting procedure is unbiased (disclaimer: GLMM estimators are not always unbiased), the simulated residuals should be flat regardless how many hierarchical levels we re-simulate. The most thorough procedure would be therefore to test all possible options. Re-simulating all levels (unconditional) can be advantageous because it tests the model structure as a whole. A potential drawback is that re-simulating the lower-level random effects creates more variability, which may reduce power for detecting problems in the upper-level stochastic processes, in particular overdispersion (see section on dispersion tests below). In particular dispersion tests may produce different results when switching from conditional to unconditional simulations, and often the conditional simulation is more sensitive.
Note: Although unconditional residuals implicitly also test the normal distribution of the REs, it is probably not a bad idea to additionally check for normality of the RE distribution. As this is not based on quantile residuals, there is no special DHARMa function for this, so you should just extract the REs, and then run e.g. a Shapiro test.
A third option is the treatment of integer responses. The background of this option is that, for integer-valued variables, some additional steps are necessary to make sure that the residual distribution becomes flat (essentially, we have to smooth away the integer nature of the data). The idea is explained in
DHARMa currently implements two procedures for randomization. The default procedure will randomize automatically. The second option requires knowledge about whether the model is integer-valued, which is usually implemented automatically. See ?simulateResiduals for details. Usually, these options should simply be kept at their defaults.
In many situations, it can be useful to look at residuals per group,
e.g. to see how much the model over / underpredicts per plot, year or
subject. To do this, use the recalculateResiduals()
function, together with a grouping variable (group) or a subsetting
variable (sel), which can also be used in combination.
Note, however, that you will have to change the selection of
variables that you provide to plots and tests (e.g. in
plotResiduals or testSpatialAutocorrelation)
accordingly when you group or subset residuals.
As DHARMa uses simulations to calculate the residuals, a naive
implementation of the algorithm would mean that residuals would look
slightly different each time a DHARMa calculation is executed. This
might both be confusing and bear the danger that a user would run the
simulation several times and take the result that looks better (which
would amount to multiple testing / p-hacking). By default, DHARMa
therefore fixes the random seed to the same value every time a
simulation is run, and afterwards restores the random state to the old
value. This means that you will get exactly the same residual plot each
time. If you want to avoid this behavior, for example for simulation
experiments on DHARMa, use seed = NULL -> no seed set,
but random state will be restored, or seed = FALSE -> no
seed set, and random state will not be restored. Whether or not you fix
the seed, the setting for the random seed and the random state are
stored in
If you want to reproduce simulations for such a run, set the variable
.Random.seed by hand, and simulate with
seed = NULL.
Moreover (general advice), to ensure reproducibility, it’s advisable
to add a set.seed() at the beginning, and a
sessionInfo() at the end of your script. The latter will
list the version number of R and all loaded packages.
So far, all shown DHARMa results were calculated for a correctly specified model, resulting in “perfect” residual plots and diagnostics. In this section, we discuss how to recognize and interpret diagnostics that indicate a misspecified model. Before going into the details, note, however, that
No residual pattern does not “prove” that the model is correct: The fact that none of the DHARMa tests indicate a problem does not “prove” that the model is correctly specified. For any model, there are likely a large number of structural problems that do not create a pattern in the DHARMa diagnostics. In good old Popper fashion, you should interpret no residual problems as your working hypothesis not being rejected in that particular test, which increases confidence in the model, but does not constitute a conclusive proof. So, keep your skepticism alive, and if you find the results fishy, keep searching and testing.
Once a residual effect is statistically significant, look at the magnitude to decide if there is a problem: It is crucial to note that significance is NOT a measure of the strength of the residual pattern, it is a measure of the signal/noise ratio, i.e. whether you are sure there is a pattern at all. Significance in hypothesis tests depends on at least 2 ingredients: the strength of the signal and the number of data points. If you have a lot of data points, residual diagnostics will nearly inevitably become significant, because having a perfectly fitting model is very unlikely. That, however, doesn’t necessarily mean that you need to change your model. The p-values confirm that there is a deviation from your null hypothesis. It is, however, in your discretion to decide whether this deviation is worth worrying about. For example, if you see a dispersion parameter of 1.01, I would not worry, even if the dispersion test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.
A residual pattern does not indicate that the model is unusable: While a significant pattern in the residuals indicates with good reliability that the observed data did likely not originate from the fitted model, this doesn’t necessarily imply that the inferential results from this wrong model are not usable. There are many situations in statistics where it is common practice to work with “wrong models”. For example, many statistical models use shrinkage estimators, which purposefully bias parameter estimates to certain values. Random effects are a special case of this. If DHARMa residuals for these estimators are calculated, they will often show a slight pattern in the residuals even if the model is correctly specified, and tests for this can get significant for large sample sizes. For this reason, DHARMa is excluding RE estimates in the predictions when plotting res ~ pred. Another example is data that is missing at random (MAR). Since it is known that this phenomenon does not create a bias on the fixed effects estimates, it is common practice to fit these data with standard mixed models. Nevertheless, DHARMa recognizes that the observed data looks different from what would be expected from the model assumptions, and flags the model as problematic (see here).
Important conclusion: DHARMa only flags a difference between the observed and expected data - the user has to decide whether this difference is actually a problem for the analysis!
GL(M)Ms often display over/underdispersion, which means that residual variance is larger/smaller than expected under the fitted model. This phenomenon is most common for GLM families with constant (fixed) dispersion, in particular for Poisson and binomial models, but it can also occur in GLM families that adjust the variance (such as the beta or negative binomial) when distribution assumptions are violated. A few general rules of thumb about dealing with dispersion problems:
This is how overdispersion looks like in the DHARMa residuals. Note that we get more residuals around 0 and 1, which means that more residuals are in the tail of the distribution than would be expected under the fitted model.
testData = createData(sampleSize = 200, overdispersion = 1.5, family = poisson())
fittedModel <- glm(observedResponse ~ Environment1 , family = "poisson",
data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)If you see this pattern, note that overdispersion is often caused by
model misfit. Thus, before moving to a GLM with variable dispersion (for
count data this would typically be a negative binomial), you should
check your model for misfit, e.g. by plotting residuals against all
predictors using plotResiduals().
Next, this is an example of underdispersion. Here, we get too many residuals around 0.5, which means that we are not getting as many residuals as we would expect in the tail of the distribution from the fitted model.
testData = createData(sampleSize = 500, intercept=0, fixedEffects = 2,
overdispersion = 0, family = poisson(),
roundPoissonVariance = 0.001, randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) ,
family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)If you see this pattern, note that a common reason for
underdispersion is overfitting, i.e. your model is too complex. Other
possible explanations to check for include zero-inflation (best to check
by comparing to a ZIP model, but see also
testZeroInflation), non-independence of the data
(e.g. temporal autocorrelation, check via
testTemporalAutocorrelation) that your predictors can use
to overfit, or that your data-generating process is simply not a Poisson
process.
From a technical side, underdispersion is not as concerning as overdispersion, as it will usually bias p-values to the conservative side, but if your goal is to get a good power, you may want to consider a simpler model. If that is not helping, you can move to a distribution that takes into account underdispersed count data (e.g. Conway-Maxwell-Poisson, generalized Poisson).
Although, as discussed above, over/underdispersion will show up in the residuals, and it’s possible to detect it with the testUniformity function, simulations show that this test is less powerful than more targeted tests. DHARMa contains several dispersion tests:
default: the simulation-based residual variance test is a non-parametric test that compares the variance of the observed raw residuals to the variances of the simulated residuals.
PearsonChisq: the Pearson-Chi2 test that is popular in
the literature, suggested in the glmm Wiki, and implemented in some
other R packages such as
performance::check_overdispersion.
refit: if residual simulations are done via refit and the test is PearsonChisq, DHARMa will compare the Pearson residuals of the re-fitted simulations to the original Pearson residuals. This is essentially a non-parametric version of test 2.
All of these tests are included in the testDispersion
function, see ?testDispersion for details.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.062221, p-value < 2.2e-16
## alternative hypothesis: two.sided
IMPORTANT INFO: we have made extensive simulations (see Leite M.S., Rettelbach, D. and Hartig, F. (2025)), which have shown that the various tests have certain advantages and disadvantages. The figure below shows the general performance of the dispersion tests for Poisson and binomial models for: GLMs in general, GLMs with small sample size or intercept (“small data”), GLMMs with one random effect with few groups/levels, GLMMs with many groups/levels in a random effect, and computational time for calculating the test (speed). The symbols mean: “-” bad performance, “+” good performance, “++” very good performance.
The basic results are that:
alternative = "greater"), this
makes the test more conservative, but it also costs power.Note: Conditional simulations are the default as of DHARMa v0.5.0. The problem with unconditional simulations is that they introduce additional variance in the simulated response by also re-simulating the lower-level stochastic process (random effects). This variability strongly reduces the power of the dispersion test. Re-simulating conditional on the fitted random effects acts only at the higher-level stochastic process, the distribution (e.g. Poisson or binomial proportion). As over/underdispersion is a problem at the distribution-level, conditional simulations are the more adequate basis for dispersion tests, and results in substantially higher test power and sensitivity.
A common special case of overdispersion is zero-inflation, which is the situation when more zeros appear in the observation than expected under the fitted model. Zero-inflation requires special correction steps. More generally, we can also have too few zeros, or too much or too few of any other values. We’ll discuss that at the end of this section.
Here an example of a typical zero-inflated count dataset, plotted against the environmental predictor
testData = createData(sampleSize = 500, intercept = 2, fixedEffects = c(1),
overdispersion = 0, family = poisson(),
quadraticFixedEffects = c(-3), randomEffectVariance = 0,
pZeroInflation = 0.6)
par(mfrow = c(1,2))
plot(testData$Environment1, testData$observedResponse,
xlab = "Environmental Predictor",
ylab = "Response")
hist(testData$observedResponse, xlab = "Response", main = "")We see a hump-shaped dependence of the environment, but with too many zeros. In the normal DHARMa residual plots, zero-inflation will look pretty much like overdispersion
fittedModel <- glmer(observedResponse ~ Environment1 + I(Environment1^2) +
(1|group), family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)The reason is that the model will usually try to find a compromise between the zeros, and the other values, which will lead to excess variance in the residuals.
DHARMa also has a special test for zero-inflation, which compares the distribution of expected zeros in the data against the observed zeros
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 2.1544, p-value < 2.2e-16
## alternative hypothesis: two.sided
This test is likely better suited for detecting zero-inflation than the standard plot, but note that also overdispersion will lead to excess zeros, so only seeing too many zeros is not a reliable diagnostic for moving towards a zero-inflated model. A reliable differentiation between overdispersion and zero-inflation will usually only be possible when directly comparing alternative models, e.g. through residual comparison / model selection of a model with / without zero-inflation, or by simply fitting a model with zero-inflation and looking at the parameter estimate for the zero-inflation. A good option is the R package glmmTMB, which is also supported by DHARMa. We can use this to fit
To test for generic excess / deficits of particular values, we have the function testGeneric, which compares the values of a generic, user-provided summary statistics
Choose one of alternative = c(“greater”, “two.sided”, “less”) to test for inflation / deficit or both. Default is “greater” = inflation.
countOnes <- function(x) sum(x == 1) # testing for number of 1s
# 1-inflation:
testGeneric(simulationOutput, summary = countOnes, alternative = "greater") ##
## DHARMa generic simulation test
##
## data: simulationOutput
## ratioObsSim = 0.20614, p-value = 1
## alternative hypothesis: greater
So far, most of the things that we have tested could also have been detected with parametric tests. Here, we come to the first issue that is difficult to detect with current tests, and that is usually neglected.
Heteroscedasticity means that there is a systematic dependency of the dispersion / variance on another variable in the model. It is not sufficiently appreciated that also binomial or Poisson models can show heteroscedasticity. Basically, it means that the level of over/underdispersion depends on another parameter. Here an example where we create such data
testData = createData(sampleSize = 500, intercept = -1.5,
overdispersion = function(x){return(rnorm(length(x),
sd = 1 * abs(x)))},
family = poisson(), randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1 , family = "poisson",
data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)The exact p-values for the quantile lines in the plot can be displayed via
As mentioned above, the equivalent test for categorical predictors (plot function will switch automatically) would be
Adding a simple overdispersion correction will try to find a compromise between the different levels of dispersion in the model. The qq plot looks better now, but there is still a pattern in the residuals
testData = createData(sampleSize = 500, intercept = 0,
overdispersion = function(x){return(rnorm(length(x),
sd = 2*abs(x)))},
family = poisson(), randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) + (1|ID),
family = "poisson", data = testData)
# plotConventionalResiduals(fittedModel)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)To remove this pattern, you would need to make the dispersion parameter dependent on a predictor (e.g. in JAGS), or apply a transformation on the data.
A second test that is typically run for LMs, but not for GL(M)Ms is to plot residuals against the predictors in the model (or potentially predictors that were not in the model) to detect possible misspecifications. Doing this is highly recommended. For that purpose, you can retrieve the residuals via
Note again that the residual values are scaled between 0 and 1. If you plot the residuals against predictors, space or time, the resulting plots should not only show no systematic dependency of those residuals on the covariates, but they should also again be flat for each fixed situation. That means that if you have, for example, a categorical predictor: treatment / control, the distribution of residuals for each predictor alone should be flat as well.
Here an example with a missing quadratic effect in the model and 2 predictors
testData = createData(sampleSize = 200, intercept = 1, fixedEffects = c(1,2),
overdispersion = 0, family = poisson(),
quadraticFixedEffects = c(-3,0))
fittedModel <- glmer(observedResponse ~ Environment1 + Environment2 + (1|group),
family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput, quantreg = TRUE)It is difficult to see that there is a problem at all in the general plot, but it becomes clear if we plot against the environment
par(mfrow = c(1,2))
plotResiduals(simulationOutput, form = ~ Environment1)
plotResiduals(simulationOutput, form = ~ Environment2)If a distance between residuals can be defined (temporal, spatial, phylogenetic), you should check if there is a distance-dependence in the residuals, which would suggest to move to a GLS (generalized least squares) structure for analysis.
The three functions to test for this in DHARMa are:
Moran.I function on package ape.Here a short example for the spatial case, see help of the functions for extended examples.
testData = createData(sampleSize = 100, family = poisson(),
spatialAutocorrelation = 3, numGroups = 1,
randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1 , data = testData,
family = poisson() )
simulationOutput <- simulateResiduals(fittedModel = fittedModel)
testSpatialAutocorrelation(simulationOutput = simulationOutput, x = ~x,
y= ~y)##
## DHARMa Moran's I test for distance-based autocorrelation
##
## data: simulationOutput
## observed = 0.089567, expected = -0.010101, sd = 0.016660, p-value =
## 2.197e-09
## alternative hypothesis: Distance-based autocorrelation
Note that all these tests are most sensitive against homogeneous residual structure, and might miss local and heterogeneous (non-stationary) residual structures. Additional visual checks can be useful.
However, standard DHARMa simulations from models with temporal /
spatial / phylogenetic conditional autoregressive terms will still have
the respective correlation in the DHARMa residuals, unless
the package you are using is modelling the autoregressive terms as
explicit REs and is able to simulate conditional on the fitted REs. It
means that the residuals will still show significant autocorrelation,
even if the model fully accounts for this structure, and other tests,
such as dispersion, uniformity, may have inflated type I error. See the
example below with a glmmTMB model with a spatial
autocorrelation structure:
library(glmmTMB)
testData$pos <- numFactor(testData$x, testData$y)
fittedModel2 <- glmmTMB(observedResponse ~ Environment1 + exp(pos + 0|group),
data = testData, family = poisson())
simulationOutput2 <- simulateResiduals(fittedModel = fittedModel2,
simulateREs = "unconditional")
testSpatialAutocorrelation(simulationOutput = simulationOutput2, x = ~x,
y= ~y)##
## DHARMa Moran's I test for distance-based autocorrelation
##
## data: simulationOutput2
## observed = 0.074021, expected = -0.010101, sd = 0.016668, p-value =
## 4.487e-07
## alternative hypothesis: Distance-based autocorrelation
Note that while glmmTMB supports conditional simulations in general,
this is not yet supported for certain (spatial) covariance structures.
Thus, we use simulateREs = "unconditional" here. Otherwise
DHARMa will give a warning and automatically fall back to unconditional
simulations. For further details please refer to the help of
glmmTMB.
One of the options to solve it and get correct tests and no residual
pattern (if the model is correct) is to rotate the residual space
according to the covariance structure of the fitted model, such that the
rotated residuals are conditionally independent. The argument rotation
in simulateResiduals does it (see also
?getQuantile for details about the rotation options):
# rotation of the residuals
simulationOutput3 <- simulateResiduals(fittedModel = fittedModel2,
simulateREs = "unconditional",
rotation = "estimated")
testSpatialAutocorrelation(simulationOutput = simulationOutput3, x = ~x,
y= ~y)##
## DHARMa Moran's I test for distance-based autocorrelation
##
## data: simulationOutput3
## observed = -0.023112, expected = -0.010101, sd = 0.016689, p-value =
## 0.4356
## alternative hypothesis: Distance-based autocorrelation
Note: More real-world examples can be found on the DHARMa GitHub repository.
This example comes from Jochen Fruend. Measured are the number of parasitized observations, with population density as a covariate:
plot(N_parasitized / (N_adult + N_parasitized ) ~ logDensity,
xlab = "Density", ylab = "Proportion infected", data = data)Let’s fit the data with a regular binomial n/k glm:
mod1 <- glm(cbind(N_parasitized, N_adult) ~ logDensity, data = data,
family = binomial)
simulationOutput <- simulateResiduals(fittedModel = mod1)
plot(simulationOutput)We see various signals of overdispersion:
OK, so let’s add overdispersion through an beta-binomial distribution
from glmmTMB package.
library(glmmTMB)
mod2 <- glmmTMB(cbind(N_parasitized, N_adult) ~ logDensity + (1|ID), data = data,
family=betabinomial)
simulationOutput <- simulateResiduals(fittedModel = mod2)
plot(simulationOutput)The overdispersion looks better, but you can see that the residuals still look a bit irregular (although tests are n.s.). The raw data looks a bit hump-shaped, so we might be tempted to add a quadratic effect.
mod3 <- glmmTMB(cbind(N_parasitized, N_adult) ~ logDensity + I(logDensity^2) +
(1|ID), data = data, family=betabinomial)
simulationOutput <- simulateResiduals(fittedModel = mod3)
plot(simulationOutput)The residuals look perfect now! That being said, we don’t have a lot of data, and we have to be sure we’re not overfitting. A likelihood ratio test tells us that the quadratic effect is not significantly supported.
## Data: data
## Models:
## mod2: cbind(N_parasitized, N_adult) ~ logDensity + (1 | ID), zi=~0, disp=~1
## mod3: cbind(N_parasitized, N_adult) ~ logDensity + I(logDensity^2) + , zi=~0, disp=~1
## mod3: (1 | ID), zi=~0, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mod2 4 217.21 221.58 -104.61 209.21
## mod3 5 215.82 221.27 -102.91 205.82 3.3943 1 0.06542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Also AIC differences are small, although slightly in favor of model 3:
## [1] 217.2116
## [1] 215.8173
I guess you could use either Model 2 or 3 - the broader point is: increasing model complexity will nearly always improve the residuals, but according to standard statistical arguments (power, bias-variance trade-off) it’s not always advisable to get them perfect, just good enough!
The next example uses the fairly well known Owl dataset which is
provided in glmmTMB (see ?Owls for more info
about the data). The following shows a sequence of models, all checked
with DHARMa. The example is discussed in a talk at ISEC 2018, see slides
here.
m1 <- glm(SiblingNegotiation ~ FoodTreatment*SexParent + offset(log(BroodSize)),
data = Owls, family = poisson)
res <- simulateResiduals(m1)
plot(res)OK, this is highly overdispersed. Let’s add a RE on nest:
m2 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
offset(log(BroodSize)) + (1|Nest), data=Owls , family = poisson)
res <- simulateResiduals(m2)
plot(res)Somewhat better, but not good. Move to neg binom, to adjust dispersion, and checking dispersion and residuals against FoodTreatment predictor:
m3 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
offset(log(BroodSize)) + (1|Nest), data=Owls , family = nbinom1)
res <- simulateResiduals(m3, plot = TRUE)##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.70386, p-value < 2.2e-16
## alternative hypothesis: two.sided
We see underdispersion now. In a model with variable dispersion, this is often the signal that some other distributional assumptions are violated. Let’s check for zero-inflation:
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.2579, p-value < 2.2e-16
## alternative hypothesis: two.sided
It looks as if there is some zero-inflation, although non-significant. Fitting a zero-inflated model:
m4 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
offset(log(BroodSize)) + (1|Nest),
ziformula = ~ FoodTreatment + SexParent, data=Owls,
family = nbinom1)
res <- simulateResiduals(m4, plot = TRUE)##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.90272, p-value = 0.272
## alternative hypothesis: two.sided
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.0526, p-value = 0.528
## alternative hypothesis: two.sided
This looks a lot better. Trying a slightly different model specification, adding a dispersion model as well:
m5 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
offset(log(BroodSize)) + (1|Nest),
dispformula = ~ FoodTreatment + SexParent ,
ziformula = ~ FoodTreatment + SexParent, data = Owls,
family = nbinom1)
res <- simulateResiduals(m5, plot = TRUE)##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.87257, p-value = 0.192
## alternative hypothesis: two.sided
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.0594, p-value = 0.352
## alternative hypothesis: two.sided
but that does not seem to make things better.
Both models would be acceptable in terms of their fit to the data. Which one should you prefer? This is not a question for residual checks. Residual checks tell you which models can be rejected with the data. Which of the typically many acceptable models you should fit must be decided by your scientific question, and/or possibly by model selection methods. In doubt, I would tend towards the simpler model though.
The main concern in Poisson data is dispersion. See comments in the section on the dispersion test, in particular regarding the advantage of conditional simulations in this case. To address overdispersion, I would recommend to prefer the negative binomial model over observation-level random effects, because this model will be easier to test in DHARMa and its dispersion can be easier modeled, e.g. with glmmTMB. The third option would be quasi models, which have shorter runtime but otherwise offer few advantages. Note also that quasi models cannot be tested with DHARMa.
Once dispersion is adjusted, you should check for heteroscedasticity (via standard plot, also against all predictors), and for zero-inflation. As noted, zero-inflation tests are often negative, and rather show up as underdispersion. Work through the owl example above.
Proportional data expressed as percentage or fractions of a whole, i.e. non-count-based proportions, is often modeled with beta regressions. Those can be tested with DHARMa. Note that beta regressions are often 0 or 1 inflated. Both should be tested with testZeroInflation or testGeneric.
Note: discrete proportions, i.e. count-based proportions, of the type k/n should NOT be modeled with the beta regression. Use the binomial (see below).
Binomial data behave slightly different depending on whether we have a 0/1 response (Bernoulli) or a k/n response (true binomial).
A k/n response, in particular when n is large, will behave similar to the Poisson, in that it approaches a normal distribution with fixed dispersion for large n and becomes more asymmetric at its borders (for k = 0 or n). You should check for dispersion and consider a beta-binomial in cases of overdispersion.
Things are a bit different for the 0/1 response. Let’s look at the residuals of a clearly misspecified binomial model (missing predictor) with 0/1 response data.
testData = createData(sampleSize = 500, overdispersion = 0, fixedEffects = 5,
family = binomial(), randomEffectVariance = 3,
numGroups = 25)
fittedModel <- glm(observedResponse ~ 1, family = "binomial", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)If you would do the same with a binomial k/n response or count data, such a misspecification would produce overdispersion (try it out). For the 0/1 response we see neither dispersion problems nor a misfit in the general res ~ predicted plot.
Even though the misfit is clearly visible if we plot the residuals against the missing predictor.
However, we can see the overdispersion arising from the misfit if we group our residuals, which basically transforms the 0/1 response to a k/n response.
To show this, let’s look at the dispersion test for the same model, once ungrouped (left), and grouped according to the variable group which was in the data (right).
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0023, p-value = 0.632
## alternative hypothesis: two.sided
simulationOutput2 = recalculateResiduals(simulationOutput, group = ~group)
testDispersion(simulationOutput2)##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 2.8481, p-value < 2.2e-16
## alternative hypothesis: two.sided
In general, you can group according to any variable that you like, including continuous variables or space. However, some variables make more sense than others. To understand this, consider a simulation where we create binomial data with true probabilities p, drawn from a uniform distribution:
The important thing to note here is that rbinom(n, 1, p)
will generate an identical overall distribution as
rbinom(1, n, mean(p)). What that means: a misfit of the
prediction will not show up unless they are wrong in the overall mean.
Consequently, if we fit:
fit <- glm(obs ~ 1, family = "binomial") # wrong model, assumes equal probabilities
library(DHARMa)
res <- simulateResiduals(fit, plot = TRUE) # nothing to seeWe see no misfit, and neither do we see a misfit when we group the data points randomly, as the mean of a random group is still fit well by the overall mean.
However, we see the misfit / overdispersion if we aggregate according to something that is correlated to the misfit. For our example, let’s just group according to the true means:
grouping = cut(p, breaks = quantile(p, seq(0,1,0.02)))
res3 <- recalculateResiduals(res, group = grouping)
plot(res3)In this case, the groups have a different mean than the fitted grand mean, and the misfit shows up in the residuals. Thus, what we are looking for in the grouping is variables that may correlate with the misfit.
If you don’t have a natural grouping variable, you can introduce arbitrary grouping variables, e.g. via discretizing a predictor or the response (usually preferable), and grouping according to that, or via discretizing space (e.g. group observation in spatial blocks). The pattern appears only, however, if the grouping variable correlates with the model error. Consider the following example:
We create data and fit a model with a missing predictor (Environment2):
set.seed(123)
testData = createData(sampleSize = 500, overdispersion = 0,
fixedEffects = c(0,3), family = binomial(),
randomEffectVariance = 3, numGroups = 50)Apparently, no problem with the residuals for the misfitted model:
res <- simulateResiduals(fittedModel = fittedModel)
fittedModel <- glm(observedResponse ~ Environment1, family = "binomial",
data = testData)
plot(res)grouping = as.factor(sample.int(50, 500, replace = TRUE))
res2 = recalculateResiduals(res , group = grouping)
plot(res2)x = predict(fittedModel)
grouping = cut(x, breaks = quantile(x, seq(0,1,0.02)))
res2 = recalculateResiduals(res , group = grouping)
plot(res2)x = testData$Environment2
grouping = cut(x, breaks = quantile(x, seq(0,1,0.02)))
res2 = recalculateResiduals(res , group = grouping)
plot(res2)x = testData$x
grouping = cut(x, breaks = quantile(x, seq(0,1,0.02)))
res3 = recalculateResiduals(res , group = grouping)
plot(res3)Conclusions: if you see overdispersion or a pattern after grouping, it highlights a model error that is structured by group. As the pattern usually highlights a model misfit, rather than a dispersion problem akin to what happens in an overdispersed binomial (which has major impacts on p-values and CIs), I view this binomial grouping pattern as less critical. Likely, most conclusions will not change if you ignore the problem. Nevertheless, you should try to understand the reason for it. When the group is spatial, it could be a sign of residual spatial autocorrelation which could be addressed by a spatial RE or a spatial model. When grouped by a continuous variable, it could be a sign of a nonlinear effect.
lm and glm and MASS::glm.nb are fully supported.
lme4 model classes are fully supported.
Possible to condition on specific REs via re.form, see help of predict.merMod, simulate.merMod and ?simulateResiduals.
When using mgcv with DHARMa, it is highly recommended to also install mgcViz. Since version 0.4.5, this will allow DHARMa to fall back on the simulate.gam function in mgcViz, which is more general than the default simulate function. For example, without mgcViz, it will not be possible to simulate from mgcv::gam objects fitted with extended families.
If you absolutely want to use DHARMa without mgcViz, you should make sure that simulate(model) works correctly for the model object for which you want to calculate DHARMa residuals.
Note that mgcv::gam models allow simulations only conditional on the fitted random effects; unconditional simulations are not possible.
Models fitted with gamm4 return a list that contains an lme4 object under the name “mer”. You can test this object like an lme4 model, so e.g. simulateResiduals(myGamm4Model$mer). All remarks regarding lme4 objects apply.
glmmTMB is nearly fully supported since DHARMa 0.2.7 and glmmTMB 1.0.0. As of DHARMa 0.5.0 there is an implemented option to adjust whether simulations should be conditional on the fitted random effects or not by using the simulateREs argument in simulateResiduals. Conditional simulations are, however, not available for glmmTMB models with certain (spatial) covariance structures.
spaMM is supported by DHARMa since 0.2.1
Possible to condition on specific REs via re.form, see help of simulate.HLfit and ?simulateResiduals.
GLMMadaptive is supported by DHARMa since 0.3.4.
phylom (version >= 2.6.5) is supported by DHARMa since 0.4.7 for both model classes phylolm and phyloglm.
DHARMa residuals work with phyr, but the correct implementation is not fully tested as of DHARMa 0.4.2. See also https://github.com/florianhartig/DHARMa/issues/235
Simple brms models are supported as of DHARMa version 0.5.0. This means that essentially, all models you could also fit with glmmTMB are possible to check with DHARMa. Multivariate models (multi-response, structural equation models, multinomial) are not supported. Note that predictions are based on the median of the posterior. See also vignette “DHARMa for Bayesians” and https://github.com/florianhartig/DHARMa/issues/33 As brms doesn’t provide Pearson residuals, testDispersion with type = “PearsonCisq” will not work. Furthermore think carefully if using simulateResiduals with refit = TRUE makes sense for brms models (for example, for testDispersion it doesn’t make sense because the test will be a nonparametric Pearson residuals test but brms doesn’t provide Pearson residuals.)
If confronted with an unsupported package, DHARMa will
try to use standard S3 functions such as coef(),
simulate() etc. to perform simulations. If no error occurs,
a residual object will be calculated, and a warning will be provided
that this package has not been checked for full functionality. In many
cases, the results can be used though (but no guarantee, maybe check
with null simulations if the results are OK). Other than that, see my
general comments about adding
new R packages to DHARMa
DHARMa can also import external simulations from a fitted model via
createDHARMa(), which will be interesting for unsupported
packages and for Bayesians.
Bayesians should note the extra Vignette on “DHARMa for Bayesians” regarding the interpretation of these residuals.
Here, an example for how to create simulations for a Poisson glm. Of
course, it doesn’t make sense to do this as glm is a supported model
class, but you could do the same in case you want to check a model class
that is currently not supported by DHARMa.
testData = createData(sampleSize = 200, overdispersion = 0.5, family = poisson())
fittedModel <- glm(observedResponse ~ Environment1, family = "poisson",
data = testData)
simulatePoissonGLM <- function(fittedModel, n){
pred = predict(fittedModel, type = "response")
nObs = length(pred)
sim = matrix(nrow = nObs, ncol = n)
for(i in 1:n) sim[,i] = rpois(nObs, pred)
return(sim)
}
sim = simulatePoissonGLM(fittedModel, 100)
DHARMaRes = createDHARMa(simulatedResponse = sim,
observedResponse = testData$observedResponse,
fittedPredictedResponse = predict(fittedModel),
integerResponse = TRUE)
#plot(DHARMaRes, quantreg = FALSE)