---
title: "2 Automatic differentiation via RTMB"
author: "Jan-Ole Fischer"
output: rmarkdown::html_vignette
# output: pdf_document
vignette: >
  %\VignetteIndexEntry{LaMa_and_RTMB}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "85%",
  fig.align = "center",
  error = TRUE
)
```

> Before diving into this vignette, we recommend reading the vignette [**Introduction to LaMa**](https://janolefi.github.io/LaMa/articles/Intro_to_LaMa.html).

Automatic differentiation (AD) is available in `R` through the  `RTMB` package. Using `RTMB`, you can define the likelihood for any statistical model as plain `R` code and have access to fast and *exact* derivatives of any order (without doing any calculations by hand!). This enables the fitting of very complicated models, potentially with complex random effect structures.

`LaMa` is fully compatible with AD provided by `RTMB`. Estimation of latent Markov models with AD is much faster and more convenient, while model specification is very smooth and less prone to errors. Let's find out how this works!
We begin by loading the `LaMa` package, which automatically loads `RTMB` as well.

```{r setup}
library(LaMa)
```

For the purpose of this vignette, we will analyse the `trex` data set contained in the package. It contains hourly step lengths of a Tyrannosaurus rex, living 66 million years ago, and we aim to understand its behavioural process using HMMs.
```{r data}
head(trex, 5)
```

The workflow with `RTMB` is always very similar. We need to 

1) define the negative log-likelihood function, 
2) turn it into an AD objective function, and
3) fit the model by numerical optimisation of the latter. 

`RTMB` provides many functions that make the process of specifying a likelihood function and calculating standard deviations of estimates really simple.

We start by fitting a simple stationary HMM with state-dependent gamma distributions for the step lengths. As a first step, we define the initial parameter list `par` containing unconstrained (real-valued) parameters and a `dat` list that contains the data and potential hyperparameters -- here $N$, the number of hidden states. The latter is not strictly necessary but provides a neat way of wrapping everything needed for the model fit.

```{r parameters}
par = list(
  log_mu = log(c(0.3, 1)),      # initial means for step length (log-transformed)
  log_sigma = log(c(0.2, 0.7)), # initial sds for step length (log-transformed)
  eta = rep(-2, 2)              # initial t.p.m. parameters (on logit scale)
  )    

dat = list(
  step = trex$step,   # hourly step lengths
  N = 2
  )
```

Because `par` is a named list, accessing the parameters inside the likelihood is very convenient -- annoying and error-prone indexing of parameter vectors is gone!
We define the negative log-likelihood function in a similar fashion to basic numerical ML:

```{r mllk}
nll = function(par) {
  getAll(par, dat) # makes everything contained available without $
  Gamma = tpm(eta) # computes transition probability matrix from unconstrained eta
  delta = stationary(Gamma) # computes stationary distribution
  # Parameter transformations for strictly positive parameters
  mu = exp(log_mu)
  sigma = exp(log_sigma)
  # Reporting statements for later use
  REPORT(mu); ADREPORT(mu)
  REPORT(sigma); ADREPORT(sigma)
  # Calculating all state-dependent densities
  allprobs = matrix(1, nrow = length(step), ncol = N)
  ind = which(!is.na(step)) # only for non-NA obs.
  for(j in 1:N){
    allprobs[ind,j] = dgamma2(step[ind], mu[j], sigma[j])
  }
  -forward(delta, Gamma, allprobs) # simple forward algorithm
}
```

A few points should be made here:

- The likelihood is a function of the parameters to be estimated *only* while data and other parameters are not passed as an argument at this stage. This is something to get used to (I know), but just the way `RTMB` works.

- The `getAll()` function is very useful. It unpacks both the `par` and the `dat` list, making all elements available without the `$` operator.

> **Important:** `nll` reads `dat` directly from the global environment at the time `MakeADFun()` is called, which *bakes* a snapshot of `dat` into the compiled objective function. Any changes to `dat` after that point will have **no effect** on the optimisation. Always finalise `dat` before calling `MakeADFun()`.

- Parameter transformations are still necessary: all parameters in `par` should be unconstrained and transformed to their natural scale inside the likelihood.

- The `REPORT()` function offered by `RTMB` is really convenient. Any quantities calculated in the likelihood function (for which you have written the code anyway), if reported, will be available after optimisation, while the report statements are ignored during optimisation.

- For simple parameter transformations, `ADREPORT()` is also great. It calculates standard deviations for `ADREPORT()`ed quantities, based on the delta method. Just note that the delta method is not advisable for complex non-linear and multivariate transformations.

Having defined the negative log-likelihood, we can now create the automatically differentiable objective function. At this point, `RTMB` takes `nll` and generates its own (very fast) version of it, including a gradient. We set `silent = TRUE` to suppress printing of the optimisation process.

```{r ADfunction, message = FALSE}
obj = MakeADFun(nll, par, silent = TRUE) # creating the AD objective function
```

Let's check out `obj`:
```{r tmbobject}
names(obj)
```
It contains the initial parameter `par` (now transformed to a named vector), the objective function `fn` (which in this case just evaluates `nll` but faster), its gradient `gr` and Hessian `he`.
If we call these functions without any argument, we get the corresponding values at the initial parameter vector.
```{r tmbobject2}
obj$par
obj$fn()
obj$gr()
```
We are now ready to fit the model by optimising the objective function. The routine `nlminb()` is very robust and allows us to provide a gradient function.
While we also have access to the Hessian function, we do not supply it to `nlminb()` because optimisation is much faster if we use a quasi-Newton algorithm that approximates the current Hessian based on previous gradient evaluations, compared to using full Newton-Raphson.

```{r modelfit}
opt = nlminb(obj$par, obj$fn, obj$gr) # minimisation
```

We can check out the estimated parameter and function value by
```{r optpar}
opt$par
opt$objective
```
Note that the names `par` and `objective` are determined by `nlminb()`. If you use a different optimiser, these may be called differently.

However, `RTMB` provides a nicer way to get the estimates. `obj` is automatically updated after the optimisation. Note that calling `obj$gr()` after optimisation now gives the gradient at the optimum, while `obj$par` is not updated but still the initial parameter vector (kind of confusing).
To get our estimated parameters on their natural scale, we don't have to do the backtransformation manually. We can just run the reporting:

```{r MLe}
mod = report(obj) # runs the reporting from the negative log-likelihood once
(delta = mod$delta) # stationary distribution
(Gamma = mod$Gamma) # estimated tpm
(mu = mod$mu) # estimated state-dependent means
(sigma = mod$sigma) # estimated state-dependent sds
```
This works because of the `REPORT()` statements in the likelihood function. 
Note that `delta`, `Gamma` and `allprobs` are always reported by default when using `forward()` which is very useful for downstream tasks like state decoding or residuals calculation. The `viterbi` and `stateprobs` functions either take these as arguments or, more conveniently, can also take the whole reported `mod` object as an input. 
Because the state-dependent parameters depend on the specific model formulation, these need to be reported manually by the user specifying the likelihood. Having all the parameters, we can plot the decoded time series

```{r decoding, fig.width = 7, fig.height = 4}
# manually
mod$states = viterbi(mod$delta, mod$Gamma, mod$allprobs)

# or much simpler
mod$states = viterbi(mod = mod)

# defining color vector
color = c("orange", "deepskyblue")

plot(trex$step[1:200], type = "h", xlab = "time", ylab = "step length", 
     col = color[mod$states[1:200]], bty = "n")
legend("topright", col = color, lwd = 1, legend = c("state 1", "state 2"), bty = "n")
```

or the estimated state-dependent distributions.

```{r statedepdist, fig.width = 8, fig.height = 4}
hist(trex$step, prob = TRUE, breaks = 40, 
     bor = "white", main = "", xlab = "step length")
for(j in 1:2) curve(delta[j] * dgamma2(x, mu[j], sigma[j]), 
                    lwd = 2, add = T, col = color[j])
curve(delta[1]*dgamma2(x, mu[1], sigma[1]) + delta[2]*dgamma2(x, mu[2], sigma[2]), 
      lwd = 2, lty = 2, add = T)
legend("top", lwd = 2, col = color, legend = c("state 1", "state 2"), bty = "n")
```

Note that because we have used `report()`, `mod` now has class "LaMaModel", which permits the use of `AIC()` and `BIC()`:

```{r AICBIC}
AIC(mod)
BIC(mod)
```


We can also use the `sdreport()` function to directly give us standard errors for our unconstrained parameters and everything we `ADREPORT()`ed. 
```{r sdreport}
sdr = sdreport(obj)
```

We can then get an overview of the estimated parameters and `ADREPORT()`ed quantities as well as their standard errors by
```{r sdreport2}
summary(sdr) # see ?summary.sdreport for more info
```
To get the estimated parameters or their standard errors in list format, type
```{r sdreport3, eval = F}
# estimated parameter in list format
as.list(sdr, "Estimate")
# standard errors in list format
as.list(sdr, "Std")
```
and to get the estimates and standard errors for `ADREPORT()`ed quantities in list format, type
```{r sdreport4, eval = F}
# adreported parameters as list
as.list(sdr, "Estimate", report = TRUE)
# their standard errors
as.list(sdr, "Std", report = TRUE)
```

Lastly, the automatic reporting with `LaMa` and `RTMB` together makes calculating pseudo-residuals really convenient:

```{r pres, fig.width = 8, fig.height = 4}
pres_step = pseudo_res(obs = trex$step, # observation sequence
                       dist = "gamma2", # which parametric CDF to use
                       par = list(mean = mu, sd = sigma), # estimated pars for that CDF
                       mod = mod) # model object
plot(pres_step, hist = TRUE)
```


<!-- ### Covariate effects -->

<!-- We can generalise the previous model to include covariate effects. In our example, we might be interested how the T-rex's behaviour varies with the time of day. Hence, we add diel variation to the state process. For example, we can model the transition probabilities as a function of the time of day using a trigonometric basis expansion to ensure diurnal continuity. The transition probabilities are given by -->
<!-- $$ -->
<!-- \text{logit}(\gamma_{ij}^{(t)}) = \beta_0^{(ij)} + \beta_1^{(ij)} \sin \bigl(\frac{2 \pi t}{24}\bigr) + \beta_2^{(ij)} \cos \bigl(\frac{2 \pi t}{24}\bigr) + \beta_3^{(ij)} \sin \bigl(\frac{2 \pi t}{12}\bigr) + \beta_4^{(ij)} \cos \bigl(\frac{2 \pi t}{12}\bigr), -->
<!-- $$ -->
<!-- where $t$ is the time of day. -->

<!-- To practically achieve this, we compute the trigonometric basis design matrix `Z` corresponding to above predictor and add the time of day to the `dat` list for indexing inside the likelihood function. The `LaMa` function `cosinor()` does this very conveniently. We can either call it directly to build the design matrix, or use it in a `formula` passed to `make_matrices()`. The latter is preferable when dealing with more complicated models. Note that the first option does not include an intercept column, while the second does. When used with `tpm_g()` this does not matter as it automatically checks if an intercept column is included. -->

<!-- ```{r tod} -->
<!-- Z = cosinor(1:24, period = c(24, 12)) -->

<!-- modmat = make_matrices(~ cosinor(tod, period = c(24, 12)),  -->
<!--                        data = data.frame(tod = 1:24)) -->
<!-- Z = modmat$Z -->

<!-- # only compute the 24 unique values and index later for entire time series -->
<!-- dat$Z = Z # adding design matrix to dat -->
<!-- dat$tod = trex$tod # adding time of day to dat for indexing -->
<!-- ``` -->

<!-- We also need to change the parameter list `par` to include the regression parameters for the time of day. The regression parameters for the state process will typically have the form of a $N (N-1) \times p+1$ matrix, where $N$ is the number of states and $p$ is the number of regressors -- this format is also expected by `tpm_g()` which computes the array of transition matrices based on the design and parameter matrix. Another lovely convenience that `RTMB` allows for is that, in our parameter list, we can have matrices, making reshaping of vectors to matrices inside the likelihood function unnecessary. -->

<!-- ```{r todpar} -->
<!-- par = list(logmu = log(c(0.3, 1)),  -->
<!--            logsigma = log(c(0.2, 0.7)), -->
<!--            logkappa = log(c(0.2, 0.7)), -->
<!--            beta = matrix(c(rep(-2, 2),  -->
<!--                            rep(0, 2*4)), nrow = 2)) # 2 times 4+1 matrix -->
<!-- # replacing eta with regression parameter matrix, initializing slopes at zero -->
<!-- ``` -->

<!-- We can now define a more general likelihood function with the main difference being the use of `tpm_g()` instead of `tpm()` and the inclusion of the time of day in the transition matrix calculation. This leads to us using `stationary_p()` instead of `stationary()` to calculate the initial distribution and `forward_g()` instead of `forward()` to calculate the log-likelihood. -->

<!-- ```{r mllk2} -->
<!-- nll2 = function(par) { -->
<!--   getAll(par, dat) # makes everything contained available without $ -->
<!--   Gamma = tpm_g(Z, beta) # covariate-dependent tpms (in this case only 24 unique) -->
<!--   # tpm_g() automatically checks if intercept column is included -->
<!--   ADREPORT(Gamma) # adreporting -->
<!--   Delta = stationary_p(Gamma) # all periodically stationary distributions -->
<!--   ADREPORT(Delta) -->
<!--   delta = Delta[tod[1],] # initial periodically stationary distribution -->
<!--   # exponentiating because all parameters strictly positive -->
<!--   mu = exp(logmu); REPORT(mu) -->
<!--   sigma = exp(logsigma); REPORT(sigma) -->
<!--   kappa = exp(logkappa); REPORT(kappa) -->
<!--   # calculating all state-dependent densities -->
<!--   allprobs = matrix(1, nrow = length(step), ncol = N) -->
<!--   ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs. -->
<!--   for(j in 1:N){ -->
<!--     allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j]) * -->
<!--       dvm(angle[ind],0,kappa[j]) -->
<!--   } -->
<!--   -forward_g(delta, Gamma[,,tod], allprobs) # indexing 24 unique tpms by tod in data -->
<!-- } -->
<!-- ``` -->

<!-- Having done this, the model fit is then essentially the same: -->

<!-- ```{r modelfit2} -->
<!-- obj2 = MakeADFun(nll2, par, silent = TRUE) # creating the objective function -->
<!-- opt2 = nlminb(obj2$par, obj2$fn, obj2$gr) # optimisation -->
<!-- ``` -->

<!-- and we can look at the reported results. In this case, for simplicity I get standard errors for `Gamma` with the delta method while, in general, this is not advisable. -->

<!-- ```{r MLE2, fig.width = 8, fig.height = 5} -->
<!-- mod2 = report(obj2) -->

<!-- sdr = sdreport(obj2) -->
<!-- Gamma = as.list(sdr, "Estimate", report = TRUE)$Gamma -->
<!-- Gammasd = as.list(sdr, "Std", report = TRUE)$Gamma -->

<!-- Delta = as.list(sdr, "Estimate", report = TRUE)$Delta -->
<!-- Deltasd = as.list(sdr, "Std", report = TRUE)$Delta -->

<!-- tod_seq = seq(0, 24, length = 200) # sequence for plotting -->
<!-- Z_pred = predict(modmat, newdata = data.frame(tod = tod_seq)) -->

<!-- Gamma_plot = tpm_g(Z_pred, mod2$beta) # interpolating transition probs -->

<!-- plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1), -->
<!--      xlab = "time of day", ylab = "transition probability", bty = "n") -->
<!-- segments(x0 = 1:24, y0 = Gamma[1,2,]-1.96*Gammasd[1,2,],  -->
<!--          y1 = Gamma[1,2,]+1.96*Gammasd[1,2,]) -->
<!-- segments(x0 = 1:24, y0 = Gamma[2,1,]-1.96*Gammasd[2,1,],  -->
<!--          y1 = Gamma[2,1,]+1.96*Gammasd[2,1,]) -->
<!-- lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3) -->
<!-- legend("topleft", lwd = 2, lty = c(1,3), bty = "n", -->
<!--        legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t)))) -->
<!-- plot(Delta[,2], type = "b", lwd = 2, xlab = "time of day", ylab = "Pr(active)",  -->
<!--      col = "deepskyblue", bty = "n", xaxt = "n") -->
<!-- segments(x0 = 1:24, y0 = Delta[,2]-1.96*Deltasd[,2], lwd = 2, -->
<!--          y1 = Delta[,2]+1.96*Deltasd[,2], col = "deepskyblue") -->
<!-- axis(1, at = seq(0,24,by=4), labels = seq(0,24,by=4)) -->
<!-- ``` -->

<!-- ### Penalized splines -->

<!-- We can go one step further and model the transition probabilities as smooth functions of the time of day using cyclic P-splines, i.e. -->
<!-- $$ -->
<!-- \text{logit}(\gamma_{ij}^{(t)}) = \beta_0^{(ij)} + s_{ij}(t), -->
<!-- $$ -->
<!-- where $s_{ij}(t)$ is a smooth periodic function of time of day. `LaMa` provides the function `make_matrices()` which creates design and penalty matrices based on the R package `mgcv` when provided with a formula and data. Hence, we can use standard `mgcv` syntax to create the matrices for cyclic P-splines (`cp`). We then append both to the `dat` list. -->

<!-- ```{r tod2} -->
<!-- modmat = make_matrices(~ s(tod, bs = "cp"),  -->
<!--                        data = data.frame(tod = 1:24), -->
<!--                        knots = list(tod = c(0,24))) # where to wrap the cyclic basis -->
<!-- Z = modmat$Z # spline design matrix -->
<!-- S = modmat$S # penalty matrix -->
<!-- dat$Z = Z -->
<!-- dat$S = S[[1]] # mgcv returns a list of penalty matrices (even if only one smooth) -->
<!-- ``` -->

<!-- We have to change our likelihood function slightly by adding the penalization. For this we use the `penalty()` function contained in `LaMa` that computes the sum of quadratic form penalties (the standard penalty used for penalized splines) based on the penalty matrices, the parameters to be estimated and the penalty strength parameters. -->

<!-- Importantly, we now have to separate the non-penalized intercept `beta0` from the penalized spline coefficients now called `betaspline`. The latter, we again conveniently initialize as a matrix, each row representing the coefficient vector for one off-diagonal element of the t.p.m. -->

<!-- ```{r mllk3} -->
<!-- pnll = function(par) { -->
<!--   getAll(par, dat) # makes everything contained available without $ -->
<!--   Gamma = tpm_g(Z, cbind(beta0, betaspline)); ADREPORT(Gamma) -->
<!--   Delta = stationary_p(Gamma); ADREPORT(Delta) -->
<!--   delta = Delta[tod[1],] -->
<!--   # exponentiating because all parameters strictly positive -->
<!--   mu = exp(logmu); REPORT(mu) -->
<!--   sigma = exp(logsigma); REPORT(sigma) -->
<!--   kappa = exp(logkappa); REPORT(kappa) -->
<!--   # calculating all state-dependent densities -->
<!--   allprobs = matrix(1, nrow = length(step), ncol = N) -->
<!--   ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs. -->
<!--   for(j in 1:N){ -->
<!--     allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j]) -->
<!--   } -->
<!--   -forward_g(delta, Gamma[,,tod], allprobs) + -->
<!--     penalty(betaspline, S, lambda) # this does all the penalization work -->
<!-- } -->
<!-- ``` -->

<!-- We also have to append a `lambda` argument to our `dat` list, which is the initial penalty strength parameter vector. In this case of length two because our coefficient matrix has two rows. -->

<!-- ```{r todpar2} -->
<!-- par = list(logmu = log(c(0.3, 2.5)),  -->
<!--            logsigma = log(c(0.2, 1.5)), -->
<!--            logkappa = log(c(0.2, 1.5)), -->
<!--            beta0 = c(-2, 2), # intercept now separated! -->
<!--            betaspline = matrix(rep(0, 2*(ncol(Z)-1)), nrow = 2)) -->

<!-- dat$lambda = rep(100, 2) # adding initial penalty strength to the dat list -->
<!-- ``` -->

<!-- The model fit can then be conducted by using the `qreml()` function contained in `LaMa`. **qREML** stands for **quasi restricted maximum likelihood** and finds a good penalty strength by treating the spline coefficients as random effects. Under the hood, `qreml()` also constructs an AD function with `RTMB` but uses the **qREML** algorithm described in Koslik (2024) to fit the model. We have to tell the `qreml()` function which parameters are spline coefficients by providing the name of the corresponding list element of `par`. -->

<!-- There are some rules to follow when using `qreml()`: -->

<!-- 1. The likelihood function needs to be `RTMB`-compatible, i.e. have the same structure as all the likelihood functions in our vignette -- most importantly, it should only be a function of the parameter list. -->
<!-- 3. The penalty strength vector `lambda` needs its length to correspond to the *total* number of spline coefficient vectors used. In our case, this is the number of rows of betaspline, but if we additionally had a different spline coefficient in our parameter list (that may have a different length and a different penalty matrix), we would have needed more elements in `lambda`. -->
<!-- 4. The `penalty()` function can only be called *once* in the likelihood. If several spline coefficients are penalized, `penalty()` expects a list of coefficient matrices or vectors and a list of penalty matrices. -->
<!-- 5. When we summarise multiple spline coefficients in a matrix in our parameter list -- which is very useful when these are of same lengths and have the same penalty matrix -- this matrix must be arranged by row, i.e. each row is one spline coefficient vector. If it is arranged by column, `qreml()` will fail. -->

<!-- ```{r qreml, message = FALSE} -->
<!-- system.time( -->
<!--   mod3 <- qreml(pnll, par, dat, random = "betaspline") -->
<!-- ) -->
<!-- ``` -->
<!-- The `mod` object is now a list that contains everything that is reported by the likelihood function, but also the `RTMB` object created in the process. After fitting the model, we can also use the `LaMa` function `pred_matrix()`, that takes the `modmat` object we created earlier, to build a new interpolating design matrix using the exact same basis expansion specified above. This allows us to plot the estimated transition probabilities as a smooth function of time of day -- I now ignore confidence bands due to laziness. -->

<!-- ```{r results qreml, fig.width = 8, fig.height = 5} -->
<!-- sdr = sdreport(mod3$obj) -->
<!-- Gamma = as.list(sdr, "Estimate", report = TRUE)$Gamma -->
<!-- Delta = as.list(sdr, "Estimate", report = TRUE)$Delta -->

<!-- tod_seq = seq(0,24, length=200) -->
<!-- Z_pred = pred_matrix(modmat, data.frame(tod = tod_seq)) -->

<!-- Gamma_plot = tpm_g(Z_pred, mod3$beta) # interpolating transition probs -->

<!-- plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1), -->
<!--      xlab = "time of day", ylab = "transition probability", bty = "n") -->
<!-- lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3) -->
<!-- legend("topleft", lwd = 2, lty = c(1,3), bty = "n", -->
<!--        legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t)))) -->
<!-- plot(Delta[,2], type = "b", lwd = 2, xlab = "time of day", ylab = "Pr(active)",  -->
<!--      col = "deepskyblue", bty = "n", xaxt = "n") -->
<!-- axis(1, at = seq(0,24,by=4), labels = seq(0,24,by=4)) -->

<!-- ``` -->

<!-- We see that by allowing for a more flexible relationship, the estimated time of day effect becomes stronger with even sharper peaks than we would have concluded using the trigonometric approach. -->

<!-- ### Full Laplace method -->

<!-- Lastly, we could have achieved a similar fit as above using the slightly more accurate full Laplace approximation method, which can be used to fit models via marginal maximum likelihood estimation by integrating out the random effects. This is natively supported by `RTMB` -- and actually one of its core selling points -- and the standard way we can now deal with all kinds of random effects.  -->

<!-- Indeed, the **qREML** algorithm above treats the spline coefficients as Gaussian random effects but exploits their relatively simple structure yielding a more efficient fitting method. The full Laplace method is much more general, allowing for very flexible random effects, but here, estimation slower because it does not exploit the simple structure of splines treated as random effects. -->

<!-- We have to alter our likelihood function slightly, because for the Laplace method, we need to implement the joint likelihood of the data and the random effect, the latter having a multivariate normal distribution. Specifically, if $b$ is our random effect for a spline, $b \sim N(0, \lambda^{-1} S^-)$. The likelihood of the data given $b$ (just our regular likelihood that treats $b$ as a parameter) is $f(x \mid b)$ and the density of $b$ is $f_{\lambda}(b)$. Hence the joint likelihood can be computed as -->
<!-- $$ -->
<!-- f(x, b) = f(x \mid b) f_{\lambda}(b) -->
<!-- $$ -->
<!-- and the joint negative log-likelihood becomes $- \log f(x \mid b) - \log f_{\lambda}(b)$ and this is what we implement below. -->

<!-- Most conveniently this is done by using the `dgmrf2()` function included in `LaMa` which provides the density function of the multivariate normal distribution reparametrized in terms of the (scaled) precision matrix, i.e. inverse covariance matrix, which in our case is $\lambda_i S$ for spline $i$. It allows evaluating at multiple points at once, each one possibly with its own penalty strength parameter `lambda`. It differs from `RTMB`'s `dgmrf()` by not expecting a *sparse* precision matrix and being more robust for rank-deficient penalty matrices, which are typical for penalized splines. -->

<!-- ```{r mllk4} -->
<!-- jnll = function(par) { -->
<!--   getAll(par, dat) # makes everything contained available without $ -->
<!--   Gamma = tpm_g(Z, cbind(beta0, betaspline)); ADREPORT(Gamma) -->
<!--   Delta = stationary_p(Gamma); ADREPORT(Delta) -->
<!--   delta = Delta[tod[1],] -->
<!--   # exponentiating because all parameters strictly positive -->
<!--   mu = exp(logmu); REPORT(mu) -->
<!--   sigma = exp(logsigma); REPORT(sigma) -->
<!--   kappa = exp(logkappa); REPORT(kappa) -->
<!--   # calculating all state-dependent densities -->
<!--   allprobs = matrix(1, nrow = length(step), ncol = N) -->
<!--   ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs. -->
<!--   for(j in 1:N){ -->
<!--     allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j]) -->
<!--   } -->
<!--   -forward_g(delta, Gamma[,,tod], allprobs) - -->
<!--     sum(dgmrf2(betaspline, 0, S, exp(loglambda), log = TRUE)) # just like any other density in R -->
<!-- } -->
<!-- ``` -->

<!-- We also have to include the log of our penalty strength as a parameter now. -->

<!-- ```{r todpar3} -->
<!-- par$loglambda = log(rep(100, 2)) -->
<!-- ``` -->

<!-- To create the objective function, we need to tell `RTMB` that `betaspline` is a random effect such that it is integrated out and as our objective function we have the marginal likelihood -->
<!-- $$ -->
<!-- f(x) = \int f(x, b) \,db, -->
<!-- $$ -->
<!-- actually its negative log of course. -->

<!-- ```{r refit, message = FALSE, eval = FALSE} -->
<!-- obj4 = MakeADFun(jnll, par, random = "betaspline", silent = TRUE) -->
<!-- system.time( -->
<!--   opt4 <- nlminb(obj4$par, obj4$fn, obj4$gr) -->
<!-- ) -->
<!-- ``` -->

<!-- This more general algorithm takes more than ten times model as long to fit the model. Hence, the above code is not evaluated. The results however are very similar. -->

<!-- ```{r results refit, fig.width = 8, fig.height = 5}
mod4 = obj4$report()
mod4$Gamma = tpm_g(Z, mod4$beta) # calculating 24 tpms

Gamma_plot = tpm_g(Z_pred, mod4$beta)
plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1),
     xlab = "time of day", ylab = "transition probability", bty = "n")
lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3)
legend("topleft", lwd = 2, lty = c(1,3), bty = "n",
       legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t))))
``` -->

## Tips and common issues

There are some commonly occuring issues with `RTMB` to keep in mind. I list the main ones I have encountered here -- please let me know if you encounter more so they can be added.

#### Non-standard distributions

If you need a non-standard probability distribution, existing implementations will likely fail due to AD incompatibility. Check whether the distribution is available in [`RTMBdist`](https://janolefi.github.io/RTMBdist/), which provides a library of AD-compatible distributions for use inside likelihood functions.


#### Operator overloading

A typical issue is that some operators may need to be overloaded to allow for automatic differentiation. In typical model setups `LaMa` functions handle this automatically, but if you take a more individualistic route and encounter an error like

```{r error, eval = FALSE}
stop("Invalid argument to 'advector' (lost class attribute?)")
```
you might have to overload the operator yourself. To do this put
```{r overloading}
"[<-" <- ADoverload("[<-")
```
as the first line of your likelihood function. If the error still prevails also add
```{r overloading2}
"c" <- ADoverload("c")
"diag<-" <- ADoverload("diag<-")
```
which should hopefully fix the error.


#### Initialising with `NA`

A common problem occurs when initialising objects with `NA` values and then filling them with `numeric` values. `NA` is logical, which causes type mismatches that break automatic differentiation. Always initialise with `numeric` or `NaN` values instead. For example, avoid
```{r NA, eval = FALSE}
X = array(dim = c(1,2,3))
# which is the same as
X = array(NA, dim = c(1,2,3))
```
and use
```{r NaN, eval = FALSE}
X = array(NaN, dim = c(1,2,3))
# or
X = array(0, dim = c(1,2,3))
```


#### Wrapping arrays in `AD()`

If you create an array and fill it with something parameter-dependent, wrap it in `AD()`:
```{r arrayfill, eval = FALSE}
X <- AD(array(...))
```
This ensures `X` is an AD object from the start and that classes are compatible. Wrapping in `AD()` is generally a good idea to resolve the above error, as it introduces no meaningful overhead when not needed.


#### No `if` statements on parameters

You cannot use `if` statements **on parameters directly**, as these are not differentiable. `RTMB` builds the *tape* (computational graph) at the initial parameter value, so `if` branches depending on a parameter will produce gradients that are wrong elsewhere. `RTMB` will fail, likely without a helpful error message. `if` statements that do not involve parameters are typically fine, since the branch is fixed throughout optimisation.

#### Byte compiler side effects

Some unfortunate side effects of `R`'s byte compiler (enabled by default) can interfere with `RTMB`. If you encounter an error not covered above, try disabling it with

```{r bytecompiler, message = FALSE}
compiler::enableJIT(0)
```

<!-- #### `expm::expm()` incompatibility -->

<!-- `expm::expm()` does not work with AD. Use `Matrix::expm()` instead. -->


#### Further reading

For more information on `RTMB`, see its [documentation](https://CRAN.R-project.org/package=RTMB) or the [TMB users Google group](https://groups.google.com/g/tmb-users).

> Continue reading with [**Penalised splines**](https://janolefi.github.io/LaMa/articles/Penalised_splines.html).
