---
title: "Introduction to mr.mash"
author: Peter Carbonetto & Fabio Morgante
date: "`r Sys.Date()`"
output:
  html_document:
    toc: no
    toc_float: no
    highlight: textmate
    theme: readable
vignette: >
  %\VignetteIndexEntry{Introduction to mr.mash}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The aim of this vignette is to introduce the basic steps of a
mr.mash analysis through a toy example. To learn more about
mr.mash, please see the [paper][mr-mash-biorxiv].

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,
                      fig.align = "center",dpi = 120)
```

First, we set the seed to make the results more easily reproducible,
and we load the "mr.mashr" package.

```{r load-pkgs}
library(mr.mashr)
set.seed(12)
```

We illustrate the application of mr.mash to a data set simulated from
a multivariate, multiple linear regression with 5 responses in which
the coefficients are the same for all responses. In the target
application considered in the paper---prediction of multi-tissue gene
expression from genotypes---this would correspond to the situation in
which we would like to predict expression of a single gene in 5
different tissues from genotype data at multiple SNPs, and the SNPs
have the same effects on gene expression in all 5 tissues. (In
multi-tissue gene expression we would normally like to predict
expression of many genes, but to simplify this vignette here we
illustrate the key ideas with a single gene.)

Although this simulation is not particularly realistic, this is
meant to illustrate the benefits of mr.mash: by modeling the sharing
of effects across tissues, mr.mash is able to more accurately estimate
the effects in multiple tissues, and therefore is able to obtain
better predictions.

We start by simulating 150 samples from a multivariate, multiple
linear regression model in which 5 out of the 800 variables (SNPs)
affect the 5 responses (expression levels).

```{r sim-data-1, results="hide"}
dat <- simulate_mr_mash_data(n = 150,p = 800,p_causal = 5,r = 5,pve = 0.5,
							 V_cor = 0.25)
```

Next we split the samples into a training set (with 100 samples) and
test set (with 50 samples).

```{r sim-data-2}
ntest  <- 50 
Ytrain <- dat$Y[-(1:ntest),]
Xtrain <- dat$X[-(1:ntest),]
Ytest  <- dat$Y[1:ntest,]
Xtest  <- dat$X[1:ntest,]
```

Define the mr.mash prior
------------------------

To run mr.mash, we need to first specify the covariances in the
mixture of normals prior. The idea is that the chosen collection of
covariance matrices should include a variety of potential effect
sharing patterns, and in the model fitting stage the prior should then
assign most weight to the sharing patterns that are present in the
data, and little or no weight on patterns that are inconsistent with
the data. In general, we recommend learning
["data-driven" covariance matrices][mashr-dd-vignette].  But here, for
simplicity, we instead use "canonical" covariances which are not
adaptive, but nonetheless well suited for this toy example since the
true effects are the same across responses/tissues.

```{r choose-prior-1}
S0 <- compute_canonical_covs(r = 5,singletons = TRUE,
                             hetgrid = seq(0,1,0.25))
```

This gives a mixture of 10 covariance matrices capturing a variety of
"canonical" effect-sharing patterns:

```{r choose-prior-2}
names(S0)
```

To illustrate the benefits of modeling a variety of effect-sharing
patterns, we also try out mr.mash with a simpler mixture of covariance
matrices in which the effects are effectively independent across
tissues. Although this may seem to be a very poor choice of prior,
particularly for this example, it turns out that several multivariate
regression methods assume, implicitly or explicitly, this prior.

```{r choose-prior-3}
S0_ind <- compute_canonical_covs(r = 5,singletons = FALSE,
                                 hetgrid = c(0,0.001,0.01))
names(S0_ind)
```

Regardless of the covariance matrices are chosen, it is recommended to
also consider a variety of effect scales in the prior. This is
normally achieved in mr.mash by expanding the mixture across a
specifed grid of scaling factors. Here we choose this grid in an
adaptive fashion based on the data:

```{r choose-prior-4}
univ_sumstats <- compute_univariate_sumstats(Xtrain,Ytrain,standardize = TRUE)
scaling_grid <- autoselect.mixsd(univ_sumstats,mult = sqrt(2))^2
S0 <- expand_covs(S0,scaling_grid)
S0_ind <- expand_covs(S0_ind,scaling_grid)
```

Fit a mr.mash model to the data
-------------------------------

Having specified the mr.mash prior, we are now ready to fit a mr.mash
model to the training data (this may take a few minutes to run). Given
that the majority of the SNPs have no effect, we initialize the mixture
weights with 99\% of the weight on the null component and the rest of the
weight split equally across the remaining components.

```{r fit-mr-mash-1, results="hide"}
null_weight <- 0.99
non_null_weight <- 1-null_weight
w0_init <- c(null_weight, rep(non_null_weight/(length(S0)-1), (length(S0)-1)))
fit <- mr.mash(X=Xtrain,Y=Ytrain,S0=S0, w0=w0_init, update_V = TRUE,
               nthreads = 1)
```

And for comparison we fit a second mr.mash model using the simpler and
less flexible prior:

```{r fit-mr-mash-2, results="hide"}
w0_init <- c(null_weight, rep(non_null_weight/(length(S0_ind)-1), (length(S0_ind)-1)))
fit_ind <- mr.mash(X=Xtrain,Y=Ytrain,S0=S0_ind,w0=w0_init,update_V = TRUE,
                   nthreads = 1)
```

(Notice that the less complex model also takes less time to fit.)

For prediction, the key output is the posterior mean estimates of the
regression coefficients, stored in the "mu1" output. Let's compare the
estimates to the ground truth:

```{r plot-coefs, fig.height=3.5, fig.width=6}
oldpar <- par(mfrow = c(1,2))
plot(dat$B,fit_ind$mu1,pch = 20,xlab = "true",ylab = "estimated",
     main = sprintf("cor = %0.3f",
	                cor(as.vector(dat$B),as.vector(fit_ind$mu1))))
abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
plot(dat$B,fit$mu1,pch = 20,xlab = "true",ylab = "estimated",
     main = sprintf("cor = %0.3f",
	                cor(as.vector(dat$B),as.vector(fit$mu1))))
abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
par(oldpar)
```

As expected, the coefficients on the left-hand side obtained using an
"independent effects" prior are not as accurate as the the
coefficients estimated using the more flexible prior (right-hand side).

While perhaps not of primary interest, for diagnostic purposes it is
often helpful to examine the estimated mixture weights in the prior as
well as the estimated residual covariance matrix.

Inspecting the top prior mixture weights from the better model, it is
helpful to see that the "null" and "shared1" components are among the
top components by weight. (The top component is the null component
because most of the SNPs have no effect on gene expression.)

```{r prior-mixture-weights}
head(sort(fit$w0,decreasing = TRUE),n = 10)
```

Also, reassuringly, the estimated residual variance-covariance matrix
is close to the matrix used to simulate the data:

```{r resid-var}
dat$V
fit$V
```

Use the fitted mr.mash model to make predictions
------------------------------------------------

We can use the fitted mr.mash model to predict gene expression from
a genotype sample, including a sample not included in the training
set. This is implemented by the "predict" method. Let's compare the
predictions from the two mr.mash models:

```{r plot-pred-test, fig.height=3.5, fig.width=6}
oldpar <- par(mfrow = c(1,2))
Ypred <- predict(fit,Xtest)
Ypred_ind <- predict(fit_ind,Xtest)
plot(Ytest,Ypred_ind,pch = 20,col = "darkblue",xlab = "true",
     ylab = "predicted",
     main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred_ind))))
abline(a = 0,b = 1,col = "magenta",lty = "dotted")
plot(Ytest,Ypred,pch = 20,col = "darkblue",xlab = "true",
     ylab = "predicted",
     main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred))))
abline(a = 0,b = 1,col = "magenta",lty = "dotted")
par(oldpar)
```

Indeed, mr.mash with the more flexible prior (right-hand plot)
produces more accurate predictions than mr.mash with the "independent
effects" prior.

[mr-mash-biorxiv]: https://doi.org/10.1101/2022.11.22.517471 
[mashr-dd-vignette]: https://stephenslab.github.io/mashr/articles/intro_mash_dd.html