%\VignetteIndexEntry{gaston package}
%\VignetteDepends{gaston}
%\VignettePackage{gaston}
%\VignetteEngine{knitr::knitr}

\documentclass{article}
%\usepackage[noae]{Sweave}
\usepackage[top=35mm, bottom=40mm, left=25mm , right=25mm]{geometry}
%\usepackage{moreverb}
\usepackage[utf8]{inputenc}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{etoolbox}

%\setkeys{Gin}{width=0.4\textwidth}
%\SweaveOpts{echo=TRUE, eps=FALSE, pdf=TRUE}

\raggedbottom
\pagestyle{empty}
\parindent0pt
\parskip8pt
\def\thesection{\arabic{section}}
\def\theequation{\arabic{equation}}
\let\epsilon\varepsilon

%<<echo=FALSE>>=
%options(continue=" ", prompt = " ", SweaveHooks=list(fig.mar=function() par(mar=c(5.1,4.1,3.1,2.1))), width=90)
%@
<<echo=FALSE, include=FALSE>>=
require(knitr)
options(width = 90, prompt="> ")
knit_hooks$set(fig.mar=function() par(mar=c(5.1,4.1,3.1,2.1)))
opts_chunk$set(out.width='0.4\\textwidth', fig.align='center', highlight=TRUE, comment=NA, fig.height=6, fig.width=6)
opts_knit$set(unnamed.chunk.label='gaston')
@

%<<prompton, echo=FALSE>>=
%options(prompt="> ", continue = " ");
%@
<<prompton, echo=FALSE>>=
opts_chunk$set(prompt=TRUE, continue = " ");
@

%<<promptoff, echo=FALSE>>=
%options(prompt=" ", continue=" ");
%@
<<promptoff, echo=FALSE>>=
opts_chunk$set(prompt=FALSE, continue=" ");
@

<<echo=FALSE>>=
<<prompton>>
@

<<desc, include=FALSE, echo=FALSE>>=
require(gaston)
desc <- packageDescription("gaston")
@

\title{{\bfseries Gaston}\\
       {\large Version \Sexpr{desc$Version}}}
\author{Hervé Perdry, Claire Dandine-Roulland}


%\DefineVerbatimEnvironment{Sinput}{Verbatim}{}
%\DefineVerbatimEnvironment{Soutput}{Verbatim}{}
%\DefineVerbatimEnvironment{Scode}{Verbatim}{}
%\fvset{listparameters={\setlength{\topsep}{-0.5em}}}
%\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\makeatletter
\preto{\@verbatim}{\topsep=2pt \partopsep=-10pt }
\preto{\alltt}{\topsep=-3pt \partopsep=-10pt }
\makeatother


\begin{document}
\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Introduction}

  Gaston offers functions for efficient manipulation of 
  large genotype (SNP) matrices, and state-of-the-art implementation of algorithms
  to fit Linear Mixed Models, that can be used to compute heritability 
  estimates or to perform association tests.
  Thanks to the packages \verb!Rcpp!, \verb!RcppParallel!, \verb!RcppEigen!, Gaston
  functions are mainly written in C++. Many are multithreaded.

  For better performances, we recommend to build Gaston with \verb!clang++!.
  To do so, it is sufficient to create in your home directory
  a file \verb!~/.R/Makevars! containing \verb!CXX1X = clang++! 
  or if you use a version of R $>=$ 3.4.0, \verb!CXX11 = clang++!

  In this vignette, we illustrate Gaston using the included data sets \verb!AGT!, \verb!LCT!,
  and \verb!TTN! (see the corresponding manual pages for a description). 
  Gaston also includes some example files in the \verb!extdata! folder. 
  Not all options of the functions are described here, but rather their basic usage.
  The reader is advised to look at the manual pages for details.

  Note that the package name is written \verb!gaston! when dealing with R commands, 
  but Gaston (with a capital) in human language.

\section{Modifying Gaston's behaviour with global options}

\subsection{Number of threads}

  The number of threads used by multithreaded functions can be modified 
  through \verb!RcppParallel!  function \verb!setThreadOptions!.
  It is advised to try several values for the number of threads, as 
  using too many threads might be counterproductive due to an important
  overhead. The default value set by \verb!RcppParallel! is generally
  too high.

  Some functions have a \verb!verbose! argument, which controls the
  function verbosity. To mute all functions at once you can use 
  \verb!options(gaston.verbose = FALSE)!.

\subsection{Basic statistics computations}

  Since version 1.4, the behaviour of all functions that output a
  matrix of genotypes (a bed.matrix, described in the next section)
  can be modified by setting \verb!options(gaston.auto.set.stats = FALSE)!.
  The effects of this option is described in section \ref{stats}
  below. Note that some examples in the manual pages might not work if
  you use this option.

\subsection{Autosomes, gonosomes, mitochondria}

  Since version 1.4.7, some functions take into account wether a SNP 
  is autosomal, X or Y linked, or mitochondrial. The ids of the corresponding
  chromosomes are defined by options \verb!gaston.autosomes!, \verb!gaston.chr.x!,
  \verb!gaston.chr.y!, \verb!gaston.chr.mt!. Default values are \verb!c(1:22, 25)!,
  \verb!23!, \verb!24! and \verb!26!, following to Plink convention (in this
  convention, \verb!25! corresponds to the pseudo-autosomal XY region: we chose
  to include it in the autosomes).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Genotype matrices}

  An S4 class for genotype matrices is defined, named \verb!bed.matrix!.
  Each row correspond to an individual, and each column to a SNP. 

\subsection{Reading bed.matrices from files}

  Bed.matrices be read from files using \verb!read.bed.matrix!.
  The function \verb!read.vcf! reads VCF files; it relies on the package \verb!WhopGenome!.

  Gaston includes example files that can be used for illustration:

<<>>=
x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston") )
x
@
  
  The folder \verb!extdata/! contains files \verb!LCT.bed!, \verb!LCT.rds!,
  \verb!LCT.bim! and \verb!LCT.fam!. The \verb!.bed!, \verb!.bim! and \verb!.fam! files follow the
  PLINK specifications. The \verb!.rds! file is a R data file; if it is present, the
  \verb!.bim! and \verb!.fam! files are ignored. You can ignore the
  \verb!.rds! file using option \verb!rds = NULL!:

<<>>=
x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston"), rds = NULL )
x
@

  A bed.matrix can be saved using \verb!write.bed.matrix!.  
  The default behavior is to write \verb!.bed!, \verb!.bim!, \verb!.fam! 
  and \verb!.rds! files; see the manual page for more details.

\subsection{Conversion from and to R objects}

  A numerical matrix \verb!x! containing genotype counts (0, 1, 2 or \verb!NA!) can be 
  transformed in a bed.matrix with \verb!as(x, "bed.matrix")!. The resulting
  object will lack individual and SNP informations (if the rownames and colnames
  of \verb!x! are set, they will be used as SNP and individual ids respectively).

  Conversely, a numerical matrix can be retrieved from a bed.matrix using \verb!as.matrix!. 
  
  The function \verb!as.bed.matrix! allows to provide data frames corresponding to 
  the \verb!.fam! and \verb!.bim! files. They should have colnames \verb!famid!, 
  \verb!id!, \verb!father!, \verb!mother!, \verb!sex!, \verb!pheno!, and \verb!chr!, \verb!id!, 
  \verb!dist!, \verb!pos!, \verb!A1!, \verb!A2! respectively. This function is widely used in
  the examples included in manual pages.

<<>>=
data(TTN)
x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)
x
@

\subsection{The insides of a bed.matrix}

  In first approach, a bed.matrix behaves as a "read-only" matrix containing only 
  0, 1, 2 and NAs, unless the genotypes are standardized (use \verb!standardize<-!).
  They are stored in a compact form, each genotype being coded on 2 bits (hence
  4 genotypes per byte). 

  Bed.matrices are implemented using S4 classes and methods.
  Let's have a look on the slots names of the bed.matrix \verb!x! created above using the dataset \verb!LCT!.

<<>>=
data(TTN)
x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)
slotNames(x)
@

  The slot \verb!x@bed! is an external pointer, which indicates where the genetic data are stored in
  memory. It will be used by the C++ functions called by Gaston. 
<<>>=
x@bed
@

  Let's look at the contents of the slots \verb!x@ped! and \verb!x@snps!.
  The other slots will be commented later.

  The slot \verb!x@ped! gives informations on the individuals. 
  The first 6 columns correspond to the contents of a \verb!.fam! file, or to the first 6 columns of a \verb!.ped! file 
  (known as linkage format). The other columns are simple descriptive
  stats that are computed by Gaston, unless \verb!options(gaston.auto.set.stats = FALSE)!
  was set (see below section \ref{stats}).

<<>>= 
dim(x@ped)
head(x@ped)
@

  The slot \verb!x@snps! gives informations on the SNPs. Its first 6
  columns corresponds to the contents of a \verb!.bim! file. The other
  columns are simple descriptive stats that are computed by Gaston. Again, if 
  the global option 
  \verb!options(gaston.auto.set.stats = FALSE)! was set (see below), these
  columns will be absent (see below section \ref{stats}).

\pagebreak
<<>>=
dim(x@snps)
head(x@snps)
@

\textbf{Note:} We refer to the allele \verb!A1! as the \textit{reference allele} and
to \verb!A2! as the \textit{alternative allele}. The choice of the reference allele is
arbitrary, it doesn't to be e.g. the major allele. If the matrix is produced by 
\verb!read.bed.matrix!, Gaston will keep them in the order they appear in the 
\verb!.bim! file, their won"t be any reordering.

\subsection{Basic statistics included in a bed.matrix}\label{stats}

  Some simple descriptive statistics can be added to a bed.matrix with \verb!set.stats!.
  Since version 1.4 of gaston, this function is called by default by all functions that create 
  a bed.matrix, unless the global option \verb!options(gaston.auto.set.stats = FALSE)! was set.
  This option can be used to gain some time in some particular cases. 


  We illustrate here the effect of this option:
<<>>=
options(gaston.auto.set.stats = FALSE)
data(TTN)
x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)
head(x@ped)
head(x@snps)
@

  The reader is invited to compare with what we obtained in the previous section of this document.

  The function \verb!set.stats! can be called to add the missing descriptive statistics to the \verb!ped! 
  and \verb!snps! slots:
<<>>=
x <- set.stats(x)
head(x@ped)
@

The columns \verb!N0!, \verb!N1!, \verb!N2! and \verb!NAs! give for each individual the number of
autosomal SNPs with a genotype equal to 0, 1, 2 (corresponding to \verb!A1A1!, \verb!A1A2! and \verb!A2A2!) 
and missing, respectively. The homologous columns
with names ending with \verb!.x!, \verb!.y! and \verb!.mt! give the same for SNPs on the X, Y, and mitochondria
(this is why you get only 0s here).

The columns \verb!callrate!, \verb!callrate.x!, \verb!callrate.y!, \verb!callrate.mt!
give the individual callrate for autosomal, X, Y, mitochondrial SNPs, and similarly,
\verb!hz!, \verb!hz.x!, \verb!hz.y!, \verb!hz.mt! give the individual heterozygosity.

<<>>=
head(x@snps)
@

  The columns \verb!N0!, \verb!N1!, \verb!N2! and \verb!NAs! contain the number
  of genotypes 0, 1, 2 and \verb!NA! for each SNP. For X-linked SNPs only, the 
  homologous columns with names ending with \verb!.f! give the same, taking only 
  women into account.
  The callrate is the proportion of non-missing
  genotypes. In the \verb!snps! slot, \verb!maf! is the minor allele frequency, and 
  \verb!hz! it the heterozygosity rate. Note that the callrate for Y linked SNPs
  taking only men into account, while the heterozygosity rate for X linked SNPs
  is computed only on women.

  The default is too also also update the slots \verb!x@p!, \verb!x@mu! and \verb!x@sigma!:
<<>>=
str(x@p)
str(x@mu)
str(x@sigma)
@
  \verb!p! contains the alternative allele (\verb!A2!) frequency; \verb!mu! is equal to \verb!2*p!
  and is the expected value of the genotype (coded in 0, 1, 2); and \verb!sigma!
  is the genotype standard error. If the Hardy-Weinberg equilibrium holds, \verb!sigma!
  should be close to \verb!sqrt(2*p(1-p))!. This is illustrated on the figure below.

%\begin{center}
%\setkeys{Gin}{width=0.7\textwidth}
<<fig.mar=TRUE, fig.width=10, out.width='0.7\\textwidth'>>=
plot(x@p, x@sigma, xlim=c(0,1))
t <- seq(0,1,length=101);
lines(t, sqrt(2*t*(1-t)), col="red")
@
%\end{center}

  There are also functions \verb!set.stats.snps! and \verb!set.stats.ped! to
  update only \verb!x@snps! and \verb!x@ped!.

  The option \verb!gaston.auto.set.stats! can be useful when extracting 
  individuals from bed.matrices (seen next section): if
  \verb!gaston.auto.set.stats = FALSE!, the slots \verb!@p!, \verb!@mu! and \verb!@sigma!
  are left unaltered; in contrast, the default behavior is to call \verb!set.stats! 
  which recomputes the values of these slots. 
  Note that when using \verb!set.stats! ``manually'', one can use options to alter its behavior, 
  allowing to update only some slots. The following code illustrates this (remember that here 
  \verb!gaston.auto.set.stats = FALSE!) :

<<>>=
head(x@p)
y <- x[1:30,]
head(y@p)
y <- set.stats(y, set.p = FALSE)
head(y@p)
y <- set.stats(y)
head(y@p)
@

  Hardy-Weinberg Equilibrium can be tested for all SNPs simply by \verb!x <- set.hwe(x)!. 
  This command adds to the data frame \verb!x@snps! a column \verb!hwe!, containing the
  $p$-value of the test (both $\chi^2$ and ``exact'' tests are available, see the man
  page).

  Note that when writting a bed.matrix \verb!x! to disk with \verb!write.bed.matrix!, 
  the \verb!.rds! file contains all the slots of \verb!x!, and thus contains the additionnal
  variables added by \verb!set.stats! or \verb!set.hwe!, which are not saved in the
  \verb!.bim! and \verb!.fam! files.

  For the remaining of this vignette, we restore the default behaviour:
<<>>=
options(gaston.auto.set.stats = TRUE)
@

\pagebreak
\subsection{Subsetting bed.matrices}

  It is possible to subset bed.matrices just as normal matrices, writing e.g.
  \verb!x[1:100,]! to extract the first 100 individuals, or \verb!x[1:100,10:19]!
  to extract the SNPs 10 to 19 for these 100 individuals:

<<>>=
x[1:100,]
x[1:100,10:19]
@

  The use of logical vectors for subsetting is allowed too. The following code extracts 
  the SNPs with minor allele frequency $> 0.1$:

<<>>=
x[,x@snps$maf > 0.1]
@

  For this kind of selection, Gaston provides the functions \verb!select.inds! and 
  \verb!select.snps!, which have a nicer syntax. Hereafter 
  we use the same condition on the minor allele frequency, and we introduce also 
  a condition on the Hardy-Weinberg Equilibrium $p$-value:

<<>>=
x <- set.hwe(x)
select.snps(x, maf > 0.1 & hwe > 1e-3)
@
  The conditions in \verb!select.snps! can use any of the variables defined in the data frame \verb!x@snps!,
  as well as variables defined in the user session. 

  The function \verb!select.inds! is similar, using variables of the data frame \verb!x@ped!. 
  One can for example 
  select individuals with callrate greater than 95\% with \verb!select.inds(x, callrate > 0.5)!.

  These functions should make basic Quality Control easy. 

  \noindent{\bfseries Note:}\ \ If \verb!options(gaston.auto.set.stats = FALSE)!
  was set, subsetting will erase some or all statistics added by \verb!set.stats!,
  except the \verb!p!, \verb!mu! and \verb!sigma! slots. 

\subsection{Genomic sex}
  When dealing with genome wide data,
  the function \verb!set.genomic.sex! can be use to add a \verb!x@ped$genomic.sex! column,
  determined by clustering on individuals X heterozygosity rate and Y callrate.
 
  \verb!x <- select.inds(x, sex == genomic.sex)! will then allow to keep only individuals 
  whose sex and genomic sex are concordant.

  An other possibility is to use \verb!x@ped$sex <- x@ped$genomic.sex; x <- set.stats(x)! 
  to erase the original sex indication and use genomic sex in subsequent analyses.

\subsection{Merging bed.matrices}

  Gaston defines methods \verb!rbind! and \verb!cbind! which can be used to merge several
  bed.matrices. \verb!cbind! checks that individuals famids and ids are identical ; all remaining
  columns are taken from the first argument, without further control.

<<>>=
data(AGT)
x1 <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)
x1
data(LCT)
x2 <- as.bed.matrix(LCT.gen, LCT.fam, LCT.bim)
x2
x <- cbind(x1,x2)
x
@

  \verb!rbind! similarly checks whether the SNP ids are identical, and also if the reference
  alleles are identical. If needed, it will perform reference allele inversions (changing a SNP
  A/G to G/A) or allele strand flips (changing A/G to T/C) or both.

<<>>=
x3 <- x[1:10, ]
x4 <- x[-(1:10), ]
rbind(x3,x4)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Standardized matrices}

  Gaston allows to ``standardize'' a genotype matrix, replacing each genotype $X_{ij}$ ($i$ is the
  individual index, $j$ is the SNP index) by 
\begin{equation}
X_{ij} - \mu_j \over \sigma_j
\end{equation}
  where $\mu_j = 2 p_j$ is the mean of the 0,1,2-coded genotype ($p_j$ being the alternative allele 
  frequency), and $\sigma_j$ is either its standard error, or the expected standard error under Hardy-Weinberg
  Equilibrium, that is $\sqrt{2p_j(1-p_j)}$.

  Consider the TTN data set. The unscaled matrix contains 0,1,2 values:
<<>>=
x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)
X <- as.matrix(x)
X[1:5,1:4]
@

  To scale it using the standard error, use \verb!standardize(x) <- "mu_sigma"! (or \verb!standardize(x) <- "mu"!,
  as the function uses \verb!match.arg!):
<<>>=
standardize(x) <- "mu"
as.matrix(x)[1:5, 1:4] 
@

  The result is similar to what would be obtained from the base function \verb!scale!:
<<>>=
scale(X)[1:5,1:4]
@

  To use $\sqrt{2p_j(1-p_j)}$ instead of the standard error, use \verb!standardize(x) <- "p"!; a similar
  result can again be obtained from \verb!scale!, with the right options:
<<>>=
standardize(x) <- "p"
as.matrix(x)[1:5, 1:4] 
scale(X, scale = sqrt(2*x@p*(1-x@p)))[1:5,1:4]
@

  To go back to the 0,1,2-coded genotypes, use \verb!standardize(x) <- "none"!.

  \noindent{\bfseries Note:}\ \ In standardized matrices, the \verb!NA! values are replaced by zeroes,
  which amount to impute the missing genotypes by the mean genotype.

\subsection{Matrix multiplication}

  Standardized matrices can be multiplied with numeric vectors or matrices with the operator \verb!%*%!.

  This feature can be used e.g. to simulate quantitative phenotypes with a genetic component. The following
  code simulates a phenotype with an effect of the SNP \#350:
<<>>=
y <- x %*% c(rep(0,350),0.25,rep(0,ncol(x)-351)) + rnorm(nrow(x), sd = 1)
@

\pagebreak
\subsection{Genetic Relationship Matrix and Principal Component Analysis}

  If $X_s$ is a standardized $n\times q$ genotype matrix, a Genetic Relationship Matrix 
  (GRM) of the individuals can be computed as
  \begin{equation*}
     GRM = {1\over q-1} X_sX_s'
  \end{equation*}
  where $q$ is the number of SNPs.  This computation is done by the function \verb!GRM!.  

  \textbf{NOTE:} Since version 1.4.7, \verb!GRM! has an argument \verb!which.snps! which allows 
  to give a logical vector specifying which SNPs to use in the computation. The default is to 
  use only autosomal SNPs (the above formula makes little to no sense for X linked SNPs, unless there 
  are only women in the sample).

  The Principal Component Analysis (PCA) of large genomic data sets is used to retrieve population
  stratification. It can be obtained from the eigen decomposition of the GRM. To illustrate this,
  we included in the \verb!extdata! folder a dataset extracted from the 1000 genomes project. 
  This data set comprehend the 503 individuals of european ascent, with 10025 SNPs on the chromosome
  2. These SNPs have been selected with the function \verb!LD.thin! so that they have very low 
  LD with each other ($r^2 < 0.05$). We added in the data frame \verb!x@ped! a variable \verb!population!
  which is a factor with levels \verb!CEU!, \verb!FIN!, \verb!GBR!, \verb!IBS! and \verb!TSI!.

  We can load this data set as follows:
<<>>=
x <- read.bed.matrix( system.file("extdata", "chr2.bed", package="gaston") )
x
head(x@ped)
table(x@ped$population)
@


We compute the Genetic Relationship Matrix, and its eigen decomposition (we don't
need to standardize \verb!x! explicitely: is \verb!x! isn't standardized, 
\verb!GRM! will use \verb!standardize(x) = "p"!):
<<>>=
K <- GRM(x)

eiK <- eigen(K)
# deal with a small negative eigen value
eiK$values[ eiK$values < 0 ] <- 0
@

  The eigenvectors are normalized. The Principal Components (PC) can be computed by 
  multiplying them by the square root of the associated eigenvalues. Here we plot
  the projection of the 503 individuals on the first two PCs, with colors corresponding
  to their population.

%\begin{center}
<<fig.mar=TRUE>>=
PC <- sweep(eiK$vectors, 2, sqrt(eiK$values), "*")
plot(PC[,1], PC[,2], col=x@ped$population)
legend("bottomleft", pch = 1, legend = levels(x@ped$population), col = 1:5)
@
%\end{center}

  As $K$ can be written
  \begin{equation*} 
    K = \left( {1 \over \sqrt{q-1}} X_s \right) \left( {1 \over \sqrt{q-1}} X_s \right)',
  \end{equation*}
  the PCs are the left singular vectors of ${1 \over \sqrt{q-1}} X_s$.
  The vectors of loadings are the right singlar vectors of this matrix. The vector of loadings 
  corresponding to a PC $u$ is the vector $v$ with unit norm, such that 
  $u = {1 \over \sqrt{q-1}} X_s v$.

  They can be retrieved with the function \verb!bed.loadings!:
<<>>=
# one can use PC[,1:2] instead of eiK$vectors[,1:2] as well
L <- bed.loadings(x, eiK$vectors[,1:2])
dim(L)
head(L)
@

We verify that the loadings have unit norm:
<<>>=
colSums(L**2)
@

And that the PCs are retrieved by right multiplying $X_s$ by the loadings (here we need 
to explicitely standardize \verb!x!):
<<>>=
standardize(x) <- 'p'
head( (x %*% L) / sqrt(ncol(x)-1) )
head( PC[,1:2] )
@


\vfill\eject
\subsection{Linkage Disequilibrium}

  Doing the crossproduct in the reverse order produces a moment estimate of the 
  Linkage Disequilibrium (LD):
  \begin{equation*} 
    LD = {1\over n-1} X_s'X_s 
  \end{equation*}
  where $n$ is the number of individuals. This computation is done by the function
  \verb!LD! (usually, only parts of the whole LD matrix is computed).  
  The fonction \verb!LD! can compute a square symmetric matrix giving the LD for a given 
  region, measured by $r^2$ (the default), $r$ or $D$. It can also compute the LD  between
  two different regions.

\begin{center}
\setkeys{Gin}{width=\textwidth}
<<>>=
data(AGT)
x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)

ld.x <- LD(x, c(1,ncol(x)))

LD.plot(ld.x, snp.positions = x@snps$pos, write.ld = NULL, 
        max.dist = 20e3, write.snp.id = FALSE, draw.chr = FALSE, 
        pdf = "LD_AGT.pdf")
@
\includegraphics[width=\textwidth]{LD_AGT.pdf}
\end{center}

  This method is 
  also used by \verb!LD.thin! to extract a set of SNPs in low linkage disequilibrium
  (it is often recommended to perform this operation before computing the GRM). We illustrate
  this function on the AGT data set:

<<>>=
y <- LD.thin(x, threshold = 0.4, max.dist = 500e3)
y
@

The argument \verb!max.dist = 500e3! is to specify that the LD won't be computed for
SNPs more than 500 kb appart. We verify that there is no SNP pair with LD $r^2 > 0.4$
(note that the LD matrix has ones on the diagonal):

<<>>=
ld.y <- LD( y, lim = c(1, ncol(y)) )
sum( ld.y > 0.4 )
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vfill\eject
\section{Linear Mixed Models}

\def\R{\mathbb{R}}
\def\N{\mathcal{N}}

  Linear Mixed Models are usually written under the form
  \begin{equation*}
    Y = X\beta + Z_1 u_1 + \cdots + Z_k u_k + \varepsilon
  \end{equation*}
  where $Y \in\R^{n}$ is the response vector, and $X\in \R^{n\times p}, Z_1 \in \R^{n\times q_1}, \dots, Z_k \in\R^{n\times q_k}$ are
  design matrices. The vector $\beta\in\R^p$ is the vector of fixed effects; the random effects are drawn in 
  normal distributions, $u_1\sim \N(0, \tau_1 I_{q_1}), \dots, u_k\sim \N(0, \tau_k I_{q_k})$, as the residual
  error $\epsilon\sim\N(0,\sigma^2 I_n)$.

  Here we will use the equivalent form 
  \begin{equation*}
    Y = X\beta + \omega_1 + \ldots + \omega_k + \varepsilon
  \end{equation*}
  where $K_i = Z_i Z_i'$ $(i = 1, \dots, k)$,
  the random terms are $\omega_i \sim N(0,\tau_i K_i)$ for $i \in 1, \dots,k$ and $\varepsilon \sim N(0,\sigma^2 I_n)$.

  Note that using the R function \verb!model.matrix! can help you to rewrite a model under this form.

  Gaston provides two functions for estimating the model parameters when the model is
  written in the second form. We are going to illustrate these functions on simulated
  data.

  \subsection{Data simulation}

  We will use the above notations. First generate some (random) design matrices:
 
<<>>=
set.seed(1)
n <- 100
q1 <- 20
Z1 <- matrix( rnorm(n*q1), nrow = n )
X <- cbind(1, 5*runif(n))
@
 
  Then the vector of random effects $u_1$ and a vector $y$ under the model $Y = X (1,2)' + Z u_1 + \varepsilon$
  with $u_1 \sim \N(0, 2 I_{q_1})$ and $\varepsilon \sim\N(0, 3 I_n)$:

<<>>=
u1 <- rnorm(q1, sd = sqrt(2))
y <- X %*% c(1,2) + Z1 %*% u1 + rnorm(n, sd = sqrt(3))
@

  To illustrate the case where there are several random effects vectors, we generate an 
  other matrix $Z_2$, the corresponding vector of random effects $u_2$, and a vector $y_2$ 
  under the model $Y = X (1,2)' + Z u_1 + Z_2 u_2 + \varepsilon$.

<<>>=
q2 <- 10
Z2 <- matrix( rnorm(n*q2), nrow = n ) 
u2 <- rnorm(q2, sd = 1)
y2 <- X %*% c(1,2) + Z1 %*% u1 + Z2 %*% u2 + rnorm(n, sd = sqrt(3))
@

  \subsection{Model fitting with the AIREML algorithm}

  \verb!lmm.aireml! is a function for linear mixed models parameter estimation
  and BLUP computations. 

  \subsubsection{One random effects vector}
  Let's start with the simple case (only one random effects vector).
  As  \verb!lmm.aireml! uses the second form of the linear mixed model,
  and we give it the matrix $K_1 = Z_1 Z_1'$.

<<>>=
K1 <- tcrossprod(Z1)
fit <- lmm.aireml(y, X, K = K1, verbose = FALSE)
str(fit)
@

  The result is a list giving all kind of information. Here we see that $\tau$ is 
  estimated to $\Sexpr{round(fit[["tau"]],2)}$ and $\sigma^2$ to $\Sexpr{round(fit[["sigma2"]],2)}$. 
  The Best Linear Unbiased Predictor (BLUP) for $\beta$ is in the component \verb!BLUP_beta!
  and here it is $(\Sexpr{round(fit[["BLUP_beta"]][1],2)}, \Sexpr{round(fit[["BLUP_beta"]][2],2)})$.
  
  The component \verb!BLUP_omega! holds the BLUP for $\omega_1 = Z u_1$. The BLUP for $u_1$
  can be retrieved by the formula $\widehat{u_1} = \tau_1 Z_1' P y$, the value of $Py$ being
  in the component \verb!Py!. The plots below compare the true values of $\omega_1$ and
  $u_1$ with their BLUPs.
   
%\setkeys{Gin}{width=0.7\textwidth}
%\begin{center}
<<fig.mar=TRUE, fig.width=12, out.width='0.7\\textwidth'>>=
par(mfrow = c(1, 2))
plot(Z1 %*% u1, fit$BLUP_omega); abline(0, 1, lty = 2, col = 3)
BLUP_u1 <- fit$tau * t(Z1) %*% fit$Py
plot(u1, BLUP_u1); abline(0, 1, lty = 2, col = 3)
@
%\end{center}

  \subsubsection{Several random effects vectors}

  It is sufficient to give to \verb!lmm.aireml! a list with the two matrices
  $K_1$ and $K_2$:

<<>>=
K2 <- tcrossprod(Z2)
fit2 <- lmm.aireml(y2, X, K = list(K1, K2), verbose = FALSE)
str(fit2)
@

  The component \verb!tau! now holds the two values $\tau_1$ and $\tau_2$.
  Note that there is only one \verb!BLUP_omega! vector. It corresponds to
  the BLUP for $\omega_1 + \omega_2$. The BLUPs for each $\omega_i$ can
  be retrieved using $\widehat{\omega_i} = \tau_i K_i Py$:

%\begin{center}
<<fig.mar=TRUE, fig.width=12, out.width='0.7\\textwidth'>>=
par(mfrow = c(1, 2))
omega1 <- fit2$tau[1] * K1 %*% fit2$Py
omega2 <- fit2$tau[2] * K2 %*% fit2$Py
plot(Z1 %*% u1, omega1); abline(0, 1, lty = 2, col = 3)
plot(Z2 %*% u2, omega2); abline(0, 1, lty = 2, col = 3)
@
%\end{center}

  The BLUPs for $u_1$ and $u_2$ can as above be retrieved using $\widehat{u_i} = \tau_i Z_i' P y$.
  
  \subsection{Model fitting with the diagonalization trick}

  In the case where there is only one vector of random effects ($k = 1)$, the
  diagonalization trick uses the eigen decomposition of $K_1$ to speed up the 
  computation of the restricted likelihood, which allows to use a generic 
  algorithm to maximize it. Of course, the eigen decomposition of $K_1$ needs 
  to be computed beforehand, which induces an overhead. 

  The trick relies on a transformation of the data which leads to a diagonal variance matrix for $Y$. 
  The computation of a the restricted likelihood
  involves the computation of the inverse of this matrix, which is then easy to compute.
  Write the eigen decomposition of $K_1$ as $K_1 = U \Sigma^2 U'$ with $\Sigma^2$
  a diagonal matrix of positive eigenvalues, and $U$ an orthogonal matrix.
  If we let $\tilde Y = U'Y$, $\tilde X = U'X$, $\tilde \omega_1 = U' \omega_1$, and
  $\tilde\varepsilon = U' \varepsilon$, we have
  $$ \tilde Y = \tilde X\beta + \tilde \omega_1 + \tilde \varepsilon $$
  with $\tilde \omega_1 \sim N\left(0,\tau \Sigma^2\right)$ and $ \tilde\varepsilon \sim N(0,\sigma^2 I_n) $.
  As stated above, $\mathrm{var}\left(\tilde Y\right) = \tau \Sigma^2 + \sigma^2 I_n$ is diagonal.

  We fit the first model again:
 
<<>>=
eiK1 <- eigen(K1)
fit.d <- lmm.diago(y, X, eiK1)
str(fit.d)
@
  You can check that the result is similar to the result obtained earlier with 
  \verb!lmm.aireml!.

  The likelihood computation with the diagonalization trick is fast enough to 
  plot the likelihood:

%\setkeys{Gin}{width=0.4\textwidth}
%\begin{center}
<<fig.mar=TRUE>>=
TAU <- seq(0.5,2.5,length=50)
S2 <- seq(2.5,4,length=50)
lik <- lmm.diago.likelihood(tau = TAU, s2 = S2, Y = y, X = X, eigenK = eiK1)
lik.contour(TAU, S2, lik, heat = TRUE, xlab = "tau", ylab = "sigma^2")
@
%\end{center}



\subsection{Genomic heritability estimation with Gaston}

  Heritability estimates based on Genetic Relationship Matrices (GRM)
  are computed under the following model:
  \begin{equation*}
  Y = \beta + \omega + \epsilon
  \end{equation*}
  where $\omega \sim \N(0, \tau K)$ and $\epsilon\sim \N(0, \sigma^2 I_n)$,
  $K$ being a GRM computed on whole genome data (e.g. by the function \verb!GRM!).
  The heritability is $h^2 = {\tau \over \tau + \sigma^2}$.

  Note that as $K = {1\over q-1} X_sX_s'$ with $X_s$ a standardized genotype matrix, 
  letting $Z = (q-1)^{-{1 \over 2}} X_s$, this model is equivalent to 
  \begin{equation*}
  Y = \beta + Zu + \epsilon
  \end{equation*}
  with $u \sim \N(0,\tau I_q)$.

  The function \verb!random.pm! generates random positive matrices with eigenvalues
  roughly similar to those typically observed when using whole genome data. It outputs a list
  with a member \verb!K!: a positive matrix, and a member \verb!eigen!: its eigen 
  decomposition (as the base function \verb!eigen! would output it).

  The function \verb!lmm.simu! can be used simulate data under the linear mixed model.
  It uses the eigen decomposition of $K$. Hereafter we use $\tau = 1$, $\sigma^2 = 2$,
  hence a $33.3\%$ heritability.

<<>>=
set.seed(1)
n <- 2000
R <- random.pm(n)
y <- 2 + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y
@

  We can use \verb!lmm.diago! to estimates the parameters of the model.

<<>>=
fit <- lmm.diago(y, eigenK = R$eigen)
@

<<echo=FALSE>>=
h2 <- fit$tau/(fit$tau + fit$sigma)
@
  We have $\widehat\tau = \Sexpr{round(fit[["tau"]],3)}$ and $\widehat{\sigma^2} = \Sexpr{round(fit[["sigma2"]],2)}$,
  hence $h^2$ is estimated to $\Sexpr{round(h2*100,1)}\%$.

  The function \verb!lmm.diago.likelihood! allows to compute a profile likelihood with parameter 
  $h^2 = {\tau \over \tau + \sigma^2}$. It can be useful to get confidence intervals. Here,
  we simply plot it:

%\begin{center}
<<fig.mar=TRUE>>=
H2 <- seq(0,1,length=51)
lik <- lmm.diago.likelihood(h2 = H2, Y = y, eigenK = R$eigen)
plot(H2, exp(lik$likelihood-max(lik$likelihood)), type="l", yaxt="n", ylab="likelihood")
@
%\end{center}

  It is often advised to include the first few (10 or 20) Principal Components (PC) in the model as fixed effects,
  to account for a possible population stratification. The function \verb!lmm.diago! has an argument \verb!p!
  for the number of PCs to incoporate in the model with fixed effects. We simulate a trait with a large effect of
  the two first PCs:
<<>>=
PC <- sweep(R$eigen$vectors, 2, sqrt(R$eigen$values), "*")
y1 <- 2 + PC[,1:2] %*% c(5,5) + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y
@

  Here are the heritability estimates with $p = 0$ (the default) and $p = 10$.
<<>>=
fit0 <- lmm.diago(y1, eigenK = R$eigen)
fit0$tau/(fit0$tau+fit0$sigma2)
fit10 <- lmm.diago(y1, eigenK = R$eigen, p = 10)
fit10$tau/(fit10$tau+fit10$sigma2)
@

  As expected, the estimate is inflated when no PCs are incorporated in the model.
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vfill\eject
\section{Association tests}

The function \verb!association.test! performs 
genome-wide association tests with either (generalized) linear models, 
or with generalized) linear (mixed) models, for quantitative or binary (0/1) traits.

\subsection{Quantitative trait}

  When the function \verb!association.test! is called with parameter \verb!method = "lm"!
  it performs a test of association between SNPs 
  and a quantitative trait $Y$ under the model
\begin{equation*}
  Y = (X|PC)\alpha + G\beta + \varepsilon 
\end{equation*}
  where $X$ is the matrix of covariables with fixed effects (including a column 
  of ones for the intercept), $G$ is the vector of genotypes at the SNP under
  consideratione. 
  A few PCs can be included in the model with fixed effects to correct for 
  population stratification (the parameter \verb!p! gives the number of PCs to include).
  The Wald test is performed to test for $\beta = 0$.

  When the function is called with parameter \verb!method = "lmm"! the following mixed model
  is used instead:
\begin{equation*}
  Y = (X|PC)\alpha + G\beta + \omega + \varepsilon 
\end{equation*}
  where $X$ and $G$ are as above, and $\omega\sim\N(0,\tau K)$ where $K$ is a Genetic Relationship
  Matrix (computed on the whole genome).

  Three tests are proposed for $\beta = 0$: a score test, a Wald test on $\widehat\beta$,
  or a Likelihood Ratio Test.

  We illustrate this function on a simple example, built on the AGT data set:
<<>>=
data(AGT)
x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)
standardize(x) <- 'p'
@

  As the whole genome is not available, we generate a random positive matrix to play the role of the GRM:
<<>>=
set.seed(27)
R <- random.pm(nrow(x))
@

  And we simulate a phenotype with an effect of the SNP \#351, and a polygenic component:
<<>>=
y <- 2 + x %*% c(rep(0,350),0.35,rep(0,ncol(x)-351)) +
     lmm.simu(tau = 0.3, sigma2 = 1, eigenK=R$eigen)$y
@

  The following code performs the association test with a score test and a wald test,
  and compares the resulting $p$-values:
%\begin{center}
<<fig.mar=TRUE>>=
t_wald <- association.test(x, y, K = R$K, method = "lm", test = "wald")
t_wald_mixed <- association.test(x, y, eigenK = R$eigen, method = "lmm", test = "wald")
plot( t_wald$p, t_wald_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (wald)")
abline(0,1,lty=3)
@
%\end{center}

We can display the results under the form of a (mini) Manhattan plot:
%\begin{center}
%\setkeys{Gin}{width=0.7\textwidth}
<<fig.mar=TRUE, fig.width=12, out.width='0.7\\textwidth'>>=
manhattan(t_wald_mixed, col = ifelse(1:ncol(x) == 351, "red", "black"))
@
%\end{center}

  We colored the point corresponding to the SNP \#351 in red. Note that when
  used on genome-wide data, the function \verb!manhattan! will draw a classical
  Manhattan plot, alternating colors between chromosomes.

  There are other SNPs with low association $p$-values: these are the SNPs in LD with
  SNP \#351. We can confirm this by plotting the $p$-values (again on log scale) 
  as a function of the LD (measured by $r^2$):
%\begin{center}
<<fig.mar=TRUE>>=
lds <- LD(x, 351, c(1,ncol(x)))
plot(lds, -log10(t_wald_mixed$p), xlab="r^2", ylab="-log(p)")
@
%\end{center}

\subsection{Binary phenotype}

We use the quantitative phenotype previously generated to construct a binary phenotype,
and we perform the association test with a logistic model (Wald test) or logistic mixed 
model (score test). 

%\begin{center}
%\setkeys{Gin}{width=0.7\textwidth}
<<fig.mar=TRUE>>=
y1 <- as.numeric(y > mean(y))
t_binary <- association.test(x, y1, K = R$K, method = "lm", response = "binary", test="wald")
t_binary_mixed <- association.test(x, y1, K = R$K, method = "lmm", response = "binary")
plot( t_binary$p, t_binary_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (score)" )
abline(0,1,lty=3)
@
%\end{center}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vfill\eject
\section{What you can hope for in later releases}

A few things that may be added to the package in the near future:

\begin{itemize}
\item Reading \verb!.ped! files
\item Maximum Likelihood Linkage Disequilibrium estimation
\item LMM models fitting written in equation form
\item More functions and models for association testing
\item Anything you asked the maintainer for
\end{itemize}

\end{document}




