\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{parskip}
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{hyperref}
%% \VignetteIndexEntry{mmod-demo}

\begin{document}
\title{\textsc{mmod} vignette}
\author{David Winter\\
      david.winter@gmail.com}
\maketitle

\tableofcontents
\newpage

\section{Why use mmod (or what's wrong with $G_\text{ST}$?)}
 
Population geneticists, molecular ecologists and evolutionary biologists often
want to be able to determine the degree to which populations are divided into
smaller sub-populations. One very widely used approach to this question uses
``F analogues'' (measures based on Wrigtht's $F_\text{ST}$) to compare diversity
within and between predefined sub-populations. Until recently, the most widely 
used of these statistics has been Nei's $G_\text{ST}$. Unfortunately, the value 
of $G_\text{ST}$ is a at least partially dependent on the number of alleles at 
each locus and the number of populations sampled. This makes simple 
interpretations of $G_\text{ST}$ difficult, and comparisons between studies (or
even between loci in the same population) potentially misleading. 

A number of new $F_\text{ST}$ analogues have been developed that compensate for 
these short comings, and give values that can be compared between studies. 
\textsc{mmod} is a package that allows three of these statistics, 
$G''_\text{ST}$, $D_\text{est}$ and $\varphi'_\text{ST}$, to be calculated from
\texttt{genind} objects (the standard representation of genetic datasets in the
\texttt{adegenet} library)

\section{Which statistic should I use?}

With the proliferation of $F_\text{ST}$ analogues, it can be hard to decide on the
most appropriate measure to use for your study. I encourage you to read 
Meirmans and Hedrick (2011 doi:\href{http://dx.doi.org/10.1111/j.1755-0998.2010.02927.x}{10.1111/j.1755-0998.2010.02927.x}), 
which includes a discussion on this topic. As you'll see in the demonstration
below, the corrected statistics often tell a similar story. Interestingly, 
$G''_\text{ST}$ can be directly related to the rate of migration between 
populations while $D_\text{est}$ and $\varphi'_\text{ST}$ are about 
partitioning distances or diversity between genes. You may consider which 
approach is most appropriate for the specific questions you wish to ask.

\section{Which statistics can \textsc{mmod} not calculate}

There are at least two population genetic statistics related to the ones discussed above that
\textsc{mmod} can't calculate. $R_\text{ST}$ was developed for micorsattelite
data, and takes the relationship between alleles (and therefore the mutation
rate) into account when measuring between-allele distances. It is not clear how 
the maximum potential value of $R_\text{ST}$ for a given dataset can be 
calculated, so it is not possible to correct this statistic in a way similar 
to $G''_\text{ST}$ and $\varphi'_\text{ST}$. 

Similarly, the calculation of the maximum value of Weir and Cockerham's $\theta$
is complex (and not yet published). If you wish to calculate a corrected 
version of this statistic you can use RecodeData 
(\url{http://www.bentleydrummer.nl/software/software/})
to create a dataset in which all between-population differences are maximised. 
You can then calculate $\theta$ for each dataset using \texttt{Fst} from the
package \texttt{pegas}. If the statistic calculated form the recoded data is
${\theta_\text{max}}$ then the corrected statistic is simply 
$\frac{\theta}{\theta_\text{max}}$.

\section{An Example - differentiation in the \textsc{nancycats} data}


With the description out of the way, let's see how these functions work in 
practice. As an example, we are going to examine the \texttt{nancycats} data that comes
with \texttt{adegenet}. This dataset contains microsattelite genotypes
taken from feral cats in Nancy, France.
So let's start.

<<start>>=
library(mmod)
data(nancycats)
nancycats
@

The nancycats data comes in \texttt{adegenet}'s default class for genotypic
data, the \texttt{genind} class. The functions in \texttt{mmod} work on genind
 objects, so you would usually start by reading in your data using
 \texttt{read.genpop} or \texttt{read.fstat} depending on the format it's in.
 
 
Now that we have our data on hand, our goal is to see

\begin{itemize}
        \item Whether this population is substantially differentiated into
        smaller sub-populations
        \item Whether such differentiation can be explained by the geographical
        distance between sub-populations.
\end{itemize}


We can look at several statistics to ask answer the first question by using
 the \verb+diff_stats()+ function:

<<diffstat>>=
diff_stats(nancycats)
@

OK, so what is that telling us? The first table has statistics calculated individually
for each locus in the dataset. \texttt{Hs} and \texttt{Ht} are estimates of the
heterozygosity expected for this population with and without the sub-populations
defined in the \texttt{nancycats} data respectively. We need to use those to
calculate the measures of population divergence so we might as well display
them at the same time. \texttt{Gst} is the standard (Nei) $G_\text{ST}$,
\verb+Gprime_st+ is Hedrick's $G''_\text{ST}$ and \texttt{D} is Jost's $D_\text{est}$.
Because all of these statistics are estimated from estimators of $H_\text{S}$
and $H_\text{T}$, it's possible to get negative values for each of these
differentiation measures. Populations can't be negatively differentiated, so
you should think of these as estimates of a number close to zero (it's up to
you and your reviewers to decide if you report the negative numbers of just
zeros).
 
$D_\text{est}$ is the easiest statistic to interpret, as you expect to
find $D=0$ for populations with no differentiation and $D=1$ for completely
differentiated populations. As you can see, different loci give quite
different estimates of divergence but they range from $\sim$0.1--0.4.

\textsc{mmod} can calculate another statistic of differentiation called
$\varphi'_\text{ST}$. This statistic is based on the Analysis of Molecular
Variance (AMOVA) method, which partitions the variance in genetic distances
in a dataset to among-population and within-population components (it is 
possible to use this framework to partition variance using more than two 
levels of population structure, but that has not been implemented in 
\textsc{mmod} yet). Because $\varphi'_\text{ST}$ can take some time to 
calculate it's not included in \verb+diff_stat+ by default (but you can 
include it using \verb+diff_stat(x, phi_st=TRUE)+). 
 
You might want to see how all these different measures compare to each other
across the loci we've looked at. You can see the corrected measures (all those
other than $G_\text{ST}$) show a similar pattern, and $G_\text{ST}$ is a bit 
strange  (Figure~\ref{Comparison}):

<<label=fig1plot>>=
nc.diff_stats <- diff_stats(nancycats, phi_st=TRUE)
with(nc.diff_stats, pairs(per.locus[,3:6], upper.panel=panel.smooth))
@
\begin{figure}
\begin{center}
<<label=fig1, fig=TRUE, echo=FALSE>>=
<<fig1plot>>
@
\end{center}
\caption{Comparison of differentiation measures}
\label{Comparison}
\end{figure}

The second part of the list returned by \verb+diff_stat+ contains global
estimates of each of these statistics. For $G_\text{ST}$ and $G''_\text{ST}$ these are
based on the average of \texttt{Hs} and \texttt{Ht} across loci. For
$D_\text{est}$ you get two, the harmonic mean of the $D_\text{est}$ for each locus and,
because that method won't work if you end up with negative estimates of $D_\text{est}$,
one calculated as per $G_\text{ST}$ and $G''_\text{ST}$. The global estimate of
$\varphi'_\text{ST}$ is calculated from the average distance among individuals 
across all loci. 

Now that we have a point-estimate for how differentiated these populations are 
we will want to have some idea of how robust this result is. \textsc{mmod} has a
few functions for performing bootstrap samples of \texttt{genind} objects and
calculating statistics from those samples. Because some of these functions can
take a long time to run, we will create a very small (10 repetition) bootstrap
sample of the \texttt{nancycats} data, then calculate $D_\text{est}$ from that sample:

<<bs>>=
bs <- chao_bootstrap(nancycats, nreps=10)
bs.D <- summarise_bootstrap(bs, D_Jost)
bs.D
@

As you can see, printing a summarised bootstrap sample gives us shows a basic
overview of that data. In this case the confidence intervals are calculated
using the ``normal method'', which it to say the the intervals are the observed
value statistic +/- 1.96 x the standard error of the boostrap sample.There is more 
to these objects than gets printed --- use
\texttt{str(bs.D)}  to check it out. I don't think there is much point trying 
to interpret confidence intervals estimated from 10 samples, but the point 
estimates seem to show a population with some substantial differentiation.

Next, we want to know if geography can explain that differentiation.
The \texttt{nancycats} data comes with coordinates for each population. We can 
use these to get Euclidean distances:

<<label=show_coords>>=
head(nancycats@other$xy, 4)
nc.pop_dists <- dist(nancycats@other$xy, method="euclidean")
@

\texttt{mmod} provides functions to calculate pairwise versions of each of
the differentiation statistics. Because we want to perform a Mantel test, we'll
use the ``linearized'' version of $D_\text{est}$, which is just $x/(1-x)$ (each of the
pairwise stats has and argument to return this version).

<<label=pw>>=
nc.pw_D <- pairwise_D(nancycats, linearized=TRUE)
@

The library \texttt{ade4}, which is loaded with \texttt{mmod}, provides
functions to perform Mantel tests on distance matrices.

<<label=mantel>>=
mantel.rtest(nc.pw_D, log(nc.pop_dists), 999)
@

So, the geographic distance between these populations can't explain the
genetic divergences we see: the correlation is small and non-significant.
If you like, we can also visualize this relationship (Figure~\ref{Distances}).

<<label=fig2plot>>=
fit <- lm(as.vector(nc.pw_D) ~ as.vector(nc.pop_dists))
plot(as.vector(nc.pop_dists), as.vector(nc.pw_D),
     ylab="pairwise D", xlab="physical distance")
abline(fit)
@
\begin{figure}
\begin{center}
<<label=fig2, fig=TRUE, echo=FALSE>>=
<<fig2plot>>
@
\end{center}
\caption{Geographic distance does not explain genetic differentiation}
\label{Distances}
\end{figure}

There are a couple of other functions that are not used here, and a few of use
the functions we have used have help messages that guide interpretation 
of their ressults - use \texttt{help(package="mmod")} to see the full 
documentation.
\end{document}
