\documentclass[letterpaper]{article}
%\VignetteIndexEntry{Algorightms and Equations}
%\VignetteEngine{knitr::knitr}
\usepackage{graphicx}
\usepackage[colorlinks = true,
            urlcolor = blue,
            citecolor = blue,
            linkcolor = blue]{hyperref}
\usepackage{array}
\usepackage{color}
\usepackage[dvipsnames,svgnames,table]{xcolor}
\usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote()
\usepackage{fullpage}
\usepackage{mathtools}
\usepackage{makeidx}
\usepackage{longtable}
\usepackage{natbib}

% for bold symbols in mathmode
\usepackage{bm}
\newcommand{\R}{\mathbb{R}}
\newcommand{\m}[1]{\mathbf{#1}}
\newcommand{\tab}{\hspace*{1em}}
\newcolumntype{H}{>{\setbox0=\hbox\bgroup}c<{\egroup}@{}}
\newcommand{\cmdlink}[2]{%
  \texttt{\hyperref[#1]{#2}}%
}
\newcommand{\seclink}[2]{%
  \textsc{\hyperref[#1]{#2}}%
}

\newcommand{\poppr}{\textit{poppr}}
\newcommand{\Poppr}{\textit{Poppr}}
\newcommand{\adegenet}{\textit{adegenet}}
\newcommand{\Adegenet}{\textit{Adegenet}}
\newcommand{\tline}{
  \noindent
  \rule{\textwidth}{1pt}
  \par
}
\newcommand{\bline}{
  \noindent
  \rule{\textwidth}{1pt}
  \kern1pt
}

\newcommand{\jala}{
  \includegraphics[height = 5mm, keepaspectratio=true]{jalapeno-poppers}
}

\newcommand{\revjala}{
  \scalebox{-1}[1]{\jala{}}
}

\title{Algorithms and equations utilized in poppr version \Sexpr{packageVersion("poppr")}}
\author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1)
Department of Botany and Plant Pathology, Oregon State University, Corvallis,
OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS,
Corvallis, OR}}




\begin{document}

<<echo=FALSE, message=FALSE>>=
knitr::opts_knit$set(out.format = "latex")
thm <- knitr::knit_theme$get("acid")
knitr::knit_theme$set(thm)
knitr::opts_chunk$set(concordance=TRUE)
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
knitr::opts_chunk$set(out.width = '0.95\\linewidth', fig.align = "center", fig.show = 'asis')
@
\maketitle

\begin{abstract}
This vignette is focused on simply explaining the different algorithms utilized
in calculations such as the index of association and different distance measures.
Many of these are previously described in other papers and it would be prudent
to cite them properly if they are used.
\end{abstract}

\begin{figure}[b]   
  \centering
  \label{logo}   
  \includegraphics{jalapeno_logo} 
\end{figure} 

\newpage 
\begingroup
  \hypersetup{linkcolor=black} 
  \tableofcontents 
\endgroup 

\section{Mathematical representation of data in \adegenet{} and \poppr{}}

The sections dealing with the index of association and genetic distances will be
based on the same data structure, a matrix with samples in rows and alleles in
columns. The number of columns is equal to the total number of alleles observed
in the data set. Much of this description is derived from \adegenet{}'s
\texttt{dist.genpop} manual page.

\subsection{Caveat}

The information here describes the data as it would be passed into the distance
functions. Data represented in the genind objects are now counts of alleles
instead of frequencies of alleles. Otherwise, these equations still hold. 

\begin{quote}
Let \textbf{A} be a table containing allelic frequencies with $t$
samples\footnote{populations or individuals} (rows) and $m$ alleles (columns).\\
\end{quote}
The above statement describes the table present in genind or genpop object
where, instead of having the number of columns equal the number of loci, the
number of columns equals the number of observed alleles in the entire data set.

\begin{quote}
Let $\nu$ be the number of loci. The locus $j$ gets $m(j)$ alleles. 

\begin{equation}
  m=\sum_{j=1}^{\nu} m(j)
\end{equation}
\end{quote}

So, if you had a data set with 5 loci that had 2 alleles each, your table
would have ten columns. Of course, codominant loci like microsatellites have
varying numbers of alleles.


\begin{quote}
For the row $i$ and the modality $k$ of the variable $j$, notice the value

\begin{equation}
  a_{ijk}\\ (1 \leq i \leq t,\\ 1 \leq j \leq \nu,\\ 1 \leq k \leq m(j))
\end{equation}

\begin{equation}
  a_{ij\cdot}=\sum_{k=1}^{m(j)}a_{ijk} 
\end{equation}

\begin{equation}
  p_{ijk}=\frac{a_{ijk}}{a_{ij\cdot}}
\end{equation}
\end{quote}

The above couple of equations are basically defining the allele counts
($a_{ijk}$) and frequency ($p_{ijk}$). Remember that $i$ is individual, $j$ is
locus, and $k$ is allele. The following continues to describe properties of the
frequency table used for analysis:

\begin{quote}
\begin{equation}
 p_{ij\cdot}=\sum_{k=1}^{m(j)}p_{ijk}=1
\end{equation}
\end{quote}
The sum of all allele frequencies for a single population (or individual) at a 
single locus is one.

\begin{quote}
\begin{equation}
p_{i{\cdot}\cdot}=\sum_{j=1}^{\nu}p_{ij\cdot}=\nu
\end{equation}
\end{quote}
The sum of all allele frequencies over all loci is equal to the number of loci.

\begin{quote} 
\begin{equation}
p_{{\cdot}{\cdot}\cdot}=\sum_{j=1}^{\nu}p_{i{\cdot}\cdot}=t\nu
\end{equation}
\end{quote}
The the sum of the entire table is the sum of all loci multiplied by the number 
of populations (or individuals).


% %
% %-----------------------------------------------------------------------------%
% %


\section{The Index of Association}
\label{indexassoc}

The index of association was originally developed by A.H.D. Brown analyzing
population structure of wheat and has been widely used as a tool to detect
clonal reproduction within populations \citep{Brown:1980, Smith:1993}.
Populations whose members are undergoing sexual reproduction, whether it be
selfing or out-crossing, will produce gametes via meiosis, and thus have a
chance to shuffle alleles in the next generation. Populations whose members are
undergoing clonal reproduction, however, generally do so via mitosis.

The most likely mechanism for a change in genotype, for a clonal organism, is 
mutation. The rate of mutation varies from species to species, but it is rarely
sufficiently high to approximate a random shuffling of alleles. The index of
association is a calculation based on the ratio of the variance of the raw
number of differences between individuals and the sum of those variances over
each locus \citep{Smith:1993}. It can also be thought of as the observed variance
over the expected variance. If both variances are equal, then the index is zero
after subtracting one (from Maynard-Smith, 1993 \citep{Smith:1993}):
\begin{equation}
\label{eq:I_A}
I_A = \frac{V_O}{V_E}-1
\end{equation}
Any sort of marker can be used for this analysis as it only counts differences
between pairs of samples. This can be thought of as a distance whose maximum is
equal to the number of loci multiplied by the ploidy of the sample. This is
calculated using an absolute genetic distance.

Remember that in \poppr{}, genetic data is stored in a table where the rows
represent samples and the columns represent potential allelic states grouped by
locus. Notice also that the sum of the rows all equal one. \Poppr{} uses this to
calculate distances by simply taking the sum of the absolute values of the
differences between rows.

The calculation for the distance between two individuals at a single locus $j$
with $m(j)$ allelic states and a ploidy of $l$ is as follows\footnote{Individuals
with Presence / Absence data will have the $l/2$ term dropped.}:
\begin{equation}
\label{eq:ia_d}
  d(a,b)=\frac{l}{2} \sum_{k=1}^{m(j)}
  |p_{ajk} - p_{bjk}|
\end{equation}

% \begin{equation}

% d = \displaystyle \frac{k}{2}\sum_{i=1}^{a} \mid ind_{Ai} - ind_{Bi}\mid
% \end{equation}
\noindent
To find the total number of differences between two individuals over all loci,
you just take $d$ over $\nu$ loci, a value we'll call $D$:

\begin{equation}
\label{eq:ia_D}
D(a,b) = \displaystyle \sum_{i=1}^{\nu} d_i
\end{equation}
An interesting observation: $D(a,b)/(l\nu)$ is Prevosti's distance.

These values are calculated over all possible combinations of individuals in the
data set, ${n \choose 2}$ after which you end up with ${n \choose 2}\cdot{}\nu$
values of $d$ and ${n \choose 2}$ values of $D$. Calculating the observed
variances is fairly straightforward (modified from Agapow and Burt, 2001)
\citep{Agapow:2001}:

\begin{equation}
\label{eq:V_O}
V_O = \frac{\displaystyle \sum_{i=1}^{n \choose 2} D_{i}^2 - \frac{\left(\displaystyle\sum_{i=1}^{n \choose 2} D_{i}\right)^2}{{n \choose 2}}}{{n \choose 2}}
\end{equation}

Calculating the expected variance is the sum of each of the variances of the
individual loci. The calculation at a single locus, $j$ is the same as the
previous equation, substituting values of $D$ for $d$ \citep{Agapow:2001}:

\begin{equation}
\label{eq:var_j}
var_j = \frac{\displaystyle \sum_{i=1}^{n \choose 2} d_{i}^2 - \frac{\left(\displaystyle\sum_{i=1}^{n \choose 2} d_i\right)^2}{{n \choose 2}}}{{n \choose 2}}
\end{equation}

The expected variance is then the sum of all the variances over all $\nu$ loci
\citep{Agapow:2001}:

\begin{equation}
\label{eq:V_E}
V_E = \displaystyle \sum_{j=1}^{\nu} var_j
\end{equation}

Now you can plug the sums of equations (\ref{eq:V_O}) and (\ref{eq:V_E}) into
equation (\ref{eq:I_A}) to get the index of association. Of course, Agapow and
Burt showed that this index increases steadily with the number of loci, so they
came up with an approximation that is widely used, $\bar r_d$
\citep{Agapow:2001}. For the derivation, see the manual for \textit{multilocus}.
The equation is as follows, utilizing equations (\ref{eq:V_O}),
(\ref{eq:var_j}), and (\ref{eq:V_E}) \citep{Agapow:2001}:

\begin{equation}
\label{eq:r_d}
\bar{r}_d = \frac{V_O - V_E}
{2\displaystyle \sum_{j=1}^{\nu}\displaystyle \sum_{k \neq j}^{\nu}\sqrt{var_j\cdot{}var_k}}
\end{equation}

\section{Genetic distances}

Genetic distances are great tools for analyzing diversity in
populations as they are the basis for creating dendrograms with bootstrap
support and also for AMOVA. This section will simply present different genetic
distances along with a few notes about them. Most of these distances are derived
from the \textit{ade4} and \adegenet{} packages, where they were implemented as
distances between populations. \Poppr{} extends the implementation to individuals
as well (with the exception of Bruvo's distance).

\begin{table}[ht]
\centering
\caption{Distance measures and their respective assumptions}
\begin{tabular}{lllll}
  \hline
 Method & Function & Assumption & Euclidean & Citation\\ 
  \hline
Prevosti & \texttt{prevosti.dist} & - & No & \citep{prevosti1975distances}\\
 & \texttt{diss.dist} & & & \\
Nei & \texttt{nei.dist} & Infinite Alleles & No & \citep{nei1972genetic, nei1978estimation}\\
 & & Genetic Drift & & \\
Edwards & \texttt{edwards.dist} & Genetic Drift & Yes & \citep{edwards1971distances}\\
Reynolds & \texttt{reynolds.dist} & Genetic Drift & Yes & \citep{reynolds1983estimation}\\
Rogers & \texttt{rogers.dist} & - & Yes & \citep{rogers1972measures}\\
Bruvo & \texttt{bruvo.dist} & Stepwise Mutation & No & \citep{Bruvo:2004}\\
   \hline
\end{tabular}
\end{table}

\subsection{Distances that assume genetic drift}
\subsubsection{Nei's 1978 Distance}
\label{distance:nei}
\begin{equation}
  D_{Nei}(a,b)= -\ln\left(\frac{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)}
  p_{ajk} p_{bjk}}{\sqrt{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)}
  {(p_{ajk}) }^2}\sqrt{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)}
  {(p_{bjk})}^2}}\right)
\end{equation}

Note: if comparing individuals in \poppr{}, those that do not share any alleles
normally receive a distance of $\infty$. As you cannot draw a dendrogram with 
infinite branch lengths, infinite values are converted to a value that is equal
to an order of magnitude greater than the largest finite value.

\subsubsection{Edwards' angular distance}
\label{distance:edwards}
\begin{equation}
  D_{Edwards}(a,b)=\sqrt{1-\frac{1}{\nu} \sum_{k=1}^{\nu}
  \sum_{j=1}^{m(k)} \sqrt{p_{ajk}  p_{bjk}}}
\end{equation}

\subsubsection{Reynolds' coancestry distance}
\label{distance:reynolds}
\begin{equation}
  D_{Reynolds}(a,b)=\sqrt{\frac{\sum_{k=1}^{\nu}
  \sum_{j=1}^{m(k)}{(p_{ajk} - p_{bjk})}^2}{2 \sum_{k=1}^{\nu} \left(1-
  \sum_{j=1}^{m(k)}p_{ajk} p_{bjk}\right)}}
\end{equation}

\subsection{Distances without assumptions}
\subsubsection{Rogers' distance}
\label{distance:rogers}
\begin{equation}
  D_{Rogers}(a,b)=\frac{1}{\nu} \sum_{k=1}^{\nu} \sqrt{\frac{1}{2}
  \sum_{j=1}^{m(k)}{(p_{ajk} - p_{bjk})}^2}
\end{equation}


\subsubsection{Prevosti's absolute genetic distance}
\label{distance:prevosti}
\begin{equation}
  D_{Prevosti}(a,b)=\frac{1}{2{\nu}} \sum_{k=1}^{\nu} \sum_{j=1}^{m(k)}
  |p_{ajk} - p_{bjk}|
\end{equation}
Note: for AFLP data, the $2$ is dropped. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Bruvo's distance (stepwise mutation for microsatellites)}
\label{bruvo}

Bruvo's distance between two individuals calculates the minimum distance across
all combinations of possible pairs of alleles at a single locus and then
averaging that distance across all loci \citep{Bruvo:2004}. The distance between
each pair of alleles is calculated as\footnote{Notation presented unmodified
from Bruvo et al, 2004}\citep{Bruvo:2004}:

\begin{equation}
\label{eq:m_x}
m_x = 2^{-\mid x \mid}
\end{equation}

\begin{equation}
\label{eq:d_a}
d_a = 1 - m_x
\end{equation}

Where $x$ is the number of steps between each allele. So, let's say we were
comparing two haploid $(k = 1)$ individuals with alleles 228 and 244 at a locus
that had a tetranucleotide repeat pattern (CATG$)^n$. The number of steps for
each of these alleles would be $228/4 = 57$ and $244/4 =61$, respectively. The
number of steps between them is then $\mid 57 - 61 \mid = 4$. Bruvo's distance
at this locus between these two individuals is then $1-2^{-4} = 0.9375$. For
samples with higher ploidy ($k$), there would be $k$ such distances of which we
would need to take the sum \citep{Bruvo:2004}.

\begin{equation}
\label{eq:s_i}
s_i = \displaystyle \sum_{a=1}^{k} d_a
\end{equation}

Unfortunately, it's not as simple as that since we do not assume to know phase.
Because of this, we need to take all possible combinations of alleles into
account. This means that we will have $k^2$ values of $d_a$, when we only want
$k$. How do we know which $k$ distances we want? We will have to invoke
parsimony for this and attempt to take the minimum sum of the alleles, of which
there are $k!$ possibilities \citep{Bruvo:2004}:

\begin{equation}
\label{eq:d_l}
d_l = \frac{1}{k}\left(\min_{i \dotsc k!} s_i\right)
\end{equation}
\noindent
Finally, after all of this, we can get the average distance over all loci
\citep{Bruvo:2004}.

\begin{equation}
\label{eq:D}
D = \frac{1}{l}\sum_{i=1}^l d_i
\end{equation}
\noindent
This is calculated over all possible combinations of individuals and results in
a lower triangle distance matrix over all individuals.

\subsubsection{Special cases of Bruvo's distance}
\label{appendix:algorithm:bruvospecial}
As shown in the above section, ploidy is irrelevant with respect to
calculation of Bruvo's distance. However, since it makes a comparison between
all alleles at a locus, it only makes sense that the two loci need to have the
same ploidy level. Unfortunately for polyploids, it's often difficult to fully
separate distinct alleles at each locus, so you end up with genotypes that
appear to have a lower ploidy level than the organism \citep{Bruvo:2004}.

To help deal with these situations, Bruvo has suggested three methods for dealing
with these differences in ploidy levels \citep{Bruvo:2004}:
\begin{itemize}
  \item{Infinite Model -} The simplest way to deal with it is to count all
  missing alleles as infinitely large so that the distance between it and
  anything else is 1. Aside from this being computationally simple, it will tend
  to inflate distances between individuals.
  \item{Genome Addition Model -} If it is suspected that the organism has gone
  through a recent genome expansion, the missing alleles will be replace with
  all possible combinations of the observed alleles in the shorter genotype. For
  example, if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele,
  the possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], 
  [69, 70, 70, 69], and [69, 70, 70, 70]. The resulting distances are then 
  averaged over the number of comparisons.
  \item{Genome Loss Model -} This is similar to the genome addition model,
  except that it assumes that there was a recent genome reduction event and uses
  the observed values in the full genotype to fill the missing values in the
  short genotype. As with the Genome Addition Model, the resulting distances are
  averaged over the number of comparisons.
  \item{Combination Model -} Combine and average the genome addition and loss
  models.
\end{itemize}

As mentioned above, the infinite model is biased, but it is not nearly as
computationally intensive as either of the other models. The reason for this is
that both of the addition and loss models requires replacement of alleles and
recalculation of Bruvo's distance. The number of replacements required is $n^k$
where $n$ is the number of potential replacements and $k$ is the number of
alleles to be replaced.

To reduce the number of calculations and assumptions otherwise, Bruvo's distance
will be calculated using the largest observed ploidy in pairwise comparisons. 
This means that when comparing [69,70,71,0] and [59,60,0,0], they will be 
treated as triploids.

\subsubsection{Choosing a model}
\label{appendix:algorithm:bruvomodel}
By default, the implementation of Bruvo's distance in \poppr{} will utilize the
combination model. This is implemented by setting both the \texttt{add} and 
\texttt{loss} arguments to \texttt{TRUE}. For other models use the following 
table for reference:

\begin{table}[ht]
\centering
\begin{tabular}{ll}
  \hline
Model & Arguments \\
  \hline
Infinite & \texttt{add = FALSE, loss = FALSE} \\
Genome Addition & \texttt{add = TRUE, loss = FALSE} \\
Genome Loss & \texttt{add = FALSE, loss = TRUE} \\
Combination (default) & \texttt{add = TRUE, loss = TRUE} \\
  \hline
\end{tabular}
\end{table}

\subsection{Tree topology}

All of these distances were designed for analysis of populations. When applying
them to individuals, we must change our interpretations. For example, with Nei's
distance, branch lengths increase linearly with mutation
\citep{nei1972genetic,nei1978estimation}. When two populations share no alleles,
then the distance becomes infinite. However, we expect two individuals to
segregate for different alleles more often than entire populations, thus we
would expect exaggerated internal branch lengths separating clades. To
demonstrate the effect of the different distances on tree topology, we will use
5 diploid samples at a single locus demonstrating a range of possibilities:

\begin{table}[ht]
\centering
\begin{tabular}{c}
  \hline
Genotype \\ 
  \hline
1/1 \\ 
  1/2 \\ 
  2/3 \\ 
  3/4 \\ 
  4/4 \\ 
   \hline
\end{tabular}
\caption{Table of genotypes to be used for analysis} 
\end{table}

<<>>=
library("poppr")
dat.df <- data.frame(Genotype = c("1/1", "1/2", "2/3", "3/4", "4/4"))
dat <- as.genclone(df2genind(dat.df, sep = "/", ind.names = dat.df[[1]]))
@

We will now compute the distances and construct neighbor-joining dendrograms
using the package \textit{ape}. This allows us to see the effect of the
different distance measures on the tree topology.

<<>>=
distances <- c("Nei", "Rogers", "Edwards", "Reynolds", "Prevosti")
dists <- lapply(distances, function(x){
  DISTFUN <- match.fun(paste(tolower(x), "dist", sep = "."))
  DISTFUN(dat)
})
names(dists) <- distances

# Adding Bruvo's distance at the end because we need to specify repeat length.
dists$Bruvo <- bruvo.dist(dat, replen = 1)
library("ape")
par(mfrow = c(2, 3))
x <- lapply(names(dists), function(x){ 
  plot(nj(dists[[x]]), main = x, type = "unrooted") 
  add.scale.bar(lcol = "red", length = 0.1)
})
@

\section{AMOVA}

AMOVA in \poppr{} acts as a wrapper for the \textit{ade4} implementation, which
is an implementation of Excoffier's original formulation
\citep{excoffier1992analysis}. As the calculation relies on a genetic distance
matrix, \poppr{} calculates the distance matrix as the number of differing sites
between two genotypes using equation \ref{eq:ia_D}. This is equivalent to
Prevosti's distance multiplied by the ploidy and number of loci. This is also
equivalent to Kronecker's delta as presented in \citep{excoffier1992analysis}.
As is the practice from \textsc{Arlequin}, within sample variance is calculated
by default. Within sample variance can only be calculated on diploid data.

\section{Genotypic Diversity}

Many of the calculations of genotypic diversity exist within the package
\textit{vegan} in the \texttt{diversity} function. Descriptions of most
calculations can be found in the paper by \cite{Grunwald:2003}.

\subsection{Rarefaction, Shannon-Wiener, Stoddart and Taylor indices}

Stoddart and Taylor's index is also known as Inverse Simpson's index. A detailed
description of these can be found in the ``Diversity'' vignette in
\textit{vegan}. You can access it by typing \texttt{vignette("diversity-vegan")}

\subsection{Evenness ($E_{5}$)}
Evenness ($E_{5}$) is essentially the ratio of the number of abundant genotypes
to the number of rarer genotypes. For the function \texttt{locus\_table}, these
indices represent the abundance of alleles.

This is simply calculated as 

\begin{equation}
E_{5} = \frac{(1/\lambda) - 1}{e^{H} - 1}
\end{equation}

Where $1/\lambda$ is Stoddart and Taylor's index and $H$ is Shannon diversity
\citep{Stoddart:1988,Shannon:1948}. The reason for this formulation is because
$1/\lambda$ is weighted for more abundant genotypes and $H$ is weighted for 
rarer genotypes.

\subsection{Hexp}

$H_{exp}$ stands for Nei's 1978 expected heterozygosity, also known as gene
diversity \citep{nei1978estimation}. Previously, in \poppr{}, it was calculated
over multilocus genotypes, which was more analogous to a correction for
Simpson's index. This has been corrected in version 2.0 and is now correctly
calculated over alleles. The following is adapted from equations 2 and 3 in
\citep{nei1978estimation}:

\begin{equation}
h = \left(\frac{n}{n-1}\right) 1 - \sum_{i = 1}^k{p^{2}_{i}}
\end{equation}

\begin{equation}
H_{exp} = \sum_{l = 1}^m{h_l/m}
\end{equation}

where $p_i$ is the population frequency of the $i^{th}$ allele in a locus, $k$
is the number of alleles in a particular locus, $n$ represents the number of
\textbf{observed alleles} in the population, and $h_l$ is the value of $h$ at
the $l^{th}$ locus.

Nei's definition originally corrected the sum of squared allele frequencies with
$2N/(2N - 1)$, where $N$ is the number of diploid samples
\citep{nei1978estimation}. This definition is still congruent with that since we
would observe $2N$ alleles in a diploid locus. The only caveat is that
populations where allelic dosage is unknown will have an inflated estimate of
allelic diversity due to the inflation of rare alleles.

\section{Probability of Genotypes: $p_{gen}$}

The metric $p_{gen}$ represents the probability that a given multilocus genotype
will occur in a population \citep{parks1993study, arnaud2007standardizing}. For 
$\nu$ loci, it is calculated as:

\begin{equation}
h = \begin{cases} 1 \text{ if heterozygous},\\ 0 \text{ if homozygous or haploid}  \end{cases}
\end{equation}

\begin{equation}
p_{gen} = \prod_{j = 1}^\nu\left(2^{h}\prod_{i = 1}^{m(j)} p_{ij}\right) \\
\end{equation}

where $p_{ij}$ represents the allele frequency of the $i^{th}$ allele present in
the sample at the $j^{th}$ locus derived using the round-robin method
\citep{parks1993study, arnaud2007standardizing}. The round-robin allele
frequencies for a given locus are derived by first clone-correcting with the
genotypes found at all other loci \citep{parks1993study,
arnaud2007standardizing}. Any allele that ends up with a zero frequency are
assigned a value of $1/N$, where $N$ is the number of samples.

\section{Probability a genotype is derived from sexual reproduction: $p_{sex}$}

The probability of a genotype can provide information as to how many times we
expect to see that genotype repeated in our population. \citet{parks1993study}
defined the probability of encountering a genotype a second time from an
independent reproductive event as:

\begin{equation}
p_{sex} = 1 - (1 - p_{gen})^G
\end{equation}

Where $G$ is the number of multilocus genotypes in the data. Finding a genotype
$n$ times out of $N$ samples is derived from the binomial distribution 
\citep{parks1993study, arnaud2007standardizing}:

\begin{equation}
p_{sex} = \sum_{i = n}^N {N \choose i} \left(p_{gen}\right)^i\left(1 - p_{gen}\right)^{N - i}
\end{equation}

The \poppr{} function \texttt{psex()} can calculate both of these quantities. 

\bibliographystyle{authordate1}
\bibliography{the_bibliography}
\end{document}
