---
title: "Oracle Calculations"
author: "David Gerard"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Oracle Calculations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=4.5,
  fig.height=3.5
)
```

# Abstract

We provide some example usage of the oracle calculations available in `updog`. These are particularly useful for read-depth determination. These calculations are described in detail in Gerard et al. (2018).

# Controlling Misclassification Error

Suppose we have a sample of tetraploid individuals derived from an S1 cross (a single generation of selfing). Using domain expertise (either from previous studies or a pilot analysis), we've determined that our sequencing technology will produce relatively clean data. That is, the sequencing error rate will not be too large (say, ~0.001), the bias will be moderate (say, ~0.7 at the most extreme), and the majority of SNPs will have reasonable levels of overdispersion (say, less than 0.01). We want to know how deep we need to sequence.

Using `oracle_mis`, we can see how deep we need to sequence under the worst-case scenario we want to control (sequencing error rate = 0.001, bias = 0.7, overdispersion = 0.01) in order to obtain a misclassification error rate of at most, say, 0.05.

```{r}
bias   <- 0.7
od     <- 0.01
seq    <- 0.001
maxerr <- 0.05
```

Before we do this, we also need the distribution of the offspring genotypes. We can get this distribution assuming various parental genotypes using the `get_q_array` function. Typically, error rates will be larger when the allele-frequency is closer to 0.5. So we'll start in the worst-case scenario of assuming that the parent has 2 copies of the reference allele.

```{r, message=FALSE}
library(updog)
ploidy <- 4
pgeno <- 2
gene_dist <- get_q_array(ploidy = ploidy)[pgeno + 1, pgeno + 1, ]
```

This is what the genotype distribution for the offspring looks like:
```{r, message=FALSE}
library(ggplot2)
distdf <- data.frame(x = 0:ploidy, y = 0, yend = gene_dist)
ggplot(distdf, mapping = aes(x = x, y = y, xend = x, yend = yend)) +
  geom_segment(lineend = "round", lwd = 2) +
  theme_bw() +
  xlab("Allele Dosage") +
  ylab("Probability")
```

Now, we are ready to iterate through read-depth's until we reach one with an error rate less than 0.05.

```{r}
err    <- Inf
depth  <- 0
while(err > maxerr) {
  depth <- depth + 1
  err <- oracle_mis(n = depth,
                    ploidy = ploidy,
                    seq = seq,
                    bias = bias,
                    od = od,
                    dist = gene_dist)
}
depth
```

Looks like we need a depth of `r depth` in order to get a misclassification error rate under `r maxerr`.


Note that `oracle_mis` returns the **best misclassification error rate possible** under these conditions (`ploidy` = `r ploidy`, `bias` = `r bias`, `seq` = `r seq`, `od` = `r od`, and `pgeno` = `r pgeno`). In your actual analysis, you will have a worse misclassification error rate than that returned by `oracle_mis`. However, if you have a lot of individuals in your sample, then this will act as a reasonable approximation to the error rate. In general though, you should sequence a little deeper than suggested by `oracle_mis`.

# Visualizing the Joint Distribution

Suppose we only have a budget to sequence to a depth of 30. Then what errors can we expect? We can use `oracle_joint` and `oracle_plot` to visualize the errors we can expect.

```{r}
depth <- 30
jd <- oracle_joint(n = depth,
                   ploidy = ploidy,
                   seq = seq,
                   bias = bias,
                   od = od,
                   dist = gene_dist)
oracle_plot(jd)
```

Most of the errors will be mistakes between genotypes 2/3 and mistakes between genotypes 1/2.

```{r, echo=FALSE}
omiss <- oracle_mis(n = depth,
                    ploidy = ploidy,
                    seq = seq,
                    bias = bias,
                    od = od,
                    dist = gene_dist)

ocorr <- oracle_cor(n = depth,
                    ploidy = ploidy,
                    seq = seq,
                    bias = bias,
                    od = od,
                    dist = gene_dist)
```


Even though the misclassification error rate is pretty high (`r round(omiss, digits = 2)`), the correlation of the oracle estimator with the true genotype is pretty reasonable (`r round(ocorr, digits = 2)`). You can obtain this using the `oracle_cor` function.

```{r}
ocorr <- oracle_cor(n = depth,
                    ploidy = ploidy,
                    seq = seq,
                    bias = bias,
                    od = od,
                    dist = gene_dist)
ocorr
```

# References

Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. "Genotyping Polyploids from Messy Sequencing Data." *Genetics* 210 (3). Genetics: 789–807. <https://doi.org/10.1534/genetics.118.301468>.






