---
title: 'Alakazam: Using sequencing quality scores'
author: "Susanna Marquez"
date: '`r Sys.Date()`'
output:
  pdf_document:
    dev: pdf
    fig_height: 4
    fig_width: 7.5
    highlight: pygments
    toc: yes  
    toc_depth: 3
  md_document:
    fig_height: 4
    fig_width: 7.5
    preserve_yaml: no
    toc: yes
    toc_depth: 3
  html_document:
    fig_height: 4
    fig_width: 7.5
    highlight: pygments
    theme: readable
    toc: yes
    toc_depth: 3
geometry: margin=1in
fontsize: 11pt
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Fastq}
  %\usepackage[utf8]{inputenc}
---

The `alakazam` package includes a set of functions to inspect the sequencing quality.

## Example data

Load example data:

```{r, eval=TRUE, warning=FALSE, message=FALSE}
library(alakazam)
library(dplyr)
library(airr)

db <- read_rearrangement(system.file("extdata", "example_quality.tsv", package="alakazam"))
fastq_file <- system.file("extdata", "example_quality.fastq", package="alakazam")
```

## Load quality scores

This method allows to add the quality scores to the repertoire `data.frame` as strings.

```{r}
original_cols <- colnames(db)
db <- readFastqDb(db, fastq_file, style="both", quality_sequence=TRUE)
new_cols <- setdiff(colnames(db), original_cols)
db[,new_cols] %>% head()
```

The function `readFastq` takes as main inputs a repertoire `data.frame` (`db`) and 
a path to the corresponding `.fastq` file (`fastq_file`). The sequencing quality scores will
be merged into the `data.frame` by `sequence_id`. The newly added columns are:
`r paste(new_cols, collapse=", ")`. The other fields, contain the ASCII quality scores in the 
form of a vector, where values are comma separated, and `-` or `.` positions 
have value `" "` (blank).

After loading the quality scores with `readFastqDb`,  `getPositionQuality`
can be used to generate a `data.frame` of sequencing quality values 
per position.

```{r}
quality <- getPositionQuality(db, sequence_id="sequence_id", 
                              sequence="sequence_alignment",
                              quality_num="quality_alignment_num")
head(quality)
```

```{r, fig.cap="Sequence quality per IMGT position for one sequence.", fig.asp=0.25}
min_pos <- min(quality$position)
max_pos <- max(quality$position)

ggplot(quality, aes(x=position,
                    y=quality_alignment_num,
                    color=nt)) +
  geom_point() +
  coord_cartesian(xlim=c(110,120)) +
  xlab("IMGT position") +
  ylab("Sequencing quality") +
  scale_fill_gradient(low = "light blue",  high = "dark red") +
  scale_x_continuous(breaks=c(min_pos:max_pos)) +
  alakazam::baseTheme()
```

You can add use the quality `data.frame` to complement analysis performed
with other tools from the Immcantation framework. For example, you could inspect
the sequencing quality of novel polymorphisms identified with `tigger`, or
the sequencing quality in mutated/unmutated regions.

## Mask low quality positions

Use `maskPositionsByQuality` to mask low quality positions. Positions with
a sequencing quality < `min_quality` will be replaced with an 'N'. A message
will show the number of sequences in `db` that had at least one position
masked.

```{r}
db <- maskPositionsByQuality(db, min_quality=70,
                             sequence="sequence_alignment",
                             quality="quality_alignment_num")
```
