---
title: "Introduction to the kmer R package"
author: "Shaun Wilkinson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: kmer.bib
csl: bioinformatics.csl
vignette: >
  %\VignetteIndexEntry{Introduction to the kmer package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE, warning = FALSE}
#knitr::opts_chunk$set(out.width='750px', dpi=200)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", dpi=500, out.width='500px')
```

--------------------------------------------------------------------------------

## Introduction
Agglomerative clustering methods that rely on a multiple sequence alignment and 
a matrix of pairwise distances can be computationally infeasible for large 
DNA and amino acid datasets.
Alternative k-mer based clustering methods involve enumerating all 
*k*-letter words in a sequence through a sliding window of length *k*.
The $n \times 4^k$ matrix of *k*-mer counts (where $n$ is the number of sequences) 
can then be used in place of a multiple sequence 
alignment to calculate distances and/or build a phylogenetic tree. 
**kmer** is an R package for clustering large 
sequence datasets using fast alignment-free *k*-mer counting. 
This can be achieved with or without a multiple sequence alignment, 
and with or without a matrix of pairwise distances.
These functions are detailed below with examples of their utility.


## Distance matrix computation
The function `kcount` is used to enumerate all 
*k*-mers within a sequence or set of sequences, 
by sliding a window of length *k* along each sequence and
counting the number of times each *k*-mer appears 
(for example, the $4^3 = 64$ possible DNA 3-mers: AAA, AAC, AAG, ..., TTT).
The `kdistance` function can then compute an alignment-free 
distance matrix, using a matrix of *k*-mer counts to derive 
the pairwise distances. 
The default distance metric used by `kdistance` is the 
*k*-mer (*k*-tuple) distance measure outlined in Edgar [-@Edgar2004c].
For two DNA sequences $a$ and $b$, the fractional common *k*-mer count over the 
$4^k$ possible words of length $k$ is calculated as:
$$F  = \sum\limits_{\tau}\frac{min (n_a(\tau), n_b (\tau))}{min (L_a , L_b ) - k + 1} \tag{1}$$

where $\tau$ represents each possible *k*-mer, 
$n_a(\tau)$ and $n_b(\tau)$ 
are the number of times $\tau$ appears in each sequence, 
$k$ is the *k*-mer length and $L$ is the sequence length. 
The pairwise distance between $a$ and $b$ is then calculated as: 

$$d = \frac{log(0.1 + F) - log(1.1)}{log(0.1)} \tag{2}$$

For $n$ sequences, the `kdistance` operation has time and memory 
complexity $O(n^2)$ and thus can become computationally infeasible 
when the sequence set is large (e.g. > 10,000 sequences).
As such, the **kmer** package also offers the function `mbed`, that 
only computes the distances from each sequence to a smaller (or equal) 
sized subset of 'seed' sequences [@Blackshields2010]. 
The default behavior of the `mbed` function is to select 
$t = (log_2n)^2$ seeds by clustering the sequences 
(*k*-means algorithm with $k = t$), and selecting one
representative sequence from each cluster. 
 
DNA and amino acid sequences can be passed to `kcount`, `kdistance` and `mbed` 
either as a list of non-aligned sequences or a matrix of aligned sequences, 
preferably in either the "DNAbin" or "AAbin" raw-byte format (see the 
**ape** package documentation for more information on these S3 classes). 
Character sequences are supported; however ambiguity 
codes may not be recognized or treated appropriately, since raw ambiguities 
are counted according to their underlying residue frequencies
(e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" 
and 0.5 to that of "ACGGT"). This excludes the ambiguity code "N", which 
is ignored.


#### Example 1: Compute k-mer distance matrices for the woodmouse dataset
The **ape** R package [@Paradis2004] contains a dataset of 15 aligned mitochondrial 
cytochrome *b* gene DNA sequences from the woodmouse *Apodemus sylvaticus*, 
originally published in Michaux et al. [-@Michaux2003].
While the **kmer** distance functions do not require sequences to 
be aligned, this example will enable us to compare the performance of the 
*k*-mer distances with the alignment-dependent distances produced by 
`ape::dist.dna`. First, load the dataset and view the first few rows and 
columns as follows:

```{r}
data(woodmouse, package = "ape")
ape::as.character.DNAbin(woodmouse[1:5, 1:5])
```


This is a semi-global ('glocal') alignment featuring some 
incomplete sequences, with unknown characters represented by 
the ambiguity code "n" (e.g. No305). 
To avoid artificially inflating the distances between these partial 
sequences and the others, we first trim the gappy ends by subsetting 
the global alignment (note that the **ape** function `dist.dna` 
also removes columns with ambiguity codes prior to distance computation 
by default).

```{r}
woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]
```

The following code first computes the full $n \times n$ distance matrix, 
and then the embedded distances of each sequence to three randomly selected 
seed sequences. 
In both cases the *k*-mer size is set to 6.

```{r}
### Compute the full distance matrix and print the first few rows and columns
library(kmer)
woodmouse.kdist <- kdistance(woodmouse, k = 6)
print(as.matrix(woodmouse.kdist)[1:7, 1:7], digits = 2)

### Compute and print the embedded distance matrix
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
seeds <- sample(1:15, size = 3)
woodmouse.mbed <- mbed(woodmouse, seeds = seeds, k = 6)
print(woodmouse.mbed[,], digits = 2)
```


#### Example 2: Alignment-free tree-building 
In this example the alignment-free *k*-mer distances calculated in 
Example 1 are compared with the Kimura [-@Kimura1980] 
distance metric as featured in the **ape** package examples.
The resulting neighbor-joining trees are visualized using the 
`tanglegram` function from the **dendextend** package.

```{r, message = FALSE, fig.height=4, fig.width=10, fig.align='left', out.width= '700px'}
## compute pairwise distance matrices
dist1 <- ape::dist.dna(woodmouse, model = "K80") 
dist2 <- kdistance(woodmouse, k = 7) 

## build neighbor-joining trees
phy1 <- ape::nj(dist1)
phy2 <- ape::nj(dist2)

## rearrange trees in ladderized fashion
phy1 <- ape::ladderize(phy1)
phy2 <- ape::ladderize(phy2)

## convert phylo objects to dendrograms
dnd1 <- as.dendrogram(phy1)
dnd2 <- as.dendrogram(phy2)

## plot the tanglegram
dndlist <- dendextend::dendlist(dnd1, dnd2)
dendextend::tanglegram(dndlist, fast = TRUE, margin_inner = 5)

```


**Figure 1:** Tanglegram comparing distance measures for the woodmouse sequences. 
Neighbor-joining trees derived from the alignment-dependent (left) and alignment-free (right) 
distances show congruent topologies.


\
\


##Clustering without a distance matrix
To avoid excessive time and memory use when building large trees 
(e.g. *n* > 10,000), the **kmer** package features the 
function `cluster` for fast divisive clustering, 
free of both alignment and distance matrix computation. 
This function first generates a matrix of *k*-mer counts, 
and then recursively partitions the matrix row-wise using successive 
k-means clustering (*k* = 2). 
While this method may not necessarily 
reconstruct sufficiently accurate phylogenetic trees for taxonomic 
purposes, it offers a fast and efficient means of producing large trees 
for a variety of other applications such as 
tree-based sequence weighting (e.g. Gerstein et al. [-@Gerstein1994]), 
guide trees for progressive multiple sequence alignment (e.g. Sievers 
et al. [-@Sievers2011]), and other recursive operations such as
classification and regression tree (CART) learning.

The package also features the function `otu` for rapid clustering of sequences
into operational taxonomic units based on a genetic distance (k-mer distance)
threshold. This function performs a similar operation to `cluster` in that it
recursively partitions a k-mer count matrix to assign sequences to groups. 
However, the top-down splitting only continues while the highest k-mer distance
within each cluster is above a defined threshold value.
Rather than returning a dendrogram, `otu` returns a named integer vector of 
cluster membership, with asterisks indicating the representative sequences
within each cluster. 


####Example 3: OTU clustering with *k*-mers
In this final example, the woodmouse dataset is clustered into operational 
taxonomic units (OTUs) with a maximum within-cluster *k*-mer distance of 0.03 
and with 20 random starts per k-means split (recommended for improved accuracy).

```{r}
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
woodmouse.OTUs <- otu(woodmouse, k = 5, threshold = 0.97, method = "farthest", nstart = 20)
woodmouse.OTUs
```

The function outputs a named integer vector of OTU membership, with
asterisks indicating the representative sequence from each cluster
(i.e. the most "central" sequence). 
In this case, three distinct OTUs were found, with No305 and N01114S 
forming one cluster (3), No0909S, No0912S, No1103S, No1007S and No1208S
forming another (2) and the remainder belonging to cluster 1 in
concordance with the consensus topology of Figure 1. 


##Concluding remarks
The **kmer** package is released under the GPL-3 license. Please direct bug 
reports to the GitHub issues page at <https://github.com/shaunpwilkinson/kmer/issues>.
Any feedback is greatly appreciated.


## Acknowledgements 
This software was developed with funding from a Rutherford Foundation Postdoctoral 
Research Fellowship from the Royal Society of New Zealand.

## References