---
title: "Introduction to santoku"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to santoku}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
set.seed(23479)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(digits = 4)
```


## Introduction

Santoku is a package for cutting data into intervals. It provides `chop()`, 
a replacement for base R's `cut()` function, as well as several convenience
functions to cut different kinds of intervals.

To install santoku, run:

``` r
install.packages("santoku")
```

## Basic usage

Use `chop()` like `cut()`, to cut numeric data into intervals between a set of
`breaks`.

```{r}
library(santoku)

x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
data.frame(x, chopped)
```

`chop()` returns a factor.

If data is beyond the limits of `breaks`, they will be extended automatically:

```{r}
chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)
```

To chop a single number into a separate category, put the number twice in
`breaks`:

```{r}
x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)
```


To quickly produce a table of chopped data, use `tab()`:

```{r}
tab(1:10, c(2, 5, 8))
```

## Chopping by width and number of elements

To chop into fixed-width intervals, starting at the minimum value, use
`chop_width()`:

```{r}
chopped <- chop_width(x, 2)
data.frame(x, chopped)
```

To chop into a fixed number of intervals, each with the same width, 
use `chop_evenly()`:

```{r}
chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)
```

To chop into groups with a fixed number of elements, use `chop_n()`:

```{r}
chopped <- chop_n(x, 4)
table(chopped)
```


To chop into a fixed number of groups, each with the same number of elements,
use `chop_equally()`:

```{r}
chopped <- chop_equally(x, groups = 5)
table(chopped)
```

To chop data up by quantiles, use `chop_quantiles()`:

```{r}
chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
```

To chop data up by proportions of the data range, use `chop_proportions()`:

```{r}
chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
```

You can think of these six functions as logically arranged in a table.

To chop into...                | Sizing intervals by...    |   
:------------------------------|:--------------------------|:----------------------
&nbsp;                         | number of elements:       | interval width:
a specific number of equal intervals... | `chop_equally()`   | `chop_evenly()`
intervals of one specific size...          | `chop_n()`         | `chop_width()`
intervals of different specific sizes...   | `chop_quantiles()` | `chop_proportions()`

: Different ways to chop by size

## Even more ways to chop

To chop data by standard deviations around the mean, use `chop_mean_sd()`:

```{r}
chopped <- chop_mean_sd(x)
data.frame(x, chopped)
```

To chop data into attractive intervals, use `chop_pretty()`. This 
selects intervals which are a multiple of 2, 5 or 10. It's useful for producing
bar plots.

```{r}
chopped <- chop_pretty(x)
data.frame(x, chopped)
```


## Isolating common values

In exploratory work, it's sometimes useful to find common values and
treat them differently. You can use `dissect()` to do this:

```{r}
x_spike <- rnorm(100)
x_spike[1:50] <- x_spike[1]

chopped <- dissect(x_spike, -3:3, prop = 0.1)
table(chopped)

```

`prop = 0.2` will put any unique value of `x` into its own separate
category if it makes up at least 20% of the data.

Note that unlike all the other `chop_*` functions, `dissect()`
doesn't always categorize `x` into ordered, connected intervals. 
To remind you of this, it is named differently. If you want to create
separate intervals on the left and right of common elements, use
`chop_spikes()`:

```{r}
chopped <- chop_spikes(x_spike, -3:3, prop = 0.1)
table(chopped)
```

Compare this to the table before. There are two intervals on either
side of the common value, instead of one interval surrounding it.


## Quick tables

`tab_n()`, `tab_width()`, and friends act similarly to
`tab()`, calling the related `chop_*` function and then `table()` on the result.

```{r}
tab_n(x, 4)
tab_width(x, 2)
tab_evenly(x, 5)
tab_mean_sd(x)
```


## Specifying labels

By default, santoku labels intervals using mathematical notation:

* `[0, 1]` means all numbers between 0 and 1 inclusive.
* `(0, 1)` means all numbers _strictly_ between 0 and 1, not including the
  endpoints.
* `[0, 1)` means all numbers between 0 and 1, including 0 but not 1.
* `(0, 1]` means all numbers between 0 and 1, including 1 but not 0.
* `{0}` means just the number 0.


To override these labels, provide names to the `breaks` argument:

```{r}
chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)
```

Or, you can specify factor labels with the `labels` argument:

```{r}
chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)
```

You need as many labels as there are intervals - one fewer than `length(breaks)`
if your data doesn't extend beyond `breaks`, one more than `length(breaks)` if
it does.

To label intervals with a dash, use `lbl_dash()`:

```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)
```

To label integer data, use `lbl_discrete()`. It uses more informative right 
endpoints:

```{r}
chopped  <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
```

You can customize the first or last labels:

```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)
```


To label intervals in order use `lbl_seq()`:

```{r}
chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)
```

You can use numerals or even roman numerals:

```{r}
chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
chop(x, c(2, 5, 8), labels = lbl_seq("i."))
```

Other labelling functions include:

* `lbl_endpoints()` - use left endpoints as labels
* `lbl_midpoints()` - use interval midpoints as labels
* `lbl_glue()` - specify labels flexibly with the `{glue}` package

## Specifying breaks

By default, `chop()` extends `breaks` if necessary. If you don't want
that, set `extend = FALSE`:

```{r}
chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)
```

Data outside the range of `breaks` will become `NA`.

By default, intervals are closed on the left, i.e. they include their left
endpoints. If you want right-closed intervals, set `left = FALSE`:

```{r}
y <- 1:5
data.frame(
        y = y, 
        left_closed = chop(y, 1:5), 
        right_closed = chop(y, 1:5, left = FALSE)
      )
```

By default, the last interval is closed on both ends.
If you want to keep the last interval open at the end, 
set `close_end = FALSE`:

```{r}
data.frame(
  y = y,
  end_closed = chop(y, 1:5),
  end_open   = chop(y, 1:5, close_end = FALSE)
)

```


# Chopping dates, times and other vectors

You can chop many kinds of vectors with santoku, including Date objects...

```{r}
y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
  y2k = y2k,
  chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)
```


... and POSIXct (date-time) objects:

```{r}
# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"), 
                     length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock    <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")

data.frame(
  crew_dragon = crew_dragon,
  chopped = chop(crew_dragon, c(liftoff, dock), 
                   labels = c("pre-flight", "flight", "docked"))
)

```

Note how santoku correctly handles the different timezones.

You can use `chop_width()` with objects from the `lubridate` package, 
to chop by irregular periods such as months:

```{r}
library(lubridate)
data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, months(1))
)
```


`lbl_date()` produces nicely formatted dates:

```{r}
data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, days(28), labels = lbl_date())
)
```

You can also chop vectors with units, using the `units` package:

```{r}
library(units)

x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
  x = x,
  chopped = chop(x, br)
)
```

You should be able to chop anything that has a comparison operator. You can
even chop character data using lexical ordering. By default santoku emits a 
warning in this case, to avoid accidentally misinterpreting results:

```{r}
chop(letters[1:10], c("d", "f"))
```

If you find a type of data that you can't chop, please 
[file an issue](https://github.com/hughjonesd/santoku/issues).