---
title: "Explainable Outlier Detection in Titanic dataset"
author: "David Cortes"
output:
    rmarkdown::html_document:
        theme: "spacelab"
        highlight: "pygments"
        toc: true
        toc_float: true
vignette: >
    %\VignetteIndexEntry{Explainable Outlier Detection in Titanic dataset}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
### Don't overload CRAN servers
### https://stackoverflow.com/questions/28961431/computationally-heavy-r-vignettes
is_check <- ("CheckExEnv" %in% search()) || any(c("_R_CHECK_TIMINGS_",
             "_R_CHECK_LICENSE_") %in% names(Sys.getenv()))
```

# Explainable Outlier Detection in Titanic dataset

This short notebook illustrates basic usage of the [OutlierTree](https://github.com/david-cortes/outliertree) library for explainable outlier detection using the Titanic dataset. For more details, you can check the package's documentation [at CRAN](https://cran.r-project.org/package=outliertree) or through R's help (e.g. `?outliertree::outlier.tree`). For a more interesting and interactive example, see the documentation of the main function (`outlier.tree`), which uses a larger dataset.

The dataset is very popular and can be downloaded from different sources, such as Kaggle or many university webpages. This vignette took it from the following link: https://github.com/jbryer/CompStats/raw/master/Data/titanic3.csv

The data comes bundled in the package so there is no need to download it from the link above.

# Loading the raw data

```{r message=FALSE}
library(data.table)
library(kableExtra)
library(outliertree)
data("titanic")

titanic |>
    head(5) |>
    kable() |>
    kable_styling()
```

# Pre-processing the data

```{r}
## Capitalize column names and some values for easier reading
capitalize <- function(x) gsub("^(\\w)", "\\U\\1\\E", x, perl=TRUE)

titanic <- as.data.table(titanic)
titanic[
    , setnames(.SD, names(.SD), capitalize(names(.SD)))
][
    , setnames(.SD, "Sibsp", "SibSp")
][
    , Sex := capitalize(Sex)
] -> titanic

## Convert 'survived' to yes/no for easier reading
titanic[
    , Survived := ifelse(Survived, "Yes", "No")
]

## Some columns are not useful, such as name (an ID), ticket number (another ID),
## or destination (too many values, many non-repeated)
titanic[
    , !c("Name", "Ticket", "Home.dest")
] -> titanic

## Ordinal columns need to be passed as ordered factors
cols_ord <- c("Pclass", "Parch", "SibSp")
titanic[
    , (cols_ord) := lapply(.SD, function(x) factor(x, ordered = TRUE))
    , .SDcols = cols_ord
]

## A look at the processed data
titanic |>
	head(5) |>
	kable() |>
	kable_styling()
```

# Fitting a model

```{r, eval=FALSE}
library(outliertree)

## Fit model with default hyperparameters
otree <- outlier.tree(titanic)
otree
```
```{r, echo=FALSE, comment=NA}
library(outliertree)

## Fit model with default hyperparameters
otree <- outlier.tree(titanic, nthreads=1)
otree
```

# Examining the results more closely

```{r}
## Double-check the data (last 2 outliers)
titanic[c(1147, 1164), ]
```

```{r}
## Distribution of the group from which those two outliers were flagged
titanic[
    Pclass == 3 &
    SibSp == 0 &
    Embarked == "Q"
][
    , Fare
] |>
	hist(breaks = 100, col = "navy", xlab="Fare",
		 main="Distribution of Fare within cluster")
```

```{r comment=NA}
## Get the outliers in a manipulable format
predict(otree, titanic, outliers_print = 0)[[1147]]
```

```{r comment=NA}
## To programatically get all the outliers that were flagged
pred <- predict(otree, titanic, outliers_print = 0)
only_flagged <- pred[!is.na(sapply(pred, function(x) x$outlier_score))]
```


```{r comment=NA}
## To print selected rows only
print(pred, only_these_rows = 1147)
```
# Trying different hyperparameters

```{r, eval=FALSE}
## In order to flag more outliers, one can also experiment
## with lowering the threshold hyperparameters
outlier.tree(titanic, z_outlier = 6., outliers_print = 5)
```
```{r, echo=FALSE, comment=NA}
## In order to flag more outliers, one can also experiment
## with lowering the threshold hyperparameters
outlier.tree(titanic, z_outlier = 6., outliers_print = 5, nthreads=1)
```



```{r, eval=FALSE}
## One can also lower the gain threshold, but this tends
## to result in more spurious outliers which come from
## not-so-good splits (not recommended)
outlier.tree(titanic, z_outlier = 6., min_gain = 1e-6, outliers_print = 5)
```
```{r, echo=FALSE, comment=NA}
## One can also lower the gain threshold, but this tends
## to result in more spurious outliers which come from
## not-so-good splits (not recommended)
outlier.tree(titanic, z_outlier = 6., min_gain = 1e-6, outliers_print = 5, nthreads=1)
```

