---
title: "scipub vignette"
author: "David Pagliaccio"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{scipub vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knitr)
library(ggplot2)
library(htmlTable)
```

The scipub package contains functions for summarizing data
    for scientific publication. This includes making a "Table 1"
    to summarize demographics across groups, correlation tables
    with significance indicated by stars, and extracting formatted
    statistical summarizes from simple tests for in-text notation.
    The package also includes functions for Winsorizing data based
    on a Z-statistic cutoff.  
The sample dataset of demographic and clinical data
 from 5,000 children is used for examples.  


***
We'll start by loading scipub:

```{r, message = FALSE,include = TRUE}
library(scipub)
data(psydat)
```


***
## `apastat`

The `apastat` function summarizes simple statistical tests to include 
in the text of an article, typically in a parenthetical. 
This is built for t-tests, correlations, ANOVA, and regression.
Regressions can be summarized by their overall model fit or the
parameter estimates for one predictor variable.
Effect sizes are calculated where possible (default: `es=TRUE`). 
For example:


There is a significant positive correlation between age and height. 
95% confidence intervals are requested.
```{r}
apastat(stats::cor.test(psydat$Age, psydat$Height), ci = TRUE)
```

There is no significant sex difference in height in the sample.
```{r}
apastat(stats::t.test(Height ~ Sex, data = psydat))
```


A linear regression model predicting height was highly significant, with the predictors (age and sex) accounting for about 17% of the variance in height. 
```{r}
apastat(stats::lm(data = psydat, Height ~ Age + Sex))
```

In this linear regression model, age was a highly significant predictor of height, controlling for sex. 
```{r}
apastat(stats::lm(data = psydat, Height ~ Age + Sex), var = "Age")
```


***
## `correltable`

The `correltable` function creates a summary correlation table with asterisks to indicate significance. 
Variables can be renamed as part of the function call. The full matrix or upper/lower triangle can be selected for output. For the selected triangle, the empty row/column can be kept or deleted as needed. 
The caption provides information on the statistics included, any missing data, and the * indications. 
For example:

The lower triangle of inter-correlation among the age, height, and iq variables are shown.
```{r results="asis"}
correltable(data = psydat, vars = c("Age", "Height", "iq"),tri="lower",html=TRUE)
```


These same variables can be relabeled in the output and, for conciseness, the columns can be indicated by corresponding number rather than variable name. 
```{r results="asis"}
correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), tri = "upper", colnum = TRUE,html=TRUE)
```


This can also be done with Spearman correlation. As well as using only complete data (list-wise deletion). And, the empty row/column can be removed if desired. 
```{r results="asis"}
correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), tri = "upper", method = "spearman", use = "complete", cutempty = TRUE, colnum = TRUE,html=TRUE)
```


The inter-correlation between two sets of variables can also be shown. 
```{r results="asis"}
correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), vars2 = c("depressT", "anxT"), var_names2 = c("Depression T", "Anxiety T"),html=TRUE)
```


The simplest call just correlates all variables in a dataset. Any non-numeric variables will be tested by t-test, chi-squared, or ANOVA as appropriate.
```{r results="asis"}
correltable(data = psydat, html=TRUE)
```


***
## `partial_correltable`

The `partial_correltable` function provides similar functionality to `correltable` but allows for covariates to be partialled out of all correlations. This function will allow for binary/factor covariates to be partialled out but only numeric variables can be correlated. 
This involves residualizing all `vars` by all `partialvars` via linear regression (`lm`).
For example:

The lower triangle of partial correlations among the age, height, and iq variables are shown, residualizing for sex and income as factor variables.
```{r results="asis"}
#partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), partialvars = c("Sex", "Income"), tri = "lower", html = TRUE)
```


These same variables can be relabeled in the output and, for conciseness, the columns can be indicated by corresponding number rather than variable name and shown in the supper triangle. 
```{r results="asis"}
#partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), partialvars = c("Sex", "Income"),tri = "upper", colnum = TRUE, html = TRUE)
```


This can also be done with Spearman correlation. As well as using only complete data (list-wise deletion). And, the empty row/column can be removed if desired. 
```{r results="asis"}
#partial_correltable(data = psydat, vars = c("Age", "Height", "iq"), var_names = c("Age (months)", "Height (inches)", "IQ"), partialvars = c("Sex", "Income"),tri = "upper", method = "spearman", use = "complete", cutempty = TRUE, colnum = TRUE, html = TRUE)
```


***
## `FullTable1`
A "Table 1" can be created to summarize data, i.e. the typical first table in a paper that describes the sample characteristics.
This can display information for a single group for the declared variables .  
```{r results="asis"}
FullTable1(data = psydat, vars = c("Age", "Sex","Height", "depressT"),  html=TRUE)
```

Or commonly this can be shown for two groups if interest including the tests of group difference for all variables.
```{r results="asis"}
FullTable1(data = psydat, vars = c("Age", "Height", "depressT"), strata = "Sex", html=TRUE)
```


This can also be created for more than two groups. As with `correltable` variables can be renamed in the call.
Also the significance stars can be moved to the statistic column or variable name (or removed). The p-value column can be removed as well (same for the effect size column, but why would you want to remove that?).
```{r results="asis"}
FullTable1(data = psydat, vars = c("Age", "Sex","Height", "depressT"), var_names = c("Age (months)", "Sex","Height (inches)", "Depression T"), strata = "Income", stars = "stat",p_col = FALSE, html=TRUE)
```


All variables will be summarized if none are declared
Shown with significance stars on variable names.
```{r results="asis"}
FullTable1(data = psydat, strata = "Sex",stars = "name",p_col = FALSE, html=TRUE)
```

You can also replace the caption with your own input and re-output to HTML.
```{r results="asis"}
tmp <- FullTable1(data = psydat,
   vars = c("Age", "Height", "depressT"), strata = "Sex")
   tmp$caption <- "Write your own caption"
   print(htmlTable::htmlTable(tmp$table, useViewer=T, rnames=F,caption=tmp$caption, pos.caption="bottom"))
```


***
## Winsorizing

The `winsorZ` function allows for Winsorizing outliers based on a Z-score cutoff, i.e.
  replacing extreme values with the next most extreme outlier value.
This is an alternative to other function, e.g. `DescTools::Winsorize` which
  identifies outlier based on quantile limits.
The `winsorZ` function can be used in combination with the `winsorZ_find` function,
  which identifies the outlier values (1=outlier, 0=non-outlier).


For example, in the example data, psydat$iq has 17 outliers that exceed a default
|Z|>3 limit test. Here, we create a temporary data frame with the original
iq scores, the Z-score winsorized iq values, and an indication of which scores
were winsorized. We can see the change in mix/max iq values in the summary
and the winsorized outliers are shown in blue in the plot.
```{r results="asis"}
temp <- data.frame(iq=psydat$iq, iq_winsor=winsorZ(psydat$iq), iq_outlier = winsorZ_find(psydat$iq))
summary(temp)
ggplot(temp[!is.na(temp$iq),], aes(x=iq, y=iq_winsor)) + geom_point(aes(color=iq_outlier),alpha=.7) + geom_line() + theme_bw()
```


***
## Group Difference Plot

The `gg_groupplot` function creates can be used to create group difference plots for scientific publication. This is intended to summarize a continuous outcome (`y`) based on a factor ('x') from an input dataset (`data`). The plot will include standard ggplot2::geom_boxplot indicating 25th, median, and 75th percentile for the box and 1.5 * IQR for the whiskers. Outliers are not highlighted. Raw data is displayed with standard ggplot2::geom_point and lateral but not vertical jittering. Histograms are shown with gghalves::geom_half_violin to the right of each boxplot.If meanline = = TRUE (default), gray dots will indicate the mean for each variable (vs. median in boxplot) connected by a gray line. This function will drop any NA values.


```{r results="asis"}
gg_groupplot(data=psydat, x=Sex, y=depressT)
```


This can be combined with other `ggplot` graphics functions, e.g. facetting.

```{r results="asis"}
gg_groupplot(data=psydat, x=Income, y=depressT) + facet_wrap(~Sex) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
```


This function is a simpler wrapper for:

ggplot(data = data[!is.na(data[, x]) & !is.na(data[, y]), ],
       aes(x = get(x), y = get(y),
           color = get(x), fill = get(x), shape = get(x))) +
  geom_half_violin(position = position_nudge(x = .3, y = 0),
           alpha = .8, width = .5, side = "r", color = NA) +
  geom_point(position = position_jitterdodge(jitter.width = .5),
             alpha = .8, size = 1.5) +
  geom_boxplot(outlier.alpha = 0, width = .5, fill = NA, color = "black") +
  xlab("") + ylab("") +
  theme_bw(base_size = 12) +
  theme(legend.position = "none",
     panel.grid.minor = element_line(linetype = "dashed", size = .5),
     axis.title.x = element_text(face = "bold"),
     axis.title.y = element_text(face = "bold")) +
  scale_shape(solid = FALSE) +
  stat_summary(fun = mean, geom = "line",
                       color = "darkgray", size = 1, aes(group = 1)) +
  stat_summary(fun = mean, geom = "point",
               color = "darkgray", size = 2, shape = 16, aes(group = 1))