Getting started with toolero

Erwin Lares

Background and motivation

toolero grew out of a recurring observation made while teaching and supporting researchers at UW-Madison: the habits that make a project reproducible, shareable, and maintainable are easiest to adopt at the very beginning — and hardest to retrofit once a project is already underway.

The package is heavily influenced by the workflows taught in workshops run by The Carpentries and the UW-Madison Libraries. Those workshops emphasize consistent project organization, version control, and reproducible data practices as foundational skills — not advanced topics. toolero tries to operationalize those principles into a small set of functions that reduce the friction of doing the right thing from the start.

The theming and branding support in toolero is specifically tailored to UW-Madison’s Research Computing and Instrumentation (RCI) unit, whose Quarto-based reporting templates are baked into the package as defaults. If you are not at UW-Madison, the branding files are optional — the rest of the package works independently of them.


Who is this for?

toolero is designed for researchers and analysts who:

The package is intentionally small. It does not try to be comprehensive. It tries to make the right defaults easy to reach for from the first line of code.


Installation

You can install toolero from CRAN:

install.packages("toolero")

Or install the development version from GitHub:

pak::pak("erwinlares/toolero")

Project setup: init_project() and create_qmd()

These two functions are designed to be used together, in order. init_project() creates the scaffold; create_qmd() populates it with a working Quarto document.

Starting with init_project()

Starting a new R project usually means the same manual steps every time: create a folder, set up an RStudio project, create subdirectories for data and scripts, initialize renv, initialize git. None of these steps is hard on its own, but skipping any of them — especially early on — tends to create friction later.

init_project() handles all of this in a single call:

library(toolero)

init_project(path = "~/Documents/my-project")

This creates a new RStudio project at the specified path with the following folder structure already in place:

my-project/
├── data/         # input data
├── data-raw/     # original, unprocessed data
├── R/            # reusable functions
├── scripts/      # analysis scripts
├── plots/        # generated visualizations
├── images/       # static images and assets
├── results/      # processed outputs and tables
└── docs/         # notes, manuscripts, Quarto documents

Why this structure? The folder layout is opinionated but not arbitrary. Separating data/ from data-raw/ makes it clear which files are original and which have been processed. Keeping R/ distinct from scripts/ encourages moving reusable logic into functions over time, which is a natural step toward more maintainable code.

By default, init_project() also initializes renv and git. This means the project is reproducible and version-controlled from the first commit.

Why renv and git by default? renv ensures that the packages your project depends on are recorded and reproducible. git provides a full history of changes. Both are much easier to set up at the start than to retrofit later.

If your project needs folders beyond the defaults:

init_project(
  path          = "~/Documents/my-project",
  extra_folders = c("notebooks", "presentations")
)

To apply UW-Madison RCI branding assets to the project:

init_project(
  path        = "~/Documents/my-project",
  uw_branding = TRUE
)

This creates an assets/ folder and populates it with styles.css, header.html, and rci-banner.png — the same assets used in the Quarto template scaffolded by create_qmd().

Adding a Quarto document with create_qmd()

Once the project exists, create_qmd() adds a working Quarto document to it:

create_qmd(path = "~/Documents/my-project", filename = "analysis.qmd")

This creates:

Why the purl hook? Having a plain .R companion to your .qmd is useful for sharing the analysis as a script, running it on a remote cluster, or archiving the code independently of the document. The hook runs automatically so you never have to remember to extract it manually.

To pre-populate the YAML header with your own metadata:

create_qmd(
  path      = "~/Documents/my-project",
  filename  = "analysis.qmd",
  yaml_data = "~/my-metadata.yml"
)

Where my-metadata.yml might look like:

title: "My Analysis"
author:
  - name: "Your Name"
    affiliation: "UW-Madison"
    email: "you@wisc.edu"

Any keys present in the YAML file overwrite the corresponding placeholders in the template. Keys not present are left as-is.


Working with data: read_clean_csv() and write_by_group()

These two functions address common friction points in day-to-day data work. They are general-purpose utilities — useful in any R project, not just ones set up with toolero.

Reading data with read_clean_csv()

read_clean_csv() combines readr::read_csv() and janitor::clean_names() into a single call:

data <- read_clean_csv("data/my-file.csv")

Column names are automatically converted to lowercase with underscores — consistent, predictable, and tidyverse-friendly. A column called First Name becomes first_name. Q1 Revenue ($) becomes q1_revenue.

By default, column type messages from readr are suppressed. Set verbose = TRUE to see them:

data <- read_clean_csv("data/my-file.csv", verbose = TRUE)

Splitting data by group with write_by_group()

When a data frame contains multiple groups that need to be written to separate files, write_by_group() handles the split and the write in a single call:

write_by_group(
  data       = penguins,
  group_col  = "species",
  output_dir = "results/by-species"
)

Output filenames are derived from the group values and sanitized for use as file names — converted to lowercase with spaces and special characters replaced by dashes. A group called Chinstrap becomes chinstrap.csv. Palmer Penguins would become palmer-penguins.csv.

To also write a manifest listing the output files, group values, and row counts:

write_by_group(
  data       = penguins,
  group_col  = "species",
  output_dir = "results/by-species",
  manifest   = TRUE
)

Execution context: detect_execution_context()

R code often needs to behave differently depending on where it is running — interactively in RStudio, during a quarto render, or as a batch Rscript job on a remote cluster. detect_execution_context() identifies which of these three environments is active and returns one of "interactive", "quarto", or "rscript".

The canonical use case is resolving input file paths portably:

context <- detect_execution_context()

input_file <- switch(context,
  interactive = "data/sample.csv",
  quarto      = params$input_file,
  rscript     = commandArgs(trailingOnly = TRUE)[1]
)

This pattern is built into the template scaffolded by create_qmd(), so you get it for free without having to write it yourself.


Knowledge Base export: generate_kb_xml()

This section is relevant only if you publish content to the UW-Madison Knowledge Base. If you do not, you can safely skip it.

The UW-Madison Knowledge Base requires content to be submitted as XML with all visual assets embedded in the HTML body. generate_kb_xml() automates this process entirely.

generate_kb_xml(
  html_path  = "docs/analysis.html",
  output_dir = "exports"
)

The function:

  1. Infers the source .qmd from the HTML path (or accepts it explicitly via qmd_path)
  2. Re-renders the document with embed-resources: true so all CSS, images, and JavaScript are self-contained
  3. Extracts metadata from the .qmd YAML header — titlekb_title, descriptionkb_summary, categorieskb_keywords
  4. Produces a .xml file ready for direct KB import

This is why the description and categories fields in the create_qmd() template matter — they flow through automatically into the KB article metadata without any extra work.

When importing into the KB, check the Decode HTML entity in body content option.


Citation

If you use toolero in your work, please cite it:

citation("toolero")