---
title: "Data Ingestion with tidylearn"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Ingestion with tidylearn}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  message = FALSE,
  warning = FALSE
)
```

## Overview

Every machine learning workflow starts with data. tidylearn's `tl_read()`
family provides a single consistent interface for loading data from files,
databases, cloud storage, and APIs into tidy tibbles — ready for
`tl_prepare_data()` and `tl_model()`.

All readers return a `tidylearn_data` object, a tibble subclass that carries
metadata about the source, format, and read timestamp.

```{r setup}
library(tidylearn)
library(dplyr)
```

---

## The `tl_read()` Dispatcher

`tl_read()` auto-detects the data format and dispatches to the appropriate
backend — just like `tl_model()` routes to the right algorithm:

```{r dispatcher-concept, eval = FALSE}
# Format is auto-detected from the file extension
data <- tl_read("sales.csv")
data <- tl_read("results.xlsx", sheet = "Q1")
data <- tl_read("experiment.parquet")
data <- tl_read("config.json")
data <- tl_read("model_data.rds")

# Override format detection when the extension is ambiguous
data <- tl_read("export.txt", format = "tsv")
```

The result always prints with a metadata header:

```{r print-demo}
tmp <- tempfile(fileext = ".csv")
write.csv(mtcars, tmp, row.names = FALSE)

data <- tl_read(tmp, .quiet = TRUE)
data
```

```{r cleanup-1, include = FALSE}
unlink(tmp)
```

---

## File Formats

### CSV and TSV

Uses [readr](https://readr.tidyverse.org/) when available for fast,
column-type-aware parsing. Falls back to base R automatically if readr is not
installed.

```{r csv-demo}
# Create example files
tmp_csv <- tempfile(fileext = ".csv")
tmp_tsv <- tempfile(fileext = ".tsv")
write.csv(iris, tmp_csv, row.names = FALSE)
write.table(iris, tmp_tsv, sep = "\t", row.names = FALSE)

csv_data <- tl_read_csv(tmp_csv)
tsv_data <- tl_read_tsv(tmp_tsv)
nrow(csv_data)
```

```{r cleanup-csv, include = FALSE}
unlink(c(tmp_csv, tmp_tsv))
```

### Excel

Reads `.xls`, `.xlsx`, and `.xlsm` files via
[readxl](https://readxl.tidyverse.org/). Select sheets by name or position:

```{r excel-demo}
library(readxl)

path <- readxl_example("datasets.xlsx")
excel_data <- tl_read_excel(path, sheet = "mtcars")
head(excel_data, 3)
```

### Parquet

Lightweight, columnar storage for large datasets. Uses
[nanoparquet](https://cran.r-project.org/package=nanoparquet) — a fast,
dependency-free reader:

```{r parquet-demo}
library(nanoparquet)

tmp_pq <- tempfile(fileext = ".parquet")
write_parquet(iris, tmp_pq)

pq_data <- tl_read_parquet(tmp_pq)
nrow(pq_data)
```

```{r cleanup-pq, include = FALSE}
unlink(tmp_pq)
```

### JSON

Reads tabular JSON (array of objects) via
[jsonlite](https://cran.r-project.org/package=jsonlite). Nested structures
are automatically flattened:

```{r json-demo}
library(jsonlite)

tmp_json <- tempfile(fileext = ".json")
write_json(mtcars[1:5, ], tmp_json)

json_data <- tl_read_json(tmp_json)
json_data
```

```{r cleanup-json, include = FALSE}
unlink(tmp_json)
```

### RDS and RData

Native R serialisation formats — no extra packages needed:

```{r rds-demo}
tmp_rds <- tempfile(fileext = ".rds")
saveRDS(iris, tmp_rds)

rds_data <- tl_read_rds(tmp_rds)
nrow(rds_data)
```

```{r rdata-demo}
tmp_rdata <- tempfile(fileext = ".rdata")
my_data <- mtcars
save(my_data, file = tmp_rdata)

# Name is auto-detected when there is a single data frame
rdata_data <- tl_read_rdata(tmp_rdata)
nrow(rdata_data)
```

```{r cleanup-rds, include = FALSE}
unlink(c(tmp_rds, tmp_rdata))
```

---

## Databases

All database readers use [DBI](https://dbi.r-dbi.org/) as the interface
layer. Each reader manages its own connection lifecycle — connect, query,
disconnect — so you only need to provide the path or credentials and a SQL
query.

### SQLite

The simplest database backend — no server required:

```{r sqlite-demo}
library(DBI)
library(RSQLite)

# Create an example database
tmp_db <- tempfile(fileext = ".sqlite")
conn <- dbConnect(SQLite(), tmp_db)
dbWriteTable(conn, "iris_tbl", iris)
dbDisconnect(conn)

# Read with tl_read_sqlite
db_data <- tl_read_sqlite(
  tmp_db,
  "SELECT * FROM iris_tbl WHERE Species = 'setosa'"
)
nrow(db_data)
```

```{r cleanup-sqlite, include = FALSE}
unlink(tmp_db)
```

### Using a Live Connection

If you already have a DBI connection, use `tl_read_db()` directly — it will
not close your connection:

```{r db-demo}
conn <- dbConnect(SQLite(), ":memory:")
dbWriteTable(conn, "mtcars_tbl", mtcars)

sql <- "SELECT mpg, wt, hp FROM mtcars_tbl WHERE mpg > 20"
db_result <- tl_read_db(conn, sql)
db_result

dbDisconnect(conn)
```

### PostgreSQL, MySQL, and BigQuery

These require a running database server or cloud service. The API is the
same — provide connection details and a SQL query:

```{r remote-db, eval = FALSE}
# PostgreSQL
pg_data <- tl_read_postgres(
  dsn = "localhost",
  query = "SELECT * FROM sales WHERE year = 2025",
  dbname = "analytics",
  user = "myuser",
  password = "mypass"
)

# MySQL / MariaDB # nolint: commented_code_linter.
mysql_data <- tl_read_mysql(
  dsn = "mysql://user:pass@host:3306/mydb",
  query = "SELECT * FROM customers LIMIT 1000"
)

# BigQuery
bq_data <- tl_read_bigquery(
  project = "my-gcp-project",
  query = "SELECT * FROM `dataset.table` LIMIT 1000"
)
```

---

## Cloud and API Sources

### Amazon S3

Downloads a file from S3 and auto-detects the format from the key's
extension. Requires valid AWS credentials:

```{r s3, eval = FALSE}
data <- tl_read_s3("s3://my-bucket/data/sales_2025.csv")
data <- tl_read_s3("s3://my-bucket/data/results.parquet", region = "eu-west-1")
```

### GitHub

Downloads raw files directly from a repository. Accepts full URLs or
`owner/repo` shorthand:

```{r github, eval = FALSE}
# Read a CSV from a public GitHub repository
data <- tl_read_github("tidyverse/dplyr",
  path = "data-raw/starwars.csv", ref = "main"
)
```

### Kaggle

Downloads datasets via the
[Kaggle CLI](https://github.com/Kaggle/kaggle-cli). Install with
`pip install kaggle` and configure your API credentials:

```{r kaggle, eval = FALSE}
data <- tl_read_kaggle("zillow/zecon", file = "Zip_time_series.csv")
data <- tl_read_kaggle("titanic", file = "train.csv", type = "competition")
```

---

## Multi-File Reading

Real-world data is often split across multiple files. tidylearn handles
three common patterns.

### Multiple Paths

Pass a character vector to `tl_read()` — each file is read and row-bound,
with a `source_file` column tracking origin:

```{r multi-path}
dir <- tempdir()
write.csv(iris[1:50, ], file.path(dir, "batch1.csv"), row.names = FALSE)
write.csv(iris[51:100, ], file.path(dir, "batch2.csv"), row.names = FALSE)

paths <- file.path(dir, c("batch1.csv", "batch2.csv"))
combined <- tl_read(paths, .quiet = TRUE)
table(combined$source_file)
```

```{r cleanup-multi, include = FALSE}
unlink(file.path(dir, c("batch1.csv", "batch2.csv")))
```

### Directory Scanning

Point `tl_read_dir()` at a directory. Filter by format, regex pattern, or
scan recursively:

```{r dir-demo}
dir <- tempfile(pattern = "tl_vignette_")
dir.create(dir)
write.csv(iris[1:50, ], file.path(dir, "jan.csv"), row.names = FALSE)
write.csv(iris[51:100, ], file.path(dir, "feb.csv"), row.names = FALSE)
write.csv(iris[101:150, ], file.path(dir, "mar.csv"), row.names = FALSE)

# Read all CSVs from the directory
all_data <- tl_read_dir(dir, format = "csv", .quiet = TRUE)
nrow(all_data)
table(all_data$source_file)
```

```{r dir-pattern}
# Filter with a regex pattern
subset <- tl_read_dir(dir, pattern = "^(jan|feb)", .quiet = TRUE)
nrow(subset)
```

```{r cleanup-dir, include = FALSE}
unlink(dir, recursive = TRUE)
```

Passing a directory path directly to `tl_read()` works too:

```{r dir-dispatch, eval = FALSE}
data <- tl_read("data/monthly_exports/")
```

### Zip Archives

`tl_read_zip()` extracts the archive, auto-detects the file format, and
reads the contents. Select a specific file or let it discover data files
automatically:

```{r zip-demo}
# Create an example zip
dir <- tempfile(pattern = "tl_zip_src_")
dir.create(dir)
write.csv(iris, file.path(dir, "iris.csv"), row.names = FALSE)
zip_path <- tempfile(fileext = ".zip")
old_wd <- getwd()
setwd(dir)
utils::zip(zip_path, "iris.csv")
setwd(old_wd)

zip_data <- tl_read_zip(zip_path, .quiet = TRUE)
nrow(zip_data)
attr(zip_data, "tl_format")
```

```{r cleanup-zip, include = FALSE}
unlink(c(dir, zip_path), recursive = TRUE)
```

Zip files are also auto-detected by `tl_read()`:

```{r zip-dispatch, eval = FALSE}
data <- tl_read("download.zip")
data <- tl_read("download.zip", file = "train.csv")
```

---

## The `tidylearn_data` Class

Every reader returns a `tidylearn_data` object — a tibble subclass with
three metadata attributes:

| Attribute | Description |
|-----------|-------------|
| `tl_source` | File path, URL, or description of the data source |
| `tl_format` | Detected or specified format (e.g., `"csv"`, `"zip+csv"`) |
| `tl_timestamp` | POSIXct timestamp of when the data was read |

Because `tidylearn_data` inherits from `tbl_df`, all dplyr verbs, ggplot2,
and tidylearn functions work transparently:

```{r class-demo}
tmp <- tempfile(fileext = ".csv")
write.csv(mtcars, tmp, row.names = FALSE)
data <- tl_read(tmp, .quiet = TRUE)

# Check metadata
attr(data, "tl_format")

# Works with dplyr
data %>%
  filter(mpg > 20) %>%
  select(mpg, wt, hp) %>%
  head(3)
```

```{r cleanup-class, include = FALSE}
unlink(tmp)
```

---

## Full Pipeline

Combining `tl_read()` with the rest of tidylearn gives you a complete
workflow from raw data to published results:

```{r pipeline}
# 1. Ingest
tmp <- tempfile(fileext = ".csv")
write.csv(iris, tmp, row.names = FALSE)
data <- tl_read(tmp, .quiet = TRUE)

# CSV files lose factor information, so convert character columns as needed
data <- data %>% mutate(Species = as.factor(Species))

# 2. Split
split <- tl_split(data, prop = 0.7, stratify = "Species", seed = 42)

# 3. Model
model <- tl_model(split$train, Species ~ ., method = "forest")

# 4. Evaluate
eval_result <- tl_evaluate(model, new_data = split$test)
eval_result
```

```{r cleanup-pipeline, include = FALSE}
unlink(tmp)
```

---

## Supported Formats Reference

| Format | Function | Backend | Dependency |
|--------|----------|---------|------------|
| CSV | `tl_read_csv()` | readr / base R | Suggests (readr) |
| TSV | `tl_read_tsv()` | readr / base R | Suggests (readr) |
| Excel | `tl_read_excel()` | readxl | Suggests |
| Parquet | `tl_read_parquet()` | nanoparquet | Suggests |
| JSON | `tl_read_json()` | jsonlite | Suggests |
| RDS | `tl_read_rds()` | base R | None |
| RData | `tl_read_rdata()` | base R | None |
| SQLite | `tl_read_sqlite()` | DBI + RSQLite | Suggests |
| PostgreSQL | `tl_read_postgres()` | DBI + RPostgres | Suggests |
| MySQL | `tl_read_mysql()` | DBI + RMariaDB | Suggests |
| BigQuery | `tl_read_bigquery()` | bigrquery | Suggests |
| S3 | `tl_read_s3()` | paws.storage | Suggests |
| GitHub | `tl_read_github()` | base R | None |
| Kaggle | `tl_read_kaggle()` | Kaggle CLI | None (system) |
| Directory | `tl_read_dir()` | (dispatches) | — |
| Zip | `tl_read_zip()` | base R + (dispatches) | — |
| Multi-path | `tl_read(c(...))` | (dispatches) | — |
