---
title: "Introduction to pixieweb"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to pixieweb}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

> **New to pixieweb?** Start with `vignette("a-quickstart")` for a hands-on
> walkthrough. This vignette covers the design and advanced features.

## Background

PX-Web is the statistical database platform used by national statistics
agencies across the Nordic countries and beyond. Each agency runs its
own instance (Statistics Sweden at [scb.se](https://www.scb.se),
Statistics Norway at [ssb.no](https://www.ssb.no), etc.), but they all
share the same underlying API.

pixieweb provides a consistent, pipe-friendly R interface to all these
APIs. It follows the same design principles as
[rKolada](https://github.com/lchansson/rKolada): **tibbles
everywhere**, **search-then-fetch**, and **progressive disclosure**.

## Design principles

1. **Tibbles everywhere.** Every function returns a tibble (or a vector
   extracted from one).
2. **Pipe-friendly.** First argument is always a tibble or API object;
   output is always pipeable.
3. **Search-then-fetch.** Users discover metadata before downloading
   data.
4. **NULL on failure.** API errors return NULL with a warning, never
   `stop()`.
5. **Progressive disclosure.** Simple things are simple; complex things
   are possible.

## The data model

PX-Web tables are *multi-dimensional data cubes*. Unlike Kolada — where
the dimensions are always KPI, municipality, and period — each PX-Web
table defines its own set of dimensions. pixieweb calls these **variables**.

| pixieweb entity | What it represents           | rKolada analog        |
|--------------|---------------------------------|-----------------------|
| **api**      | A PX-Web instance (SCB, SSB...) | *(implicit — single)* |
| **table**    | A statistical table             | kpi                   |
| **variable** | A dimension within a table      | *(municipality/year)* |
| **codelist** | An aggregation/value set        | kpi_groups            |
| **data**     | Downloaded values               | values                |

## Connecting to an API

```{r}
library(pixieweb)

# Known aliases
scb <- px_api("scb", lang = "en")
ssb <- px_api("ssb", lang = "en")

# Or a custom URL
custom <- px_api("https://my.statbank.example/api/v2/", lang = "en")

# See all known APIs
px_api_catalogue()
```

## Discovering tables

Tables are the central entity. `get_tables()` sends a server-side
search query. The result is a tibble with rich metadata:

```{r}
tables <- get_tables(scb, query = "income") |>
  table_search("taxable")

tables |> table_describe(max_n = 3)
```

The table tibble includes subject path, time period range, time unit,
and data source — all of which are searchable by `table_search()`.

### Table helper functions

| Function              | Purpose                          |
|-----------------------|----------------------------------|
| `table_search()`      | Filter by regex (client-side)    |
| `table_describe()`    | Print human-readable summaries   |
| `table_minimize()`    | Remove constant columns          |
| `table_extract_ids()` | Extract ID vector for piping     |

## Exploring variables

Each table has its own set of variables (dimensions). The key
discovery step is `get_variables()`:
```{r}
vars <- get_variables(scb, "TAB638")
vars |> variable_describe()
```

Important variable properties:
- **elimination**: can this variable be left out of your `get_data()`
  call? If `TRUE`, omitting it means the API returns a pre-computed
  total (e.g., omitting "Sex" gives the total for all sexes). If
  `FALSE`, the variable is **mandatory** — you must include it.
- **time**: is this the time dimension?
- **values**: the available codes and their human-readable labels.
- **codelists**: alternative groupings (e.g. municipalities → counties).

```{r}
# See what values a variable has
vars |> variable_values("Kon")

# Look up variable codes by name
variable_name_to_code(vars, "sex")
```

## Fetching data

### Direct approach

If you know exactly what you want:

```{r}
pop <- get_data(scb, "TAB638",
  Region = c("0180", "1480"),
  Kon = c("1", "2"),
  ContentsCode = "*",
  Tid = px_top(5)
)
```

Variables you omit are **eliminated** (aggregated) if the API allows
it. If a variable is mandatory, you must include it.

### Selection helpers

| Helper            | Meaning                  | Example         |
|-------------------|--------------------------|-----------------|
| `c("0180")`      | Specific values          | Item selection  |
| `"*"`             | All values               | Wildcard        |
| `px_top(5)`       | First N values           | Most recent     |
| `px_bottom(3)`    | Last N values (v2 only)  |                 |
| `px_from("2020")` | From value onward (v2)   |                 |
| `px_to("2023")`   | Up to value (v2)         |                 |
| `px_range(a, b)`  | Inclusive range (v2)     |                 |

### The `prepare_query()` shortcut

For interactive exploration, `prepare_query()` inspects the table
metadata and builds a query with sensible defaults:

```{r}
q <- prepare_query(scb, "TAB638")
```

Default strategy:
- **ContentsCode**: all values (`"*"`)
- **Time variable**: latest 10 periods (`px_top(10)`)
- **Eliminable variables**: omitted (API aggregates)
- **Small mandatory variables** (≤ 22 values): all (`"*"`)
- **Large mandatory variables**: first value (`px_top(1)`)

Override specific variables while letting defaults handle the rest:

```{r}
q <- prepare_query(scb, "TAB638",
  Region = c("0180", "1480"),
  maximize_selection = TRUE
)
```

With `maximize_selection = TRUE`, the function expands unspecified
variables to include as many values as possible while staying under
the API's cell limit.

Then fetch:

```{r}
pop <- get_data(scb, query = q)
```

### Advanced features

The sections below cover features you may not need on your first query,
but that become essential for complex tables or cross-country work.

## Codelists

Codelists provide alternative groupings of variable values. They are
useful when you want data at a different aggregation level than the
table's default. For example, a "Region" variable with 290
municipalities might have a codelist that groups them into 21 counties:

```{r}
cls <- get_codelists(scb, "TAB638", "Region")
cls |> codelist_describe(max_n = 5)

# Use a codelist in a query
get_data(scb, "TAB638",
  Region = "*",
  Tid = px_top(5),
  ContentsCode = "*",
  .codelist = list(Region = "vs_RegionLän07")
)
```

## Wide output and multiple contents

When a table has multiple content variables (e.g. both Population and
Deaths), use `.output = "wide"` to pivot them into separate columns.
This is useful when you want to *compute* with multiple measures (e.g.
death rate = Deaths / Population):

```{r}
demo <- get_data(scb, "TAB638",
  Region = "0180",
  Tid = px_top(5),
  ContentsCode = "*",
  .output = "wide"
)
demo
```

## Advanced: query composition

For full control over the HTTP request — useful for debugging or when
you need to inspect/modify the exact query before sending it — use the
low-level query composers:

```{r}
q <- compose_data_query(scb, "TAB638",
  Region = c("0180"),
  ContentsCode = "*",
  Tid = px_top(3)
)

# Inspect the query
q$url
q$body

# Modify and execute
raw <- execute_query(scb, q$url, q$body)
```

## Saved queries (v2 only)

PX-Web v2 supports server-side stored queries. Useful for recurring
reports — save a query once, then retrieve it by ID later:

```{r}
# Save a query
id <- save_query(scb, "TAB638", Region = "0180", Tid = px_top(5), ContentsCode = "*")

# Retrieve later
get_saved_query(scb, id)
```

## Citation

Always cite your data sources:

```{r}
px_cite(pop)
```
