---
title: "Metric and Non-Gaussian Variants of Archetypal Analysis"
bibliography: references.bib
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Metric and Non-Gaussian Variants of Archetypal Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 7,
    fig.height = 5.5,
    fig.align = "center",
    warning = FALSE
)
set.seed(42)
par(bty = "n")
library(yaap)
```

## One simplex, several geometries

Classical archetypal analysis (AA) represents each observation as a convex
mixture of a small number of extreme profiles [@Cutler1994]. In `yaap`, the main
objects keep the same interpretation across most variants:

- `coefficients` describe each archetype as a mixture of training observations.
- `compositions` describe each observation as a mixture of archetype profiles.
- `coordinates` are the fitted archetype profiles, when an input-space profile is available.

What changes is the notion of closeness. For Gaussian models, this is usually an
inner product that defines lengths and angles in feature space. For
non-Gaussian or directional models, it is a likelihood or angular objective.

- Ordinary AA uses squared Euclidean distance.
- Metric AA uses a metric matrix $G$ to weight feature-space residuals.
- Kernel AA implicitly maps the data into another feature space and uses that
  space's inner product.
- Probabilistic AA uses a likelihood objective, and directional AA uses an angular objective.

[@Seth2016; @Olsen2022].

| Variant              | Main function                               | What changes                         | Typical data                                |
| -------------------- | ------------------------------------------- | ------------------------------------ | ------------------------------------------- |
| Metric Gaussian AA   | `run_aa(..., scale = G)`                    | Feature inner product                | Curves, spectra, correlated features        |
| Kernel AA            | `run_aa(..., method = "kernel")`            | Implicit feature-space inner product | Nonlinear geometry                          |
| Probabilistic AA     | `run_aa(..., method = "paa", family = ...)` | Likelihood/objective                 | Binary, count, compositional observations   |
| Directional AA       | `run_aa(..., method = "directional")`       | Angular objective                    | Unit directions, polarity-invariant signals |

This vignette is organized as a set of recipes. Each section starts with the
modeling situation where the variant is useful, then shows the smallest code
pattern needed to fit the model and interpret the result. The examples are
small and synthetic so they can be rerun quickly.

```{r helpers, include = FALSE}
pal_aa <- c("#0072B2", "#D55E00", "#009E73", "#CC79A7")
pal_data <- adjustcolor("black", 0.20)
pal_data_mid <- adjustcolor("black", 0.45)

sample_simplex <- function(n, k) {
    Z <- matrix(stats::rexp(n * k), nrow = n, ncol = k)
    Z / rowSums(Z)
}

final_loss <- function(fit) {
    tail(fit[["loss"]], 1)
}

unit_rows <- function(X) {
    X <- as.matrix(X)
    X / sqrt(rowSums(X * X))
}
```

## Metric Gaussian AA

Use this recipe when the rows are still numeric profiles, but ordinary
Euclidean distance is the wrong residual scale. Common cases include calibrated
measurement error, correlated features, spectra, and functional curves.

In `yaap`, the standard Gaussian objective uses the following matrices:

- $X$, the data matrix ($N \times M$)
- $S$, the composition matrix ($N \times K$)
- $A$, the archetype matrix ($K \times M$)
- $E = X - SA$, the residual matrix
- $W = \mathrm{diag}(w_1, \ldots, w_N)$, the diagonal matrix of sample weights
- $G$, the feature metric, a $M \times M$ positive-definite matrix

The fitted objective is a weighted and metric-adjusted sum of squared errors:

$$\mathcal{L}(S,A; W, G) = \mathrm{tr}\{W E G E^\top\}$$

Here, $W$ is used to adjust the contribution of each sample, and $G$ defines how
feature-space residuals are measured. If $G$ is diagonal, its entries say how
strongly each feature contributes to squared error. $G$'s off-diagonal entries
add cross-terms to the quadratic error, so residuals in nearby or related
features can be judged together.

The `weights` argument sets $W$; when it is omitted, $W$ is the identity
matrix. The `scale` argument sets $G$. Conceptually, every non-kernel
Gaussian use of `scale` is a choice of this feature metric:

- `scale = FALSE` uses the identity metric
- `scale = TRUE` uses the feature-wise sample variances as a diagonal metric
- a numeric vector of column divisors gives a diagonal metric through
  per-feature scale factors
- a positive-definite matrix supplies a full non-diagonal metric.

### Recipe: known measurement-error metric

A feature metric is useful when technical measurement error is known from
instrument calibration or prior domain knowledge. In the recipe below, the
true archetypal structure is a triangle in the first two measured features.
The last two features are nuisance measurements with large correlated technical
noise. For simplicity, we do not add any other irreducible noise. Plain
Euclidean AA treats all feature directions as having the same reliability.
Metric Gaussian AA uses the technical-noise precision matrix, so residuals are
judged relative to measurement reliability.

It is important that this metric describes measurement noise, not the total
covariance of `X_metric`. The total covariance of `X_metric` contains the
archetypal signal we want to model. Using its inverse would also downweight
meaningful variation between archetypes.

```{r metric-simplex-data}
# Define the signal archetypes.
A_signal_true <- rbind(
    left  = c(-1,      0, 0, 0),
    right = c(1,       0, 0, 0),
    top   = c(0, sqrt(3), 0, 0)
)

# Basic dimensions
K <- nrow(A_signal_true)
M <- ncol(A_signal_true)
N <- 220  # samples

# Archetype compositions for each sample
S_metric_true <- sample_simplex(N, K)
X_metric_mean <- S_metric_true %*% A_signal_true

# Build block-diagonal measurement-noise covariance
Sigma_noise <- matrix(0, nrow = M, ncol = M)
# Signal features: small, moderately correlated noise
Sigma_noise[1:2, 1:2] <- 0.03^2 * matrix(c(1, -0.5, -0.5, 1), nrow = 2)
# Nuisance features: large, weakly correlated noise
Sigma_noise[3:4, 3:4] <- 3^2 * matrix(c(1, 0.2, 0.2, 1), nrow = 2)
# Precision matrix used as metric for AA
G_noise <- solve(Sigma_noise)

# Correlated anisotropic noise
X_noise <- matrix(rnorm(N * M), ncol = M) %*% chol(Sigma_noise)
X_metric <- X_metric_mean + X_noise
colnames(X_metric) <- c("feature_1", "feature_2", "noise_1", "noise_2")
```

```{r metric-simplex-fit, warning=FALSE}
set.seed(42)
fit_plain <- run_aa(
    X_metric,
    K = K,
    scale = FALSE,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_plain

set.seed(42)  # same starts
fit_metric <- run_aa(
    X_metric,
    K = K,
    scale = G_noise,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_metric
```

The two fits use the same observations and the same simplex model. The metric
fit differs only in how residuals are judged: directions with larger technical
measurement noise count less, and the off-diagonal entries in `G_noise`
account for correlated errors.

```{r metric-simplex-loss, echo=FALSE}
match_to_truth <- function(fit, truth, G = diag(ncol(truth))) {
    perms <- rbind(
        c(1, 2, 3), c(1, 3, 2), c(2, 1, 3),
        c(2, 3, 1), c(3, 1, 2), c(3, 2, 1)
    )
    error <- apply(perms, 1, function(ix) {
        E <- coordinates(fit)[ix, , drop = FALSE] - truth
        sum(diag(E %*% G %*% t(E)))
    })
    perms[which.min(error), ]
}

noise_metric_archetype_error <- function(fit, truth, G) {
    ix <- match_to_truth(fit, truth, G)
    E <- coordinates(fit)[ix, , drop = FALSE] - truth
    sum(diag(E %*% G %*% t(E)))
}

clr <- function(x) {
    logx <- log2(pmax(x, 1e-6))
    logx - rowMeans(logx)
}

composition_rmse <- function(fit, truth, S_true, G) {
    ix <- match_to_truth(fit, truth, G)
    clr_true <- clr(S_true)
    clr_fit <- clr(compositions(fit)[, ix, drop = FALSE])
    sqrt(mean((clr_fit - clr_true)^2))
}

res <- rbind(
    plain = cbind(
        final_loss(fit_plain),
        data.frame(
            noise_metric_archetype_error = noise_metric_archetype_error(
                fit_plain, A_signal_true, G_noise
            ),
            composition_rmse = composition_rmse(
                fit_plain, A_signal_true, S_metric_true, G_noise
            )
        )
    ),
    metric = cbind(
        final_loss(fit_metric),
        data.frame(
            noise_metric_archetype_error = noise_metric_archetype_error(
                fit_metric, A_signal_true, G_noise
            ),
            composition_rmse = composition_rmse(
                fit_metric, A_signal_true, S_metric_true, G_noise
            )
        )
    )
)
knitr::kable(
    as.data.frame(res),
    digits = c(0, 3, 2, 2),
    col.names = c("Final loss", "R2", "Noise-metric archetype error", "Composition RMSE")
)
```

The loss values are on different scales because each fit uses its own residual
metric. The archetype error and composition RMSE are evaluated against the known
generating structure, using the noise precision matrix to match the fitted
archetypes to the truth and thus are comparable across fits.

```{r metric-simplex-plot, echo=FALSE, fig.cap = "Metric Gaussian AA with known correlated measurement noise."}
plot(
    X_metric[, 1:2], frame.plot = FALSE,
    asp = 1, pch = 19, cex = 0.45, col = pal_data,
    xlab = "feature 1", ylab = "feature 2", main = "Metric Gaussian AA"
)
polygon(A_signal_true[c(1, 3, 2), ], border = "black", lty = 3, lwd = 1.4)
points(A_signal_true, pch = 15, cex = 1.1, col = "black")
points(coordinates(fit_plain)[, 1:2], pch = 4, cex = 1.45, lwd = 1.7, col = pal_aa[2])
points(coordinates(fit_metric)[, 1:2], pch = 1, cex = 1.65, lwd = 1.8, col = pal_aa[1])
legend(
    "topright",
    legend = c("Data", "True archetypes", "Plain Euclidean", "Metric"),
    pch = c(19, 15, 4, 1),
    col = c(pal_data, "black", pal_aa[2], pal_aa[1]),
    pt.cex = c(0.7, 1.1, 1.3, 1.5),
    bty = "n"
)
```

### Functional AA

Functional data use the same metric idea, but the metric is usually not supplied
by hand. Use this recipe when analyzing functional data with the `fda` package.

In functional analysis, a curve is represented as a weighted sum of basis
functions, such as sine waves in Fourier analysis. The appropriate feature
metric is computed from the inner-product matrix of the basis functions.

The package `fda` provides a rich set of basis functions and tools for working
with functional data. `run_aa.fd()` works directly with `fda::fd` objects and
applies the basis inner-product metric (computed using `fda::eval.penalty()`) to
the Gaussian AA engine automatically, so users can fit the functional object
directly. The resulting archetypes and fitted values are also `fd` objects so
that they can be post-processed with the same tools as the input data.

Below we demonstrate functional AA on synthetic curves with known archetypal
structure. The true archetypes are smooth curves with peaks at different times.
The data are noisy samples from the convex hull of those archetypes.

```{r functional-aa-fit, eval=requireNamespace("fda", quietly = TRUE)}
N <- 90  # number of curves
t_fd <- seq(0, 1, length.out = 80)  # time points for functional data
A_fd_true <- rbind(
    early_peak = exp(-60 * (t_fd - 0.18)^2),
    late_peak  = exp(-60 * (t_fd - 0.82)^2),
    ramp       = 0.15 + 0.85 * t_fd
)
# Archetype compositions for each sample
S_fd_true <- sample_simplex(N, 3)
X_fd_noise <- matrix(rnorm(N * length(t_fd), sd = 0.025), nrow = N)
X_fd <- S_fd_true %*% A_fd_true + X_fd_noise

# Convert the sampled curves to an fda::fd object
basis_fd <- fda::create.bspline.basis(rangeval = c(0, 1), nbasis = 14)
fd_data <- fda::Data2fd(argvals = t_fd, y = t(X_fd), basisobj = basis_fd)

fit_fd <- run_aa(fd_data, K = 3, sd_threshold = 0)
fit_fd
```

The resulting archetypes are functional objects, and the fitted values are
curves in the same space. The plot below shows the fitted archetypes overlaid on
a sample of the data curves.

```{r functional-aa-plot, eval=requireNamespace("fda", quietly = TRUE), fig.cap = "Functional Gaussian AA fitted directly to an fda::fd object."}

A_fit_fd <- coordinates(fit_fd)
class(A_fit_fd)  # fitted archetype curves

sample_ix <- seq(1, nrow(X_fd), length.out = 25) # sample curves for plotting

# Plot sampled curves, then overlay archetype curves
matplot(
    t_fd,
    t(X_fd[sample_ix, ]),
    type = "l", lty = 1, col = pal_data,
    xlab = "t", ylab = "Value", main = "Functional AA"
)
invisible( # plot.fd prints a "done" message that we want to suppress
    plot(A_fit_fd, lty = 1, lwd = 2, col = pal_aa[1:3], add = TRUE)
)
legend(
    "top",
    legend = anames(fit_fd),
    lty = 1, lwd = 2, col = pal_aa[1:3],
    bty = "n", horiz = TRUE
)
```

## Kernel AA

Kernel AA is the nonlinear analog of changing the inner product. Instead of
choosing an explicit feature metric ($G$), it works with a $N \times N$
Gram matrix $K_{ij} = k(x_i, x_j)$ and fits archetypes in the kernel-induced
feature space [@Morup2012]. With the linear kernel, this recovers the ordinary
Euclidean geometry. With nonlinear kernels, the convex hull is formed after the
implicit feature map.

Use this recipe when the relevant extremes are defined by a nonlinear geometry
rather than by the convex hull in the original feature space. The example below
has a dense core and an outer ring. Linear AA sees the input space and places
proxy archetypes around broad outer directions. A Laplace kernel changes the
geometry so local neighborhoods are emphasized.

```{r kernel-data}
# Choose the sample sizes
n_outer <- 120
n_inner <- 80

# Simulate an outer ring
theta_outer <- runif(n_outer, 0, 2 * pi)
r_outer <- 1 + 0.12 * runif(n_outer)
X_outer <- cbind(r_outer * cos(theta_outer), r_outer * sin(theta_outer))

# Simulate an inner core
theta_inner <- runif(n_inner, 0, 2 * pi)
r_inner <- 0.15 * runif(n_inner)
X_inner <- cbind(r_inner * cos(theta_inner), r_inner * sin(theta_inner))

# Combine the two groups
X_ring <- rbind(X_outer, X_inner)
colnames(X_ring) <- c("x", "y")
```

```{r kernel-fit, warning=FALSE}
set.seed(42)
fit_linear <- run_aa(
    X_ring,
    K = 7,
    scale = FALSE,
    init = "random",
    nrep = 20
)
fit_linear

set.seed(42)
fit_kernel <- run_aa(
    X_ring,
    K = 7,
    method = "kernel",
    kernel = "laplace",
    kernel_args = list(sigma = 0.35),
    init = "random",
    nrep = 20
)
fit_kernel
```

The two loss metrics are not directly comparable because they are computed in
different spaces. Even the `r2` values are not directly comparable because they
are computed relative to different null models. There is no simple scalar
diagnostic that decides whether to use kernel AA or linear AA. Use domain
knowledge and visual inspection of the fitted profiles.

```{r kernel-plot, fig.cap = "Linear AA archetypes and kernel-AA input-space proxy archetypes."}
plot(
    X_ring, frame.plot = FALSE, axes = FALSE,
    col = pal_data, asp = 1, pch = 19, cex = 0.45,
    main = "Linear AA vs kernel-AA proxies", xlab = "", ylab = ""
)
points(coordinates(fit_linear), pch = 4, cex = 1.25, lwd = 1.5, col = pal_aa[1])
points(coordinates(fit_kernel), pch = 1, cex = 1.45, lwd = 1.6, col = pal_aa[2])
legend(
    "topleft",
    legend = c("Data", "Linear AA", "Kernel-AA"),
    pch = c(19, 4, 1),
    col = c(pal_data, pal_aa[1:2]),
    pt.cex = c(0.7, 1.2, 1.4),
    bty = "n"
)
```

For nonlinear kernels, `coordinates` are input-space proxy coordinates computed
as `coefficients %*% data`. They are useful for plotting, but they are not exact
preimages of the Hilbert-space archetypes. This distinction also explains why
`fitted()` and out-of-sample `predict()` for `type = "reconstruction"` are not
defined for `kernel_archetypes`. `predict()` for `type = "compositions"` on the
other hand is defined and returns the new composition weights using cross-Gram
evaluation (new samples vs every training sample).

In this particular example, the kernel-AA considers the inner core and outer
ring as two separate archetypal directions, while linear AA places archetypes
around the outer ring and considers the inner core as a uniformly mixed profile.

The kernel choice and bandwidth are substantive modeling decisions. Very local
kernels can make many points effectively extreme in feature space, a phenomenon
also discussed in kernel frame methods [@Mair2018Frame]. In practice, compare a
small grid of kernel parameters and inspect whether the resulting compositions
are stable.

## Probabilistic AA

Use PAA when the observed entries are generated from a distribution such as
binary responses, counts, or category counts. PAA changes the objective rather
than the inner product. It keeps the simplex structure, but archetypes are
profiles in the parameter space of an observation family [@Seth2016]. In `yaap`,
the supported families are:

| `family =`      | Input data                   | Parameter space                            |
| --------------- | ---------------------------- | ------------------------------------------ |
| `"gaussian"`    | Real-valued numeric matrix   | Mean profiles on the data scale            |
| `"binomial"`    | Binary 0/1 response matrix   | Response probabilities                     |
| `"poisson"`     | Non-negative count matrix    | Poisson rates, or expected counts          |
| `"multinomial"` | Rows of category counts      | Category probabilities that sum to one     |

PAA's main caveat is interpretive: the convex hull lives in parameter space.
For non-Gaussian families, `plot(fit, what = "coordinates")` is intentionally
unavailable because the observed data and fitted archetypes are not in the same
space. PAA also does not currently support sample `weights`, robust fitting, or
missing-data fitting.  For Gaussian data, PAA is equivalent to ordinary AA but
less versatile and slower. Prefer the standard Gaussian fitters unless you
specifically need the PAA interface.

### Recipe: binary responses

The following example represents observations as answers to six yes/no
items. The archetypes are probability profiles, not binary observations. For
your own data, use an `N x M` matrix of 0/1 responses.

```{r paa-binomial-data}
N <- 120
P_true <- rbind(
    broad_yes = c(0.85, 0.80, 0.75, 0.20, 0.15, 0.10),
    middle    = c(0.25, 0.75, 0.35, 0.70, 0.35, 0.65),
    broad_no  = c(0.10, 0.15, 0.20, 0.80, 0.85, 0.75)
)
K <- nrow(P_true)
M <- ncol(P_true)
S_bin <- sample_simplex(N, K)
P_bin <- S_bin %*% P_true
X_bin <- matrix(rbinom(N * M, size = 1, prob = as.vector(P_bin)), nrow = N)
colnames(X_bin) <- paste0("item_", seq_len(M))
head(X_bin)
```

```{r paa-binomial-fit}
fit_binomial <- run_aa(
    X_bin,
    K = K,
    method = "paa",
    family = "binomial"
)

fit_binomial
```

```{r paa-profile, echo=FALSE, fig.cap = "Binomial PAA archetype response probabilities."}
old_par <- par(mar = c(5.1, 4.1, 4.1, 8.5), xpd = TRUE)
plot(
    fit_binomial,
    what = "profiles",
    col = pal_aa[1:3],
    ylim = c(0, 1),
    main = "Binomial archetype probabilities",
    args.legend = list(
        x = "topright",
        inset = c(-0.32, 0),
        bty = "n"
    )
)
par(old_par)
```

Because the archetypes are in parameter space, the fitted values are
family parameters. The compositions are still convex weights over archetypes,
but the reconstruction is not always a convex combination of observed data.
Instead, it is a convex combination of archetype profiles in parameter space.

The `predict()` method for PAA supports both types of prediction. However,
`type = "reconstruction"` returns fitted values in parameter space, not
observed data in the original sample space.

```{r paa-predict}
S_new <- predict(fit_binomial, X_bin[1:5, ], type = "compositions")
round(S_new, 3)

P_new <- predict(fit_binomial, X_bin[1:5, ], type = "reconstruction")
round(P_new, 2)
```

### Recipe: multinomial counts

Next, we fit a multinomial example. The data are counts of five terms in three
document types:

- "visual" documents have high counts of the first two terms and low counts
  of the last three.
- "text" documents have the opposite pattern.
- "mixed" documents have moderate counts of all terms.

The archetypes are term-probability profiles, not observed count vectors. The
fitted and predicted values are probabilities whose rows sum to one. If expected
counts are needed, multiply those probabilities by the document totals.

```{r paa-multinomial}
# Define term probabilities for three document types
Topic_true <- rbind(
    visual = c(0.55, 0.25, 0.10, 0.05, 0.05),
    text   = c(0.05, 0.10, 0.55, 0.25, 0.05),
    mixed  = c(0.15, 0.25, 0.15, 0.20, 0.25)
)
# Basic dimensions
K <- nrow(Topic_true)
M <- ncol(Topic_true)
N <- 80
# Archetype compositions
S_txt <- sample_simplex(N, K)
Theta <- S_txt %*% Topic_true

totals <- 39 + sample(40, N, replace = TRUE) # total term counts per document
X_txt <- matrix(0L, nrow = N, ncol = M)     # observed term counts
for (i in seq_len(N)) {
    X_txt[i, ] <- as.integer(rmultinom(1, totals[i], Theta[i, ]))
}
colnames(X_txt) <- paste0("term_", seq_len(ncol(X_txt)))

fit_multi <- run_aa(
    X_txt,
    K = K,
    method = "paa",
    family = "multinomial",
    tol = 1e-6
)
fit_multi

P_hat <- fitted(fit_multi)
head(round(P_hat, 2), 3)
rowSums(P_hat[1:3, ])

expected_counts <- rowSums(X_txt) * P_hat
head(round(expected_counts, 1), 3)
```

## Directional AA

Use directional AA when row direction matters more than magnitude, especially
when signs can be arbitrary. The implementation uses a Watson-style loss
that is insensitive to polarity after hemisphere handling [@Olsen2022].

The recipe below draws points from a spherical triangle, then flips some rows
by multiplying them by `-1`. Euclidean AA treats the flipped rows as opposite
points. Directional AA can model them as the same direction.

```{r directional-data}
A_dir_true <- rbind(
    c(1, 0, 0),
    c(0, 1, 0),
    c(0, 0.15, 1)
)
K <- nrow(A_dir_true)
N <- 120
S_dir <- sample_simplex(N, K)
X_dir <- unit_rows(S_dir %*% A_dir_true)
flip <- sample(c(-1, 1), nrow(X_dir), replace = TRUE)
X_dir_flip <- X_dir * flip
colnames(X_dir_flip) <- c("x", "y", "z")
```

```{r directional-fit}
set.seed(42)
fit_euclidean_dir <- run_aa(
    X_dir_flip,
    K = K,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_euclidean_dir

set.seed(42)
fit_directional <- run_aa(
    X_dir_flip,
    K = K,
    method = "directional",
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_directional
```

```{r directional-directions}
round(fit_directional[["directions"]], 3)
round(rowSums(fitted(fit_directional)^2), 6)[1:6]
```

```{r directional-plot, echo=FALSE, fig.cap = "Orthographic view of polarity-flipped spherical data."}
grid_t <- seq(0, 2 * pi, length.out = 200)
grid_lat <- seq(-60, 60, by = 30) * pi / 180
grid_lon <- seq(-60, 60, by = 30) * pi / 180
view_tilt <- 25 * pi / 180
view_turn <- 15 * pi / 180
rot_x <- matrix(
    c(1, 0, 0,
      0, cos(view_tilt), -sin(view_tilt),
      0, sin(view_tilt), cos(view_tilt)),
    nrow = 3, byrow = TRUE
)
rot_z <- matrix(
    c(cos(view_turn), -sin(view_turn), 0,
      sin(view_turn), cos(view_turn), 0,
      0, 0, 1),
    nrow = 3, byrow = TRUE
)
project_sphere <- function(X) {
    as.matrix(X) %*% t(rot_z %*% rot_x)
}

plot(
    NA, NA, axes = FALSE,
    xlim = c(-1.05, 1.05), ylim = c(-1.05, 1.05), asp = 1,
    main = "Directional AA on the unit sphere", xlab = "", ylab = ""
)
polygon(
    cos(grid_t), sin(grid_t),
    border = adjustcolor("black", 0.35), col = adjustcolor("black", 0.03),
    lwd = 1.2
)
for (lat in grid_lat) {
    grid_line <- cbind(
        cos(lat) * cos(grid_t),
        cos(lat) * sin(grid_t),
        sin(lat)
    )
    grid_line <- project_sphere(grid_line)
    lines(grid_line[, 1:2], col = adjustcolor("black", 0.12), lwd = 0.8)
}
for (lon in grid_lon) {
    grid_lat_dense <- seq(-pi / 2, pi / 2, length.out = 200)
    grid_line <- cbind(
        cos(grid_lat_dense) * cos(lon),
        cos(grid_lat_dense) * sin(lon),
        sin(grid_lat_dense)
    )
    grid_line <- project_sphere(grid_line)
    lines(grid_line[, 1:2], col = adjustcolor("black", 0.12), lwd = 0.8)
}

X_dir_view <- project_sphere(X_dir_flip)
A_euclidean_view <- project_sphere(unit_rows(coordinates(fit_euclidean_dir)))
A_directional_view <- project_sphere(fit_directional[["directions"]])
front <- X_dir_view[, 3] >= 0
points(X_dir_view[!front, 1:2], pch = 19, cex = 0.48, col = pal_data)
points(X_dir_view[front, 1:2], pch = 19, cex = 0.52, col = pal_data_mid)
points(
    A_euclidean_view[, 1:2],
    pch = 4, cex = 1.45, lwd = 1.7, col = pal_aa[2]
)
points(
    A_directional_view[, 1:2],
    pch = 16, cex = 1.25, col = pal_aa[1]
)
legend(
    "bottomleft",
    legend = c("Back hemisphere", "Front hemisphere", "Euclidean AA", "Directional AA"),
    pch = c(19, 19, 4, 16),
    col = c(pal_data, pal_data_mid, pal_aa[2], pal_aa[1]),
    pt.cex = c(0.7, 0.8, 1.2, 1.1),
    bty = "n"
)
```

Directional AA returns `directional_archetypes`, which extends the ordinary
`archetypes` class. `fitted()` and out-of-sample `predict()` return
row-normalized reconstructions aligned to the original row polarity when the
training data are available. The directional loss is not comparable to Euclidean
SSE, so compare directional fits with directional diagnostics, not with ordinary
residual sums of squares.

Current limitations are stricter than for Gaussian AA: input must be a dense
numeric matrix with no zero rows; missing data, robust fitting, and custom
`scale` values are not supported.

## Practical checklist

Use the simplest model that matches the data-generating assumptions:

- If a weighted or correlated feature metric is enough, stay with Gaussian AA
  and supply `scale`.
- If the structure is nonlinear but still geometric, try kernel AA and inspect
  sensitivity to kernel parameters.
- If observations are binary, counts, or category frequencies, use PAA and
  interpret archetypes as distribution parameters.
- If only direction matters, use directional AA and remember that magnitudes are
  not the object of fit.

Across all variants, use multiple seeds or `nrep` where supported. The
optimization problems remain non-convex, so stable conclusions should not rely
on a single initialization.

## References