Scaling KRLS to larger n: the Nystrom approximation

Jens Hainmueller and Chad Hazlett

June 2026

Motivation

Exact KRLS solves a Tikhonov problem on the full \(n \times n\) Gaussian kernel. Time is \(O(n^3)\) and memory is \(O(n^2)\), so the practical ceiling on a laptop is around \(n \approx 5{,}000\). For larger samples the kernel allocation alone becomes prohibitive.

The Nystrom approximation replaces the full kernel with a low-rank representation anchored at \(m\) landmark points. The fitted model is \(\hat f(x) = K(x, Z)\, \alpha\) where \(Z\) are the landmarks and \(\alpha\) is a length-\(m\) coefficient vector solved in \(m\)-dimensional space. Time becomes \(O(n m^2 + m^3)\) and memory \(O(n m)\), so KRLS can now handle sample sizes well past the exact ceiling.

KRLS exposes this as an explicit approximation:

krls(X, y, approx = "nystrom", nystrom_m = 100)

Default landmark count is min(500, n). Inference (when vcov = TRUE) is conditional approximate – standard errors condition on the selected landmarks, fixed \(\lambda\), and the low-rank feature map; they are not equivalent to exact KRLS inference.

library(KRLS)
#> ## KRLS Package for Kernel-based Regularized Least Squares.
#> ## See Hainmueller and Hazlett (2014) for details.

A scaling comparison

We simulate a smooth function in three predictors plus a binary indicator, and compare exact KRLS to Nystrom across modest \(n\).

make_data <- function(n) {
  X <- cbind(rnorm(n), rnorm(n), rnorm(n), rbinom(n, 1, 0.4))
  colnames(X) <- c("x1", "x2", "x3", "x4")
  f <- sin(X[, 1]) + 0.3 * X[, 2] + 0.05 * X[, 3]^2 + 0.5 * X[, 4]
  list(X = X, y = f + rnorm(n, sd = 0.3), truth = f)
}

For each \(n\) we fit both paths, record wall-clock time, in-sample \(R^2\), and prediction RMSE on a held-out test set drawn from the same distribution. Predictions are evaluated against the true mean function (no noise on the test target) to focus on approximation error.

sizes <- c(500, 1000)
res <- vector("list", length(sizes))

for (i in seq_along(sizes)) {
  n <- sizes[i]
  tr <- make_data(n)
  te <- make_data(200)
  # Use the package default landmark count, min(500, n).
  m  <- min(500L, n)

  t_e <- system.time({
    fit_e <- krls(tr$X, tr$y, approx = "none",
                  derivative = FALSE, vcov = FALSE,
                  print.level = 0)
  })["elapsed"]
  t_n <- system.time({
    fit_n <- krls(tr$X, tr$y, approx = "nystrom",
                  derivative = FALSE, print.level = 0)
  })["elapsed"]

  pred_e <- predict(fit_e, newdata = te$X)$fit
  pred_n <- predict(fit_n, newdata = te$X)$fit

  res[[i]] <- data.frame(
    n          = n,
    m          = m,
    time_exact = round(as.numeric(t_e), 2),
    time_nys   = round(as.numeric(t_n), 3),
    speedup    = round(as.numeric(t_e) / as.numeric(t_n), 1),
    rmse_exact = round(sqrt(mean((pred_e - te$truth)^2)), 4),
    rmse_nys   = round(sqrt(mean((pred_n - te$truth)^2)), 4)
  )
}
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
do.call(rbind, res)
#>      n   m time_exact time_nys speedup rmse_exact rmse_nys
#> 1  500 500       0.84    0.568     1.5     0.1119   0.1119
#> 2 1000 500       6.26    1.187     5.3     0.0885   0.0885

The Nystrom path stays in milliseconds while exact KRLS grows cubically. Prediction RMSE on held-out data is comparable – Nystrom trades a small amount of approximation error for a large amount of runtime.

Scaling beyond what exact KRLS can do

Past \(n \approx 2{,}000\) the exact path becomes uncomfortable on a typical laptop. Nystrom keeps going:

n  <- 2000
tr <- make_data(n)
te <- make_data(200)
# Default is min(500, n) -- here 500 landmarks at n = 2000.
m  <- min(500L, n)

t_n <- system.time({
  fit_big <- krls(tr$X, tr$y, approx = "nystrom",
                  derivative = FALSE, print.level = 0)
})["elapsed"]
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

pred_big <- predict(fit_big, newdata = te$X)$fit
sprintf("n = %d, m = %d, time = %.3fs, R2 = %.3f, RMSE = %.3f",
        n, m, t_n, fit_big$R2,
        sqrt(mean((pred_big - te$truth)^2)))
#> [1] "n = 2000, m = 500, time = 2.436s, R2 = 0.862, RMSE = 0.075"

The fit still recovers the truth at held-out RMSE comparable to the exact path on the smaller datasets, in a fraction of a second.

Reusing landmarks across fits

When running sensitivity checks – refitting on the same predictors with several outcome variants, or perturbing \(y\) – it is convenient to fix the landmark set so the approximation geometry is held constant. The fitted object stores landmarks in standardized \(X\)-space; to pass them back through krls() use get_landmarks(), which un-standardizes to the original units the function expects:

tr <- make_data(400)
fit <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 30,
            derivative = FALSE, print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

Z <- get_landmarks(fit)              # original X-scale matrix
y_perturbed <- tr$y + rnorm(400, sd = 0.05)

fit2 <- krls(tr$X, y_perturbed, approx = "nystrom",
             landmarks = Z, derivative = FALSE, print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

# Same landmarks under the hood
identical(unname(fit$landmarks), unname(fit2$landmarks))
#> [1] TRUE

The landmark coordinates are bit-identical, so any difference between the two fits is attributable to the change in \(y\) alone – not to re-selecting anchor points.

Choosing the lambda criterion

KRLS picks the regularization parameter by minimizing a closed-form objective. Two are available:

For the Nystrom path both are evaluated in \(O(nm)\) per candidate \(\lambda\) from the cached SVD, so the choice is essentially free.

tr <- make_data(500)
fit_loo <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25,
                derivative = FALSE, lambda_method = "loo",
                print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
fit_gcv <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25,
                derivative = FALSE, lambda_method = "gcv",
                print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
data.frame(
  method = c("loo", "gcv"),
  lambda = c(fit_loo$lambda, fit_gcv$lambda),
  R2     = c(fit_loo$R2, fit_gcv$R2)
)
#>   method    lambda        R2
#> 1    loo 0.1983744 0.8341407
#> 2    gcv 0.3202462 0.8134584

LOO and GCV typically pick lambdas within a small multiplicative factor of each other. GCV is sometimes preferred when one or two observations have disproportionate leverage; LOO has more historical use in KRLS applications.

What carries over from the exact path

A Nystrom fit is still a krls object, so the standard accessors work:

Limitations