querying data that won’t fit in memory
dplyr-style queries on larger-than-RAM data, backed by a pull-based columnar engine written in C11.
Point vectra at a file too big to load and query it with the verbs you already use. Data flows through the engine one row group at a time, so peak memory stays bounded no matter how large the file gets. Arrow needs compiled binaries that match your platform, DuckDB links a bundled library, Spark wants a JVM. vectra is a standard R extension with no external dependencies: it compiles where R compiles.
library(vectra)
# Lazy scan over a multi-GB CSV; nothing runs until collect()
tbl_csv("measurements.csv") |>
filter(temperature > 30, year >= 2020) |>
group_by(station) |>
summarise(avg_temp = mean(temperature), n = n()) |>
collect().vtr (vectra’s own columnar format), CSV, SQLite, and
GeoTIFF all open into the same lazy query nodes, so the same pipeline
runs against any of them:
# GeoTIFF climate raster as tidy data
tbl_tiff("worldclim_bio1.tif") |>
filter(band1 > 0) |>
mutate(temp_c = band1 / 10) |>
collect()
# SQLite without DBI
tbl_sqlite("survey.db", "responses") |>
filter(year == 2025) |>
left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
collect()Convert to .vtr once for repeated queries, then read it
back fast:
write_vtr(big_df, "data.vtr", batch_size = 100000)
tbl("data.vtr") |>
filter(x > 0, region == "EU") |>
group_by(region) |>
summarise(total = sum(value), n = n()) |>
collect()The optimizer rewrites the plan before any data moves, and execution streams the result rather than materializing it:
explain() shows the optimized plan, including which
columns and row groups were pruned:
tbl("data.vtr") |>
filter(x > 0) |>
select(id, x) |>
explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#> FilterNode [streaming]
#> ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]String distance runs inside the C engine, with no round-trip to R per row:
tbl("taxa.vtr") |>
filter(levenshtein(species, "Quercus robur") <= 2) |>
mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(similarity)) |>
collect()levenshtein(), dl_dist()
(Damerau-Levenshtein), and jaro_winkler() are available in
filter() and mutate(), alongside
nchar(), substr(), grepl(),
gsub() and the rest of the string toolkit.
Same idea, applied to geometry: read and write layers larger than memory, running the GIS operations you would otherwise do in sf. Geometry rides through as WKB in an ordinary column, so the spatial verbs stream one batch at a time like the rest. Tag a billion-point stream with the polygon each point falls in, holding the polygons plus one batch in memory:
library(sf)
tbl("gps_pings.vtr") |>
spatial_join(zones_sf, join = st_intersects) |>
count(zone_id) |>
collect()st_* geometry functions evaluate inside
filter() and mutate(), computed in C on the
GEOS library straight off the geometry column:
tbl("parcels.vtr") |>
filter(st_area(geometry) > 1e6) |>
mutate(centroid = st_centroid(geometry)) |>
collect_sf()Raster operations stream strip by strip over the tiled
.vec format, so a grid larger than memory passes through
zonal(), focal(), terrain(),
warp(), and map algebra one tile-row at a time:
r <- vec_open_raster("dem.vec")
terrain(r, c("slope", "aspect", "hillshade")) # Horn's method over haloed strips
zonal(r, watersheds_sf, fun = c("mean", "sd")) # per-zone, one strip residentThe toolbox reaches further: vector-raster bridges
(rasterize(), polygonize(),
contours()), planar overlay and coverage cleaning
(spatial_overlay(), spatial_dissolve(),
spatial_eliminate()), and routing on a line network
(spatial_network(), spatial_route(),
spatial_service_area()). Geometry stays with GEOS through
libgeos; sf is needed only for vector format I/O and
reprojection.
Register dimension tables once, then pull columns from any of them; joins are built for you, and unmatched keys are reported:
s <- vtr_schema(
fact = tbl("observations.vtr"),
species = link("sp_id", tbl("species.vtr")),
site = link("site_id", tbl("sites.vtr"))
)
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)Append new rows as a new row group without rewriting the file, or take a key-based diff between two snapshots:
append_vtr(new_rows_df, "data.vtr")
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added) # rows present in new but not old
d$deleted # key values present in old but not newcollect() brings the whole result into memory.
collect_chunked() folds a function over a query one batch
at a time, holding a single batch plus the accumulator, so a result
larger than memory reduces to a small summary in one pass: a running
count, per-group sufficient statistics, or the cross-products behind a
linear fit.
# Accumulate X'X and X'y for an exact OLS fit, one streaming batch at a time
acc <- tbl("survey.vtr") |>
select(mpg, wt, hp) |>
collect_chunked(
function(acc, chunk) {
X <- cbind(1, chunk$wt, chunk$hp)
list(XtX = acc$XtX + crossprod(X),
Xty = acc$Xty + crossprod(X, chunk$mpg))
},
.init = list(XtX = matrix(0, 3, 3), Xty = matrix(0, 3, 1))
)
solve(acc$XtX, acc$Xty)For models that re-read the data on every iteration,
chunk_feeder() exposes a query as a resettable generator
that biglm::bigglm() drives directly, fitting a GLM on data
too large to hold in memory:
src <- function() tbl("occurrences.vtr") |> select(presence, bio1, bio12)
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(src),
family = binomial())offload() spills a prepared query to disk once and
streams it back, so each reweighted pass reads the prepared columns from
a file rather than rebuilding the pipeline. With a by key
it splits the query into per-key shards on disk, each small enough to
pull into memory on its own. group_map() then runs your
function on each shard, passing the shard as an ordinary data.frame
along with its group key, and returns the results keyed by shard. That
function is arbitrary R: a model fit, a clustering call, a custom
summary, or any computation that needs the whole group in memory at
once. group_modify() recombines per-shard data.frames into
one table.
# Prepare once, then let bigglm re-read the spill on every pass
s <- offload(tbl("occurrences.vtr") |> select(presence, bio1, bio12))
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(s),
family = binomial())
# Per-region: group_map runs any function on each in-memory shard (here a GLM)
p <- offload(tbl("occurrences.vtr"), by = "region")
fits <- group_map(p, function(d, region)
glm(presence ~ bio1 + bio12, data = d, family = binomial()))vectra covers the dplyr surface most analysis pipelines use:
filter(), select(), mutate(),
group_by() / summarise() with the common
aggregations, all the joins (left_join() through
cross_join(), plus fuzzy_join()),
arrange() and the slice_*() family, and window
functions (row_number(), rank(),
lag(), lead(), cumsum(),
ntile(), and more). Date/time and string functions evaluate
in the engine. select(), rename(),
relocate(), and across() accept the full
tidyselect helper set (starts_with(), where(),
all_of(), and the rest).
The Function Reference lists every verb and expression with examples.
install.packages("vectra") # CRAN
install.packages("pak") # development version
pak::pak("gcol33/vectra")Engine:
Spatial:
“Software is like sex: it’s better when it’s free.” — Linus Torvalds
I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.
If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.
MIT (see the LICENSE.md file)
@software{vectra,
author = {Colling, Gilles},
title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
year = {2026},
url = {https://github.com/gcol33/vectra}
}