llamaR turns a local GGUF model into a chat backend for the R ecosystem. You can talk to it several ways, from lowest to highest level:
llama_serve_openai()
exposes an OpenAI-compatible API any client can hit (OpenCode, the
openai SDK, curl).llama_serve_anthropic() exposes an Anthropic Messages API
so Claude Code runs against local inference (see
section 7).Chat —
chat_llamar() returns an ellmer::Chat, so the
whole ellmer / ragnar toolchain works against local inference.inst/examples/chat.R wraps both for quick use.Both servers share a tool-aware chat layer
(llama_chat_build() / llama_chat_parse()) so
tool-calling models work end to end (section 6).
chat_llamar()chat_llamar() returns an ellmer Chat. It
has two modes, picked by which argument you pass — the same DBI-style
choice as DBI::dbConnect() (connection parameters
or a ready connection).
Give it a model file and it starts llama_serve_openai()
in a background process (via the callr package), waits
for it to come up, and points a Chat at it. The server’s
lifetime is tied to the returned object: when it is garbage-collected
(or R exits) the process is killed.
chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
chat$chat("Why is the sky blue?")
chat_llamar_stop(chat) # stop the spawned server (or just let GC do it)Large models can take a while to load from disk; raise
timeout (default 180s) if a 14B at Q8 doesn’t come up in
time:
If you already run a server (in another process, or a pool of them), pass its URL. No process is spawned.
chat <- chat_llamar(
model_path = "Ministral-3B-Instruct.gguf",
system_prompt = "You are a concise assistant. Answer in one sentence."
)
chat$chat("What is R?")Under the hood.
chat_llamar()wrapsellmer::chat_vllm(), which talks to the server’s/v1/chat/completionsendpoint — the de-facto standard our server implements. (ellmer’schat_openai()targets OpenAI’s newer/v1/responsesAPI, which the server does not implement.)
llama_serve_openai()chat_llamar(model_path=) is a convenience wrapper; you
can run the server directly for non-R clients. It needs the optional
drogonR package for the HTTP/SSE layer.
It blocks, serving:
GET /v1/modelsPOST /v1/chat/completions (both blocking and
stream = true)Point any OpenAI client at
http://127.0.0.1:11434/v1:
curl http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'A runnable launcher lives at
inst/examples/serve_openai.R.
Add an OpenAI-compatible provider in opencode.json (see
the one in this repo) with baseURL set to
http://127.0.0.1:11434/v1 and the model id matching what
/v1/models reports.
inst/examples/chat.R wraps both modes for the
terminal:
# Spawn a server for the model and open an interactive prompt
Rscript inst/examples/chat.R model.gguf
# Positional [port] [n_ctx], plus flags
Rscript inst/examples/chat.R model.gguf 11434 8192 \
--system "Be concise." --timeout 300
# One-shot: a trailing message prints a single reply and exits
Rscript inst/examples/chat.R model.gguf "Why is the sky blue?"
# Connect to a server you already started
Rscript inst/examples/chat.R --url http://127.0.0.1:11434/v1In interactive mode, type a message and press Enter; a blank line or Ctrl-D quits. A spawned server is stopped automatically on exit.
Because chat_llamar() returns a real
ellmer::Chat, it plugs into ragnar. Pair it with
embed_llamar() (see
vignette("getting-started")) for a fully local RAG stack:
local embeddings for the store, local generation for the chat.
library(ragnar)
store <- ragnar_store_create(
location = "store.duckdb",
embed = embed_llamar(model = "embedding-model.gguf")
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
ragnar_register_tool_retrieve(chat, store)
chat$chat("What do the documents say about X?")Note. Tool calling is mediated by the chat protocol. Both servers emit
tool_calls(the tool-aware chat layer, see section 6), so a tool-calling model can autonomously invoke the registered retrieve tool. Tool-calling quality is model-dependent — capable models (e.g. Qwen3) drive retrieval reliably; very small models may need manual retrieval.
llama_chat_build() /
llama_chat_parse()The servers above call a lower-level, tool-aware chat layer you can
also use directly. llama_chat_build() applies the model’s
chat template to messages plus tool definitions and
returns everything needed to constrain and parse a tool call;
llama_chat_parse() turns the raw output back into
structured tool calls.
tools <- list(list(
type = "function",
"function" = list(
name = "get_weather",
description = "Get the current weather for a city.",
parameters = list(
type = "object",
properties = list(city = list(type = "string")),
required = list("city")
)
)
))
messages <- list(list(role = "user", content = "What's the weather in Paris?"))
built <- llama_chat_build(model, messages, tools = tools)
# built$prompt — the formatted prompt to feed the model
# built$grammar — grammar that constrains tool-call output
# built$format — format id to pass to llama_chat_parse()
# built$grammar_lazy, built$trigger_patterns, built$trigger_tokens,
# built$additional_stops, built$preserved_tokens, built$parserGenerate against the returned prompt and grammar, then parse. Pass
the lazy triggers only when built$grammar_lazy is
TRUE:
lazy <- isTRUE(built$grammar_lazy)
raw <- llama_generate(
ctx, built$prompt,
grammar = built$grammar,
trigger_patterns = if (lazy) built$trigger_patterns,
trigger_tokens = if (lazy) built$trigger_tokens
)
parsed <- llama_chat_parse(raw, format = built$format, parser = built$parser)
parsed$content # assistant text (may be empty for a pure tool call)
parsed$tool_calls # data frame: name, arguments (JSON string), idTool-calling quality is model-dependent: capable models (e.g. Qwen3) emit clean tool calls; very small models are less reliable.
llama_serve_anthropic()llama_serve_anthropic() exposes an Anthropic
Messages API so Claude Code (or any Anthropic SDK client) runs
against local inference. It uses the tool-aware layer from section 5 and
streams the Anthropic SSE event sequence. Like the OpenAI server it
needs the optional drogonR package.
It serves POST /v1/messages and
GET /v1/models. Point Claude Code at it via environment
variables:
export ANTHROPIC_BASE_URL=http://127.0.0.1:11435
export ANTHROPIC_API_KEY=local # any non-empty value
claudeenable_thinking (default FALSE) toggles the
chat template’s reasoning mode for hybrid thinking models (Qwen3.5,
etc.). It is off by default so Claude Code gets direct answers and fast
tool calls; set it TRUE (and raise max_tokens)
to keep the reasoning trace. A runnable launcher lives at
inst/examples/claude_code_launcher.sh.
The server is single-sequence: it handles one
request at a time on the main R thread. That is enough for a single
local user or agent. For parallel sessions, run a pool of servers on
different ports and create one chat_llamar(base_url=) per
worker — the worker-pool architecture is described in
TODO.md.
ports <- c(11434L, 11435L, 11436L)
chats <- lapply(ports, function(p)
chat_llamar(base_url = sprintf("http://127.0.0.1:%d/v1", p)))vignette("getting-started") — the rest of the
package.?chat_llamar, ?llama_serve_openai,
?llama_serve_anthropic?llama_chat_build, ?llama_chat_parseinst/examples/chat.R,
inst/examples/serve_openai.R,
inst/examples/claude_code_launcher.sh