Chat and Agents

llamaR turns a local GGUF model into a chat backend for the R ecosystem. You can talk to it several ways, from lowest to highest level:

Both servers share a tool-aware chat layer (llama_chat_build() / llama_chat_parse()) so tool-calling models work end to end (section 6).

library(llamaR)

1. The chat object: chat_llamar()

chat_llamar() returns an ellmer Chat. It has two modes, picked by which argument you pass — the same DBI-style choice as DBI::dbConnect() (connection parameters or a ready connection).

Mode A — spawn a server for a model

Give it a model file and it starts llama_serve_openai() in a background process (via the callr package), waits for it to come up, and points a Chat at it. The server’s lifetime is tied to the returned object: when it is garbage-collected (or R exits) the process is killed.

chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")

chat$chat("Why is the sky blue?")

chat_llamar_stop(chat)   # stop the spawned server (or just let GC do it)

Large models can take a while to load from disk; raise timeout (default 180s) if a 14B at Q8 doesn’t come up in time:

chat <- chat_llamar(model_path = "Qwen3-14B-Q8_0.gguf", timeout = 300)

Mode B — connect to a running server

If you already run a server (in another process, or a pool of them), pass its URL. No process is spawned.

# In another process / shell:
#   llama_serve_openai("model.gguf", port = 11434L)

chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1")
chat$chat("Hello!")

System prompt

chat <- chat_llamar(
  model_path    = "Ministral-3B-Instruct.gguf",
  system_prompt = "You are a concise assistant. Answer in one sentence."
)
chat$chat("What is R?")

Under the hood. chat_llamar() wraps ellmer::chat_vllm(), which talks to the server’s /v1/chat/completions endpoint — the de-facto standard our server implements. (ellmer’s chat_openai() targets OpenAI’s newer /v1/responses API, which the server does not implement.)


2. The server: llama_serve_openai()

chat_llamar(model_path=) is a convenience wrapper; you can run the server directly for non-R clients. It needs the optional drogonR package for the HTTP/SSE layer.

llama_serve_openai("model.gguf", port = 11434L, n_ctx = 8192L)

It blocks, serving:

Point any OpenAI client at http://127.0.0.1:11434/v1:

curl http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'

A runnable launcher lives at inst/examples/serve_openai.R.

Connecting OpenCode

Add an OpenAI-compatible provider in opencode.json (see the one in this repo) with baseURL set to http://127.0.0.1:11434/v1 and the model id matching what /v1/models reports.


3. The command-line example

inst/examples/chat.R wraps both modes for the terminal:

# Spawn a server for the model and open an interactive prompt
Rscript inst/examples/chat.R model.gguf

# Positional [port] [n_ctx], plus flags
Rscript inst/examples/chat.R model.gguf 11434 8192 \
  --system "Be concise." --timeout 300

# One-shot: a trailing message prints a single reply and exits
Rscript inst/examples/chat.R model.gguf "Why is the sky blue?"

# Connect to a server you already started
Rscript inst/examples/chat.R --url http://127.0.0.1:11434/v1

In interactive mode, type a message and press Enter; a blank line or Ctrl-D quits. A spawned server is stopped automatically on exit.


4. ragnar: retrieval-augmented chat

Because chat_llamar() returns a real ellmer::Chat, it plugs into ragnar. Pair it with embed_llamar() (see vignette("getting-started")) for a fully local RAG stack: local embeddings for the store, local generation for the chat.

library(ragnar)

store <- ragnar_store_create(
  location = "store.duckdb",
  embed    = embed_llamar(model = "embedding-model.gguf")
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)

chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
ragnar_register_tool_retrieve(chat, store)
chat$chat("What do the documents say about X?")

Note. Tool calling is mediated by the chat protocol. Both servers emit tool_calls (the tool-aware chat layer, see section 6), so a tool-calling model can autonomously invoke the registered retrieve tool. Tool-calling quality is model-dependent — capable models (e.g. Qwen3) drive retrieval reliably; very small models may need manual retrieval.


5. Tool calling: llama_chat_build() / llama_chat_parse()

The servers above call a lower-level, tool-aware chat layer you can also use directly. llama_chat_build() applies the model’s chat template to messages plus tool definitions and returns everything needed to constrain and parse a tool call; llama_chat_parse() turns the raw output back into structured tool calls.

tools <- list(list(
  type = "function",
  "function" = list(
    name        = "get_weather",
    description = "Get the current weather for a city.",
    parameters  = list(
      type       = "object",
      properties = list(city = list(type = "string")),
      required   = list("city")
    )
  )
))

messages <- list(list(role = "user", content = "What's the weather in Paris?"))

built <- llama_chat_build(model, messages, tools = tools)
# built$prompt   — the formatted prompt to feed the model
# built$grammar  — grammar that constrains tool-call output
# built$format   — format id to pass to llama_chat_parse()
# built$grammar_lazy, built$trigger_patterns, built$trigger_tokens,
# built$additional_stops, built$preserved_tokens, built$parser

Generate against the returned prompt and grammar, then parse. Pass the lazy triggers only when built$grammar_lazy is TRUE:

lazy <- isTRUE(built$grammar_lazy)
raw <- llama_generate(
  ctx, built$prompt,
  grammar          = built$grammar,
  trigger_patterns = if (lazy) built$trigger_patterns,
  trigger_tokens   = if (lazy) built$trigger_tokens
)

parsed <- llama_chat_parse(raw, format = built$format, parser = built$parser)
parsed$content        # assistant text (may be empty for a pure tool call)
parsed$tool_calls     # data frame: name, arguments (JSON string), id

Tool-calling quality is model-dependent: capable models (e.g. Qwen3) emit clean tool calls; very small models are less reliable.


6. Claude Code on a local model: llama_serve_anthropic()

llama_serve_anthropic() exposes an Anthropic Messages API so Claude Code (or any Anthropic SDK client) runs against local inference. It uses the tool-aware layer from section 5 and streams the Anthropic SSE event sequence. Like the OpenAI server it needs the optional drogonR package.

llama_serve_anthropic("model.gguf", port = 11435L, n_ctx = 32768L)

It serves POST /v1/messages and GET /v1/models. Point Claude Code at it via environment variables:

export ANTHROPIC_BASE_URL=http://127.0.0.1:11435
export ANTHROPIC_API_KEY=local   # any non-empty value
claude

enable_thinking (default FALSE) toggles the chat template’s reasoning mode for hybrid thinking models (Qwen3.5, etc.). It is off by default so Claude Code gets direct answers and fast tool calls; set it TRUE (and raise max_tokens) to keep the reasoning trace. A runnable launcher lives at inst/examples/claude_code_launcher.sh.


7. Concurrency

The server is single-sequence: it handles one request at a time on the main R thread. That is enough for a single local user or agent. For parallel sessions, run a pool of servers on different ports and create one chat_llamar(base_url=) per worker — the worker-pool architecture is described in TODO.md.

ports <- c(11434L, 11435L, 11436L)
chats <- lapply(ports, function(p)
  chat_llamar(base_url = sprintf("http://127.0.0.1:%d/v1", p)))

See also