Local inference command center. Manage llama-server and vLLM backends, hot-swap models, monitor GPU, run agents, and browse models — all from one daemon + CLI + live dashboard.
rookery status # server state + uptime
rookery gpu # VRAM, temp, power, processes
rookery start # start default profile
rookery swap qwen_thinking # hot-swap to another model profile
rookery bench # quick PP + gen speed benchmark
rookery agent start my_agent # start a managed agentOpen the dashboard at your configured address (default http://localhost:3000) — live GPU gauges, profile switcher, agent controls, chat playground, model browser.
See Installation below for setup instructions.
More screenshots
Settings — profile switcher and sampling param editor
Agents — agent cards, controls, filtered logs
Models — hardware profile, HuggingFace search, cached models
Logs — live streaming log viewer
- Multi-backend — manage llama-server (GGUF) and vLLM (safetensors, AWQ, GPTQ, NVFP4) from the same config
- Hot-swap — switch between model profiles without restarting the daemon
- Live dashboard — Leptos WASM frontend with 7 tabs: Overview, Settings, Agents, Chat, Bench, Logs, Models
- GPU monitoring — real-time VRAM, temperature, utilization, power draw, per-process memory via NVML
- Agent management — spawn, stop, update, and watchdog external processes like Hermes (multi-platform AI agent with tool calling, web browsing, vision, and voice), coding assistants, or any service that depends on inference
- Model discovery — search HuggingFace, browse quants, VRAM-aware recommendations, one-click download
- Upstream release monitor — background polling of llama.cpp and vLLM releases with version comparison, dashboard banner, and
rookery releasesCLI - Auto-sleep — unloads the model after idle timeout, wakes transparently on next request
- Inference canary — periodic health checks detect CUDA zombies and auto-restart
- Prometheus metrics —
/metricsendpoint for GPU, server, agent, and canary telemetry - Optional API key auth — single bearer token protects API and SSE data routes (dashboard shell is public, data requires auth)
- systemd integration — OOM protection, journal logging, graceful shutdown
| Feature | rookery | llama-swap | GPUStack | LocalAI |
|---|---|---|---|---|
| Hot-swap profiles | Yes | Yes | No | No |
| Multi-backend (llama.cpp + vLLM) | Yes | No | Partial | Yes |
| Live dashboard | Yes (WASM) | No | Yes | No |
| Agent lifecycle management | Yes | No | No | No |
| Model browser + download | Yes | No | Yes | Yes |
| VRAM-aware recommendations | Yes | No | Yes | No |
| Auto-sleep / wake-on-request | Yes | Yes | No | No |
| Inference canary + auto-restart | Yes | No | Yes | No |
| Prometheus metrics | Yes | No | Yes | Yes |
| Single binary + embedded dashboard | Yes | Yes | No | No |
Run Hermes with a dense model for reliable tool calling. Rookery manages the full lifecycle — auto-starts on boot, restarts on crash, bounces on model swap:
rookery start qwen_dense # 27B Q6 for best tool accuracy
rookery agent start hermes # AI agent with crash watchdog
rookery agent describe hermes # check health, uptime, restartsHot-swap between models without restarting anything:
rookery start qwen_fast # MoE at ~196 tok/s
rookery bench # measure performance
rookery swap qwen_dense # switch to dense 27B
rookery bench # compareRun 24/7 with minimal power draw when idle:
auto_start = true
idle_timeout = 1800 # unload after 30 min idleThe model unloads after inactivity. Next API request wakes it transparently.
Find the best quant for your GPU without leaving the terminal:
rookery models search Qwen3.5-27B
rookery models quants Qwen3.5-27B # shows VRAM fit + estimated tok/s
rookery models pull Qwen3.5-27B # downloads best-fit quantNote: Review scripts before piping to sh. See the install script source.
curl -fsSL https://raw.githubusercontent.com/lance0/rookery/main/install.sh | shInstalls binaries to /usr/local/bin and seeds a default config at ~/.config/rookery/config.toml.
Download from GitHub Releases:
| Platform | Target |
|---|---|
| Linux x86_64 | rookery-x86_64-unknown-linux-gnu.tar.gz |
| Linux ARM64 | rookery-aarch64-unknown-linux-gnu.tar.gz |
curl -LO https://github.com/lance0/rookery/releases/latest/download/rookery-x86_64-unknown-linux-gnu.tar.gz
tar xzf rookery-*.tar.gz
sudo mv rookeryd rookery /usr/local/bin/Requires Rust 1.88+ and an NVIDIA GPU with CUDA drivers.
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Build and install
git clone https://github.com/lance0/rookery.git
cd rookery
sudo make installThis builds both binaries, installs them to /usr/local/bin, and sets up a systemd unit. Customize with:
sudo make install PREFIX=/opt/rookery SERVICE_USER=myuser HF_HOME=/mnt/modelsConfig file: ~/.config/rookery/config.toml
Models define what to run, profiles define how to run it. Multiple profiles can share a model.
llama_server = "/path/to/llama-server"
default_profile = "qwen_fast"
auto_start = true
idle_timeout = 1800
[models.qwen35]
source = "hf"
repo = "unsloth/Qwen3.5-35B-A3B-GGUF"
file = "UD-Q4_K_XL"
estimated_vram_mb = 25800
[profiles.qwen_fast]
model = "qwen35"
aliases = ["fast", "moe"]
port = 8081
[profiles.qwen_fast.llama_server]
ctx_size = 262144
flash_attention = true
reasoning_budget = 0
temp = 0.7
top_p = 0.8Agents are external processes managed alongside the server:
[agents.hermes]
command = "/path/to/hermes"
args = ["gateway", "run", "--replace"]
auto_start = true
restart_on_swap = true
restart_on_crash = true
depends_on_port = 8081
restart_on_error_patterns = ["ConnectionError", "ReadTimeout"]See config.example.toml for all options including vLLM backend, KV cache tuning, and API key auth.
Full reference: docs/configuration.md
The embedded dashboard runs at your configured listen address. Seven tabs with keyboard shortcuts:
| Tab | Key | Purpose |
|---|---|---|
| Overview | 1 |
GPU gauges, server status, model info, agent summary |
| Settings | 2 |
Profile switcher, sampling param editor |
| Agents | 3 |
Agent cards, controls, watchdog state, filtered logs |
| Chat | 4 |
Streaming chat playground with abort |
| Bench | 5 |
PP + gen speed benchmark |
| Logs | 6 |
Live log viewer |
| Models | 7 |
Search HF, browse quants, download |
Additional shortcuts: s start, x stop, t toggle theme.
rookery status # server state, profile, PID, uptime
rookery gpu # VRAM, temp, utilization, power, processes
rookery start [profile] # start server (or default profile)
rookery stop # stop server
rookery sleep # unload model, keep profile for fast wake
rookery wake # wake sleeping profile
rookery swap <profile> # hot-swap to another profile
rookery profiles # list available profiles
rookery bench # PP + gen speed benchmark
rookery logs [-f] [-n N] # fetch or follow log lines
rookery agent start <name> # start a managed agent
rookery agent stop <name> # stop a managed agent
rookery agent update <name> # stop, update, restart
rookery agent status # list agents
rookery agent describe <name> # detailed health, watchdog, errors
rookery models search <q> # search HuggingFace
rookery models quants <repo> # list quants with VRAM fit
rookery models pull <repo> # download best-fit quant
rookery models list # locally cached models
rookery models hardware # GPU/CPU/RAM profile
rookery releases # upstream release status (llama.cpp, vLLM)
rookery config # validate config
rookery auth generate # generate a random API key
rookery completions <shell> # generate shell completions
Most commands support --json for scripting.
The daemon exposes a REST API. When api_key is configured, all /api/* data routes and SSE require Authorization: Bearer <key>. Exempt: /api/health, /metrics, and the dashboard HTML shell (which loads but shows an auth prompt before fetching data).
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Daemon health check (always open) |
/api/status |
GET | Server state, profile, PID, uptime |
/api/gpu |
GET | GPU stats (VRAM, temp, utilization, power, processes) |
/api/start |
POST | Start server { "profile": "name" } |
/api/stop |
POST | Stop server |
/api/sleep |
POST | Put server into sleeping state |
/api/wake |
POST | Wake sleeping profile |
/api/swap |
POST | Hot-swap { "profile": "name" } |
/api/profiles |
GET | List available profiles |
/api/bench |
GET | Run PP + gen benchmark |
/api/logs |
GET | Fetch log lines ?n=50 |
/api/events |
GET | SSE stream (gpu, state, log events) |
/api/chat |
POST | Streaming chat proxy (auto-wakes sleeping backends) |
/api/agents |
GET | List agents with health metrics |
/api/agents/start |
POST | Start agent { "name": "..." } |
/api/agents/stop |
POST | Stop agent |
/api/agents/{name}/update |
POST | Stop, update, restart agent |
/api/agents/{name}/health |
GET | Detailed health (watchdog, backoff, deps) |
/api/config |
GET | Full config (secrets redacted) |
/api/config/profile/{name} |
PUT | Update profile sampling params |
/api/model-info |
GET | Model ID, context window |
/api/server-stats |
GET | Slot status, request count |
/api/hardware |
GET | Hardware profile (GPU, CPU, RAM) |
/api/models/search |
GET | Search HuggingFace ?q=query |
/api/models/quants |
GET | List quants ?repo=name |
/api/models/cached |
GET | Locally cached models |
/api/models/pull |
POST | Download model { "repo": "...", "quant": "..." } |
/metrics |
GET | Prometheus/OpenMetrics (always open) |
crates/
rookery-core/ # config, state machine, shared types
rookery-engine/ # process manager, GPU monitor, health checker, agent manager
rookery-daemon/ # axum REST API, SSE, auth middleware, embedded dashboard
rookery-dashboard/ # Leptos WASM frontend (built with trunk, embedded into daemon)
rookery-cli/ # clap CLI client
Two binaries:
rookeryd— long-running daemon (axum REST API + embedded dashboard)rookery— thin CLI that talks to the daemon over HTTP
The daemon reconciles persisted state on startup, adopts orphan processes, auto-starts configured agents, and cleans up stale llama-servers. The InferenceBackend trait abstracts over llama-server and vLLM backends.
| Platform | Status |
|---|---|
| Linux x86_64 + NVIDIA GPU | Supported |
| Linux ARM64 + NVIDIA GPU | Supported (Jetson, etc.) |
| AMD GPUs (ROCm) | Not tested |
| macOS (Metal) | Not supported (no NVML) |
- Quick Start — build, configure, run
- Configuration — full config reference
- Agent Management — watchdog, crash recovery, update
- API Reference — REST endpoints
- CLI Reference — all commands and flags
- Models — model sources, discovery, download
- GPU Guide — model recommendations by GPU tier
- Compatible Agents — OpenClaw, SillyTavern, AnythingLLM, and more
- Dashboard — tabs, shortcuts, features
- Architecture — crate structure, design decisions
- vLLM Integration — Docker backend
See CONTRIBUTING.md for development setup, code style, and PR guidelines.
Licensed under either of:
at your option.
