Lightweight LLM access layer with provider normalization, caching, and observability. It uses a unified endpoint format (model@provider) so application code can switch providers without rewriting call sites, while still allowing developers to choose the model provider or compatible endpoint they want to use.
UniLLM is the model-access layer in the wider Unity stack:
User (Console/Phone/SMS/Email)
│
┌─────────────────┴──────────────────┐
│ Communication │
│ (Webhooks, Voice, SMS, Email) │
└────┬───────────────────────────────┘
│
┌────┴────┐ ┌─────────┐ ┌─────────┐
│ Unity │ │ Unify │ │Orchestra│
│ (Brain) │───▶│ (SDK) │───▶│ (API) │
│ │ │ │ │ (DB) │
└────┬────┘ └────┬────┘ └────┬────┘
│ ▲ ▲
│ │ │
│ ┌─────────┴─┐ ┌────┴───────┐
└───▶│ UniLLM │ │ Console │
│ (LLM API) │ │(Interfaces)│
└───────────┘ └────────────┘
This repo (UniLLM) handles LLM inference for Unity. It normalizes requests across providers (OpenAI, Anthropic, Vertex AI, etc.), provides response caching for test determinism, and can integrate with Unify for logging and billing context.
If you're here from the Unity quickstart, this is the layer that talks to model providers. OpenAI and Anthropic are the simplest documented paths, but the point of UniLLM is that the provider choice is yours, including other supported providers and compatible local endpoints.
Related repositories:
- Unity — AI assistant brain (primary consumer)
- Unify — Python SDK for logging and persistence
- Orchestra — Backend API and database
pip install unifyai-unillmOr with uv:
uv add unifyai-unillmSet credentials for whichever providers you want to use:
export OPENAI_API_KEY=<your-key>
export ANTHROPIC_API_KEY=<your-key>
# ... other provider keysYou do not need a Unify account for basic inference. UNIFY_KEY only matters for optional logging, credit, and observability features that integrate with the wider Unify stack.
For Vertex AI models (Gemini, Claude on Vertex, etc.), authenticate using Google Cloud Application Default Credentials:
# One-time setup: authenticate with your Google Cloud account
gcloud auth application-default login
# Set your GCP project and location
export VERTEXAI_PROJECT=<your-project-id>
export VERTEXAI_LOCATION=<your-location> # e.g., us-central1, europe-west1Alternatively, use a service account JSON file:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonimport unillm
# Sync client
client = unillm.Unify("gpt-4o@openai")
response = client.generate(
messages=[{"role": "user", "content": "Hello!"}]
)
# Async client
async_client = unillm.AsyncUnify("claude-sonnet-4-20250514@anthropic")
response = await async_client.generate(
messages=[{"role": "user", "content": "Hello!"}]
)All models use a consistent model@provider format:
client = unillm.Unify("gpt-4o@openai")
client = unillm.Unify("claude-sonnet-4-20250514@anthropic")
client = unillm.Unify("gemini-2.0-flash@vertexai")Automatic handling of provider quirks (message format normalization, parameter translation, etc.) before requests are sent.
Built-in caching to avoid redundant LLM calls:
client = unillm.Unify("gpt-4o@openai", cache=True)
# Cache modes
client.generate(..., cache="read") # Read from cache only if available
client.generate(..., cache="write") # Write to cache only
client.generate(..., cache="both") # Read and write
client.generate(..., cache="read-only") # Must be in cache, else errorTrack cache hit/miss status for observability:
from unillm import capture_cache_events
with capture_cache_events() as events:
client.generate(messages=[...])
print(events[0]["cache_status"]) # "hit" or "miss"client = unillm.Unify("gpt-4o@openai", stream=True)
for chunk in client.generate(messages=[...]):
print(chunk, end="")client = unillm.Unify("gpt-4o@openai", stateful=True)
client.generate(user_message="What is 2+2?")
client.generate(user_message="And what is that times 3?") # Maintains historytools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {...}}
}
}]
response = client.generate(messages=[...], tools=tools)from pydantic import BaseModel
class Answer(BaseModel):
value: int
explanation: str
response = client.generate(
messages=[{"role": "user", "content": "What is 2+2?"}],
response_format=Answer
)Terminal and file logging are independently controlled:
# Terminal (console) output (default: true)
export UNILLM_TERMINAL_LOG=true
# File-based traces (independent of terminal)
export UNILLM_LOG_DIR=/path/to/logsWhen UNILLM_LOG_DIR is set, structured log files are written:
- During call (cache enabled):
{base}.cache_pending.txt - During call (cache disabled):
{base}.pending.txt - After completion:
{base}.cache_hit.txtor{base}.cache_miss.txt(cache enabled), or{base}.txt(cache disabled)
Pending files remain as evidence if an LLM call hangs or crashes.
# Enable OTel tracing
export UNILLM_OTEL=true
# OTLP endpoint (optional)
export UNILLM_OTEL_ENDPOINT=http://localhost:4317
# File-based span export (optional)
export UNILLM_OTEL_LOG_DIR=/path/to/tracesLLM calls create OTel spans that can be correlated with parent application spans and propagated to child services.
When UNILLM_LOG_DIR is set, full request and response payloads are written to disk. This includes user messages, tool arguments, and model responses — which may contain PII or sensitive data. Be mindful of this when enabling file logging in production environments.
- OpenAI
- Anthropic
- Vertex AI (Google)
- Bedrock (AWS)
- DeepSeek
- Groq
- Mistral
- Replicate
- Together AI
- xAI
unillm/
├── __init__.py # Public API exports
├── settings.py # Configuration via env vars
├── helpers.py # Utility functions
├── costs.py # Provider cost computation
├── cost_tracker.py # Per-call cost event capture
├── tokens.py # Token counting and context window utilities
├── cache_events.py # Cache hit/miss event capture
├── llm_events.py # LLM event hooks for observability
├── limit_hooks.py # Spending limit check callbacks
├── logger.py # File logging and OTel tracing
├── clients/ # LLM client implementations
│ ├── base.py # Base client class
│ ├── uni_llm.py # Unify (sync) and AsyncUnify clients
│ ├── provider_preprocessing.py # Provider-specific request normalization
│ ├── provider_postprocessing.py # Response validation and retries
│ └── shared_session.py # Shared aiohttp session management
├── caching/ # Response caching system
│ ├── base_cache.py # Abstract cache backend
│ ├── local_cache.py # File-based NDJSON cache
│ └── local_separate_cache.py # Split read/write cache (for CI)
├── endpoints/ # Provider-specific model mappings
│ ├── openai.py
│ ├── anthropic.py
│ └── ...
└── types/ # Type definitions
├── cache.py
├── prompt.py
└── prompt_caching.py
This project uses uv for dependency management. By default, local development installs the published unifyai package for the persistence/logging dependency.
git clone https://github.com/unifyai/unillm.git
cd unillm
uv syncIf you're iterating on a sibling checkout of unify as well, override the installed dependency with an editable install after uv sync:
uv pip install -e ../unifyTests require at minimum an OPENAI_API_KEY or ANTHROPIC_API_KEY set in your environment (or a .env file). With a populated .cache.ndjson, cached LLM responses are replayed — so tests run fast and deterministically without making real LLM calls.
uv run pytest tests/ -vUNIFY_KEY is optional. If set, credit deduction runs against the Unify API. If unset, credit deduction silently warns and tests continue normally.
Tests are opt-in to reduce GitHub Actions costs. Tests only run when explicitly requested:
- Commit message: Include
[run-tests]in your commit message - PR title: Include
[run-tests]in your pull request title - Manual trigger: Use the "Run workflow" button in GitHub Actions
Examples:
# Run tests on this commit
git commit -m "Fix caching logic [run-tests]"
# No tests (default)
git commit -m "Update README"Note: The black formatting check always runs on every push.
Some CI steps (local Orchestra deployment, GCP authentication) are internal infrastructure for the Unify team and are automatically skipped on external forks.
Pre-commit hooks run automatically on git commit (Black, isort, autoflake). If a commit fails due to auto-formatting, re-run the commit.
pre-commit installMIT — see LICENSE for details.