FastAPI server providing CTranslate2 inference with an OpenAI-compatible API.
- Speech-to-text: Whisper models via faster-whisper, using the same
openai/whisper-*model IDs as the OpenAI API. - LLMs: any model compatible with CTranslate2 (Gemma 3, Qwen, LLaMA, Mistral, and more), converting models from HuggingFace to CTranslate2 format (int8) on first use.
- Docker
- NVIDIA Container Toolkit (GPU only)
Images are published to the GitHub Container Registry on every push to main.
Pull and run the CPU image:
docker pull ghcr.io/jordimas/ctranslate2-web-server-cpu:latest
docker run --rm -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-cpu:latestFor GPU:
docker pull ghcr.io/jordimas/ctranslate2-web-server-gpu:latest
docker run --rm --gpus all -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-gpu:latestNote:
HF_TOKENis required to download gated models such as Gemma from HuggingFace. Set it in your environment (export HF_TOKEN=your_token) or pass it directly with-e HF_TOKEN=your_token. You can create a token at huggingface.co/settings/tokens after accepting the model's license.
Chat completion:
curl http://localhost:8015/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'Audio transcription:
curl http://localhost:8015/v1/audio/transcriptions \
-F "model=openai/whisper-large-v3" \
-F "file=@speech.mp3" \
-F "language=en"Chat completion:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8015/v1", api_key="unused")
response = client.chat.completions.create(
model="google/gemma-3-270m-it",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Audio transcription: see sample/transcribe.py, which works against both OpenAI and this server via --url:
uv run sample/transcribe.py --url http://localhost:8015/v1 --model openai/whisper-large-v3 speech.mp3sample/eval_flores_en_ca.py evaluates English to Catalan translation on FLORES-200 scored with BLEU. Switching to this server requires only a --url flag:
python sample/eval_flores_en_ca.py --model gpt-4o-mini
python sample/eval_flores_en_ca.py --url http://localhost:8015/v1 --model google/gemma-3-4b-it| File | Base | Use case |
|---|---|---|
Dockerfile.cpu |
python:3.14-slim |
Standard CPU image |
Dockerfile.gpu |
nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04 |
NVIDIA GPU image |
make build-cpu # CPU image
make build-gpu # GPU image
make build # bothTo bake one or more models into the image at build time, pass BUILD_MODELS as a space-separated list of HuggingFace model IDs:
make build-cpu BUILD_MODELS="google/gemma-3-270m-it"
make build-cpu BUILD_MODELS="google/gemma-3-270m-it google/gemma-3-4b-it"The models are downloaded, converted to CTranslate2 int8 format, and stored inside the image under /models. When the container starts those models are available immediately with no conversion step on first request.
You can also pass the build arg directly to Docker:
docker build -f Dockerfile.cpu \
--build-arg BUILD_MODELS="google/gemma-3-270m-it" \
-t ctranslate2-web-server-cpu .Note:
HF_TOKENmust be set in the environment if the models require authentication (e.g. gated models). Pass it with--build-arg HF_TOKEN=$HF_TOKEN.
make run-cpu # CPU
make run-gpu # GPU (requires NVIDIA runtime)Models are stored inside the image under /models.
GET /v1/models
Returns all models available on HuggingFace that are supported by CTranslate2 (e.g. Gemma 3, Qwen, LLaMA, Mistral).
curl http://localhost:8015/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3-270m-it", "prompt": "Once upon a time", "max_tokens": 100}'curl http://localhost:8015/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'The first request for a model triggers an automatic download and conversion. Subsequent requests use the cached converted model.
curl http://localhost:8015/v1/audio/transcriptions \
-F "model=openai/whisper-large-v3" \
-F "file=@speech.mp3" \
-F "language=en"Supported model IDs: openai/whisper-large-v3, openai/whisper-large-v2, openai/whisper-large, openai/whisper-medium, openai/whisper-small, openai/whisper-base, openai/whisper-tiny. The alias whisper-1 maps to openai/whisper-large-v3.
Models are downloaded automatically on first use.
| Variable | Default | Description |
|---|---|---|
MODELS_DIR |
/models |
Directory to store converted models |
DEVICE |
cpu / cuda |
Inference device (set by image) |