ctranslate2-web-server

FastAPI server providing CTranslate2 inference with an OpenAI-compatible API.

Speech-to-text: Whisper models via faster-whisper, using the same openai/whisper-* model IDs as the OpenAI API.
LLMs: any model compatible with CTranslate2 (Gemma 3, Qwen, LLaMA, Mistral, and more), converting models from HuggingFace to CTranslate2 format (int8) on first use.

Requirements

Docker
NVIDIA Container Toolkit (GPU only)

Pre-built images (quick start)

Images are published to the GitHub Container Registry on every push to main.

Pull and run the CPU image:

docker pull ghcr.io/jordimas/ctranslate2-web-server-cpu:latest
docker run --rm -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-cpu:latest

For GPU:

docker pull ghcr.io/jordimas/ctranslate2-web-server-gpu:latest
docker run --rm --gpus all -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-gpu:latest

Note: HF_TOKEN is required to download gated models such as Gemma from HuggingFace. Set it in your environment (export HF_TOKEN=your_token) or pass it directly with -e HF_TOKEN=your_token. You can create a token at huggingface.co/settings/tokens after accepting the model's license.

Using curl

Chat completion:

curl http://localhost:8015/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'

Audio transcription:

curl http://localhost:8015/v1/audio/transcriptions \
  -F "model=openai/whisper-large-v3" \
  -F "file=@speech.mp3" \
  -F "language=en"

Using the OpenAI Python SDK

Chat completion:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8015/v1", api_key="unused")

response = client.chat.completions.create(
    model="google/gemma-3-270m-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Audio transcription: see sample/transcribe.py, which works against both OpenAI and this server via --url:

uv run sample/transcribe.py --url http://localhost:8015/v1 --model openai/whisper-large-v3 speech.mp3

Evaluation samples

sample/eval_flores_en_ca.py evaluates English to Catalan translation on FLORES-200 scored with BLEU. Switching to this server requires only a --url flag:

python sample/eval_flores_en_ca.py --model gpt-4o-mini
python sample/eval_flores_en_ca.py --url http://localhost:8015/v1 --model google/gemma-3-4b-it

Dockerfiles

File	Base	Use case
`Dockerfile.cpu`	`python:3.14-slim`	Standard CPU image
`Dockerfile.gpu`	`nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04`	NVIDIA GPU image

Build

make build-cpu   # CPU image
make build-gpu   # GPU image
make build       # both

To bake one or more models into the image at build time, pass BUILD_MODELS as a space-separated list of HuggingFace model IDs:

make build-cpu BUILD_MODELS="google/gemma-3-270m-it"
make build-cpu BUILD_MODELS="google/gemma-3-270m-it google/gemma-3-4b-it"

The models are downloaded, converted to CTranslate2 int8 format, and stored inside the image under /models. When the container starts those models are available immediately with no conversion step on first request.

You can also pass the build arg directly to Docker:

docker build -f Dockerfile.cpu \
  --build-arg BUILD_MODELS="google/gemma-3-270m-it" \
  -t ctranslate2-web-server-cpu .

Note: HF_TOKEN must be set in the environment if the models require authentication (e.g. gated models). Pass it with --build-arg HF_TOKEN=$HF_TOKEN.

Run

make run-cpu     # CPU
make run-gpu     # GPU (requires NVIDIA runtime)

Models are stored inside the image under /models.

API

List models

GET /v1/models

Returns all models available on HuggingFace that are supported by CTranslate2 (e.g. Gemma 3, Qwen, LLaMA, Mistral).

Text completion

curl http://localhost:8015/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "prompt": "Once upon a time", "max_tokens": 100}'

Chat completion

curl http://localhost:8015/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'

The first request for a model triggers an automatic download and conversion. Subsequent requests use the cached converted model.

Audio transcription

curl http://localhost:8015/v1/audio/transcriptions \
  -F "model=openai/whisper-large-v3" \
  -F "file=@speech.mp3" \
  -F "language=en"

Supported model IDs: openai/whisper-large-v3, openai/whisper-large-v2, openai/whisper-large, openai/whisper-medium, openai/whisper-small, openai/whisper-base, openai/whisper-tiny. The alias whisper-1 maps to openai/whisper-large-v3.

Models are downloaded automatically on first use.

Configuration

Variable	Default	Description
`MODELS_DIR`	`/models`	Directory to store converted models
`DEVICE`	`cpu` / `cuda`	Inference device (set by image)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
sample		sample
src		src
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ctranslate2-web-server

Requirements

Pre-built images (quick start)

Using curl

Using the OpenAI Python SDK

Evaluation samples

Dockerfiles

Build

Run

API

List models

Text completion

Chat completion

Audio transcription

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ctranslate2-web-server

Requirements

Pre-built images (quick start)

Using curl

Using the OpenAI Python SDK

Evaluation samples

Dockerfiles

Build

Run

API

List models

Text completion

Chat completion

Audio transcription

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages