Skip to content

jordimas/ctranslate2-web-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ctranslate2-web-server

FastAPI server providing CTranslate2 inference with an OpenAI-compatible API.

  • Speech-to-text: Whisper models via faster-whisper, using the same openai/whisper-* model IDs as the OpenAI API.
  • LLMs: any model compatible with CTranslate2 (Gemma 3, Qwen, LLaMA, Mistral, and more), converting models from HuggingFace to CTranslate2 format (int8) on first use.

Requirements

  • Docker
  • NVIDIA Container Toolkit (GPU only)

Pre-built images (quick start)

Images are published to the GitHub Container Registry on every push to main.

Pull and run the CPU image:

docker pull ghcr.io/jordimas/ctranslate2-web-server-cpu:latest
docker run --rm -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-cpu:latest

For GPU:

docker pull ghcr.io/jordimas/ctranslate2-web-server-gpu:latest
docker run --rm --gpus all -p 8015:8015 -e HF_TOKEN=$HF_TOKEN ghcr.io/jordimas/ctranslate2-web-server-gpu:latest

Note: HF_TOKEN is required to download gated models such as Gemma from HuggingFace. Set it in your environment (export HF_TOKEN=your_token) or pass it directly with -e HF_TOKEN=your_token. You can create a token at huggingface.co/settings/tokens after accepting the model's license.

Using curl

Chat completion:

curl http://localhost:8015/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'

Audio transcription:

curl http://localhost:8015/v1/audio/transcriptions \
  -F "model=openai/whisper-large-v3" \
  -F "file=@speech.mp3" \
  -F "language=en"

Using the OpenAI Python SDK

Chat completion:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8015/v1", api_key="unused")

response = client.chat.completions.create(
    model="google/gemma-3-270m-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Audio transcription: see sample/transcribe.py, which works against both OpenAI and this server via --url:

uv run sample/transcribe.py --url http://localhost:8015/v1 --model openai/whisper-large-v3 speech.mp3

Evaluation samples

sample/eval_flores_en_ca.py evaluates English to Catalan translation on FLORES-200 scored with BLEU. Switching to this server requires only a --url flag:

python sample/eval_flores_en_ca.py --model gpt-4o-mini
python sample/eval_flores_en_ca.py --url http://localhost:8015/v1 --model google/gemma-3-4b-it

Dockerfiles

File Base Use case
Dockerfile.cpu python:3.14-slim Standard CPU image
Dockerfile.gpu nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04 NVIDIA GPU image

Build

make build-cpu   # CPU image
make build-gpu   # GPU image
make build       # both

To bake one or more models into the image at build time, pass BUILD_MODELS as a space-separated list of HuggingFace model IDs:

make build-cpu BUILD_MODELS="google/gemma-3-270m-it"
make build-cpu BUILD_MODELS="google/gemma-3-270m-it google/gemma-3-4b-it"

The models are downloaded, converted to CTranslate2 int8 format, and stored inside the image under /models. When the container starts those models are available immediately with no conversion step on first request.

You can also pass the build arg directly to Docker:

docker build -f Dockerfile.cpu \
  --build-arg BUILD_MODELS="google/gemma-3-270m-it" \
  -t ctranslate2-web-server-cpu .

Note: HF_TOKEN must be set in the environment if the models require authentication (e.g. gated models). Pass it with --build-arg HF_TOKEN=$HF_TOKEN.

Run

make run-cpu     # CPU
make run-gpu     # GPU (requires NVIDIA runtime)

Models are stored inside the image under /models.

API

List models

GET /v1/models

Returns all models available on HuggingFace that are supported by CTranslate2 (e.g. Gemma 3, Qwen, LLaMA, Mistral).

Text completion

curl http://localhost:8015/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "prompt": "Once upon a time", "max_tokens": 100}'

Chat completion

curl http://localhost:8015/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-270m-it", "messages": [{"role": "user", "content": "Hello!"}]}'

The first request for a model triggers an automatic download and conversion. Subsequent requests use the cached converted model.

Audio transcription

curl http://localhost:8015/v1/audio/transcriptions \
  -F "model=openai/whisper-large-v3" \
  -F "file=@speech.mp3" \
  -F "language=en"

Supported model IDs: openai/whisper-large-v3, openai/whisper-large-v2, openai/whisper-large, openai/whisper-medium, openai/whisper-small, openai/whisper-base, openai/whisper-tiny. The alias whisper-1 maps to openai/whisper-large-v3.

Models are downloaded automatically on first use.

Configuration

Variable Default Description
MODELS_DIR /models Directory to store converted models
DEVICE cpu / cuda Inference device (set by image)

About

ctranslate2-web-server with OpenAI compatibility

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors