A Docker container for generating ONNX models from HuggingFace models using onnxruntime-genai. This service provides a REST API to convert HuggingFace models to ONNX format with various optimization options.
- REST API: Easy-to-use endpoints for model generation
- Multiple Formats: Support for different precision levels (fp32, fp16, int4, int8)
- Multiple Targets: CPU and GPU optimization support
- Security: Non-root user, proper permissions, token-based authentication
- Health Monitoring: Health check endpoint for container monitoring
- Docker installed on your system
- HuggingFace account and access token (for downloading models)
- Go to HuggingFace Settings
- Create a new token with "Read" permissions
- Save the token securely (you'll need it for API calls)
If you want to create a local token file for testing:
- Copy the template file:
cp myhftoken.template myhftoken - Edit
myhftokenand replaceYOUR_TOKEN_HEREwith your actual token - The token file is already in
.gitignoreto prevent accidental commits
Important: Never commit your actual token to version control!
docker build -t onnx-model-generator-service:latest .docker run -d \
--name onnx-generator \
-p 8080:8080 \
onnx-model-generator-service:latestCheck if the service is running:
curl http://127.0.0.1:8080/healthExpected response:
{
"status": "healthy",
"timestamp": "2024-01-01T12:00:00Z"
}- GET
/health - Returns service status
- GET
/models - Returns list of supported model architectures
- POST
/generate - Converts a HuggingFace model to ONNX format
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"precision": "fp32",
"execution_provider": "cpu",
"token": "hf_your_token_here"
}model(required): HuggingFace model identifierprecision(optional): Model precision -fp32,fp16,int4,int8(default:fp32)execution_provider(optional): Target execution provider -cpu,cuda,dml(default:cpu)token(required): HuggingFace access token
curl -X POST http://127.0.0.1:8080/generate \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"precision": "fp32",
"execution_provider": "cpu",
"token": "hf_your_token_here"
}' -o tinyllama_fp32_model.zipcurl -X POST http://127.0.0.1:8080/generate \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"precision": "int4",
"execution_provider": "cpu",
"token": "hf_your_token_here"
}' -o tinyllama_int4_model.zipNote: The model is returned as a downloadable zip file. Use the -o filename.zip option to save it to your local machine. int4 precision is recommended for smaller file sizes (~636MB vs ~2.5GB for fp32) and faster generation times.
curl http://127.0.0.1:8080/models-
Start the container (if not already running):
docker run -d --name onnx-generator -p 8080:8080 onnx-model-generator-service:latest
-
Test health endpoint:
curl http://127.0.0.1:8080/health
-
Test model generation
TOKEN=$(cat myhftoken) && curl -X POST http://127.0.0.1:8080/generate -H "Content-Type: application/json" -d "{\"model\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\", \"precision\": \"int4\", \"execution_provider\": \"cpu\", \"token\": \"$TOKEN\"}" -o tinyllama_int4_model.zip
Expected result: A ~636MB zip file containing the ONNX model will be downloaded to your local machine. Generation takes approximately 2-3 minutes.
Create a test script:
import requests
import json
# Test configuration
BASE_URL = "http://127.0.0.1:8080"
TOKEN = "hf_your_token_here" # Replace with your token
# Test health endpoint
response = requests.get(f"{BASE_URL}/health")
print(f"Health check: {response.status_code} - {response.json()}")
# Test model generation
payload = {
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"precision": "int4",
"execution_provider": "cpu",
"token": TOKEN
}
response = requests.post(f"{BASE_URL}/generate", json=payload)
if response.status_code == 200:
# Save the model zip file
with open("tinyllama_int4_model.zip", "wb") as f:
f.write(response.content)
print(f"Model generation successful! Downloaded {len(response.content)} bytes")
else:
print(f"Model generation failed: {response.status_code} - {response.text}")When you generate a model, it's returned as a zip file containing:
- ONNX model files (
.onnx) - Configuration files
- Tokenizer files
- All necessary components to run the model
- int4 precision: ~636MB (recommended)
- fp32 precision: ~2.5GB
- fp16 precision: ~1.3GB
- NOT stored in container: Models are temporary during generation
- Downloaded to your machine: The complete model comes as a zip download
- Ready to use: Extract the zip and use with onnxruntime
docker logs onnx-generatordocker stop onnx-generatordocker rm onnx-generatordocker exec -it onnx-generator /bin/bash- Token Security: Never embed tokens in the Docker image. Always pass them via API parameters.
- Non-root User: The container runs as a non-root user (
appuser) for security. - File Permissions: Proper file permissions are set to prevent unauthorized access.
- Network: The service only exposes necessary ports (8080 for API).
The container supports models compatible with onnxruntime-genai. Popular models include:
- TinyLlama models
- Microsoft Phi models
- Llama models
- And many others from HuggingFace
Note: Not all HuggingFace models are supported. The /models endpoint can help identify supported architectures.
- Python 3.13
- onnxruntime-genai 0.8.2
- onnxruntime 1.22.0
- torch 2.4.1
- transformers 4.52.4
- onnx 1.18.0
- Base Image:
python:3.13-slim - Working Directory:
/app - User:
appuser(non-root) - Exposed Port: 8080
- Health Check: Built-in endpoint monitoring
-
Token Authentication Errors:
- Ensure your HuggingFace token has the correct permissions
- Verify the token is valid and not expired
-
Model Not Supported:
- Check if the model architecture is supported by onnxruntime-genai
- Try with a known working model like TinyLlama
-
Out of Memory or Worker Crashes:
- Use int4 precision: Much smaller memory footprint and faster generation
- Try smaller models: Some models may be too large for available memory
- Increase Docker memory limits: If you need fp32 precision
- Check logs:
docker logs onnx-generatorfor specific error messages
-
Model Generation Fails:
- Large models: fp32 precision may cause memory issues, try int4 instead
- Download timeout: Generation can take 2-5 minutes, don't cancel early
- Use
-o filename.zip: Always specify output file for curl downloads
-
Container Won't Start:
- Check Docker logs:
docker logs onnx-generator - Verify port 8080 is not already in use
- Check Docker logs:
-
Connection Refused or Broken:
- If using podman, try
127.0.0.1:8080instead oflocalhost:8080 - Check if the container is actually running:
docker ps
- If using podman, try
This project is licensed under the MIT License - see the LICENSE file for details.
This project is provided as-is for educational and development purposes. Please ensure compliance with HuggingFace's terms of service and the licenses of any models you convert.