-
Notifications
You must be signed in to change notification settings - Fork 19
docs: add Docker installation guide for OpenJudge and training enviro… #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bfbf81f
b60be30
b309a5b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # OpenJudge Dockerfile | ||
| # Base image for running OpenJudge evaluation tasks | ||
| # For training Judge models with verl/sglang/vllm, use Dockerfile.train instead | ||
|
|
||
| FROM dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch:2.8.0-gpu-py312-cu128-ubuntu24.04-3995b779-1764359181 | ||
|
|
||
| # Set environment variables | ||
| ENV PYTHONUNBUFFERED=1 | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| # Set working directory | ||
| WORKDIR /workspace | ||
|
|
||
| # Install system dependencies | ||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| git \ | ||
| curl \ | ||
| build-essential \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Install OpenJudge with all dependencies | ||
| RUN pip install --no-cache-dir \ | ||
| "pandas>=2.2.3,<3.0.0" \ | ||
| "loguru>=0.7.3,<0.8.0" \ | ||
| "json_repair>=0.54.0,<1.0.0" \ | ||
| "pydantic>=2.11.5,<3.0.0" \ | ||
| "openai>=1.85.0,<2.0.0" \ | ||
| "tenacity>=9.1.0,<10.0.0" \ | ||
| "math-verify>=0.7.0,<0.8.0" \ | ||
| "tqdm>=4.66.0,<5.0.0" \ | ||
| "fire" \ | ||
| "numpy>=1.22.0,<2.0.0" \ | ||
| "dashscope>=1.19.0" \ | ||
| "tiktoken>=0.7.0" \ | ||
| "nltk>=3.8.1" \ | ||
| "jieba>=0.42.1" \ | ||
| "sacrebleu>=2.0.0" \ | ||
| "rouge-score>=0.1.2" \ | ||
| "python-Levenshtein>=0.20.0" \ | ||
| "scikit-learn>=1.0.0" | ||
|
|
||
| # Install OpenJudge from GitHub | ||
| RUN pip install --no-cache-dir git+https://github.com/agentscope-ai/OpenJudge.git | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Installing from a git repository URL without pinning to a specific commit hash or tag can lead to non-reproducible builds, as it will pull the latest commit from the For example: |
||
|
|
||
| # Clean up | ||
| RUN pip cache purge | ||
|
|
||
| # Set default command | ||
| CMD ["/bin/bash"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # ============================================================================= | ||
| # OpenJudge Training Dockerfile | ||
| # ============================================================================= | ||
| # Training image for Judge model SFT/RL with: | ||
| # - PyTorch + CUDA support | ||
| # - Inference frameworks: sglang, vllm | ||
| # - Training frameworks: verl, transformers, accelerate | ||
| # - FlashAttention, FlashInfer | ||
| # - OpenJudge | ||
| # | ||
| # For basic installation (evaluation only), use Dockerfile instead | ||
| # ============================================================================= | ||
|
|
||
| # Base image: PAI PyTorch | ||
| FROM dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch:2.8.0-gpu-py312-cu128-ubuntu24.04-3995b779-1764359181 | ||
|
|
||
| # Set environment variables | ||
| ENV USE_MEGATRON=0 | ||
| ENV USE_SGLANG=1 | ||
| ENV MAX_JOBS=32 | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| # Set working directory | ||
| WORKDIR /workspace | ||
|
|
||
| # ============================================================================= | ||
| # 1. Inference Frameworks (sglang, vllm) | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir "sglang[all]==0.5.2" && \ | ||
| pip install --no-cache-dir torch-memory-saver && \ | ||
| pip install --no-cache-dir "vllm==0.11.0" | ||
|
Comment on lines
+29
to
+31
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| # ============================================================================= | ||
| # 2. Training & ML Packages | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir \ | ||
| "transformers[hf_xet]>=4.51.0" \ | ||
| accelerate \ | ||
| datasets \ | ||
| peft \ | ||
| hf-transfer \ | ||
| "numpy<2.0.0" \ | ||
| "pyarrow>=15.0.0" \ | ||
| pandas \ | ||
| "tensordict>=0.8.0,<=0.10.0,!=0.9.0" \ | ||
| torchdata \ | ||
| "ray[default]" \ | ||
| codetiming \ | ||
| hydra-core \ | ||
| pylatexenc \ | ||
| qwen-vl-utils \ | ||
| wandb \ | ||
| swanlab \ | ||
| dill \ | ||
| pybind11 \ | ||
| liger-kernel \ | ||
| mathruler \ | ||
| pytest \ | ||
| py-spy \ | ||
| pre-commit \ | ||
| ruff \ | ||
| tensorboard | ||
|
|
||
| # ============================================================================= | ||
| # 3. Additional Dependencies | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir \ | ||
| "nvidia-ml-py>=12.560.30" \ | ||
| "fastapi[standard]>=0.115.0" \ | ||
| "optree>=0.13.0" \ | ||
| "pydantic>=2.9" \ | ||
| "grpcio>=1.62.1" | ||
|
|
||
| # ============================================================================= | ||
| # 4. FlashAttention & FlashInfer (Python 3.12 + CUDA 12) | ||
| # ============================================================================= | ||
| RUN curl -L -O "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl" && \ | ||
| pip install --no-cache-dir flash_attn-2.8.1+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl && \ | ||
| rm -f flash_attn-2.8.1+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl | ||
|
|
||
| RUN pip install --no-cache-dir flashinfer-python==0.3.1 | ||
|
|
||
| # ============================================================================= | ||
| # 5. OpenCV Fix | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir opencv-python opencv-fixer && \ | ||
| python -c "from opencv_fixer import AutoFix; AutoFix()" | ||
|
|
||
| # ============================================================================= | ||
| # 6. verl (RL Training Framework) | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir --no-deps git+https://github.com/volcengine/[email protected] | ||
|
|
||
| # ============================================================================= | ||
| # 7. OpenJudge Dependencies | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir \ | ||
| "loguru>=0.7.3,<0.8.0" \ | ||
| "json_repair>=0.54.0,<1.0.0" \ | ||
| "openai>=1.85.0,<2.0.0" \ | ||
| "tenacity>=9.1.0,<10.0.0" \ | ||
| "math-verify>=0.7.0,<0.8.0" \ | ||
| "tqdm>=4.66.0,<5.0.0" \ | ||
| fire \ | ||
| "dashscope>=1.19.0" \ | ||
| "tiktoken>=0.7.0" \ | ||
| "nltk>=3.8.1" \ | ||
| "jieba>=0.42.1" \ | ||
| "sacrebleu>=2.0.0" \ | ||
| "rouge-score>=0.1.2" \ | ||
| "python-Levenshtein>=0.20.0" \ | ||
| "scikit-learn>=1.0.0" | ||
|
|
||
| # ============================================================================= | ||
| # 8. OpenJudge | ||
| # ============================================================================= | ||
| RUN pip install --no-cache-dir --no-deps git+https://github.com/agentscope-ai/OpenJudge.git | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Installing from a git repository URL without pinning to a specific commit hash or tag can lead to non-reproducible builds, as it will pull the latest commit from the For example: |
||
|
|
||
| # Clean up cache | ||
| RUN pip cache purge | ||
|
|
||
| # Set default command | ||
| CMD ["/bin/bash"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # OpenJudge Docker Guide | ||
|
|
||
| This document describes how to deploy OpenJudge using Docker. We provide two images: | ||
|
|
||
| | Image | Purpose | Dockerfile | | ||
| |-------|---------|------------| | ||
| | **Base Image** | Running evaluation tasks (API calls) | `Dockerfile.base` | | ||
| | **Training Image** | Judge model SFT/RL training | `Dockerfile.train` | | ||
|
|
||
| --- | ||
|
|
||
| ## Option 1: Base Installation (Evaluation) | ||
|
|
||
| For scenarios using OpenJudge for evaluation (calling LLMs via API). | ||
|
|
||
| ### 1.1 Build Image | ||
|
|
||
| ```bash | ||
| cd OpenJudge | ||
| docker build -f docker/Dockerfile.base -t openjudge:latest . | ||
| ``` | ||
|
|
||
| ### 1.2 Start Container | ||
|
|
||
| ```bash | ||
| docker run -it \ | ||
| -v $(pwd):/workspace/OpenJudge \ | ||
| -e OPENAI_API_KEY=your_api_key \ | ||
| --name openjudge \ | ||
| openjudge:latest | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Option 2: Training Environment | ||
|
|
||
| For scenarios using the [verl](https://github.com/volcengine/verl) framework for Judge model SFT/RL training. | ||
|
|
||
| ### Environment Details | ||
|
|
||
| - **Base Image**: PAI PyTorch 2.8.0 + CUDA 12.8 + Python 3.12 | ||
| - **Training Framework**: verl v0.6.1 (FSDP distributed training) | ||
| - **Inference Frameworks**: vLLM 0.11.0, SGLang 0.5.2 | ||
|
|
||
| ### 2.1 Build Image | ||
|
|
||
| ```bash | ||
| cd OpenJudge | ||
| docker build -f docker/Dockerfile.train -t openjudge-train:latest . | ||
| ``` | ||
|
|
||
| ### 2.2 Start Container | ||
|
|
||
| ```bash | ||
| docker run --gpus all -it \ | ||
| --shm-size=64g \ | ||
| -v $(pwd):/workspace/OpenJudge \ | ||
| -v /path/to/your/models:/models \ | ||
| -v /path/to/your/data:/data \ | ||
| --name openjudge-train \ | ||
| openjudge-train:latest | ||
| ``` | ||
|
|
||
| **Parameter Description:** | ||
|
|
||
| | Parameter | Description | | ||
| |-----------|-------------| | ||
| | `--gpus all` | Use all GPUs | | ||
| | `--shm-size=64g` | Set shared memory to 64GB (required for training) | | ||
| | `-v $(pwd):/workspace/OpenJudge` | Mount current directory to container | | ||
| | `-v /path/to/your/models:/models` | Mount model directory (modify path as needed) | | ||
| | `-v /path/to/your/data:/data` | Mount data directory (modify path as needed) | | ||
| | `--name` | Container name | | ||
|
|
||
| ### 2.3 Run Training | ||
|
|
||
| After entering the container: | ||
|
|
||
| ```bash | ||
| cd /workspace/OpenJudge/cookbooks/training_judge_model/sft | ||
| bash run_sft_rm.sh | ||
| ``` | ||
|
|
||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To optimize the Docker image size and reduce the number of layers, it's a good practice to chain related
RUNcommands. You can combine thisapt-getinstallation with the subsequentpip installandpip cache purgecommands into a singleRUNinstruction. This creates a single layer, making the image more compact and potentially speeding up builds.