Complete guide to all environment variables and configuration options for worker-vllm.
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
MODEL_NAME |
'facebook/opt-125m' | str |
Name or path of the Hugging Face model to use. |
MODEL_REVISION |
'main' | str |
Model revision to load (default: main). |
TOKENIZER |
None | str |
Name or path of the Hugging Face tokenizer to use. |
SKIP_TOKENIZER_INIT |
False | bool |
Skip initialization of tokenizer and detokenizer. |
TOKENIZER_MODE |
'auto' | ['auto', 'slow'] | The tokenizer mode. |
TRUST_REMOTE_CODE |
False |
bool |
Trust remote code from Hugging Face. |
DOWNLOAD_DIR |
None | str |
Directory to download and load the weights. |
LOAD_FORMAT |
'auto' | str |
The format of the model weights to load. |
HF_TOKEN |
- | str |
Hugging Face token for private and gated models. |
DTYPE |
'auto' | ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'] | Data type for model weights and activations. |
KV_CACHE_DTYPE |
'auto' | ['auto', 'fp8'] | Data type for KV cache storage. |
QUANTIZATION_PARAM_PATH |
None | str |
Path to the JSON file containing the KV cache scaling factors. |
MAX_MODEL_LEN |
None | int |
Model context length. |
GUIDED_DECODING_BACKEND |
'outlines' | ['outlines', 'lm-format-enforcer'] | Which engine will be used for guided decoding by default. |
DISTRIBUTED_EXECUTOR_BACKEND |
None | ['ray', 'mp'] | Backend to use for distributed serving. |
WORKER_USE_RAY |
False | bool |
Deprecated, use --distributed-executor-backend=ray. |
PIPELINE_PARALLEL_SIZE |
1 | int |
Number of pipeline stages. |
TENSOR_PARALLEL_SIZE |
1 | int |
Number of tensor parallel replicas. |
MAX_PARALLEL_LOADING_WORKERS |
None | int |
Load model sequentially in multiple batches. |
RAY_WORKERS_USE_NSIGHT |
False | bool |
If specified, use nsight to profile Ray workers. |
ENABLE_PREFIX_CACHING |
False | bool |
Enables automatic prefix caching. |
DISABLE_SLIDING_WINDOW |
False | bool |
Disables sliding window, capping to sliding window size. |
NUM_LOOKAHEAD_SLOTS |
0 | int |
Experimental scheduling config necessary for speculative decoding. |
SEED |
0 | int |
Random seed for operations. |
NUM_GPU_BLOCKS_OVERRIDE |
None | int |
If specified, ignore GPU profiling result and use this number of GPU blocks. |
MAX_NUM_BATCHED_TOKENS |
None | int |
Maximum number of batched tokens per iteration. |
MAX_NUM_SEQS |
256 | int |
Maximum number of sequences per iteration. |
MAX_LOGPROBS |
20 | int |
Max number of log probs to return when logprobs is specified in SamplingParams. |
DISABLE_LOG_STATS |
False | bool |
Disable logging statistics. |
QUANTIZATION |
None | ['awq', 'squeezellm', 'gptq', 'bitsandbytes'] | Method used to quantize the weights. |
ROPE_SCALING |
None | dict |
RoPE scaling configuration in JSON format. |
ROPE_THETA |
None | float |
RoPE theta. Use with rope_scaling. |
TOKENIZER_POOL_SIZE |
0 | int |
Size of tokenizer pool to use for asynchronous tokenization. |
TOKENIZER_POOL_TYPE |
'ray' | str |
Type of tokenizer pool to use for asynchronous tokenization. |
TOKENIZER_POOL_EXTRA_CONFIG |
None | dict |
Extra config for tokenizer pool. |
| Variable | Default | Type | Description |
|---|---|---|---|
ENABLE_LORA |
False | bool |
If True, enable handling of LoRA adapters. |
MAX_LORAS |
1 | int |
Max number of LoRAs in a single batch. |
MAX_LORA_RANK |
16 | int |
Max LoRA rank. |
LORA_EXTRA_VOCAB_SIZE |
256 | int |
Maximum size of extra vocabulary for LoRA adapters. |
LORA_DTYPE |
'auto' | ['auto', 'float16', 'bfloat16', 'float32'] | Data type for LoRA. |
LONG_LORA_SCALING_FACTORS |
None | tuple |
Specify multiple scaling factors for LoRA adapters. |
MAX_CPU_LORAS |
None | int |
Maximum number of LoRAs to store in CPU memory. |
FULLY_SHARDED_LORAS |
False | bool |
Enable fully sharded LoRA layers. |
LORA_MODULES |
[] |
list[dict] |
Add lora adapters from Hugging Face [{"name": "xx", "path": "xxx/xxxx", "base_model_name": "xxx/xxxx"}] |
Note (Serverless): When LoRA adapters are configured via
LORA_MODULES, initialization is deferred to the first request to ensure compatibility with RunPod Serverless. This means the first request will include LoRA loading time. Subsequent requests are unaffected. Check logs for "LoRA mode: X adapter(s) will load on first request" at startup.
Speculative decoding can be configured in two ways:
Set SPECULATIVE_CONFIG to a JSON string with your full speculative decoding configuration:
SPECULATIVE_CONFIG='{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4}'| Variable | Default | Type/Choices | Description |
|---|---|---|---|
SPECULATIVE_METHOD |
None | ['draft_model', 'ngram', 'eagle', 'eagle3', 'medusa', 'mlp_speculator'] | Speculative decoding method to use. |
SPECULATIVE_MODEL |
None | str |
The name of the draft model to be used in speculative decoding. |
NUM_SPECULATIVE_TOKENS |
None | int |
The number of speculative tokens to sample from the draft model. |
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE |
None | int |
Number of tensor parallel replicas for the draft model. |
SPECULATIVE_MAX_MODEL_LEN |
None | int |
The maximum sequence length supported by the draft model. |
SPECULATIVE_DISABLE_BY_BATCH_SIZE |
None | int |
Disable speculative decoding if the number of enqueue requests is larger than this value. |
NGRAM_PROMPT_LOOKUP_MAX |
None | int |
Max size of window for ngram prompt lookup in speculative decoding. |
NGRAM_PROMPT_LOOKUP_MIN |
None | int |
Min size of window for ngram prompt lookup in speculative decoding. |
If SPECULATIVE_CONFIG is set, it takes priority over individual env vars. When using individual env vars without SPECULATIVE_METHOD, the method is auto-detected from the model name or configuration.
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
GPU_MEMORY_UTILIZATION |
0.95 |
float |
Sets GPU VRAM utilization. |
MAX_PARALLEL_LOADING_WORKERS |
None |
int |
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. |
BLOCK_SIZE |
16 |
8, 16, 32 |
Token block size for contiguous chunks of tokens. |
SWAP_SPACE |
4 |
int |
CPU swap space size (GiB) per GPU. |
ENFORCE_EAGER |
False | bool |
Always use eager-mode PyTorch. If False(0), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. |
MAX_SEQ_LEN_TO_CAPTURE |
8192 |
int |
Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. |
DISABLE_CUSTOM_ALL_REDUCE |
0 |
int |
Enables or disables custom all reduce. |
ENABLE_EXPERT_PARALLEL |
False |
bool |
Enable Expert Parallel for MoE models. |
ATTENTION_BACKEND |
None |
str |
Attention backend to use (e.g., FLASH_ATTN, FLASHINFER, TRITON_FLASH_ATTN). Replaces deprecated VLLM_ATTENTION_BACKEND. |
ASYNC_SCHEDULING |
None |
bool |
Enable async scheduling (overlaps engine scheduling with GPU execution). Default: enabled in vLLM 0.14.0+. Set to false to disable. |
STREAM_INTERVAL |
1 |
int |
Controls how often to yield streaming results. Lower = more frequent updates. |
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
TOKENIZER_NAME |
None |
str |
Tokenizer repository to use a different tokenizer than the model's default. |
TOKENIZER_REVISION |
None |
str |
Tokenizer revision to load. |
CUSTOM_CHAT_TEMPLATE |
None |
str of single-line jinja template |
Custom chat jinja template. More Info |
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker.
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
DEFAULT_BATCH_SIZE |
50 |
int |
Default and Maximum batch size for token streaming to reduce HTTP calls. |
DEFAULT_MIN_BATCH_SIZE |
1 |
int |
Batch size for the first request, which will be multiplied by the growth factor every subsequent request. |
DEFAULT_BATCH_SIZE_GROWTH_FACTOR |
3 |
float |
Growth factor for dynamic batch size. |
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
RAW_OPENAI_OUTPUT |
1 |
boolean as int |
Enables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility. |
OPENAI_SERVED_MODEL_NAME_OVERRIDE |
None |
str |
Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests |
OPENAI_RESPONSE_ROLE |
assistant |
str |
Role of the LLM's Response in OpenAI Chat Completions. |
ENABLE_AUTO_TOOL_CHOICE |
false |
bool |
Enables automatic tool selection for supported models. Set to true to activate. |
TOOL_CALL_PARSER |
None |
str |
Specifies the parser for tool calls. Options: mistral, hermes, llama3_json, llama4_json, llama4_pythonic, granite, granite-20b-fc, deepseek_v3, internlm, jamba, phi4_mini_json, pythonic |
REASONING_PARSER |
None |
str |
Parser for reasoning-capable models (enables reasoning mode). Examples: deepseek_r1, qwen3, granite, hunyuan_a13b. Leave unset to disable. |
TRUST_REQUEST_CHAT_TEMPLATE |
false |
bool |
Allow clients to send custom chat templates in API requests. Security consideration: Only enable if you trust your API clients. |
RETURN_TOKENS_AS_TOKEN_IDS |
false |
bool |
Return token IDs instead of decoded text strings in responses. |
EXCLUDE_TOOLS_WHEN_TOOL_CHOICE_NONE |
false |
bool |
Exclude tool definitions from the prompt when tool_choice is set to none. |
ENABLE_PROMPT_TOKENS_DETAILS |
false |
bool |
Include detailed prompt token information in API responses. |
ENABLE_FORCE_INCLUDE_USAGE |
false |
bool |
Always include usage statistics in API responses, even when not requested. |
ENABLE_LOG_OUTPUTS |
false |
bool |
Log model outputs for debugging purposes. |
LOG_ERROR_STACK |
false |
bool |
Include full stack traces in error responses for debugging. |
| Variable | Default | Type/Choices | Description |
|---|---|---|---|
MAX_CONCURRENCY |
30 |
int |
Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency |
DISABLE_LOG_STATS |
False | bool |
Enables or disables vLLM stats logging. |
ENABLE_LOG_REQUESTS |
False | bool |
Enables vLLM request logging. (Replaces deprecated DISABLE_LOG_REQUESTS in vLLM 0.15.0) |
| Variable | Default | Type | Description |
|---|---|---|---|
MODEL_LOADER_EXTRA_CONFIG |
None | dict |
Extra config for model loader. |
PREEMPTION_MODE |
None | str |
If 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens. |
PREEMPTION_CHECK_PERIOD |
1.0 | float |
How frequently the engine checks if a preemption happens. |
PREEMPTION_CPU_CAPACITY |
2 | float |
The percentage of CPU memory used for the saved activations. |
DISABLE_LOGGING_REQUEST |
False | bool |
Disable logging requests. |
MAX_LOG_LEN |
None | int |
Max number of prompt characters or prompt ID numbers being printed in log. |
Any vLLM AsyncEngineArgs field can be set via an environment variable using the UPPERCASED field name (the same names vLLM uses). The worker auto-discovers all fields from env — no prefix.
Format: <FIELD_NAME_UPPERCASED>=<value> (e.g. MAX_MODEL_LEN=4096)
Examples:
| Environment Variable | vLLM Engine Arg | Value Example |
|---|---|---|
MAX_MODEL_LEN |
max_model_len |
4096 |
ENFORCE_EAGER |
enforce_eager |
true |
ENABLE_CHUNKED_PREFILL |
enable_chunked_prefill |
true |
NUM_SCHEDULER_STEPS |
num_scheduler_steps |
8 |
TOKENIZER_POOL_SIZE |
tokenizer_pool_size |
4 |
Backward-compat aliases: MODEL_NAME → model, TOKENIZER_NAME → tokenizer, MAX_CONTEXT_LEN_TO_CAPTURE → max_seq_len_to_capture, MODEL_REVISION → revision.
Notes:
- Only valid
AsyncEngineArgsfields are applied. Unknown keys are silently ignored. - Values are automatically cast to the correct type (
int,float,bool,str, or JSON fordict/list/tuple). - For a full list of available engine args, see the vLLM AsyncEngineArgs documentation.
These variables are used when building custom Docker images with models baked in:
| Variable | Default | Type | Description |
|---|---|---|---|
BASE_PATH |
/runpod-volume |
str |
Storage directory for huggingface cache and model |
WORKER_CUDA_VERSION |
12.1.0 |
str |
CUDA version for the worker image |
| Old Variable | New Variable | Note |
|---|---|---|
MAX_CONTEXT_LEN_TO_CAPTURE |
MAX_SEQ_LEN_TO_CAPTURE |
Use new variable name |
kv_cache_dtype=fp8_e5m2 |
kv_cache_dtype=fp8 |
Simplified fp8 format |
USE_V2_BLOCK_MANAGER |
(removed) | V2 block manager is now the default in vLLM 0.13.0, setting ignored |
VLLM_ATTENTION_BACKEND |
ATTENTION_BACKEND |
Use new env var name (old still works with deprecation warning) |
DISABLE_LOG_REQUESTS |
ENABLE_LOG_REQUESTS |
Inverted logic in vLLM 0.15.0 (old still works with deprecation warning) |