This document provides comprehensive documentation for the SPNL command-line interface.
spnl <COMMAND>run- Run a querylist- List available local models with pretty names (requires 'local' feature)vllm- Bring up vLLM in a Kubernetes cluster or Google Compute Enginehelp- Print help message or help for a specific subcommand
-h, --help- Print help-V, --version- Print version
Execute a span query with various configuration options.
spnl run [OPTIONS]-f, --file <FILE>- File to process-b, --builtin <BUILTIN>- Builtin query to run (env:SPNL_BUILTIN)- Required unless
--fileis present - Possible values:
bulkmap,email,email2,email3,sweagent,gsm8k,rag,spans
- Required unless
-m, --model <MODEL>- Generative model to use (env:SPNL_MODEL)- Default:
ollama/granite3.3:2b
- Default:
-e, --embedding-model <EMBEDDING_MODEL>- Embedding model to use (env:SPNL_EMBEDDING_MODEL)- Default:
ollama/mxbai-embed-large:335m
- Default:
-t, --temperature <TEMPERATURE>- Temperature for generation- Default:
0.5
- Default:
-l, --max-tokens <MAX_TOKENS>- Maximum completion/generated tokens- Default:
100
- Default:
-n, --n <N>- Number of candidates to consider- Default:
5
- Default:
-p, --prompt <PROMPT>- Question to pose-d, --document <DOCUMENT>- Document(s) that will augment the question-x, --max-aug <MAX_AUG>- Maximum augmentations to add to the query (env:SPNL_RAG_MAX_MATCHES)-i, --indexer <INDEXER>- The RAG indexing scheme- Possible values:
simple-embed-retrieve- Only perform the initial embedding without any further knowledge graph formationraptor- Use the RAPTOR algorithm (https://github.com/parthsarthi03/raptor)
- Possible values:
-k, --chunk-size <CHUNK_SIZE>- Chunk size for document processing--vecdb-uri <VECDB_URI>- Vector database URL- Default:
data/spnl
- Default:
--shuffle- Randomly shuffle order of fragments
-r, --reverse- Reverse order--dry-run- Dry run (do not execute query)
-s, --show-query- Re-emit the compiled query--time- Report timing metrics (TTFT and ITL) to stdout-v, --verbose- Verbose output
# Run a builtin example
spnl run --builtin email2 --verbose
# Run a query from a file
spnl run --file query.json --model ollama/granite3.3:2b
# RAG query with custom settings
spnl run --prompt "What is the main topic?" --document paper.pdf --max-aug 5 --indexer raptorList all available local models with their pretty names and cache status. This command requires the 'local' feature to be enabled.
spnl listThis command displays a table with the following columns:
- NAME - Human-readable pretty name for the model
- CACHED - Whether the model is cached locally (✓ for cached, - for not cached)
- ID - The HuggingFace model identifier
NAME CACHED ID
gemma2:2b ✓ unsloth/gemma-2-2b-it-GGUF
llama3.2:1b - unsloth/Llama-3.2-1B-Instruct-GGUF
qwen2.5:0.5b ✓ unsloth/Qwen2.5-0.5B-Instruct-GGUF
Once you've identified a model from the list, you can use its pretty name with the --model flag:
# Use a pretty name instead of the full HuggingFace ID
spnl run --builtin email2 --model llama3.2:1b
# The pretty name is more concise than the full ID
spnl run --builtin email2 --model unsloth/Llama-3.2-1B-Instruct-GGUFDeploy and manage vLLM inference servers on Kubernetes or Google Compute Engine.
spnl vllm <COMMAND>up- Bring up a vLLM serverdown- Tear down a vLLM serverimage- Manage custom images with vLLM pre-installed (GCE only)patchfile- Emit vLLM patchfile to stdouthelp- Print help message
Deploy a vLLM inference server with a model from HuggingFace.
spnl vllm up [OPTIONS] --hf-token <HF_TOKEN> <NAME><NAME>- Name of the deployment/instance (required)
--target <TARGET>- Target platform- Possible values:
k8s(Kubernetes),gce(Google Compute Engine) - Default:
k8s
- Possible values:
-n, --namespace <NAMESPACE>- Namespace for the Kubernetes deployment
-m, --model <MODEL>- Model to serve from HuggingFace (env:SPNL_MODEL)-t, --hf-token <HF_TOKEN>- HuggingFace token for pulling model weights (env:HF_TOKEN) (required)
--gpus <GPUS>- Number of GPUs to request- Default:
1
- Default:
-p, --local-port <LOCAL_PORT>- Local port for port forwarding- Default:
8000
- Default:
-r, --remote-port <REMOTE_PORT>- Remote port for port forwarding- Default:
8000
- Default:
When using --target gce, the following options are available:
--project <PROJECT>- GCP project ID (env:GCP_PROJECTorGOOGLE_CLOUD_PROJECT) (required)--service-account <SERVICE_ACCOUNT>- GCP service account email without @PROJECT.iam.gserviceaccount.com (env:GCP_SERVICE_ACCOUNT) (required)--region <REGION>- GCE region (env:GCE_REGION)- Default:
us-west1
- Default:
--zone <ZONE>- GCE zone (env:GCE_ZONE)- Default:
us-west1-a
- Default:
--machine-type <MACHINE_TYPE>- GCE machine type (env:GCE_MACHINE_TYPE)- Default:
g2-standard-4
- Default:
--gcs-bucket <GCS_BUCKET>- GCS bucket for storing artifacts (env:GCS_BUCKET)- Default:
spnl-test
- Default:
--spnl-github <SPNL_GITHUB>- SPNL GitHub repository URL for dev mode (env:SPNL_GITHUB)--github-sha <GITHUB_SHA>- SPNL GitHub commit SHA (env:GITHUB_SHA)--github-ref <GITHUB_REF>- SPNL GitHub ref (branch/tag) (env:GITHUB_REF)--vllm-org <VLLM_ORG>- vLLM organization on GitHub (env:VLLM_ORG)- Default:
neuralmagic
- Default:
--vllm-repo <VLLM_REPO>- vLLM repository name (env:VLLM_REPO)- Default:
vllm
- Default:
--vllm-branch <VLLM_BRANCH>- vLLM branch to use (env:VLLM_BRANCH)- Default:
llm-d-release-0.4
- Default:
When using --target gce, you must set:
GCP_PROJECTorGOOGLE_CLOUD_PROJECT- Your GCP project ID (required)GCP_SERVICE_ACCOUNT- Service account name for the instance (required)GOOGLE_APPLICATION_CREDENTIALS- Path to your service account key file (optional, only needed if not already logged in viagcloud auth login)
See GCP authentication docs for more information.
# Deploy on Kubernetes with default model
spnl vllm up my-deployment --target k8s --hf-token YOUR_HF_TOKEN
# Deploy with custom model and multiple GPUs
spnl vllm up my-deployment --target k8s --model meta-llama/Llama-3.1-8B-Instruct --hf-token YOUR_HF_TOKEN --gpus 2
# Deploy on Google Compute Engine
export GCP_PROJECT=my-project
export GCP_SERVICE_ACCOUNT=my-service-account
spnl vllm up my-deployment --target gce --hf-token YOUR_HF_TOKEN
# Deploy with custom ports
spnl vllm up my-deployment --target k8s --hf-token YOUR_HF_TOKEN --local-port 8080 --remote-port 8000
# Deploy on GCE with custom configuration
spnl vllm up my-deployment --target gce \
--hf-token YOUR_HF_TOKEN \
--project my-gcp-project \
--service-account my-sa \
--region us-central1 \
--zone us-central1-a \
--machine-type g2-standard-8 \
--gpus 2Remove a vLLM deployment and clean up resources.
spnl vllm down [OPTIONS] <NAME><NAME>- Name of the deployment/instance to tear down (required)
--target <TARGET>- Target platform- Possible values:
k8s(Kubernetes),gce(Google Compute Engine) - Default:
k8s
- Possible values:
-n, --namespace <NAMESPACE>- Namespace of the Kubernetes deployment
When using --target gce, the same GCE configuration options from vllm up are available.
# Tear down Kubernetes deployment
spnl vllm down my-deployment --target k8s
# Tear down GCE deployment
spnl vllm down my-deployment --target gce
# Tear down with specific namespace
spnl vllm down my-deployment --target k8s --namespace my-namespaceCreate and manage custom GCE images with vLLM pre-installed.
spnl vllm image <COMMAND>create- Create a custom image with vLLM pre-installed
Create a custom GCE image with vLLM pre-installed for faster instance startup.
spnl vllm image create [OPTIONS]--target <TARGET>- Target platform (onlygceis supported)- Default:
gce
- Default:
-f, --force- Force overwrite of existing image with the same name--image-name <IMAGE_NAME>- Custom image name (defaults to auto-generated from hash)--image-family <IMAGE_FAMILY>- Image family- Default:
vllm-spnl
- Default:
--llmd-version <LLMD_VERSION>- LLM-D version for patch file- Default:
0.4.0
- Default:
The same GCE configuration options from vllm up are available.
# Create a custom image with default settings
spnl vllm image create --project my-project --service-account my-sa
# Create with custom image name and force overwrite
spnl vllm image create --project my-project --service-account my-sa \
--image-name my-vllm-image --force
# Create with custom vLLM version
spnl vllm image create --project my-project --service-account my-sa \
--vllm-branch main --llmd-version 0.5.0Output the vLLM patchfile to stdout.
spnl vllm patchfileThis command outputs the patchfile used to modify vLLM for SPNL integration.
The following environment variables can be used to configure SPNL:
SPNL_BUILTIN- Default builtin query to runSPNL_MODEL- Default generative modelSPNL_EMBEDDING_MODEL- Default embedding modelSPNL_RAG_MAX_MATCHES- Default maximum RAG augmentations
HF_TOKEN- HuggingFace token for model access
GCP_PROJECTorGOOGLE_CLOUD_PROJECT- GCP project ID (required for GCE)GCP_SERVICE_ACCOUNT- GCP service account name (required for GCE)GOOGLE_APPLICATION_CREDENTIALS- Path to GCP service account key fileGCE_REGION- GCE region (default:us-west1)GCE_ZONE- GCE zone (default:us-west1-a)GCE_MACHINE_TYPE- GCE machine type (default:g2-standard-4)GCS_BUCKET- GCS bucket for artifacts (default:spnl-test)SPNL_GITHUB- SPNL GitHub repository URL (for dev mode)GITHUB_SHA- SPNL GitHub commit SHAGITHUB_REF- SPNL GitHub ref (branch/tag)VLLM_ORG- vLLM organization on GitHub (default:neuralmagic)VLLM_REPO- vLLM repository name (default:vllm)VLLM_BRANCH- vLLM branch to use (default:llm-d-release-0.4)
Some CLI options may require specific features to be enabled at compile time. To build with all features:
cargo build --all-featuresRefer to the project's Cargo.toml for a complete list of available features.