This software enables you to join multiple local LLM API servers to the KoboldAI Horde as a Scribe worker performing distributed text generation. Obtain an API Key here to accumulate kudos, the virtual currency of the Horde whicah can be used for multiple tasks such as image generation and interrogation.
It is the sucessor to LlamaCpp-Horde-Bridge, rewritten in NodeJS. If upgrading, note that the names of some configuration arguments have changed.
- Multi-threaded processing: generate multiple jobs in parallel when the backend supports it
- Asyncronous job submission: pop a new job as soon as the previous one has finished generating, submit in the background
- Context-length enforcement: incoming requests with prompts larger then context size are automatically shrunk to fit
Supported inference REST API servers:
- llama.cpp server
- koboldcpp
- While native support exists in KoboldCpp exists, it does not support the throughput-enhancing features of this bridge.
- vllm
- sglang backend
- tabbyAPI
- Supported both natively and via KoboldAI-compatible endpoint
- While native Horde support exists, it does not support the throughput-enhancing features of this bridge.
- aphrodite
- Supported via the KoboldAI-compatible endpoint
See below for example configurations for each engine.
Medusa-Bridge requires NodeJS v22, installation via nvm is recommended.
Execute npm ci to install dependencies
Run node index.js to see the default configuration:
┌───────────────────┬────────────────────────────────┐
│ (index) │ Values │
├───────────────────┼────────────────────────────────┤
│ clusterUrl │ 'https://stablehorde.net' │
│ workerName │ 'Automated Instance #73671757' │
│ apiKey │ '0000000000' │
│ priorityUsernames │ │
│ serverUrl │ 'http://localhost:8000' │
│ serverEngine │ null │
│ model │ null │
│ ctx │ null │
│ maxLength │ '512' │
│ threads │ '1' │
└───────────────────┴────────────────────────────────┘
Run node index.js --help to see command line equivillents and descriptions for each option:
Usage: index [options]
Options:
-f, --config-file <file> Load config from .json file
-c, --cluster-url <url> Set the Horde cluster URL (default: "https://stablehorde.net")
-w, --worker-name <name> Set the Horde worker name (default: "Automated Instance #37508138")
-a, --api-key <key> Set the Horde API key (default: "0000000000")
-p, --priority-usernames <usernames> Set priority usernames, comma-separated (default: [])
-s, --server-url <url> Set the REST Server URL (default: "http://localhost:8000")
-e, --server-engine <engine> Set the REST Server API type (default: null)
-sm, --server-model <server-model> Set the model requested from API server (default: null)
-m, --model <model> Set the model name offered to Horde (default: null)
-x, --ctx <ctx> Set the context length (default: null)
-l, --max-length <length> Set the max generation length (default: "512")
-t, --threads <threads> Number of parallel threads (default: "1")
--timeout <timeout> How long to wait for generation to complete (sec) (default: "60")
-h, --help display help for command
The -f / --config-file option allows you to group configuration into a named json file, while still allowing command-line overrides.
See this reddit post, using this trick older Pascal GPUs (GTX 10x0, P40, K80) are almost twice as fast, particulary at long contexts.
Compile llama.cpp with make LLAMA_CUBLAS=1 LLAMA_CUDA_FORCE_MMQ=1 to get a Pascal-optimized server binary.
Example server command: ./server ~/models/openhermes-2.5-mistral-7b.Q5_0.gguf -ngl 99 -c 4096
Example configuration file:
{
"apiKey": "<your api key>",
"workerName": "<your worker name>",
"serverEngine": "llamacpp",
"serverUrl": "http://localhost:8000",
"model": "llamacpp/openhermes-2.5-mistral-7b.Q5_0",
"ctx": 4096
}
Example server command: ./koboldcpp-linux-x64 ~/models/openhermes-2.5-mistral-7b.Q5_0.gguf --usecublas 0 mmq --gpulayers 99 --context-size 4096 --quiet
Example configuration file:
{
"apiKey": "<your api key>",
"workerName": "<your worker name>",
"serverEngine": "koboldcpp",
"serverUrl": "http://localhost:5001",
"model": "koboldcpp/openhermes-2.5-mistral-7b.Q5_0",
"ctx": 4096
}
TODO
TODO
TODO
python3 -m aphrodite.endpoints.kobold.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ --max-length 512 --max-model-len 8192