KVSwitch offloads prefix-aware routing for distributed LLM inference from centralized layer-7 routers into the network fabric itself. A client SDK tokenizes each prompt, computes a cumulative prefix hash chain, and embeds it in a compact shim header. Programmable switches perform hierarchical TCAM matching and per-prefix weighted ECMP routing at line rate, coordinated by a cache-event-driven SDN controller that keeps the forwarding state synchronized with distributed KV cache state.
Architecture overview of KVSwitch.
This repository contains a research prototype built on BMv2 and Mininet with a trace-driven LLM serving simulator. On its evaluation topology, KVSwitch reduces median TTFT by up to 27% and tail TTFT by up to 76% relative to a state-of-the-art layer-7 prefix-aware router.
Experiments run inside Docker with a pre-built Mininet + BMv2 image. No local dependency installation is required — Docker pulls the image automatically on first run.
- Docker (with
--privilegedsupport) - Compiled P4 artifacts in
build/p4/kvswitch/(included in the repo) - ShareGPT dataset at
data/ShareGPT_V3_unfiltered_cleaned_split.json(download) - Access to
meta-llama/Llama-3.2-3B-Instructon HuggingFace (request access) - HuggingFace model cache (for tokenizer; downloaded on first run)
We provide evaluation results (download) containing all raw results, profiling traces, experimental logs, and generated figures.
To use the pre-computed results, extract the archive into the repository root:
unzip kvswitch-results.zip -d .This populates results/ with experiment data and figures. You can then
rerun the analysis notebooks (see notebooks) without running experiments:
jupyter notebook notebooks/result_analysis.ipynbbash exp/run_exp.shResults are saved to results/exp/.
bash exp/run_exp.sh 1 # Microbenchmark: routing overhead
bash exp/run_exp.sh 2 # End-to-end: rate sweep
bash exp/run_exp.sh 3a # Ablation: ECMP vs pinning
bash exp/run_exp.sh 3b # Ablation: warm-up impact
bash exp/run_exp.sh 4a # Sensitivity: prefix sharing ratio
bash exp/run_exp.sh 4b # Sensitivity: KV cache capacity
bash exp/run_exp.sh 4c # Sensitivity: number of workers
bash exp/run_exp.sh 2 4b # Multiple experimentsbash exp/run_eval.sh --baselines l4_ecmp,l7_rr,l7_pa,kvswitch \
--num-requests 200 --request-rate 10Pass --build to run_eval.sh to recompile the P4 program and rebuild
the Docker image:
bash exp/run_eval.sh --build --baselines kvswitch --num-requests 50Pre-compiled artifacts are committed in the repo for zero-setup
reproduction. If you modify p4/, recompile and recommit:
bash scripts/compile_p4.sh p4/kvswitch.p4 build/p4/kvswitchThe script uses a locally installed p4c if available, otherwise falls
back to a p4c Docker image (built from p4lang/p4c,
tagged as p4c:latest).
Local installation is only needed for development (editing code, running tests, profiling). This is not required to run experiments, which use Docker.
bash scripts/install.shThis creates a Python 3.12 virtual environment in .venv, installs all
dependencies (including vLLM for GPU profiling), and may take a long time
due to the vLLM build.
uv run pytest tests/ -quv run bash scripts/format.sh --all