Skip to content

kaiitunnz/kvswitch

Repository files navigation

KVSwitch: Accelerating Distributed LLM Inference with In-Network Prefix-Aware Routing

KVSwitch offloads prefix-aware routing for distributed LLM inference from centralized layer-7 routers into the network fabric itself. A client SDK tokenizes each prompt, computes a cumulative prefix hash chain, and embeds it in a compact shim header. Programmable switches perform hierarchical TCAM matching and per-prefix weighted ECMP routing at line rate, coordinated by a cache-event-driven SDN controller that keeps the forwarding state synchronized with distributed KV cache state.

KVSwitch architecture
Architecture overview of KVSwitch.

This repository contains a research prototype built on BMv2 and Mininet with a trace-driven LLM serving simulator. On its evaluation topology, KVSwitch reduces median TTFT by up to 27% and tail TTFT by up to 76% relative to a state-of-the-art layer-7 prefix-aware router.

Running Experiments

Experiments run inside Docker with a pre-built Mininet + BMv2 image. No local dependency installation is required — Docker pulls the image automatically on first run.

Prerequisites

  • Docker (with --privileged support)
  • Compiled P4 artifacts in build/p4/kvswitch/ (included in the repo)
  • ShareGPT dataset at data/ShareGPT_V3_unfiltered_cleaned_split.json (download)
  • Access to meta-llama/Llama-3.2-3B-Instruct on HuggingFace (request access)
  • HuggingFace model cache (for tokenizer; downloaded on first run)

Pre-computed results

We provide evaluation results (download) containing all raw results, profiling traces, experimental logs, and generated figures.

To use the pre-computed results, extract the archive into the repository root:

unzip kvswitch-results.zip -d .

This populates results/ with experiment data and figures. You can then rerun the analysis notebooks (see notebooks) without running experiments:

jupyter notebook notebooks/result_analysis.ipynb

Run all experiments

bash exp/run_exp.sh

Results are saved to results/exp/.

Run specific experiments

bash exp/run_exp.sh 1          # Microbenchmark: routing overhead
bash exp/run_exp.sh 2          # End-to-end: rate sweep
bash exp/run_exp.sh 3a         # Ablation: ECMP vs pinning
bash exp/run_exp.sh 3b         # Ablation: warm-up impact
bash exp/run_exp.sh 4a         # Sensitivity: prefix sharing ratio
bash exp/run_exp.sh 4b         # Sensitivity: KV cache capacity
bash exp/run_exp.sh 4c         # Sensitivity: number of workers
bash exp/run_exp.sh 2 4b       # Multiple experiments

Run a single evaluation

bash exp/run_eval.sh --baselines l4_ecmp,l7_rr,l7_pa,kvswitch \
  --num-requests 200 --request-rate 10

Rebuild the Docker image

Pass --build to run_eval.sh to recompile the P4 program and rebuild the Docker image:

bash exp/run_eval.sh --build --baselines kvswitch --num-requests 50

Recompile P4 artifacts

Pre-compiled artifacts are committed in the repo for zero-setup reproduction. If you modify p4/, recompile and recommit:

bash scripts/compile_p4.sh p4/kvswitch.p4 build/p4/kvswitch

The script uses a locally installed p4c if available, otherwise falls back to a p4c Docker image (built from p4lang/p4c, tagged as p4c:latest).

Development Setup

Local installation is only needed for development (editing code, running tests, profiling). This is not required to run experiments, which use Docker.

bash scripts/install.sh

This creates a Python 3.12 virtual environment in .venv, installs all dependencies (including vLLM for GPU profiling), and may take a long time due to the vLLM build.

Run tests

uv run pytest tests/ -q

Lint and format

uv run bash scripts/format.sh --all

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors