Skip to content

PeiwenSun2000/X-Stream

Repository files navigation

X-Stream logo X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Authors: Peiwen Sun*1, Xudong Lu*1, Huadai Liu*3, Yang Bo2, Dongming Wu1, Huankang Guan2, Minghong Cai1, Jinpeng Chen2, Xintong Guo2, Shuhan Li2, Fang Liu2, Rui Liu2, and Xiangyu Yue†1.

Affiliations: 1MMLab, The Chinese University of Hong Kong; 2Huawei Inc.; 3Independent.

Official inference and evaluation code for X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding. This package runs online multi-stream video QA with local vLLM checkpoints or hosted API models.

Project Website Dataset HuggingFace Paper arXiv License

X-Stream multi-stream scenarios

Introduction

X-Stream is a multi-stream streaming understanding benchmark for evaluating how multimodal large language models handle concurrent video streams. It contains 4,220 curated QA pairs across 932 videos and covers 11 subtasks in multi-window, multi-view, and multi-device scenarios. The paper frames current MLLMs as naive multiplexers and studies spatial, temporal, and semantic ways to combine multiple streams into one model-consumable token sequence.

The inference/ package keeps the runtime simple: most users only need run.sh.

Pipeline

X-Stream multiplexing pipeline

X-Stream evaluates online inference where synchronized streams are multiplexed into a single model-consumable sequence under a fixed average video-token rate. The runner supports spatial division for tiled video inputs, time division for stream-wise interleaving, and semantic division or token-level pruning for reducing redundant visual content before or inside the model call.

Supported multi-stream modes:

Mode Multiplexing term Meaning Input file
pixel Spatial Division Multiplexing Uses the pre-merged video input and sends each tiled visual stream as one spatial canvas without multi-stream segment expansion. eval_relative_merged_
phostream_type.jsonl
time Time Division Multiplexing Splits step-based video placeholders into segments and interleaves them as Stream 1: A1, Stream 2: B1, Stream 1: A2, Stream 2: B2, and so on. eval_relative_multi_
phostream_type.jsonl
code
code_adaptive
Extra Exploration code keeps the stream segment with the larger video-change score and marks the others as unchanged, while code_adaptive scales each changed stream's FPS between 0x and 2x based on that score. eval_relative_multi_
phostream_type.jsonl
cdpruner Semantic Division Multiplexing (Dropping frames) Reuses time-style interleaving, then applies client-side media selection with CDPruner-style instruction relevance and diversity before the model call. eval_relative_multi_
phostream_type.jsonl
surge Extra Exploration Reuses time-style interleaving, then applies client-side SURGE-style temporal surprise selection before the model call. eval_relative_multi_
phostream_type.jsonl
cdpruner_token Semantic Division Multiplexing Reuses time-style interleaving and forwards pruning metadata to the local vLLM worker, where the X-Stream pruner performs patch-level CDPruner token selection inside video frames. eval_relative_multi_
phostream_type.jsonl
surge_token Extra Exploration Reuses time-style interleaving and forwards pruning metadata to the local vLLM worker, where the X-Stream pruner performs patch-level SURGE token selection inside video frames. eval_relative_multi_
phostream_type.jsonl

cdpruner_token and surge_token are only available with local vLLM. Hosted API models cannot run patch-level token pruning because the pruning hook must be installed inside the vLLM worker.

Environment Setup

1. Common Base Environment

Use this base setup before running inference.

Requirements:

  • Linux.
  • Python >=3.12,<3.13.
  • uv >= 0.4.
  • ffmpeg and ffprobe on PATH for video probing and segment-cache generation.
  • NVIDIA GPU and CUDA-compatible drivers for local vLLM runs. API-only runs and cache prewarming can run without GPUs.

Install the project environment:

git clone https://github.com/PeiwenSun2000/X-Stream.git
cd X-Stream
uv sync --extra local

If you are working from a monorepo checkout where this package lives under an inference/ subdirectory, run the same uv sync --extra local command from that inference/ directory instead. If your default python3 is not Python 3.12, point uv at a 3.12 interpreter explicitly:

UV_PYTHON=/path/to/python3.12 uv sync --extra local
uv run python --version

Use uv run for commands:

uv run bash run.sh --help

Or activate the environment manually:

source .venv/bin/activate
bash run.sh --help

Create a local model configuration:

cp configs/models.example.json configs/models.json

For reproducible environment comparisons, prefer uv over invoking pip directly because uv-created virtual environments may not include the pip Python module:

uv pip freeze --python .venv/bin/python > env.freeze.txt

2. Download Data

Download the X-Stream dataset from Hugging Face. The dataset is distributed as JSONL manifests plus compressed video archives. In a monorepo checkout, the examples place the downloaded dataset root directly at the repository-level data/, so commands launched from inference/ can refer to it as ../data. In a standalone public X-Stream checkout, either place the dataset as a sibling directory and keep using ../data, or place it at X-Stream/data and change the example --input and --video-root paths from ../data to data.

cd X-Stream
pip install -U huggingface_hub
huggingface-cli download spw2000/X-stream \
  --repo-type dataset \
  --local-dir data

If you also download video archives, install zstd and extract the archives from that dataset root:

sudo apt-get update
sudo apt-get install -y zstd
python data/scripts/extract_archives.py --dataset-root data

To extract only the evaluation split or only the lightweight 2 fps model-input videos:

python data/scripts/extract_archives.py --dataset-root data --splits eval
python data/scripts/extract_archives.py --dataset-root data --kinds reencoded

3. Local Runtime Variables

For local vLLM checkpoint runs:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

If STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507, provide a Qwen-compatible judge endpoint and key:

export QWEN_ENDPOINT=https://<your-qwen-compatible-endpoint>/v1/chat/completions
export QWEN_API_KEY=<your-qwen-api-key>

For hosted API models, export only the provider credentials used by the selected model:

export OPENROUTER_API_KEY=<your-openrouter-api-key>
export OPENAI_API_KEY=<your-openai-api-key>
export GEMINI_API_KEY=<your-gemini-api-key>

For a quick smoke test or inference-only run, add --no-stream-eval to avoid judge credentials.

4. Model Checkpoints

Local vLLM runs need a downloaded checkpoint. The examples below use Qwen3-Omni-30B-A3B-Instruct, whose logical model name must match the key in configs/models.json.

Download the checkpoint with the Hugging Face CLI:

cd X-Stream/inference
pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --local-dir checkpoints/Qwen3-Omni-30B-A3B-Instruct

Then use the checkpoint root as --vllm-model-path:

--vllm-model-path ./checkpoints

The expected structure is:

inference/
`-- checkpoints/
    `-- Qwen3-Omni-30B-A3B-Instruct/
        |-- config.json
        |-- tokenizer_config.json
        |-- generation_config.json
        |-- model-00001-of-*.safetensors
        `-- ...

pipeline.sh also supports pointing --vllm-model-path directly at one checkpoint directory if that directory contains config.json:

--vllm-model-path ./checkpoints/Qwen3-Omni-30B-A3B-Instruct

For other local models, keep the same rule: the directory name under checkpoints/ should match the logical model key passed through --model, or --vllm-model-path should point directly to a checkpoint directory with config.json.

5. (Optional) CLIP Weights For Pruning Modes

pip install -U huggingface_hub
huggingface-cli download openai/clip-vit-large-patch14-336

If you use a custom Hugging Face cache location, set it before downloading and before running inference:

export HF_HOME=/path/to/hf-cache
huggingface-cli download openai/clip-vit-large-patch14-336

Segment-level cdpruner and surge fall back to the default media-limit behavior if CLIP cannot be loaded, so the run may continue but it will not use the intended pruning strategy.

Usage

Start with the API-free smoke test, then add vLLM, hosted API models, or evaluation as needed.

0. (Optional) CPU Cache Prewarming

Long multi-stream videos are split into cached MP4 segments before model inference. This MoviePy and ffmpeg stage is CPU-bound. Prewarm the cache on a CPU machine before a GPU run.

cd X-Stream/inference
uv run bash run.sh \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --warm-cache-only \
  --workers 64 \
  --cache-warm-workers 64 \
  --cache-dir ./cache \
  --run-id prewarm_multi \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data

--warm-cache-only does not start vLLM and does not call any model. It only resolves {{video:...}} placeholders and writes the segment cache. Keep --cache-dir, --input, --video-root, and video placeholder parameters identical between prewarming and the later GPU run.

For time, cdpruner, surge, cdpruner_token, and surge_token, --multi-stream can be omitted during prewarming because the same base video segments are generated. For code_adaptive, pass the same --multi-stream code_adaptive as the later run because it may create additional fps-scaled segments.

1. API-Free Smoke Test

This command verifies the Python environment, CLI path, input parsing, output writing, and video-root resolution. It does not start vLLM and does not call any hosted API.

cd X-Stream/inference
uv run bash run.sh \
  --model echo \
  --no-vllm \
  --no-stream-eval \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --workers 2 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id smoke_echo_pixel

2. Local vLLM With Merged Pixel Input

Use this when your checkpoint is available on the same machine. Keep /path/to/checkpoints as a placeholder for the local checkpoint root.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --tp 2 \
  --workers 4 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_pixel

3. Local vLLM With Multi-Stream Input

Switch to eval_relative_multi_phostream_type.jsonl for temporal, semantic, and token-reduction modes.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --multi-stream time \
  --tp 2 \
  --workers 4 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_time

To run another non-token mode, change only --multi-stream and --run-id, for example code, code_adaptive, cdpruner, or surge.

4. Local vLLM With Token-Level Pruning

surge_token and cdpruner_token require a local vLLM backend. Do not pass --no-vllm, and do not use these modes with hosted API models.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --multi-stream cdpruner_token \
  --xstream-rho 0.25 \
  --tp 2 \
  --workers 1 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_cdpruner_token

5. Hosted API Model

Hosted API models do not start vLLM. Pass --no-vllm and make sure the relevant provider key exists in the environment. Patch-level token pruning modes (cdpruner_token and surge_token) are not supported for API models.

cd X-Stream/inference
export FLOW_CONFIG=configs/models.json
export OPENROUTER_API_KEY=<your-openrouter-api-key>

uv run bash run.sh \
  --model qwen3-vl-30b-a3b-instruct \
  --no-vllm \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --workers 8 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id api_qwen3vl_pixel

Outputs And Evaluation

Each run writes to:

outputs/<RUN_ID>_<YYYYMMDD-HHMMSS>/

Typical contents:

run_env.json
models.json
output_<input>.jsonl
eval.sh
eval.json
vllm_pids.txt
vllmlogs/

Useful flags:

  • --resume: continue a compatible incomplete run.
  • --no-stream-eval: skip stream-eval and write only raw model outputs.
  • --stream-eval-judger MODEL: choose the judge model.
  • --output-dir DIR: change the output root.
  • --warm-cache-only: pre-generate video segment cache on CPU and exit.
  • --cache-warm-workers N: set CPU prewarming concurrency.

run_env.json records resolved runtime paths and options. Use it to reproduce a run or inspect which config, input, cache directory, and multi-stream mode were used.

Directory Structure

inference/
|-- README.md
|-- LICENSE
|-- assets/
|   |-- logo.png                    # X-Stream logo used in this README
|   |-- teaser.png                  # Scenario overview figure
|   `-- multiplexing_pipeline.png   # Online inference and multiplexing pipeline
|-- run.sh                         # Main entrypoint for inference runs
|-- pipeline.sh                    # vLLM startup, resume, evaluation, cleanup
|-- pyproject.toml                 # uv environment and dependency pins
|-- configs/
|   |-- models.example.json        # Public model-config template
|   `-- models.json                # Local model config
|-- prompts/
|   |-- streaming_prompt/
|   |   `-- system_prompt.txt      # Streaming QA prompt resolved by {{file:system_prompt.txt}}
|   `-- general_prompt/
|       `-- system_prompt.txt      # General-purpose prompt for custom runs
|-- tools/
|-- third_party/
|   |-- MLLMFlow
|   |-- ModelHub
|   |-- stream-eval
|   `-- xstream_vllm_pruner
|-- outputs/                       # Generated runs
`-- cache/                         # Generated video segment cache

Token Rate Guidelines

Different model providers account for video tokens differently. If a run exceeds the model's video-token or token-per-second budget, reduce the input load by lowering the resolution, lowering the FPS, shortening clips, or changing playback speed according to the model family:

  1. Gemini: Fixed 258 tokens/sec (independent of resolution/FPS).
  2. GPT: 85 tokens/frame + 170 tokens per 512 $\times$ 512 tile.
  3. Qwen3+: 28 $\times$ 28 pixel patches per token with token merging.

Use these rules to estimate the effective token rate for your target model, then choose the resolution, FPS, clip length, or playback-speed adjustment that keeps the input within that model's limit.

FAQ

Which dataset file should I use?

Use ../data/eval_relative_merged_phostream_type.jsonl for pixel. Use ../data/eval_relative_multi_phostream_type.jsonl for all multi-stream modes. Use ../data/eval_relative.json when you need to inspect or validate the dataset manifest itself.

Why does eval_relative.json not appear in run.sh examples?

eval_relative.json is the release manifest. The inference runner consumes the MLLMFlow-ready JSONL task files, so the executable examples use eval_relative_merged_phostream_type.jsonl or eval_relative_multi_phostream_type.jsonl.

ValueError: t:1 must be larger than temgoral_factor:2

Change the value of FLOW_VIDEO_URL_FPS from 1 to 2 in each *.jsonl input file.

Discussion

  1. Drawback of semantic multiplexing.
In a typical streaming setting, the question is provided only after the frames have already appeared. This means that, when a frame is first observed, the question cannot be used as a query to determine which salient tokens should be retained.

However, most existing methods for identifying salient tokens rely on question-based importance ranking and keep only the tokens deemed important. As a result, they cannot fundamentally address this limitation. We leave this issue for the community to further explore.

Acknowledgements

This inference package builds on ideas and components from the following open-source projects:

  • PhoStream and AURA for streaming video understanding infrastructure and evaluation design.
  • CDPruner and SURGE for visual token pruning.

Citation

@article{sun2026x,
  title={X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding},
  author={Sun, Peiwen and Lu, Xudong and Liu, Huadai and Bo, Yang and Wu, Dongming and Guan, Huankang and Cai, Minghong and Chen, Jinpeng and Guo, Xintong and Li, Shuhan and others},
  journal={arXiv preprint arXiv:2606.02482},
  year={2026}
}

License

This inference package is released under the MIT License. Third-party packages under third_party/ keep their original licenses and notices.

About

Official Repo of "$X$-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors