X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Authors: Peiwen Sun^*1, Xudong Lu^*1, Huadai Liu^*3, Yang Bo², Dongming Wu¹, Huankang Guan², Minghong Cai¹, Jinpeng Chen², Xintong Guo², Shuhan Li², Fang Liu², Rui Liu², and Xiangyu Yue^†1.

Affiliations: ¹MMLab, The Chinese University of Hong Kong; ²Huawei Inc.; ³Independent.

Official inference and evaluation code for X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding. This package runs online multi-stream video QA with local vLLM checkpoints or hosted API models.

Introduction

X-Stream is a multi-stream streaming understanding benchmark for evaluating how multimodal large language models handle concurrent video streams. It contains 4,220 curated QA pairs across 932 videos and covers 11 subtasks in multi-window, multi-view, and multi-device scenarios. The paper frames current MLLMs as naive multiplexers and studies spatial, temporal, and semantic ways to combine multiple streams into one model-consumable token sequence.

The inference/ package keeps the runtime simple: most users only need run.sh.

Pipeline

X-Stream evaluates online inference where synchronized streams are multiplexed into a single model-consumable sequence under a fixed average video-token rate. The runner supports spatial division for tiled video inputs, time division for stream-wise interleaving, and semantic division or token-level pruning for reducing redundant visual content before or inside the model call.

Supported multi-stream modes:

Mode	Multiplexing term	Meaning	Input file
`pixel`	Spatial Division Multiplexing	Uses the pre-merged video input and sends each tiled visual stream as one spatial canvas without multi-stream segment expansion.	`eval_relative_merged_ phostream_type.jsonl`
`time`	Time Division Multiplexing	Splits step-based video placeholders into segments and interleaves them as `Stream 1: A1`, `Stream 2: B1`, `Stream 1: A2`, `Stream 2: B2`, and so on.	`eval_relative_multi_ phostream_type.jsonl`
`code` `code_adaptive`	Extra Exploration	`code` keeps the stream segment with the larger video-change score and marks the others as unchanged, while `code_adaptive` scales each changed stream's FPS between 0x and 2x based on that score.	`eval_relative_multi_ phostream_type.jsonl`
`cdpruner`	Semantic Division Multiplexing (Dropping frames)	Reuses time-style interleaving, then applies client-side media selection with CDPruner-style instruction relevance and diversity before the model call.	`eval_relative_multi_ phostream_type.jsonl`
`surge`	Extra Exploration	Reuses time-style interleaving, then applies client-side SURGE-style temporal surprise selection before the model call.	`eval_relative_multi_ phostream_type.jsonl`
`cdpruner_token`	Semantic Division Multiplexing	Reuses time-style interleaving and forwards pruning metadata to the local vLLM worker, where the X-Stream pruner performs patch-level CDPruner token selection inside video frames.	`eval_relative_multi_ phostream_type.jsonl`
`surge_token`	Extra Exploration	Reuses time-style interleaving and forwards pruning metadata to the local vLLM worker, where the X-Stream pruner performs patch-level SURGE token selection inside video frames.	`eval_relative_multi_ phostream_type.jsonl`

cdpruner_token and surge_token are only available with local vLLM. Hosted API models cannot run patch-level token pruning because the pruning hook must be installed inside the vLLM worker.

Environment Setup

1. Common Base Environment

Use this base setup before running inference.

Requirements:

Linux.
Python >=3.12,<3.13.
uv >= 0.4.
ffmpeg and ffprobe on PATH for video probing and segment-cache generation.
NVIDIA GPU and CUDA-compatible drivers for local vLLM runs. API-only runs and cache prewarming can run without GPUs.

Install the project environment:

git clone https://github.com/PeiwenSun2000/X-Stream.git
cd X-Stream
uv sync --extra local

If you are working from a monorepo checkout where this package lives under an inference/ subdirectory, run the same uv sync --extra local command from that inference/ directory instead. If your default python3 is not Python 3.12, point uv at a 3.12 interpreter explicitly:

UV_PYTHON=/path/to/python3.12 uv sync --extra local
uv run python --version

Use uv run for commands:

uv run bash run.sh --help

Or activate the environment manually:

source .venv/bin/activate
bash run.sh --help

Create a local model configuration:

cp configs/models.example.json configs/models.json

For reproducible environment comparisons, prefer uv over invoking pip directly because uv-created virtual environments may not include the pip Python module:

uv pip freeze --python .venv/bin/python > env.freeze.txt

2. Download Data

Download the X-Stream dataset from Hugging Face. The dataset is distributed as JSONL manifests plus compressed video archives. In a monorepo checkout, the examples place the downloaded dataset root directly at the repository-level data/, so commands launched from inference/ can refer to it as ../data. In a standalone public X-Stream checkout, either place the dataset as a sibling directory and keep using ../data, or place it at X-Stream/data and change the example --input and --video-root paths from ../data to data.

cd X-Stream
pip install -U huggingface_hub
huggingface-cli download spw2000/X-stream \
  --repo-type dataset \
  --local-dir data

If you also download video archives, install zstd and extract the archives from that dataset root:

sudo apt-get update
sudo apt-get install -y zstd
python data/scripts/extract_archives.py --dataset-root data

To extract only the evaluation split or only the lightweight 2 fps model-input videos:

python data/scripts/extract_archives.py --dataset-root data --splits eval
python data/scripts/extract_archives.py --dataset-root data --kinds reencoded

3. Local Runtime Variables

For local vLLM checkpoint runs:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

If STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507, provide a Qwen-compatible judge endpoint and key:

export QWEN_ENDPOINT=https://<your-qwen-compatible-endpoint>/v1/chat/completions
export QWEN_API_KEY=<your-qwen-api-key>

For hosted API models, export only the provider credentials used by the selected model:

export OPENROUTER_API_KEY=<your-openrouter-api-key>
export OPENAI_API_KEY=<your-openai-api-key>
export GEMINI_API_KEY=<your-gemini-api-key>

For a quick smoke test or inference-only run, add --no-stream-eval to avoid judge credentials.

4. Model Checkpoints

Local vLLM runs need a downloaded checkpoint. The examples below use Qwen3-Omni-30B-A3B-Instruct, whose logical model name must match the key in configs/models.json.

Download the checkpoint with the Hugging Face CLI:

cd X-Stream/inference
pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --local-dir checkpoints/Qwen3-Omni-30B-A3B-Instruct

Then use the checkpoint root as --vllm-model-path:

--vllm-model-path ./checkpoints

The expected structure is:

inference/
`-- checkpoints/
    `-- Qwen3-Omni-30B-A3B-Instruct/
        |-- config.json
        |-- tokenizer_config.json
        |-- generation_config.json
        |-- model-00001-of-*.safetensors
        `-- ...

pipeline.sh also supports pointing --vllm-model-path directly at one checkpoint directory if that directory contains config.json:

--vllm-model-path ./checkpoints/Qwen3-Omni-30B-A3B-Instruct

For other local models, keep the same rule: the directory name under checkpoints/ should match the logical model key passed through --model, or --vllm-model-path should point directly to a checkpoint directory with config.json.

5. (Optional) CLIP Weights For Pruning Modes

pip install -U huggingface_hub
huggingface-cli download openai/clip-vit-large-patch14-336

If you use a custom Hugging Face cache location, set it before downloading and before running inference:

export HF_HOME=/path/to/hf-cache
huggingface-cli download openai/clip-vit-large-patch14-336

Segment-level cdpruner and surge fall back to the default media-limit behavior if CLIP cannot be loaded, so the run may continue but it will not use the intended pruning strategy.

Usage

Start with the API-free smoke test, then add vLLM, hosted API models, or evaluation as needed.

0. (Optional) CPU Cache Prewarming

Long multi-stream videos are split into cached MP4 segments before model inference. This MoviePy and ffmpeg stage is CPU-bound. Prewarm the cache on a CPU machine before a GPU run.

cd X-Stream/inference
uv run bash run.sh \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --warm-cache-only \
  --workers 64 \
  --cache-warm-workers 64 \
  --cache-dir ./cache \
  --run-id prewarm_multi \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data

--warm-cache-only does not start vLLM and does not call any model. It only resolves {{video:...}} placeholders and writes the segment cache. Keep --cache-dir, --input, --video-root, and video placeholder parameters identical between prewarming and the later GPU run.

For time, cdpruner, surge, cdpruner_token, and surge_token, --multi-stream can be omitted during prewarming because the same base video segments are generated. For code_adaptive, pass the same --multi-stream code_adaptive as the later run because it may create additional fps-scaled segments.

1. API-Free Smoke Test

This command verifies the Python environment, CLI path, input parsing, output writing, and video-root resolution. It does not start vLLM and does not call any hosted API.

cd X-Stream/inference
uv run bash run.sh \
  --model echo \
  --no-vllm \
  --no-stream-eval \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --workers 2 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id smoke_echo_pixel

2. Local vLLM With Merged Pixel Input

Use this when your checkpoint is available on the same machine. Keep /path/to/checkpoints as a placeholder for the local checkpoint root.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --tp 2 \
  --workers 4 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_pixel

3. Local vLLM With Multi-Stream Input

Switch to eval_relative_multi_phostream_type.jsonl for temporal, semantic, and token-reduction modes.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json
export STREAM_EVAL_JUDGER=qwen3-235b-a22b-instruct-2507

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --multi-stream time \
  --tp 2 \
  --workers 4 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_time

To run another non-token mode, change only --multi-stream and --run-id, for example code, code_adaptive, cdpruner, or surge.

4. Local vLLM With Token-Level Pruning

surge_token and cdpruner_token require a local vLLM backend. Do not pass --no-vllm, and do not use these modes with hosted API models.

cd X-Stream/inference
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export FLOW_CONFIG=configs/models.json

uv run bash run.sh \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --vllm-model-path /path/to/checkpoints \
  --input ../data/eval_relative_multi_phostream_type.jsonl \
  --multi-stream cdpruner_token \
  --xstream-rho 0.25 \
  --tp 2 \
  --workers 1 \
  --max-model-len 200000 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id qwen3omni_cdpruner_token

5. Hosted API Model

Hosted API models do not start vLLM. Pass --no-vllm and make sure the relevant provider key exists in the environment. Patch-level token pruning modes (cdpruner_token and surge_token) are not supported for API models.

cd X-Stream/inference
export FLOW_CONFIG=configs/models.json
export OPENROUTER_API_KEY=<your-openrouter-api-key>

uv run bash run.sh \
  --model qwen3-vl-30b-a3b-instruct \
  --no-vllm \
  --input ../data/eval_relative_merged_phostream_type.jsonl \
  --multi-stream pixel \
  --workers 8 \
  --prompt-root prompts/streaming_prompt \
  --video-root ../data \
  --run-id api_qwen3vl_pixel

Outputs And Evaluation

Each run writes to:

outputs/<RUN_ID>_<YYYYMMDD-HHMMSS>/

Typical contents:

run_env.json
models.json
output_<input>.jsonl
eval.sh
eval.json
vllm_pids.txt
vllmlogs/

Useful flags:

--resume: continue a compatible incomplete run.
--no-stream-eval: skip stream-eval and write only raw model outputs.
--stream-eval-judger MODEL: choose the judge model.
--output-dir DIR: change the output root.
--warm-cache-only: pre-generate video segment cache on CPU and exit.
--cache-warm-workers N: set CPU prewarming concurrency.

run_env.json records resolved runtime paths and options. Use it to reproduce a run or inspect which config, input, cache directory, and multi-stream mode were used.

Directory Structure

inference/
|-- README.md
|-- LICENSE
|-- assets/
|   |-- logo.png                    # X-Stream logo used in this README
|   |-- teaser.png                  # Scenario overview figure
|   `-- multiplexing_pipeline.png   # Online inference and multiplexing pipeline
|-- run.sh                         # Main entrypoint for inference runs
|-- pipeline.sh                    # vLLM startup, resume, evaluation, cleanup
|-- pyproject.toml                 # uv environment and dependency pins
|-- configs/
|   |-- models.example.json        # Public model-config template
|   `-- models.json                # Local model config
|-- prompts/
|   |-- streaming_prompt/
|   |   `-- system_prompt.txt      # Streaming QA prompt resolved by {{file:system_prompt.txt}}
|   `-- general_prompt/
|       `-- system_prompt.txt      # General-purpose prompt for custom runs
|-- tools/
|-- third_party/
|   |-- MLLMFlow
|   |-- ModelHub
|   |-- stream-eval
|   `-- xstream_vllm_pruner
|-- outputs/                       # Generated runs
`-- cache/                         # Generated video segment cache

Token Rate Guidelines

Different model providers account for video tokens differently. If a run exceeds the model's video-token or token-per-second budget, reduce the input load by lowering the resolution, lowering the FPS, shortening clips, or changing playback speed according to the model family:

Gemini: Fixed 258 tokens/sec (independent of resolution/FPS).
GPT: 85 tokens/frame + 170 tokens per 512 $\times$ 512 tile.
Qwen3+: 28 $\times$ 28 pixel patches per token with token merging.

Use these rules to estimate the effective token rate for your target model, then choose the resolution, FPS, clip length, or playback-speed adjustment that keeps the input within that model's limit.

FAQ

Which dataset file should I use?

Use ../data/eval_relative_merged_phostream_type.jsonl for pixel. Use ../data/eval_relative_multi_phostream_type.jsonl for all multi-stream modes. Use ../data/eval_relative.json when you need to inspect or validate the dataset manifest itself.

Why does `eval_relative.json` not appear in `run.sh` examples?

eval_relative.json is the release manifest. The inference runner consumes the MLLMFlow-ready JSONL task files, so the executable examples use eval_relative_merged_phostream_type.jsonl or eval_relative_multi_phostream_type.jsonl.

ValueError: t:1 must be larger than temgoral_factor:2

Change the value of FLOW_VIDEO_URL_FPS from 1 to 2 in each *.jsonl input file.

Discussion

Drawback of semantic multiplexing.

In a typical streaming setting, the question is provided only after the frames have already appeared. This means that, when a frame is first observed, the question cannot be used as a query to determine which salient tokens should be retained.

However, most existing methods for identifying salient tokens rely on question-based importance ranking and keep only the tokens deemed important. As a result, they cannot fundamentally address this limitation. We leave this issue for the community to further explore.

Acknowledgements

This inference package builds on ideas and components from the following open-source projects:

PhoStream and AURA for streaming video understanding infrastructure and evaluation design.
CDPruner and SURGE for visual token pruning.

Citation

@article{sun2026x,
  title={X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding},
  author={Sun, Peiwen and Lu, Xudong and Liu, Huadai and Bo, Yang and Wu, Dongming and Guan, Huankang and Cai, Minghong and Chen, Jinpeng and Guo, Xintong and Li, Shuhan and others},
  journal={arXiv preprint arXiv:2606.02482},
  year={2026}
}

License

This inference package is released under the MIT License. Third-party packages under third_party/ keep their original licenses and notices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Introduction

Pipeline

Environment Setup

1. Common Base Environment

2. Download Data

3. Local Runtime Variables

4. Model Checkpoints

5. (Optional) CLIP Weights For Pruning Modes

Usage

0. (Optional) CPU Cache Prewarming

1. API-Free Smoke Test

2. Local vLLM With Merged Pixel Input

3. Local vLLM With Multi-Stream Input

4. Local vLLM With Token-Level Pruning

5. Hosted API Model

Outputs And Evaluation

Directory Structure

Token Rate Guidelines

FAQ

Which dataset file should I use?

Why does `eval_relative.json` not appear in `run.sh` examples?

ValueError: t:1 must be larger than temgoral_factor:2

Discussion

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
configs		configs
prompts		prompts
third_party		third_party
tools		tools
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pipeline.sh		pipeline.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Introduction

Pipeline

Environment Setup

1. Common Base Environment

2. Download Data

3. Local Runtime Variables

4. Model Checkpoints

5. (Optional) CLIP Weights For Pruning Modes

Usage

0. (Optional) CPU Cache Prewarming

1. API-Free Smoke Test

2. Local vLLM With Merged Pixel Input

3. Local vLLM With Multi-Stream Input

4. Local vLLM With Token-Level Pruning

5. Hosted API Model

Outputs And Evaluation

Directory Structure

Token Rate Guidelines

FAQ

Which dataset file should I use?

Why does eval_relative.json not appear in run.sh examples?

ValueError: t:1 must be larger than temgoral_factor:2

Discussion

Acknowledgements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why does `eval_relative.json` not appear in `run.sh` examples?

Packages