🏆 Best Short Paper Award — ICLR 2026 Workshop AIWILD
Yunbei Zhang, Yingqiang Ge, Weijie Xu, Yuhui Xu, Jihun Hamm, Chandan K. Reddy
- Apr 15, 2026 — 🏆 Received the Best Short Paper Award at ICLR 2026 AIWILD Workshop!
- Apr 10, 2026 — Code and VE-Safety dataset released.
- Apr 06, 2026 — Selected as Oral Presentation at ICLR 2026 AIWILD Workshop.
- Mar 24, 2026 — Selected as Spotlight at ICLR 2026 AIWILD Workshop.
- Mar 01, 2026 — Accepted to ICLR 2026 Workshop Agents in the Wild (AIWILD): Safety, Security, and Beyond.
- Feb 05, 2026 — Paper released on arXiv.
- Release VE-Safety dataset on HuggingFace (under DUA)
- Release code
Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise — attacks that are structurally brittle once standard defenses expose the payload. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner via Group Relative Policy Optimization (GRPO) to self-discover effective multi-turn strategies without human supervision. We also introduce VE-Safety, a human-curated benchmark of 440 instances across 15 safety categories. MM-Plan achieves 46.3% ASR against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2–5×.
Unlike prior "wrapper-based" attacks where images merely conceal text payloads, Visual Exclusivity (VE) exploits the model's own visual reasoning capabilities. The text query appears innocuous (e.g., "How do I assemble this?"), the image contains no adversarial noise or hidden typography, and harm materializes only when the model correctly interprets spatial/functional relationships in the image. This makes standard defenses (OCR, caption-based screening) largely ineffective.
MM-Plan reformulates multimodal jailbreaking as agentic planning. Given a harmful goal and image, the Attacker Planner generates complete multi-turn strategies in a single pass. Plans are sampled and executed against victim MLLMs, with rewards collected from a judge model. The policy is updated via GRPO based on relative plan performance.
git clone https://github.com/yunbeizhang/MM-Plan.git
cd MM-Plan
conda create -n mm_plan python=3.10 -y
conda activate mm_plan
bash verl/scripts/install_mm_plan.shThe VE-Safety benchmark is available on HuggingFace under a Data Use Agreement (DUA). Please follow the instructions on the dataset page to request access.
After approval, download and place the parquet files under data/VE-Safety/:
from datasets import load_dataset
dataset = load_dataset("zybeich/VE-Safety")Update the data_path in the training scripts accordingly.
Train the attacker planner with GRPO on VE-Safety:
bash verl/examples/mm_plan/Qwen3/vebench.shAfter training, merge the FSDP sharded checkpoint into HuggingFace format:
python -m verl.model_merger merge \
--backend fsdp \
--local_dir /path/to/checkpoint/actor \
--target_dir /path/to/checkpoint/merged_hf_modelEvaluate on the test set:
python -m verl.utils.reward_score.mm_plan.evaluate_attacker_agent \
--parquet ./data/VE-Safety/test.parquet \
--out-jsonl ./logs_eval/vesafety \
--model qwen3-vl-4b \
--model-weights /path/to/merged_hf_model \
--target-model Qwen3-VL-8B \
--judge-model Claude-Sonnet-4.5This repo is built upon the following prior works:
- veRL — RL training framework for LLMs.
- Qwen-VL — Base vision-language model for the attacker planner.
We sincerely thank the authors for making their code and data publicly available.
If you find this work useful, please cite our paper:
@article{zhang2026mmplan,
title={Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning},
author={Zhang, Yunbei and Ge, Yingqiang and Xu, Weijie and Xu, Yuhui and Hamm, Jihun and Reddy, Chandan K.},
journal={arXiv preprint arXiv:2603.20198},
year={2026}
}This project is licensed under the Apache License 2.0.
This repository contains research on AI safety vulnerabilities. The content is intended solely for academic research and responsible disclosure to improve the safety of multimodal AI systems.

