ODE treats multimodal deep-search data construction as an adaptive loop over the same visual-native harness used by the target policy.
The left half shows the visual-native harness. The original task image and every image returned by search,
browsing, visual search, or image manipulation are registered as reusable <image:N>
handles, so later tools can crop, search, rotate, or inspect evidence produced by earlier steps. The right
half shows ODE: a forward pipeline proposes seeds, explores web evidence, organizes multimodal evidence
graphs, and curates verifiable tasks, while a backward pipeline rolls out the teacher or current policy,
diagnoses the traces with SFT/RL rubrics, and updates the next epoch's generation configuration.
Images are no longer transient observations. They become addressable visual state that can be reused by later tool calls.
The example chains zoom-in, visual search, web search, and another zoom-in to identify and verify the location.
Rollout feedback edits the generator configuration instead of scaling a fixed curation recipe unchanged.
Six components behind visual-native interaction, closed-loop data evolution, and policy-facing training.
Every initial or tool-returned image receives an addressable handle, making intermediate visual evidence reusable across later tool calls.
Search, browsing, visual search, image manipulation, and Python computation operate in one shared multimodal workspace.
Candidate tasks are rolled out by the target policy, diagnosed by trace rubrics, and used to update the next data-generation configuration.
ODE selects teacher trajectories that are visually grounded, tool-effective, and diverse enough to teach useful agent behavior.
For RL, ODE shifts task difficulty toward the current policy's learning frontier rather than scaling a fixed recipe.
Across eight benchmarks, ODE raises Qwen3-VL agents from 24.9 to 39.0 at 8B and from 30.6 to 41.5 at 30B.
Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Existing tool-use harnesses often treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Training data is also usually built by fixed curation recipes that cannot track the target agent's evolving capability.
We introduce a visual-native agent harness centered on an Image Bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained.
The method combines a visual-native workspace with policy-feedback data construction.
The harness unifies nine core tools in a shared workspace. The Image Bank reference protocol stores the original task image and every tool-returned image as reusable visual state, enabling crop-conditioned retrieval, iterative zoom-and-search, and evidence accumulation.
The data generator proposes seeds, explores web evidence, organizes multimodal evidence graphs, and curates verifiable tasks with capability and difficulty annotations.
Rollout traces are judged and analyzed with shared and mode-specific rubrics. Diagnoses are aggregated into configuration updates that repair the generator stage responsible for each failure.
ODE is evaluated on eight multimodal deep-search and multimodal reasoning benchmarks.
The table compares direct-answer MLLMs, general agent workflows with external tools, dedicated multimodal search agents, and ODE-trained Qwen3-VL agents. ODE raises the 8B agent from 24.9 to 39.0 average accuracy and the 30B agent from 30.6 to 41.5, with gains concentrated on benchmarks that require iterative evidence gathering, visual grounding, and search-oriented reasoning. This separates tool access from tool competence: the model benefits from training traces that teach when to search, when to inspect visual evidence, and how to synthesize retrieved evidence into a grounded answer.
The gains are measured under the same tool environment, so the difference comes from training data and policy learning.
ODE-8B-RL reaches 39.0 average accuracy, exceeding the Gemini-2.5 Pro agent-workflow average reported in the table.
The larger Qwen3-VL backbone also benefits, improving to 41.5 average accuracy after SFT and RL.
Reusable Image Bank references make intermediate visual evidence actionable across later tool calls.
The ablated harness still shows tool-returned images to the model, but removes their reusable
<image:N> references, so those images cannot become inputs to later image-consuming tools.
The full harness performs better on visually grounded and search-heavy benchmarks, with the largest gains
appearing where secondary image reuse is most active. The downstream tool distribution further shows that
reused images are mainly consumed by zoom-in and visual search, indicating that the Image Bank supports
iterative visual refinement rather than passive image viewing.
The full harness improves over the no-reuse ablation on key benchmarks such as MMBC, HLE-VL, and MMSearch+.
Benchmarks with more secondary image-use calls tend to show larger accuracy gains from the full harness.
Intermediate images are cropped and searched again, turning tool outputs into working visual evidence.
ODE is compared with a matched static recipe to isolate the value of rollout-feedback evolution.
The static baseline uses the initial ODE configuration and runs only the forward generation pipeline, without rollout-based analysis or configuration optimization. Under matched data scale, evolved SFT traces outperform the static recipe and contain stronger supervision patterns: more traces with tool-produced images, more high-density visual traces with 4+ tool images, more multi-step tool calls, more visual+search strategies, and broader raw tool-chain and abstract strategy diversity. For RL, both datasets start from the same ODE-8B-SFT checkpoint, but the evolved RL tasks produce stronger downstream gains, showing that policy-facing data needs feedback from the current policy rather than only verifiable task generation.
Evolution improves imitation data by shifting supervision toward visual evidence, multi-step solving, and mixed tool strategies.
Tool-chain diversity counts raw tool-call sequences, while strategy diversity groups them into higher-level solving families.
Evolved RL tasks are better matched to the policy's learning frontier than tasks from the static initial recipe.
Trace statistics show how ODE changes the data distribution across evolution.
The score trends and radar profiles show that the same evolution loop moves in mode-specific directions. In SFT mode, ODE improves teacher-trace dimensions such as visual dependency, step appropriateness, and tool-pattern diversity while keeping verifiability high. The behavior statistics clarify the mechanism: evolved SFT traces use fewer tool calls overall, but produce more dynamic images and more image-input calls, meaning they are not simply longer traces; more supervision is carried by intermediate visual evidence. In RL mode, evolution increases information complexity, capability requirement, difficulty match, and learning utility, and the resulting rollouts use more tools, more dynamic images, and more image-input calls.
Tool-produced images acquired during rollout, excluding the original task image.
Tool invocations that take an image reference as input, measuring whether visual state is re-consumed.
Rollout feedback steers the generator toward data that exposes missing capabilities and gives stronger training signal.
@article{huang2026towards,
title={Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents},
author={Huang, Shijue and Guo, Hangyu and Li, Chenxin and Lu, Junting and Geng, Xinyu and Su, Zhaochen and Li, Zhenyu and Chen, Shuang and Wang, Hongru and Fung, Yi R.},
journal={arXiv preprint arXiv:2605.10832},
year={2026}
}