Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

What is ODE?

ODE treats multimodal deep-search data construction as an adaptive loop over the same visual-native harness used by the target policy.

Overview of ODE and the visual-native agent harness

Two coupled components: a stateful visual workspace and an evolving data generator.

The left half shows the visual-native harness. The original task image and every image returned by search, browsing, visual search, or image manipulation are registered as reusable <image:N> handles, so later tools can crop, search, rotate, or inspect evidence produced by earlier steps. The right half shows ODE: a forward pipeline proposes seeds, explores web evidence, organizes multimodal evidence graphs, and curates verifiable tasks, while a backward pipeline rolls out the teacher or current policy, diagnoses the traces with SFT/RL rubrics, and updates the next epoch's generation configuration.

Image Bank reference protocol

Images are no longer transient observations. They become addressable visual state that can be reused by later tool calls.

Visual-native rollout

The example chains zoom-in, visual search, web search, and another zoom-in to identify and verify the location.

Closed-loop evolution

Rollout feedback edits the generator configuration instead of scaling a fixed curation recipe unchanged.

Key Capabilities

Six components behind visual-native interaction, closed-loop data evolution, and policy-facing training.

Image Bank Reference Protocol

Every initial or tool-returned image receives an addressable handle, making intermediate visual evidence reusable across later tool calls.

Visual-Native Tool Harness

Search, browsing, visual search, image manipulation, and Python computation operate in one shared multimodal workspace.

Closed-Loop Data Evolution

Candidate tasks are rolled out by the target policy, diagnosed by trace rubrics, and used to update the next data-generation configuration.

SFT Trace Curation

ODE selects teacher trajectories that are visually grounded, tool-effective, and diverse enough to teach useful agent behavior.

Policy-Facing RL Tasks

For RL, ODE shifts task difficulty toward the current policy's learning frontier rather than scaling a fixed recipe.

Strong Same-Harness Gains

Across eight benchmarks, ODE raises Qwen3-VL agents from 24.9 to 39.0 at 8B and from 30.6 to 41.5 at 30B.

Abstract

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Existing tool-use harnesses often treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Training data is also usually built by fixed curation recipes that cannot track the target agent's evolving capability.

We introduce a visual-native agent harness centered on an Image Bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained.

Method

The method combines a visual-native workspace with policy-feedback data construction.

Visual-Native Agent Harness

The harness unifies nine core tools in a shared workspace. The Image Bank reference protocol stores the original task image and every tool-returned image as reusable visual state, enabling crop-conditioned retrieval, iterative zoom-and-search, and evidence accumulation.

Forward Curation

The data generator proposes seeds, explores web evidence, organizes multimodal evidence graphs, and curates verifiable tasks with capability and difficulty annotations.

Backward Optimization

Rollout traces are judged and analyzed with shared and mode-specific rubrics. Diagnoses are aggregated into configuration updates that repair the generator stage responsible for each failure.

Experiments

ODE is evaluated on eight multimodal deep-search and multimodal reasoning benchmarks.

+14.1 8B average gain

+10.9 30B average gain

39.0 ODE-8B-RL avg.

41.5 ODE-30B-RL avg.

Main insight. Tool access alone is not enough. ODE improves the policy by curating trajectories and tasks that demonstrate when to search, inspect visual evidence, chain tools, and synthesize grounded answers.

Main result: ODE improves the same backbones under the same visual-native harness.

The table compares direct-answer MLLMs, general agent workflows with external tools, dedicated multimodal search agents, and ODE-trained Qwen3-VL agents. ODE raises the 8B agent from 24.9 to 39.0 average accuracy and the 30B agent from 30.6 to 41.5, with gains concentrated on benchmarks that require iterative evidence gathering, visual grounding, and search-oriented reasoning. This separates tool access from tool competence: the model benefits from training traces that teach when to search, when to inspect visual evidence, and how to synthesize retrieved evidence into a grounded answer.

Same-harness comparison

The gains are measured under the same tool environment, so the difference comes from training data and policy learning.

8B result

ODE-8B-RL reaches 39.0 average accuracy, exceeding the Gemini-2.5 Pro agent-workflow average reported in the table.

30B result

The larger Qwen3-VL backbone also benefits, improving to 41.5 average accuracy after SFT and RL.

Visual-Native Harness Ablation

Reusable Image Bank references make intermediate visual evidence actionable across later tool calls.

Figure 3 isolates the effect of making intermediate images reusable.

The ablated harness still shows tool-returned images to the model, but removes their reusable <image:N> references, so those images cannot become inputs to later image-consuming tools. The full harness performs better on visually grounded and search-heavy benchmarks, with the largest gains appearing where secondary image reuse is most active. The downstream tool distribution further shows that reused images are mainly consumed by zoom-in and visual search, indicating that the Image Bank supports iterative visual refinement rather than passive image viewing.

Performance effect

The full harness improves over the no-reuse ablation on key benchmarks such as MMBC, HLE-VL, and MMSearch+.

Reuse explains gains

Benchmarks with more secondary image-use calls tend to show larger accuracy gains from the full harness.

What reuse enables

Intermediate images are cropped and searched again, turning tool outputs into working visual evidence.

Data Evolution vs. Static Synthesis

ODE is compared with a matched static recipe to isolate the value of rollout-feedback evolution.

Figure 4 tests whether closed-loop evolution matters beyond a fixed synthesis recipe.

The static baseline uses the initial ODE configuration and runs only the forward generation pipeline, without rollout-based analysis or configuration optimization. Under matched data scale, evolved SFT traces outperform the static recipe and contain stronger supervision patterns: more traces with tool-produced images, more high-density visual traces with 4+ tool images, more multi-step tool calls, more visual+search strategies, and broader raw tool-chain and abstract strategy diversity. For RL, both datasets start from the same ODE-8B-SFT checkpoint, but the evolved RL tasks produce stronger downstream gains, showing that policy-facing data needs feedback from the current policy rather than only verifiable task generation.

SFT traces

Evolution improves imitation data by shifting supervision toward visual evidence, multi-step solving, and mixed tool strategies.

Diversity metrics

Tool-chain diversity counts raw tool-call sequences, while strategy diversity groups them into higher-level solving families.

RL data

Evolved RL tasks are better matched to the policy's learning frontier than tasks from the static initial recipe.

Mechanism Analysis of ODE

Trace statistics show how ODE changes the data distribution across evolution.

Figure 5 shows how ODE changes the generator differently for SFT and RL.

The score trends and radar profiles show that the same evolution loop moves in mode-specific directions. In SFT mode, ODE improves teacher-trace dimensions such as visual dependency, step appropriateness, and tool-pattern diversity while keeping verifiability high. The behavior statistics clarify the mechanism: evolved SFT traces use fewer tool calls overall, but produce more dynamic images and more image-input calls, meaning they are not simply longer traces; more supervision is carried by intermediate visual evidence. In RL mode, evolution increases information complexity, capability requirement, difficulty match, and learning utility, and the resulting rollouts use more tools, more dynamic images, and more image-input calls.

Dynamic images

Tool-produced images acquired during rollout, excluding the original task image.

Image-input calls

Tool invocations that take an image reference as input, measuring whether visual state is re-consumed.

Central mechanism

Rollout feedback steers the generator toward data that exposes missing capabilities and gives stronger training signal.

Citation

@article{huang2026towards,
  title={Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents},
  author={Huang, Shijue and Guo, Hangyu and Li, Chenxin and Lu, Junting and Geng, Xinyu and Su, Zhaochen and Li, Zhenyu and Chen, Shuang and Wang, Hongru and Fung, Yi R.},
  journal={arXiv preprint arXiv:2605.10832},
  year={2026}
}

On-Policy Data Evolution for Multimodal Deep Search Visual-Native Multimodal Deep Search Agents