OmniAgent-SFT-7B

OmniAgent-SFT-7B is the cold-start checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is produced by Agentic SFT — best-of-N trajectory synthesis (58K trajectories) with dual-stage quality control — on top of Qwen2.5-Omni-7B. It bootstraps native active perception and serves as the initialization for Agentic RL. For the best-performing model, use OmniAgent-RL-7B instead.

📄 Paper: Native Active Perception as Reasoning for Omni-Modal Understanding
💻 Code: https://github.com/HarryHsing/OmniAgent
🤗 Models: OmniAgent-RL-7B · OmniAgent-SFT-7B

What it does

Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.

How to use

⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.

This checkpoint is mainly intended for:

Initializing Agentic RL — point MODEL_BASE_PATH at it to run TAURA training.
Evaluating the SFT-only stage (e.g., for ablations).

# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-SFT-7B assets/example_video_mcq.mp4

Model size & components

OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.

Which checkpoint should I use?

OmniAgent-SFT-7B (this model) — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.
OmniAgent-RL-7B — the final, best-performing checkpoint; use for inference, evaluation, and deployment.

Results

Agentic SFT alone already lifts the Qwen2.5-Omni-7B base substantially (e.g., LVBench 43.0 → 48.7, DailyOmni 60.1 → 63.3). Agentic RL with TAURA pushes this further to the paper's open-source state-of-the-art — see OmniAgent-RL-7B and the paper for the full tables.

Limitations

The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work. As the cold-start stage, this checkpoint underperforms the final OmniAgent-RL-7B.

Citation

@inproceedings{xing2026omniagent,
  title={Native Active Perception as Reasoning for Omni-Modal Understanding},
  author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}