OmniAgent-SFT-7B

OmniAgent-SFT-7B is the cold-start checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is produced by Agentic SFT — best-of-N trajectory synthesis (58K trajectories) with dual-stage quality control — on top of Qwen2.5-Omni-7B. It bootstraps native active perception and serves as the initialization for Agentic RL. For the best-performing model, use OmniAgent-RL-7B instead.

What it does

Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.

How to use

⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.

This checkpoint is mainly intended for:

  1. Initializing Agentic RL — point MODEL_BASE_PATH at it to run TAURA training.
  2. Evaluating the SFT-only stage (e.g., for ablations).
# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-SFT-7B assets/example_video_mcq.mp4

Model size & components

OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.

Which checkpoint should I use?

  • OmniAgent-SFT-7B (this model) — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.
  • OmniAgent-RL-7B — the final, best-performing checkpoint; use for inference, evaluation, and deployment.

Results

Agentic SFT alone already lifts the Qwen2.5-Omni-7B base substantially (e.g., LVBench 43.0 → 48.7, DailyOmni 60.1 → 63.3). Agentic RL with TAURA pushes this further to the paper's open-source state-of-the-art — see OmniAgent-RL-7B and the paper for the full tables.

Limitations

The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work. As the cold-start stage, this checkpoint underperforms the final OmniAgent-RL-7B.

Citation

@inproceedings{xing2026omniagent,
  title={Native Active Perception as Reasoning for Omni-Modal Understanding},
  author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
11B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for harryhsing/OmniAgent-SFT-7B

Finetuned
(57)
this model
Finetunes
1 model

Paper for harryhsing/OmniAgent-SFT-7B