Instructions to use harryhsing/OmniAgent-SFT-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use harryhsing/OmniAgent-SFT-7B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("harryhsing/OmniAgent-SFT-7B") model = AutoModelForMultimodalLM.from_pretrained("harryhsing/OmniAgent-SFT-7B") - Notebooks
- Google Colab
- Kaggle
OmniAgent-SFT-7B
OmniAgent-SFT-7B is the cold-start checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is produced by Agentic SFT — best-of-N trajectory synthesis (58K trajectories) with dual-stage quality control — on top of Qwen2.5-Omni-7B. It bootstraps native active perception and serves as the initialization for Agentic RL. For the best-performing model, use OmniAgent-RL-7B instead.
- 📄 Paper: Native Active Perception as Reasoning for Omni-Modal Understanding
- 💻 Code: https://github.com/HarryHsing/OmniAgent
- 🤗 Models: OmniAgent-RL-7B · OmniAgent-SFT-7B
What it does
Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.
How to use
⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.
This checkpoint is mainly intended for:
- Initializing Agentic RL — point
MODEL_BASE_PATHat it to run TAURA training. - Evaluating the SFT-only stage (e.g., for ablations).
# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-SFT-7B assets/example_video_mcq.mp4
Model size & components
OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.
Which checkpoint should I use?
- OmniAgent-SFT-7B (this model) — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.
- OmniAgent-RL-7B — the final, best-performing checkpoint; use for inference, evaluation, and deployment.
Results
Agentic SFT alone already lifts the Qwen2.5-Omni-7B base substantially (e.g., LVBench 43.0 → 48.7, DailyOmni 60.1 → 63.3). Agentic RL with TAURA pushes this further to the paper's open-source state-of-the-art — see OmniAgent-RL-7B and the paper for the full tables.
Limitations
The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work. As the cold-start stage, this checkpoint underperforms the final OmniAgent-RL-7B.
Citation
@inproceedings{xing2026omniagent,
title={Native Active Perception as Reasoning for Omni-Modal Understanding},
author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
- Downloads last month
- -