OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation

Author: Md Rezwan Haque โ€” CPAMI Lab, University of Waterloo

Overview

OmniAgent is a unified multimodal agent. It takes text, image, audio, and video as input. It produces text, image, audio, and video as output. The model uses a three-tier architecture:

  • Encoder: ImageBind-Huge (frozen, 632M params)
  • Backbone: Vicuna-7B-v1.5 + LoRA (r=64, ฮฑ=128)
  • Decoders: Stable Diffusion 2.1 (image), AudioLDM-L (audio), ZeroScope v2 (video)

Total parameters: 7.4B. Trainable parameters: 660M (8.9%).

Training Pipeline

Stage What Is Trained Data Steps Final Loss
1: Encoding Alignment Input Projector CC3M + AudioCaps (100K) 15,620 16.55
2: Decoding Alignment Output Projectors Decoder embeddings 150 0.86
3: Agentic SFT LoRA + Projectors + Action Tokens MAgenIT (10K) 1,560 0.015
4: MM-DPO LoRA + Projectors Preferences (2K) 750 0.20
1.5: Re-alignment Input Projector Synthetic captions (20K) 6,000 16.55

Evaluation

23/24 functional tests pass (95.8%) across 9 categories:

  • Text QA: 3/4 (75%)
  • Image/Audio/Video Understanding: 12/12 (100%)
  • Image Generation: 1/1 (100%)
  • Audio Generation: 1/1 (100%)
  • Multi-turn Dialogue: 1/1 (100%)
  • Intent Detection: 4/4 (100%)

Files

  • backbone/ โ€” LoRA adapter for Vicuna-7B-v1.5
  • input_projector.pt โ€” Trained input projector (1024โ†’4096)
  • output_projectors.pt โ€” Trained output projectors (image, audio, video)
  • tokenizer/ โ€” Extended tokenizer with special tokens

Usage

from omniagent.model.omniagent_arch import OmniAgentModel

model = OmniAgentModel.from_pretrained(
    model_path="mr3haque/OmniAgent",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

output = model.chat(text="What is the capital of Japan?")
print(output.text)  # "Tokyo"

Hardware

Trained on 2ร— NVIDIA RTX A6000 (48 GB each) in under 48 hours total.

Citation

@article{haque2026omniagent,
  title={OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation and Agentic Capabilities},
  author={Haque, Md Rezwan},
  year={2026},
  institution={University of Waterloo}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support