--- license: apache-2.0 tags: - multimodal - any-to-any - agent - image-generation - audio-generation - vicuna - imagebind - lora language: - en pipeline_tag: text-generation library_name: transformers --- # OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation **Author:** Md Rezwan Haque — CPAMI Lab, University of Waterloo ## Overview OmniAgent is a unified multimodal agent. It takes text, image, audio, and video as input. It produces text, image, audio, and video as output. The model uses a three-tier architecture: - **Encoder:** ImageBind-Huge (frozen, 632M params) - **Backbone:** Vicuna-7B-v1.5 + LoRA (r=64, α=128) - **Decoders:** Stable Diffusion 2.1 (image), AudioLDM-L (audio), ZeroScope v2 (video) Total parameters: 7.4B. Trainable parameters: 660M (8.9%). ## Training Pipeline | Stage | What Is Trained | Data | Steps | Final Loss | |-------|----------------|------|-------|------------| | 1: Encoding Alignment | Input Projector | CC3M + AudioCaps (100K) | 15,620 | 16.55 | | 2: Decoding Alignment | Output Projectors | Decoder embeddings | 150 | 0.86 | | 3: Agentic SFT | LoRA + Projectors + Action Tokens | MAgenIT (10K) | 1,560 | 0.015 | | 4: MM-DPO | LoRA + Projectors | Preferences (2K) | 750 | 0.20 | | 1.5: Re-alignment | Input Projector | Synthetic captions (20K) | 6,000 | 16.55 | ## Evaluation 23/24 functional tests pass (95.8%) across 9 categories: - Text QA: 3/4 (75%) - Image/Audio/Video Understanding: 12/12 (100%) - Image Generation: 1/1 (100%) - Audio Generation: 1/1 (100%) - Multi-turn Dialogue: 1/1 (100%) - Intent Detection: 4/4 (100%) ## Files - `backbone/` — LoRA adapter for Vicuna-7B-v1.5 - `input_projector.pt` — Trained input projector (1024→4096) - `output_projectors.pt` — Trained output projectors (image, audio, video) - `tokenizer/` — Extended tokenizer with special tokens ## Usage ```python from omniagent.model.omniagent_arch import OmniAgentModel model = OmniAgentModel.from_pretrained( model_path="mr3haque/OmniAgent", torch_dtype=torch.bfloat16, device_map="auto", ) output = model.chat(text="What is the capital of Japan?") print(output.text) # "Tokyo" ``` ## Hardware Trained on 2× NVIDIA RTX A6000 (48 GB each) in under 48 hours total. ## Citation ```bibtex @article{haque2026omniagent, title={OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation and Agentic Capabilities}, author={Haque, Md Rezwan}, year={2026}, institution={University of Waterloo} } ```