|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- multimodal |
|
|
- any-to-any |
|
|
- agent |
|
|
- image-generation |
|
|
- audio-generation |
|
|
- vicuna |
|
|
- imagebind |
|
|
- lora |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation |
|
|
|
|
|
**Author:** Md Rezwan Haque β CPAMI Lab, University of Waterloo |
|
|
|
|
|
## Overview |
|
|
|
|
|
OmniAgent is a unified multimodal agent. It takes text, image, audio, and video as input. It produces text, image, audio, and video as output. The model uses a three-tier architecture: |
|
|
|
|
|
- **Encoder:** ImageBind-Huge (frozen, 632M params) |
|
|
- **Backbone:** Vicuna-7B-v1.5 + LoRA (r=64, Ξ±=128) |
|
|
- **Decoders:** Stable Diffusion 2.1 (image), AudioLDM-L (audio), ZeroScope v2 (video) |
|
|
|
|
|
Total parameters: 7.4B. Trainable parameters: 660M (8.9%). |
|
|
|
|
|
## Training Pipeline |
|
|
|
|
|
| Stage | What Is Trained | Data | Steps | Final Loss | |
|
|
|-------|----------------|------|-------|------------| |
|
|
| 1: Encoding Alignment | Input Projector | CC3M + AudioCaps (100K) | 15,620 | 16.55 | |
|
|
| 2: Decoding Alignment | Output Projectors | Decoder embeddings | 150 | 0.86 | |
|
|
| 3: Agentic SFT | LoRA + Projectors + Action Tokens | MAgenIT (10K) | 1,560 | 0.015 | |
|
|
| 4: MM-DPO | LoRA + Projectors | Preferences (2K) | 750 | 0.20 | |
|
|
| 1.5: Re-alignment | Input Projector | Synthetic captions (20K) | 6,000 | 16.55 | |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
23/24 functional tests pass (95.8%) across 9 categories: |
|
|
- Text QA: 3/4 (75%) |
|
|
- Image/Audio/Video Understanding: 12/12 (100%) |
|
|
- Image Generation: 1/1 (100%) |
|
|
- Audio Generation: 1/1 (100%) |
|
|
- Multi-turn Dialogue: 1/1 (100%) |
|
|
- Intent Detection: 4/4 (100%) |
|
|
|
|
|
## Files |
|
|
|
|
|
- `backbone/` β LoRA adapter for Vicuna-7B-v1.5 |
|
|
- `input_projector.pt` β Trained input projector (1024β4096) |
|
|
- `output_projectors.pt` β Trained output projectors (image, audio, video) |
|
|
- `tokenizer/` β Extended tokenizer with special tokens |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from omniagent.model.omniagent_arch import OmniAgentModel |
|
|
|
|
|
model = OmniAgentModel.from_pretrained( |
|
|
model_path="mr3haque/OmniAgent", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
output = model.chat(text="What is the capital of Japan?") |
|
|
print(output.text) # "Tokyo" |
|
|
``` |
|
|
|
|
|
## Hardware |
|
|
|
|
|
Trained on 2Γ NVIDIA RTX A6000 (48 GB each) in under 48 hours total. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{haque2026omniagent, |
|
|
title={OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation and Agentic Capabilities}, |
|
|
author={Haque, Md Rezwan}, |
|
|
year={2026}, |
|
|
institution={University of Waterloo} |
|
|
} |
|
|
``` |
|
|
|