OmniAgent / README.md
mr3haque's picture
Upload OmniAgent checkpoint (Stage 4 MM-DPO + Stage 1.5 projector)
e296aac verified
---
license: apache-2.0
tags:
- multimodal
- any-to-any
- agent
- image-generation
- audio-generation
- vicuna
- imagebind
- lora
language:
- en
pipeline_tag: text-generation
library_name: transformers
---
# OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation
**Author:** Md Rezwan Haque β€” CPAMI Lab, University of Waterloo
## Overview
OmniAgent is a unified multimodal agent. It takes text, image, audio, and video as input. It produces text, image, audio, and video as output. The model uses a three-tier architecture:
- **Encoder:** ImageBind-Huge (frozen, 632M params)
- **Backbone:** Vicuna-7B-v1.5 + LoRA (r=64, Ξ±=128)
- **Decoders:** Stable Diffusion 2.1 (image), AudioLDM-L (audio), ZeroScope v2 (video)
Total parameters: 7.4B. Trainable parameters: 660M (8.9%).
## Training Pipeline
| Stage | What Is Trained | Data | Steps | Final Loss |
|-------|----------------|------|-------|------------|
| 1: Encoding Alignment | Input Projector | CC3M + AudioCaps (100K) | 15,620 | 16.55 |
| 2: Decoding Alignment | Output Projectors | Decoder embeddings | 150 | 0.86 |
| 3: Agentic SFT | LoRA + Projectors + Action Tokens | MAgenIT (10K) | 1,560 | 0.015 |
| 4: MM-DPO | LoRA + Projectors | Preferences (2K) | 750 | 0.20 |
| 1.5: Re-alignment | Input Projector | Synthetic captions (20K) | 6,000 | 16.55 |
## Evaluation
23/24 functional tests pass (95.8%) across 9 categories:
- Text QA: 3/4 (75%)
- Image/Audio/Video Understanding: 12/12 (100%)
- Image Generation: 1/1 (100%)
- Audio Generation: 1/1 (100%)
- Multi-turn Dialogue: 1/1 (100%)
- Intent Detection: 4/4 (100%)
## Files
- `backbone/` β€” LoRA adapter for Vicuna-7B-v1.5
- `input_projector.pt` β€” Trained input projector (1024β†’4096)
- `output_projectors.pt` β€” Trained output projectors (image, audio, video)
- `tokenizer/` β€” Extended tokenizer with special tokens
## Usage
```python
from omniagent.model.omniagent_arch import OmniAgentModel
model = OmniAgentModel.from_pretrained(
model_path="mr3haque/OmniAgent",
torch_dtype=torch.bfloat16,
device_map="auto",
)
output = model.chat(text="What is the capital of Japan?")
print(output.text) # "Tokyo"
```
## Hardware
Trained on 2Γ— NVIDIA RTX A6000 (48 GB each) in under 48 hours total.
## Citation
```bibtex
@article{haque2026omniagent,
title={OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation and Agentic Capabilities},
author={Haque, Md Rezwan},
year={2026},
institution={University of Waterloo}
}
```