OmniAgent / README.md

Upload OmniAgent checkpoint (Stage 4 MM-DPO + Stage 1.5 projector)

e296aac verified 9 days ago

2.51 kB

	---
	license: apache-2.0
	tags:
	- multimodal
	- any-to-any
	- agent
	- image-generation
	- audio-generation
	- vicuna
	- imagebind
	- lora
	language:
	- en
	pipeline_tag: text-generation
	library_name: transformers
	---

	# OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation

	Author: Md Rezwan Haque — CPAMI Lab, University of Waterloo

	## Overview

	OmniAgent is a unified multimodal agent. It takes text, image, audio, and video as input. It produces text, image, audio, and video as output. The model uses a three-tier architecture:

	- Encoder: ImageBind-Huge (frozen, 632M params)
	- Backbone: Vicuna-7B-v1.5 + LoRA (r=64, α=128)
	- Decoders: Stable Diffusion 2.1 (image), AudioLDM-L (audio), ZeroScope v2 (video)

	Total parameters: 7.4B. Trainable parameters: 660M (8.9%).

	## Training Pipeline

	\| Stage \| What Is Trained \| Data \| Steps \| Final Loss \|
	\|-------\|----------------\|------\|-------\|------------\|
	\| 1: Encoding Alignment \| Input Projector \| CC3M + AudioCaps (100K) \| 15,620 \| 16.55 \|
	\| 2: Decoding Alignment \| Output Projectors \| Decoder embeddings \| 150 \| 0.86 \|
	\| 3: Agentic SFT \| LoRA + Projectors + Action Tokens \| MAgenIT (10K) \| 1,560 \| 0.015 \|
	\| 4: MM-DPO \| LoRA + Projectors \| Preferences (2K) \| 750 \| 0.20 \|
	\| 1.5: Re-alignment \| Input Projector \| Synthetic captions (20K) \| 6,000 \| 16.55 \|

	## Evaluation

	23/24 functional tests pass (95.8%) across 9 categories:
	- Text QA: 3/4 (75%)
	- Image/Audio/Video Understanding: 12/12 (100%)
	- Image Generation: 1/1 (100%)
	- Audio Generation: 1/1 (100%)
	- Multi-turn Dialogue: 1/1 (100%)
	- Intent Detection: 4/4 (100%)

	## Files

	- `backbone/` — LoRA adapter for Vicuna-7B-v1.5
	- `input_projector.pt` — Trained input projector (1024→4096)
	- `output_projectors.pt` — Trained output projectors (image, audio, video)
	- `tokenizer/` — Extended tokenizer with special tokens

	## Usage

	```python
	from omniagent.model.omniagent_arch import OmniAgentModel

	model = OmniAgentModel.from_pretrained(
	model_path="mr3haque/OmniAgent",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	output = model.chat(text="What is the capital of Japan?")
	print(output.text) # "Tokyo"
	```

	## Hardware

	Trained on 2× NVIDIA RTX A6000 (48 GB each) in under 48 hours total.

	## Citation

	```bibtex
	@article{haque2026omniagent,
	title={OmniAgent: A Unified Multimodal Agent with Any-to-Any Generation and Agentic Capabilities},
	author={Haque, Md Rezwan},
	year={2026},
	institution={University of Waterloo}
	}
	```