micro-Omni (uOmni) β Tiny Multimodal AI
A from-scratch multimodal AI that handles text + images + audio (in and out) on a single GPU. Inspired by Qwen3 Omni's Thinker-Talker architecture.
3.4M params | Qwen3.5-aligned | Trained on synthetic data | MIT License
Architecture
Key Features
- GQA (Grouped Query Attention) with 2:1 Q:KV ratio
- Multi-Token Prediction (predict t+2, t+3 during training)
- SwiGLU FFN with 8/3 ratio (Qwen3.5 standard)
- Sliding Window Attention infrastructure (configurable)
- YaRN RoPE for context extension beyond training length
- Label Smoothing (0.1) for better calibration
- Flash Attention via PyTorch scaled_dot_product_attention
- HiFi-GAN vocoder + Griffin-Lim fallback for speech synthesis
- OCR model for text extraction from images
Performance (Synthetic Data, 2000 samples)
| Component | Metric | Score |
|---|---|---|
| Thinker (GQA+MTP) | Top-1 Accuracy | 65.09% |
| Top-5 Accuracy | 92.92% | |
| Perplexity | 2.71 | |
| Audio Encoder (12.5Hz) | Val Loss | 0.0000202 |
| Vision Encoder (CLIP) | Diversity | 0.93 |
| Talker (TTS) | Top-5 Accuracy | 92-93% |
Quick Start
Text Generation
Full Multimodal (Image + Audio + Text)
Model Components
| Component | Params | File Prefix |
|---|---|---|
| Thinker (LLM) | 792K | |
| Audio Encoder | 998K | |
| Vision Encoder | 744K | |
| Talker (TTS) | 776K | |
| RVQ Codec | 33K | |
| Projectors | 33K | , |
| Total | 3.4M |
Files
- β HF-compatible text model (flat keys, 3.3MB)
- β Full multimodal model (prefixed keys, 51MB)
- β Self-contained HF model classes (no external dependencies)
- β HuggingFace config with auto_map
- β SentencePiece BPE tokenizer
Training
Trained on RTX 5070 Ti (16GB VRAM) in ~90 minutes across 7 stages:
- Thinker LLM (text, cross-entropy + MTP)
- Audio Encoder (CTC loss, 12.5Hz)
- Vision Encoder (CLIP contrastive)
- Talker + RVQ (speech codes)
- Multimodal SFT (all modalities)
- HiFi-GAN Vocoder (optional)
- OCR Model (optional)
Links
- GitHub: github.com/prskid1000/micro-Omni
- Study Guide: 25 chapters + 5 appendices, zero-to-master (in folder)
- License: MIT
- Downloads last month
- 350
Evaluation results
- Top-1 Accuracyself-reported65.090
- Top-5 Accuracyself-reported92.920
- Perplexityself-reported2.710