micro-Omni (uOmni) β€” Tiny Multimodal AI

A from-scratch multimodal AI that handles text + images + audio (in and out) on a single GPU. Inspired by Qwen3 Omni's Thinker-Talker architecture.

3.4M params | Qwen3.5-aligned | Trained on synthetic data | MIT License

Architecture

Key Features

  • GQA (Grouped Query Attention) with 2:1 Q:KV ratio
  • Multi-Token Prediction (predict t+2, t+3 during training)
  • SwiGLU FFN with 8/3 ratio (Qwen3.5 standard)
  • Sliding Window Attention infrastructure (configurable)
  • YaRN RoPE for context extension beyond training length
  • Label Smoothing (0.1) for better calibration
  • Flash Attention via PyTorch scaled_dot_product_attention
  • HiFi-GAN vocoder + Griffin-Lim fallback for speech synthesis
  • OCR model for text extraction from images

Performance (Synthetic Data, 2000 samples)

Component Metric Score
Thinker (GQA+MTP) Top-1 Accuracy 65.09%
Top-5 Accuracy 92.92%
Perplexity 2.71
Audio Encoder (12.5Hz) Val Loss 0.0000202
Vision Encoder (CLIP) Diversity 0.93
Talker (TTS) Top-5 Accuracy 92-93%

Quick Start

Text Generation

Full Multimodal (Image + Audio + Text)

Model Components

Component Params File Prefix
Thinker (LLM) 792K
Audio Encoder 998K
Vision Encoder 744K
Talker (TTS) 776K
RVQ Codec 33K
Projectors 33K ,
Total 3.4M

Files

  • β€” HF-compatible text model (flat keys, 3.3MB)
  • β€” Full multimodal model (prefixed keys, 51MB)
  • β€” Self-contained HF model classes (no external dependencies)
  • β€” HuggingFace config with auto_map
  • β€” SentencePiece BPE tokenizer

Training

Trained on RTX 5070 Ti (16GB VRAM) in ~90 minutes across 7 stages:

  1. Thinker LLM (text, cross-entropy + MTP)
  2. Audio Encoder (CTC loss, 12.5Hz)
  3. Vision Encoder (CLIP contrastive)
  4. Talker + RVQ (speech codes)
  5. Multimodal SFT (all modalities)
  6. HiFi-GAN Vocoder (optional)
  7. OCR Model (optional)

Links

Downloads last month
350
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results