Text Generation
Transformers
Safetensors
PyTorch
English
muomni
multimodal
speech
vision
audio
tts
asr
ocr
qwen
from-scratch
Eval Results (legacy)
Instructions to use prskid1000/micro-Omni with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prskid1000/micro-Omni with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="prskid1000/micro-Omni")# Load model directly from transformers import ThinkerLM model = ThinkerLM.from_pretrained("prskid1000/micro-Omni", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use prskid1000/micro-Omni with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prskid1000/micro-Omni" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prskid1000/micro-Omni", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/prskid1000/micro-Omni
- SGLang
How to use prskid1000/micro-Omni with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prskid1000/micro-Omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prskid1000/micro-Omni", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prskid1000/micro-Omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prskid1000/micro-Omni", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use prskid1000/micro-Omni with Docker Model Runner:
docker model run hf.co/prskid1000/micro-Omni
micro-Omni (uOmni) β Tiny Multimodal AI
A from-scratch multimodal AI that handles text + images + audio (in and out) on a single GPU. Inspired by Qwen3 Omni's Thinker-Talker architecture.
3.4M params | Qwen3.5-aligned | Trained on synthetic data | MIT License
Architecture
Key Features
- GQA (Grouped Query Attention) with 2:1 Q:KV ratio
- Multi-Token Prediction (predict t+2, t+3 during training)
- SwiGLU FFN with 8/3 ratio (Qwen3.5 standard)
- Sliding Window Attention infrastructure (configurable)
- YaRN RoPE for context extension beyond training length
- Label Smoothing (0.1) for better calibration
- Flash Attention via PyTorch scaled_dot_product_attention
- HiFi-GAN vocoder + Griffin-Lim fallback for speech synthesis
- OCR model for text extraction from images
Performance (Synthetic Data, 2000 samples)
| Component | Metric | Score |
|---|---|---|
| Thinker (GQA+MTP) | Top-1 Accuracy | 65.09% |
| Top-5 Accuracy | 92.92% | |
| Perplexity | 2.71 | |
| Audio Encoder (12.5Hz) | Val Loss | 0.0000202 |
| Vision Encoder (CLIP) | Diversity | 0.93 |
| Talker (TTS) | Top-5 Accuracy | 92-93% |
Quick Start
Text Generation
Full Multimodal (Image + Audio + Text)
Model Components
| Component | Params | File Prefix |
|---|---|---|
| Thinker (LLM) | 792K | |
| Audio Encoder | 998K | |
| Vision Encoder | 744K | |
| Talker (TTS) | 776K | |
| RVQ Codec | 33K | |
| Projectors | 33K | , |
| Total | 3.4M |
Files
- β HF-compatible text model (flat keys, 3.3MB)
- β Full multimodal model (prefixed keys, 51MB)
- β Self-contained HF model classes (no external dependencies)
- β HuggingFace config with auto_map
- β SentencePiece BPE tokenizer
Training
Trained on RTX 5070 Ti (16GB VRAM) in ~90 minutes across 7 stages:
- Thinker LLM (text, cross-entropy + MTP)
- Audio Encoder (CTC loss, 12.5Hz)
- Vision Encoder (CLIP contrastive)
- Talker + RVQ (speech codes)
- Multimodal SFT (all modalities)
- HiFi-GAN Vocoder (optional)
- OCR Model (optional)
Links
- GitHub: github.com/prskid1000/micro-Omni
- Study Guide: 25 chapters + 5 appendices, zero-to-master (in folder)
- License: MIT
- Downloads last month
- 1,274
Evaluation results
- Top-1 Accuracyself-reported65.090
- Top-5 Accuracyself-reported92.920
- Perplexityself-reported2.710