Tiny Audio
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.
Architecture
Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
Only the projector is trained (~12M params). The encoder and decoder remain frozen.
Training
| Dataset | LoquaciousSet (25,000 hours) |
| Hardware | Single NVIDIA A40 |
| Time | ~24 hours |
| Cost | ~$12 |
Usage
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])
Limitations
- English only
- 16kHz audio (other sample rates resampled automatically)
- May degrade on accented speech, noisy audio, or domain-specific terms
Links
- Downloads last month
- 1,103