tiny-audio-lora / README.md
mazesmazes's picture
Update custom model files, README, and requirements
5c33a04 verified
---
license: mit
language:
- en
datasets:
- speechbrain/LoquaciousSet
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
pipeline_tag: automatic-speech-recognition
tags:
- asr
- speech-recognition
- audio
- smollm
- whisper
- mlp
---
# Tiny Audio
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
## Architecture
```
Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
```
**MLP Projector:**
- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
- Output normalization: RMSNorm
## Training Details
| | |
|---|---|
| **Dataset** | LoquaciousSet (25,000 hours) |
| **Hardware** | Single NVIDIA A40 40GB |
| **Training Time** | ~24 hours |
| **Cost** | ~$12 |
| **Trainable Parameters** | ~12M (projector only) |
## Performance
**Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
## Usage
```python
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("path/to/audio.wav")
print(result["text"])
```
## Limitations
- English only
- Optimized for 16kHz audio; other sample rates are resampled automatically
- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
- Maximum audio length limited by context window
## Learn More
- **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
- **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch