File size: 2,097 Bytes
f52be0d
1195741
 
 
 
 
 
 
 
 
7b56ba4
1195741
 
 
 
 
 
f52be0d
 
1195741
f52be0d
1195741
f52be0d
1195741
f52be0d
1195741
 
 
f52be0d
1195741
 
 
 
f52be0d
 
 
1195741
 
 
 
 
 
 
f52be0d
1195741
f52be0d
1195741
f52be0d
1195741
f52be0d
1195741
f52be0d
1195741
 
f52be0d
1195741
f52be0d
1195741
 
 
f52be0d
1195741
f52be0d
1195741
 
 
 
f52be0d
1195741
f52be0d
1195741
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: mit
language:
- en
datasets:
- speechbrain/LoquaciousSet
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
pipeline_tag: automatic-speech-recognition
tags:
- asr
- speech-recognition
- audio
- smollm
- whisper
- mlp
---

# Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.

## Architecture

```
Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
```

**MLP Projector:**
- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
- Output normalization: RMSNorm

## Training Details

| | |
|---|---|
| **Dataset** | LoquaciousSet (25,000 hours) |
| **Hardware** | Single NVIDIA A40 40GB |
| **Training Time** | ~24 hours |
| **Cost** | ~$12 |
| **Trainable Parameters** | ~12M (projector only) |

## Performance

**Word Error Rate (WER): 12.14%** on LoquaciousSet test set.

See the [community leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard) for comparisons.

## Usage

```python
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

result = pipe("path/to/audio.wav")
print(result["text"])
```

## Limitations

- English only
- Optimized for 16kHz audio; other sample rates are resampled automatically
- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
- Maximum audio length limited by context window

## Learn More

- **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
- **[Free 3-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch
- **[Submit to leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard)** — Share your trained model