File size: 1,883 Bytes
3a54d87
5c33a04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a54d87
 
5c33a04
3a54d87
5c33a04
3a54d87
5c33a04
3a54d87
5c33a04
 
 
3a54d87
5c33a04
 
 
 
3a54d87
 
 
5c33a04
 
 
 
 
 
 
3a54d87
5c33a04
3a54d87
5c33a04
3a54d87
 
5c33a04
3a54d87
5c33a04
 
3a54d87
5c33a04
3a54d87
5c33a04
 
 
3a54d87
5c33a04
3a54d87
5c33a04
 
 
 
3a54d87
5c33a04
3a54d87
5c33a04
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: mit
language:
- en
datasets:
- speechbrain/LoquaciousSet
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
pipeline_tag: automatic-speech-recognition
tags:
- asr
- speech-recognition
- audio
- smollm
- whisper
- mlp
---

# Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.

## Architecture

```
Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
```

**MLP Projector:**
- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
- Output normalization: RMSNorm

## Training Details

| | |
|---|---|
| **Dataset** | LoquaciousSet (25,000 hours) |
| **Hardware** | Single NVIDIA A40 40GB |
| **Training Time** | ~24 hours |
| **Cost** | ~$12 |
| **Trainable Parameters** | ~12M (projector only) |

## Performance

**Word Error Rate (WER): 12.14%** on LoquaciousSet test set.


## Usage

```python
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

result = pipe("path/to/audio.wav")
print(result["text"])
```

## Limitations

- English only
- Optimized for 16kHz audio; other sample rates are resampled automatically
- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
- Maximum audio length limited by context window

## Learn More

- **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
- **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch