File size: 5,049 Bytes
b89cd41
 
643b247
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b89cd41
643b247
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---

license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
- asr
- speech-to-text
- whisper
- tiny-audio
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
datasets:
- speechbrain/LoquaciousSet
metrics:
- wer
---


# Tiny Audio ASR - LoquaciousSet Training

A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

## Model Description

This model uses an encoder-projector-decoder architecture for automatic speech recognition:

| Component | Model | Parameters | Training Status |
|-----------|-------|------------|-----------------|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
| Projector | MLP | 11.7M | **Trained** |
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
| **Total** | - | **3.72B** | 0.32% trainable |

## Training Details

### Infrastructure
- **GPU**: NVIDIA H100 80GB HBM3
- **Cloud Provider**: E2E Networks
- **Framework**: PyTorch 2.8.0, Transformers 4.57.3

### Hyperparameters
- **Dataset**: speechbrain/LoquaciousSet (small subset)
- **Train Samples**: 1,000
- **Evaluation Samples**: 100
- **Batch Size**: 8
- **Learning Rate**: 3e-4
- **Max Steps**: 500
- **Warmup Steps**: 50
- **Precision**: BF16
- **Gradient Checkpointing**: Enabled

### Training Metrics

| Step | Training Loss | Validation Loss |
|------|---------------|-----------------|
| 100 | 3.078 | 3.165 |
| 200 | 2.543 | 3.163 |
| 300 | 0.500 | 0.813 |
| 400 | 0.140 | 0.728 |
| 500 | 0.101 | 0.764 |

Training time: ~18 minutes on H100.

## Usage

```python

from src.asr_config import ASRConfig

from src.asr_modeling import ASRModel

import torchaudio



# Initialize model

config = ASRConfig(

    audio_model_id="openai/whisper-large-v3-turbo",

    text_model_id="HuggingFaceTB/SmolLM3-3B",

    projector_type="mlp",

    attn_implementation="sdpa",

)

model = ASRModel(config)



# Load audio

waveform, sample_rate = torchaudio.load("audio.wav")

if sample_rate != 16000:

    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)

audio_array = waveform.squeeze().numpy()



# Transcribe

inputs = model.feature_extractor(

    audio_array, sampling_rate=16000, return_tensors="pt"

).input_features.to(model.device).to(model.dtype)



with torch.no_grad():

    output = model.generate(input_features=inputs, max_new_tokens=256)



transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)

print(transcription)

```

## Example Results

**Input Audio**: Sample from LoquaciousSet evaluation set

**Ground Truth**:
```

THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER 

BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME

```

**Model Output**:
```

These are reforms that will discipline and constrain the exercise of power 

by the government and any other economic or political actor for generations to come

```

## Limitations

- Trained on a small subset (1,000 samples) for demonstration purposes
- Full training with 50,000+ steps recommended for production use
- English language only
- Optimized for clean speech; performance may degrade on noisy audio

## Citation

### Tiny Audio Framework
```bibtex

@software{kroman2025tinyaudio,

  author = {Kroman, Alex},

  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},

  year = {2025},

  url = {https://github.com/alexkroman/tiny-audio}

}

```

### LoquaciousSet Dataset
```bibtex

@misc{speechbrain2024loquaciousset,

  author = {{SpeechBrain Team}},

  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},

  year = {2024},

  publisher = {Hugging Face},

  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}

}

```

### Whisper
```bibtex

@article{radford2022whisper,

  title = {Robust Speech Recognition via Large-Scale Weak Supervision},

  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},

  journal = {arXiv preprint arXiv:2212.04356},

  year = {2022}

}

```

### SmolLM
```bibtex

@misc{smollm2024,

  author = {{Hugging Face}},

  title = {SmolLM: Smaller Language Models for Efficient Inference},

  year = {2024},

  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}

}

```

## License

Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.

## Acknowledgments

- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
- [OpenAI](https://openai.com/) for Whisper
- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure