File size: 5,049 Bytes
b89cd41 643b247 b89cd41 643b247 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
- asr
- speech-to-text
- whisper
- tiny-audio
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
datasets:
- speechbrain/LoquaciousSet
metrics:
- wer
---
# Tiny Audio ASR - LoquaciousSet Training
A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.
## Model Description
This model uses an encoder-projector-decoder architecture for automatic speech recognition:
| Component | Model | Parameters | Training Status |
|-----------|-------|------------|-----------------|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
| Projector | MLP | 11.7M | **Trained** |
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
| **Total** | - | **3.72B** | 0.32% trainable |
## Training Details
### Infrastructure
- **GPU**: NVIDIA H100 80GB HBM3
- **Cloud Provider**: E2E Networks
- **Framework**: PyTorch 2.8.0, Transformers 4.57.3
### Hyperparameters
- **Dataset**: speechbrain/LoquaciousSet (small subset)
- **Train Samples**: 1,000
- **Evaluation Samples**: 100
- **Batch Size**: 8
- **Learning Rate**: 3e-4
- **Max Steps**: 500
- **Warmup Steps**: 50
- **Precision**: BF16
- **Gradient Checkpointing**: Enabled
### Training Metrics
| Step | Training Loss | Validation Loss |
|------|---------------|-----------------|
| 100 | 3.078 | 3.165 |
| 200 | 2.543 | 3.163 |
| 300 | 0.500 | 0.813 |
| 400 | 0.140 | 0.728 |
| 500 | 0.101 | 0.764 |
Training time: ~18 minutes on H100.
## Usage
```python
from src.asr_config import ASRConfig
from src.asr_modeling import ASRModel
import torchaudio
# Initialize model
config = ASRConfig(
audio_model_id="openai/whisper-large-v3-turbo",
text_model_id="HuggingFaceTB/SmolLM3-3B",
projector_type="mlp",
attn_implementation="sdpa",
)
model = ASRModel(config)
# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
audio_array = waveform.squeeze().numpy()
# Transcribe
inputs = model.feature_extractor(
audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device).to(model.dtype)
with torch.no_grad():
output = model.generate(input_features=inputs, max_new_tokens=256)
transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
print(transcription)
```
## Example Results
**Input Audio**: Sample from LoquaciousSet evaluation set
**Ground Truth**:
```
THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
```
**Model Output**:
```
These are reforms that will discipline and constrain the exercise of power
by the government and any other economic or political actor for generations to come
```
## Limitations
- Trained on a small subset (1,000 samples) for demonstration purposes
- Full training with 50,000+ steps recommended for production use
- English language only
- Optimized for clean speech; performance may degrade on noisy audio
## Citation
### Tiny Audio Framework
```bibtex
@software{kroman2025tinyaudio,
author = {Kroman, Alex},
title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
year = {2025},
url = {https://github.com/alexkroman/tiny-audio}
}
```
### LoquaciousSet Dataset
```bibtex
@misc{speechbrain2024loquaciousset,
author = {{SpeechBrain Team}},
title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
}
```
### Whisper
```bibtex
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
```
### SmolLM
```bibtex
@misc{smollm2024,
author = {{Hugging Face}},
title = {SmolLM: Smaller Language Models for Efficient Inference},
year = {2024},
url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}
```
## License
Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.
## Acknowledgments
- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
- [OpenAI](https://openai.com/) for Whisper
- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure
|