|
|
---
|
|
|
license: apache-2.0
|
|
|
language:
|
|
|
- en
|
|
|
library_name: transformers
|
|
|
pipeline_tag: automatic-speech-recognition
|
|
|
tags:
|
|
|
- speech
|
|
|
- audio
|
|
|
- asr
|
|
|
- speech-to-text
|
|
|
- whisper
|
|
|
- tiny-audio
|
|
|
base_model:
|
|
|
- openai/whisper-large-v3-turbo
|
|
|
- HuggingFaceTB/SmolLM3-3B
|
|
|
datasets:
|
|
|
- speechbrain/LoquaciousSet
|
|
|
metrics:
|
|
|
- wer
|
|
|
---
|
|
|
|
|
|
# Tiny Audio ASR - LoquaciousSet Training
|
|
|
|
|
|
A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
This model uses an encoder-projector-decoder architecture for automatic speech recognition:
|
|
|
|
|
|
| Component | Model | Parameters | Training Status |
|
|
|
|-----------|-------|------------|-----------------|
|
|
|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
|
|
|
| Projector | MLP | 11.7M | **Trained** |
|
|
|
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
|
|
|
| **Total** | - | **3.72B** | 0.32% trainable |
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Infrastructure
|
|
|
- **GPU**: NVIDIA H100 80GB HBM3
|
|
|
- **Cloud Provider**: E2E Networks
|
|
|
- **Framework**: PyTorch 2.8.0, Transformers 4.57.3
|
|
|
|
|
|
### Hyperparameters
|
|
|
- **Dataset**: speechbrain/LoquaciousSet (small subset)
|
|
|
- **Train Samples**: 1,000
|
|
|
- **Evaluation Samples**: 100
|
|
|
- **Batch Size**: 8
|
|
|
- **Learning Rate**: 3e-4
|
|
|
- **Max Steps**: 500
|
|
|
- **Warmup Steps**: 50
|
|
|
- **Precision**: BF16
|
|
|
- **Gradient Checkpointing**: Enabled
|
|
|
|
|
|
### Training Metrics
|
|
|
|
|
|
| Step | Training Loss | Validation Loss |
|
|
|
|------|---------------|-----------------|
|
|
|
| 100 | 3.078 | 3.165 |
|
|
|
| 200 | 2.543 | 3.163 |
|
|
|
| 300 | 0.500 | 0.813 |
|
|
|
| 400 | 0.140 | 0.728 |
|
|
|
| 500 | 0.101 | 0.764 |
|
|
|
|
|
|
Training time: ~18 minutes on H100.
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from src.asr_config import ASRConfig
|
|
|
from src.asr_modeling import ASRModel
|
|
|
import torchaudio
|
|
|
|
|
|
# Initialize model
|
|
|
config = ASRConfig(
|
|
|
audio_model_id="openai/whisper-large-v3-turbo",
|
|
|
text_model_id="HuggingFaceTB/SmolLM3-3B",
|
|
|
projector_type="mlp",
|
|
|
attn_implementation="sdpa",
|
|
|
)
|
|
|
model = ASRModel(config)
|
|
|
|
|
|
# Load audio
|
|
|
waveform, sample_rate = torchaudio.load("audio.wav")
|
|
|
if sample_rate != 16000:
|
|
|
waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
|
|
|
audio_array = waveform.squeeze().numpy()
|
|
|
|
|
|
# Transcribe
|
|
|
inputs = model.feature_extractor(
|
|
|
audio_array, sampling_rate=16000, return_tensors="pt"
|
|
|
).input_features.to(model.device).to(model.dtype)
|
|
|
|
|
|
with torch.no_grad():
|
|
|
output = model.generate(input_features=inputs, max_new_tokens=256)
|
|
|
|
|
|
transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
|
|
|
print(transcription)
|
|
|
```
|
|
|
|
|
|
## Example Results
|
|
|
|
|
|
**Input Audio**: Sample from LoquaciousSet evaluation set
|
|
|
|
|
|
**Ground Truth**:
|
|
|
```
|
|
|
THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
|
|
|
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
|
|
|
```
|
|
|
|
|
|
**Model Output**:
|
|
|
```
|
|
|
These are reforms that will discipline and constrain the exercise of power
|
|
|
by the government and any other economic or political actor for generations to come
|
|
|
```
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- Trained on a small subset (1,000 samples) for demonstration purposes
|
|
|
- Full training with 50,000+ steps recommended for production use
|
|
|
- English language only
|
|
|
- Optimized for clean speech; performance may degrade on noisy audio
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
### Tiny Audio Framework
|
|
|
```bibtex
|
|
|
@software{kroman2025tinyaudio,
|
|
|
author = {Kroman, Alex},
|
|
|
title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
|
|
|
year = {2025},
|
|
|
url = {https://github.com/alexkroman/tiny-audio}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### LoquaciousSet Dataset
|
|
|
```bibtex
|
|
|
@misc{speechbrain2024loquaciousset,
|
|
|
author = {{SpeechBrain Team}},
|
|
|
title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
|
|
|
year = {2024},
|
|
|
publisher = {Hugging Face},
|
|
|
url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### Whisper
|
|
|
```bibtex
|
|
|
@article{radford2022whisper,
|
|
|
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
|
|
|
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
|
|
|
journal = {arXiv preprint arXiv:2212.04356},
|
|
|
year = {2022}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### SmolLM
|
|
|
```bibtex
|
|
|
@misc{smollm2024,
|
|
|
author = {{Hugging Face}},
|
|
|
title = {SmolLM: Smaller Language Models for Efficient Inference},
|
|
|
year = {2024},
|
|
|
url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## License
|
|
|
|
|
|
Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.
|
|
|
|
|
|
## Acknowledgments
|
|
|
|
|
|
- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
|
|
|
- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
|
|
|
- [OpenAI](https://openai.com/) for Whisper
|
|
|
- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
|
|
|
- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure
|
|
|
|