---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
- asr
- speech-to-text
- whisper
- tiny-audio
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
datasets:
- speechbrain/LoquaciousSet
metrics:
- wer
---

# Tiny Audio ASR - LoquaciousSet Training

A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

## Model Description

This model uses an encoder-projector-decoder architecture for automatic speech recognition:

| Component | Model | Parameters | Training Status |
|-----------|-------|------------|-----------------|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
| Projector | MLP | 11.7M | **Trained** |
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
| **Total** | - | **3.72B** | 0.32% trainable |

## Training Details

### Infrastructure
- **GPU**: NVIDIA H100 80GB HBM3
- **Cloud Provider**: E2E Networks
- **Framework**: PyTorch 2.8.0, Transformers 4.57.3

### Hyperparameters
- **Dataset**: speechbrain/LoquaciousSet (small subset)
- **Train Samples**: 1,000
- **Evaluation Samples**: 100
- **Batch Size**: 8
- **Learning Rate**: 3e-4
- **Max Steps**: 500
- **Warmup Steps**: 50
- **Precision**: BF16
- **Gradient Checkpointing**: Enabled

### Training Metrics

| Step | Training Loss | Validation Loss |
|------|---------------|-----------------|
| 100 | 3.078 | 3.165 |
| 200 | 2.543 | 3.163 |
| 300 | 0.500 | 0.813 |
| 400 | 0.140 | 0.728 |
| 500 | 0.101 | 0.764 |

Training time: ~18 minutes on H100.

## Usage

```python
from src.asr_config import ASRConfig
from src.asr_modeling import ASRModel
import torchaudio

# Initialize model
config = ASRConfig(
    audio_model_id="openai/whisper-large-v3-turbo",
    text_model_id="HuggingFaceTB/SmolLM3-3B",
    projector_type="mlp",
    attn_implementation="sdpa",
)
model = ASRModel(config)

# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
audio_array = waveform.squeeze().numpy()

# Transcribe
inputs = model.feature_extractor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device).to(model.dtype)

with torch.no_grad():
    output = model.generate(input_features=inputs, max_new_tokens=256)

transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
print(transcription)
```

## Example Results

**Input Audio**: Sample from LoquaciousSet evaluation set

**Ground Truth**:
```
THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER 
BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
```

**Model Output**:
```
These are reforms that will discipline and constrain the exercise of power 
by the government and any other economic or political actor for generations to come
```

## Limitations

- Trained on a small subset (1,000 samples) for demonstration purposes
- Full training with 50,000+ steps recommended for production use
- English language only
- Optimized for clean speech; performance may degrade on noisy audio

## Citation

### Tiny Audio Framework
```bibtex
@software{kroman2025tinyaudio,
  author = {Kroman, Alex},
  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
  year = {2025},
  url = {https://github.com/alexkroman/tiny-audio}
}
```

### LoquaciousSet Dataset
```bibtex
@misc{speechbrain2024loquaciousset,
  author = {{SpeechBrain Team}},
  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
}
```

### Whisper
```bibtex
@article{radford2022whisper,
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year = {2022}
}
```

### SmolLM
```bibtex
@misc{smollm2024,
  author = {{Hugging Face}},
  title = {SmolLM: Smaller Language Models for Efficient Inference},
  year = {2024},
  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}
```

## License

Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.

## Acknowledgments

- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
- [OpenAI](https://openai.com/) for Whisper
- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure