File size: 5,049 Bytes

---

license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
- asr
- speech-to-text
- whisper
- tiny-audio
base_model:
- openai/whisper-large-v3-turbo
- HuggingFaceTB/SmolLM3-3B
datasets:
- speechbrain/LoquaciousSet
metrics:
- wer
---


# Tiny Audio ASR - LoquaciousSet Training

A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

## Model Description

This model uses an encoder-projector-decoder architecture for automatic speech recognition:

| Component | Model | Parameters | Training Status |
|-----------|-------|------------|-----------------|
| Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen |
| Projector | MLP | 11.7M | **Trained** |
| Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen |
| **Total** | - | **3.72B** | 0.32% trainable |

## Training Details

### Infrastructure
- **GPU**: NVIDIA H100 80GB HBM3
- **Cloud Provider**: E2E Networks
- **Framework**: PyTorch 2.8.0, Transformers 4.57.3

### Hyperparameters
- **Dataset**: speechbrain/LoquaciousSet (small subset)
- **Train Samples**: 1,000
- **Evaluation Samples**: 100
- **Batch Size**: 8
- **Learning Rate**: 3e-4
- **Max Steps**: 500
- **Warmup Steps**: 50
- **Precision**: BF16
- **Gradient Checkpointing**: Enabled

### Training Metrics

| Step | Training Loss | Validation Loss |
|------|---------------|-----------------|
| 100 | 3.078 | 3.165 |
| 200 | 2.543 | 3.163 |
| 300 | 0.500 | 0.813 |
| 400 | 0.140 | 0.728 |
| 500 | 0.101 | 0.764 |

Training time: ~18 minutes on H100.

## Usage

```python

from src.asr_config import ASRConfig

from src.asr_modeling import ASRModel

import torchaudio



# Initialize model

config = ASRConfig(

    audio_model_id="openai/whisper-large-v3-turbo",

    text_model_id="HuggingFaceTB/SmolLM3-3B",

    projector_type="mlp",

    attn_implementation="sdpa",

)

model = ASRModel(config)



# Load audio

waveform, sample_rate = torchaudio.load("audio.wav")

if sample_rate != 16000:

    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)

audio_array = waveform.squeeze().numpy()



# Transcribe

inputs = model.feature_extractor(

    audio_array, sampling_rate=16000, return_tensors="pt"

).input_features.to(model.device).to(model.dtype)



with torch.no_grad():

    output = model.generate(input_features=inputs, max_new_tokens=256)



transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)

print(transcription)

```

## Example Results

**Input Audio**: Sample from LoquaciousSet evaluation set

**Ground Truth**:
```

THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER 

BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME

```

**Model Output**:
```

These are reforms that will discipline and constrain the exercise of power 

by the government and any other economic or political actor for generations to come

```

## Limitations

- Trained on a small subset (1,000 samples) for demonstration purposes
- Full training with 50,000+ steps recommended for production use
- English language only
- Optimized for clean speech; performance may degrade on noisy audio

## Citation

### Tiny Audio Framework
```bibtex

@software{kroman2025tinyaudio,

  author = {Kroman, Alex},

  title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},

  year = {2025},

  url = {https://github.com/alexkroman/tiny-audio}

}

```

### LoquaciousSet Dataset
```bibtex

@misc{speechbrain2024loquaciousset,

  author = {{SpeechBrain Team}},

  title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},

  year = {2024},

  publisher = {Hugging Face},

  url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}

}

```

### Whisper
```bibtex

@article{radford2022whisper,

  title = {Robust Speech Recognition via Large-Scale Weak Supervision},

  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},

  journal = {arXiv preprint arXiv:2212.04356},

  year = {2022}

}

```

### SmolLM
```bibtex

@misc{smollm2024,

  author = {{Hugging Face}},

  title = {SmolLM: Smaller Language Models for Efficient Inference},

  year = {2024},

  url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}

}

```

## License

Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.

## Acknowledgments

- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
- [OpenAI](https://openai.com/) for Whisper
- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure