--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: automatic-speech-recognition tags: - speech - audio - asr - speech-to-text - whisper - tiny-audio base_model: - openai/whisper-large-v3-turbo - HuggingFaceTB/SmolLM3-3B datasets: - speechbrain/LoquaciousSet metrics: - wer --- # Tiny Audio ASR - LoquaciousSet Training A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder. ## Model Description This model uses an encoder-projector-decoder architecture for automatic speech recognition: | Component | Model | Parameters | Training Status | |-----------|-------|------------|-----------------| | Audio Encoder | openai/whisper-large-v3-turbo | ~800M | Frozen | | Projector | MLP | 11.7M | **Trained** | | Language Model | HuggingFaceTB/SmolLM3-3B | 3B | Frozen | | **Total** | - | **3.72B** | 0.32% trainable | ## Training Details ### Infrastructure - **GPU**: NVIDIA H100 80GB HBM3 - **Cloud Provider**: E2E Networks - **Framework**: PyTorch 2.8.0, Transformers 4.57.3 ### Hyperparameters - **Dataset**: speechbrain/LoquaciousSet (small subset) - **Train Samples**: 1,000 - **Evaluation Samples**: 100 - **Batch Size**: 8 - **Learning Rate**: 3e-4 - **Max Steps**: 500 - **Warmup Steps**: 50 - **Precision**: BF16 - **Gradient Checkpointing**: Enabled ### Training Metrics | Step | Training Loss | Validation Loss | |------|---------------|-----------------| | 100 | 3.078 | 3.165 | | 200 | 2.543 | 3.163 | | 300 | 0.500 | 0.813 | | 400 | 0.140 | 0.728 | | 500 | 0.101 | 0.764 | Training time: ~18 minutes on H100. ## Usage ```python from src.asr_config import ASRConfig from src.asr_modeling import ASRModel import torchaudio # Initialize model config = ASRConfig( audio_model_id="openai/whisper-large-v3-turbo", text_model_id="HuggingFaceTB/SmolLM3-3B", projector_type="mlp", attn_implementation="sdpa", ) model = ASRModel(config) # Load audio waveform, sample_rate = torchaudio.load("audio.wav") if sample_rate != 16000: waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform) audio_array = waveform.squeeze().numpy() # Transcribe inputs = model.feature_extractor( audio_array, sampling_rate=16000, return_tensors="pt" ).input_features.to(model.device).to(model.dtype) with torch.no_grad(): output = model.generate(input_features=inputs, max_new_tokens=256) transcription = model.tokenizer.decode(output[0], skip_special_tokens=True) print(transcription) ``` ## Example Results **Input Audio**: Sample from LoquaciousSet evaluation set **Ground Truth**: ``` THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME ``` **Model Output**: ``` These are reforms that will discipline and constrain the exercise of power by the government and any other economic or political actor for generations to come ``` ## Limitations - Trained on a small subset (1,000 samples) for demonstration purposes - Full training with 50,000+ steps recommended for production use - English language only - Optimized for clean speech; performance may degrade on noisy audio ## Citation ### Tiny Audio Framework ```bibtex @software{kroman2025tinyaudio, author = {Kroman, Alex}, title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours}, year = {2025}, url = {https://github.com/alexkroman/tiny-audio} } ``` ### LoquaciousSet Dataset ```bibtex @misc{speechbrain2024loquaciousset, author = {{SpeechBrain Team}}, title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet} } ``` ### Whisper ```bibtex @article{radford2022whisper, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, journal = {arXiv preprint arXiv:2212.04356}, year = {2022} } ``` ### SmolLM ```bibtex @misc{smollm2024, author = {{Hugging Face}}, title = {SmolLM: Smaller Language Models for Efficient Inference}, year = {2024}, url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B} } ``` ## License Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details. ## Acknowledgments - [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework - [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset - [OpenAI](https://openai.com/) for Whisper - [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure - [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure