Upload 14 files

643b247 verified 10 days ago

5.05 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- speech
	- audio
	- asr
	- speech-to-text
	- whisper
	- tiny-audio
	base_model:
	- openai/whisper-large-v3-turbo
	- HuggingFaceTB/SmolLM3-3B
	datasets:
	- speechbrain/LoquaciousSet
	metrics:
	- wer
	---

	# Tiny Audio ASR - LoquaciousSet Training

	A Speech-to-Text model trained using the [Tiny Audio](https://github.com/alexkroman/tiny-audio) framework, combining a frozen Whisper encoder with a trained MLP projector and frozen SmolLM3-3B decoder.

	## Model Description

	This model uses an encoder-projector-decoder architecture for automatic speech recognition:

	\| Component \| Model \| Parameters \| Training Status \|
	\|-----------\|-------\|------------\|-----------------\|
	\| Audio Encoder \| openai/whisper-large-v3-turbo \| ~800M \| Frozen \|
	\| Projector \| MLP \| 11.7M \| Trained \|
	\| Language Model \| HuggingFaceTB/SmolLM3-3B \| 3B \| Frozen \|
	\| Total \| - \| 3.72B \| 0.32% trainable \|

	## Training Details

	### Infrastructure
	- GPU: NVIDIA H100 80GB HBM3
	- Cloud Provider: E2E Networks
	- Framework: PyTorch 2.8.0, Transformers 4.57.3

	### Hyperparameters
	- Dataset: speechbrain/LoquaciousSet (small subset)
	- Train Samples: 1,000
	- Evaluation Samples: 100
	- Batch Size: 8
	- Learning Rate: 3e-4
	- Max Steps: 500
	- Warmup Steps: 50
	- Precision: BF16
	- Gradient Checkpointing: Enabled

	### Training Metrics

	\| Step \| Training Loss \| Validation Loss \|
	\|------\|---------------\|-----------------\|
	\| 100 \| 3.078 \| 3.165 \|
	\| 200 \| 2.543 \| 3.163 \|
	\| 300 \| 0.500 \| 0.813 \|
	\| 400 \| 0.140 \| 0.728 \|
	\| 500 \| 0.101 \| 0.764 \|

	Training time: ~18 minutes on H100.

	## Usage

	```python
	from src.asr_config import ASRConfig
	from src.asr_modeling import ASRModel
	import torchaudio

	# Initialize model
	config = ASRConfig(
	audio_model_id="openai/whisper-large-v3-turbo",
	text_model_id="HuggingFaceTB/SmolLM3-3B",
	projector_type="mlp",
	attn_implementation="sdpa",
	)
	model = ASRModel(config)

	# Load audio
	waveform, sample_rate = torchaudio.load("audio.wav")
	if sample_rate != 16000:
	waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
	audio_array = waveform.squeeze().numpy()

	# Transcribe
	inputs = model.feature_extractor(
	audio_array, sampling_rate=16000, return_tensors="pt"
	).input_features.to(model.device).to(model.dtype)

	with torch.no_grad():
	output = model.generate(input_features=inputs, max_new_tokens=256)

	transcription = model.tokenizer.decode(output[0], skip_special_tokens=True)
	print(transcription)
	```

	## Example Results

	Input Audio: Sample from LoquaciousSet evaluation set

	Ground Truth:
	```
	THESE ARE REFORMS THAT WILL DISCIPLINE AND CONSTRAIN THE EXERCISE OF POWER
	BY THE GOVERNMENT AND ANY OTHER ECONOMIC OR POLITICAL ACTOR FOR GENERATIONS TO COME
	```

	Model Output:
	```
	These are reforms that will discipline and constrain the exercise of power
	by the government and any other economic or political actor for generations to come
	```

	## Limitations

	- Trained on a small subset (1,000 samples) for demonstration purposes
	- Full training with 50,000+ steps recommended for production use
	- English language only
	- Optimized for clean speech; performance may degrade on noisy audio

	## Citation

	### Tiny Audio Framework
	```bibtex
	@software{kroman2025tinyaudio,
	author = {Kroman, Alex},
	title = {Tiny Audio: Train Your Own Speech Recognition Model in 24 Hours},
	year = {2025},
	url = {https://github.com/alexkroman/tiny-audio}
	}
	```

	### LoquaciousSet Dataset
	```bibtex
	@misc{speechbrain2024loquaciousset,
	author = {{SpeechBrain Team}},
	title = {LoquaciousSet: 25,000 Hours of Transcribed English Speech},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/speechbrain/LoquaciousSet}
	}
	```

	### Whisper
	```bibtex
	@article{radford2022whisper,
	title = {Robust Speech Recognition via Large-Scale Weak Supervision},
	author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
	journal = {arXiv preprint arXiv:2212.04356},
	year = {2022}
	}
	```

	### SmolLM
	```bibtex
	@misc{smollm2024,
	author = {{Hugging Face}},
	title = {SmolLM: Smaller Language Models for Efficient Inference},
	year = {2024},
	url = {https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
	}
	```

	## License

	Apache 2.0 - See the [Tiny Audio repository](https://github.com/alexkroman/tiny-audio) for details.

	## Acknowledgments

	- [Alex Kroman](https://github.com/alexkroman) for the Tiny Audio framework
	- [SpeechBrain](https://speechbrain.github.io/) for the LoquaciousSet dataset
	- [OpenAI](https://openai.com/) for Whisper
	- [Hugging Face](https://huggingface.co/) for SmolLM3 and infrastructure
	- [E2E Networks](https://www.e2enetworks.com/) for GPU cloud infrastructure