tiny-audio-lora / README.md

Update custom model files, README, and requirements

5c33a04 verified 6 days ago

1.88 kB

	---
	license: mit
	language:
	- en
	datasets:
	- speechbrain/LoquaciousSet
	base_model:
	- openai/whisper-large-v3-turbo
	- HuggingFaceTB/SmolLM3-3B
	pipeline_tag: automatic-speech-recognition
	tags:
	- asr
	- speech-recognition
	- audio
	- smollm
	- whisper
	- mlp
	---

	# Tiny Audio

	A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.

	## Architecture

	```
	Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
	```

	MLP Projector:
	- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
	- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
	- Output normalization: RMSNorm

	## Training Details

	\| \| \|
	\|---\|---\|
	\| Dataset \| LoquaciousSet (25,000 hours) \|
	\| Hardware \| Single NVIDIA A40 40GB \|
	\| Training Time \| ~24 hours \|
	\| Cost \| ~$12 \|
	\| Trainable Parameters \| ~12M (projector only) \|

	## Performance

	Word Error Rate (WER): 12.14% on LoquaciousSet test set.


	## Usage

	```python
	from transformers import pipeline

	pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

	result = pipe("path/to/audio.wav")
	print(result["text"])
	```

	## Limitations

	- English only
	- Optimized for 16kHz audio; other sample rates are resampled automatically
	- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
	- Maximum audio length limited by context window

	## Learn More

	- [Train your own model](https://github.com/alexkroman/tiny-audio) — The full codebase with training scripts
	- [Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) — Build your own ASR system from scratch