| | --- |
| | license: mit |
| | language: |
| | - en |
| | datasets: |
| | - speechbrain/LoquaciousSet |
| | base_model: |
| | - zai-org/GLM-ASR-Nano-2512 |
| | - Qwen/Qwen3-0.6B |
| | pipeline_tag: automatic-speech-recognition |
| | tags: |
| | - asr |
| | - speech-recognition |
| | - audio |
| | - qwen |
| | - glm-asr |
| | library_name: transformers |
| | --- |
| | |
| | # Tiny Audio |
| |
|
| | A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework. |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
| | result = pipe("audio.wav") |
| | print(result["text"]) |
| | ``` |
| |
|
| | ## Usage Examples |
| |
|
| | ### Basic Transcription |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
| | |
| | # From file |
| | result = pipe("audio.wav") |
| | print(result["text"]) |
| | |
| | # From URL |
| | result = pipe("https://example.com/audio.mp3") |
| | |
| | # From numpy array (must be 16kHz) |
| | import numpy as np |
| | audio = np.random.randn(16000).astype(np.float32) # 1 second |
| | result = pipe(audio) |
| | ``` |
| |
|
| | ### Batch Processing |
| |
|
| | ```python |
| | # Process multiple files |
| | files = ["audio1.wav", "audio2.wav", "audio3.wav"] |
| | results = pipe(files, batch_size=4) |
| | for r in results: |
| | print(r["text"]) |
| | ``` |
| |
|
| | ### Word-Level Timestamps |
| |
|
| | ```python |
| | result = pipe("audio.wav", return_timestamps="word") |
| | # Returns: |
| | # { |
| | # "text": "hello world", |
| | # "chunks": [ |
| | # {"text": "hello", "timestamp": (0.0, 0.5)}, |
| | # {"text": "world", "timestamp": (0.6, 1.0)} |
| | # ] |
| | # } |
| | ``` |
| |
|
| | ### Streaming Inference |
| |
|
| | ```python |
| | from tiny_audio import ASRModel, ASRProcessor |
| | import torch |
| | |
| | model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
| | processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
| | |
| | # Load and process audio |
| | import librosa |
| | audio, sr = librosa.load("audio.wav", sr=16000) |
| | inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
| | |
| | # Stream tokens |
| | for token in model.generate_streaming(inputs["input_features"]): |
| | print(token, end="", flush=True) |
| | ``` |
| |
|
| | ### Using with torch directly |
| |
|
| | ```python |
| | from tiny_audio import ASRModel, ASRProcessor |
| | import torch |
| | import librosa |
| | |
| | # Load model and processor |
| | model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
| | processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
| | |
| | # Load audio (16kHz) |
| | audio, sr = librosa.load("audio.wav", sr=16000) |
| | |
| | # Process |
| | inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
| | |
| | # Generate |
| | with torch.no_grad(): |
| | output = model.generate( |
| | input_features=inputs["input_features"], |
| | attention_mask=inputs["attention_mask"], |
| | max_new_tokens=256 |
| | ) |
| | |
| | # Decode |
| | text = processor.batch_decode(output, skip_special_tokens=True)[0] |
| | print(text) |
| | ``` |
| |
|
| | ### GPU Inference |
| |
|
| | ```python |
| | import torch |
| | |
| | pipe = pipeline( |
| | "automatic-speech-recognition", |
| | model="mazesmazes/tiny-audio", |
| | trust_remote_code=True, |
| | device="cuda" # or device=0 |
| | ) |
| | ``` |
| |
|
| | ### Half Precision |
| |
|
| | ```python |
| | pipe = pipeline( |
| | "automatic-speech-recognition", |
| | model="mazesmazes/tiny-audio", |
| | trust_remote_code=True, |
| | torch_dtype=torch.float16, |
| | device="cuda" |
| | ) |
| | ``` |
| |
|
| | ## Architecture |
| |
|
| | ``` |
| | Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text |
| | ``` |
| |
|
| | Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge. |
| |
|
| | | Component | Model | Parameters | Status | |
| | |-----------|-------|------------|--------| |
| | | Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen | |
| | | Projector | 2-layer MLP | ~12M | Trained | |
| | | Language Model | Qwen3-0.6B | ~600M | Frozen | |
| |
|
| | ### How It Works |
| |
|
| | 1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim) |
| | 2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces |
| | 3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio |
| |
|
| | The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1` |
| |
|
| | ## Model Specifications |
| |
|
| | | Specification | Value | |
| | |---------------|-------| |
| | | Input | Audio (16kHz mono) | |
| | | Output | Text transcription | |
| | | Max Audio Length | ~30 seconds (limited by encoder) | |
| | | Vocabulary | Qwen3 tokenizer | |
| | | Languages | English only | |
| | | Generation | Greedy decoding (num_beams=1, do_sample=False) | |
| |
|
| | ## Training Details |
| |
|
| | | | | |
| | |---|---| |
| | | **Dataset** | LoquaciousSet (25,000 hours) | |
| | | **Hardware** | Single NVIDIA A40 | |
| | | **Time** | ~24 hours | |
| | | **Cost** | ~$12 | |
| | | **Optimizer** | AdamW | |
| | | **Learning Rate** | 1e-4 | |
| | | **Batch Size** | 4 | |
| | | **Steps** | 50,000 | |
| |
|
| | ## Limitations |
| |
|
| | - **English only**: Not trained on other languages |
| | - **Sample rate**: Expects 16kHz audio (other rates resampled automatically) |
| | - **Audio length**: Best for clips under 30 seconds |
| | - **Accuracy**: May degrade on: |
| | - Heavily accented speech |
| | - Noisy or low-quality audio |
| | - Domain-specific terminology |
| | - Overlapping speakers |
| | - **No punctuation**: Output is lowercase without punctuation by default |
| |
|
| | ## Requirements |
| |
|
| | ``` |
| | transformers>=4.40.0 |
| | torch>=2.0.0 |
| | torchaudio>=2.0.0 |
| | ``` |
| |
|
| | Optional for streaming: |
| | ``` |
| | librosa |
| | soundfile |
| | ``` |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `config.json` | Model configuration | |
| | | `model.safetensors` | Projector weights (~48MB) | |
| | | `preprocessor_config.json` | Audio preprocessing config | |
| | | `tokenizer.json` | Tokenizer | |
| | | `tokenizer_config.json` | Tokenizer config | |
| | | `special_tokens_map.json` | Special tokens | |
| |
|
| | Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{tinyaudio2024, |
| | author = {Alex Kroman}, |
| | title = {Tiny Audio: Minimal ASR Training}, |
| | year = {2024}, |
| | publisher = {GitHub}, |
| | url = {https://github.com/alexkroman/tiny-audio} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model |
| | - [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch |
| | - [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser |
| |
|
| | ## Acknowledgments |
| |
|
| | - [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder |
| | - [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model |
| | - [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|