--- license: mit language: - en datasets: - speechbrain/LoquaciousSet base_model: - zai-org/GLM-ASR-Nano-2512 - Qwen/Qwen3-0.6B pipeline_tag: automatic-speech-recognition tags: - asr - speech-recognition - audio - qwen - glm-asr library_name: transformers --- # Tiny Audio A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework. ## Quick Start ```python from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) result = pipe("audio.wav") print(result["text"]) ``` ## Usage Examples ### Basic Transcription ```python from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) # From file result = pipe("audio.wav") print(result["text"]) # From URL result = pipe("https://example.com/audio.mp3") # From numpy array (must be 16kHz) import numpy as np audio = np.random.randn(16000).astype(np.float32) # 1 second result = pipe(audio) ``` ### Batch Processing ```python # Process multiple files files = ["audio1.wav", "audio2.wav", "audio3.wav"] results = pipe(files, batch_size=4) for r in results: print(r["text"]) ``` ### Word-Level Timestamps ```python result = pipe("audio.wav", return_timestamps="word") # Returns: # { # "text": "hello world", # "chunks": [ # {"text": "hello", "timestamp": (0.0, 0.5)}, # {"text": "world", "timestamp": (0.6, 1.0)} # ] # } ``` ### Streaming Inference ```python from tiny_audio import ASRModel, ASRProcessor import torch model = ASRModel.from_pretrained("mazesmazes/tiny-audio") processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") # Load and process audio import librosa audio, sr = librosa.load("audio.wav", sr=16000) inputs = processor(audio, sampling_rate=16000, return_tensors="pt") # Stream tokens for token in model.generate_streaming(inputs["input_features"]): print(token, end="", flush=True) ``` ### Using with torch directly ```python from tiny_audio import ASRModel, ASRProcessor import torch import librosa # Load model and processor model = ASRModel.from_pretrained("mazesmazes/tiny-audio") processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") # Load audio (16kHz) audio, sr = librosa.load("audio.wav", sr=16000) # Process inputs = processor(audio, sampling_rate=16000, return_tensors="pt") # Generate with torch.no_grad(): output = model.generate( input_features=inputs["input_features"], attention_mask=inputs["attention_mask"], max_new_tokens=256 ) # Decode text = processor.batch_decode(output, skip_special_tokens=True)[0] print(text) ``` ### GPU Inference ```python import torch pipe = pipeline( "automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True, device="cuda" # or device=0 ) ``` ### Half Precision ```python pipe = pipeline( "automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True, torch_dtype=torch.float16, device="cuda" ) ``` ## Architecture ``` Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text ``` Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge. | Component | Model | Parameters | Status | |-----------|-------|------------|--------| | Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen | | Projector | 2-layer MLP | ~12M | Trained | | Language Model | Qwen3-0.6B | ~600M | Frozen | ### How It Works 1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim) 2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces 3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1` ## Model Specifications | Specification | Value | |---------------|-------| | Input | Audio (16kHz mono) | | Output | Text transcription | | Max Audio Length | ~30 seconds (limited by encoder) | | Vocabulary | Qwen3 tokenizer | | Languages | English only | | Generation | Greedy decoding (num_beams=1, do_sample=False) | ## Training Details | | | |---|---| | **Dataset** | LoquaciousSet (25,000 hours) | | **Hardware** | Single NVIDIA A40 | | **Time** | ~24 hours | | **Cost** | ~$12 | | **Optimizer** | AdamW | | **Learning Rate** | 1e-4 | | **Batch Size** | 4 | | **Steps** | 50,000 | ## Limitations - **English only**: Not trained on other languages - **Sample rate**: Expects 16kHz audio (other rates resampled automatically) - **Audio length**: Best for clips under 30 seconds - **Accuracy**: May degrade on: - Heavily accented speech - Noisy or low-quality audio - Domain-specific terminology - Overlapping speakers - **No punctuation**: Output is lowercase without punctuation by default ## Requirements ``` transformers>=4.40.0 torch>=2.0.0 torchaudio>=2.0.0 ``` Optional for streaming: ``` librosa soundfile ``` ## Files | File | Description | |------|-------------| | `config.json` | Model configuration | | `model.safetensors` | Projector weights (~48MB) | | `preprocessor_config.json` | Audio preprocessing config | | `tokenizer.json` | Tokenizer | | `tokenizer_config.json` | Tokenizer config | | `special_tokens_map.json` | Special tokens | Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos. ## Citation If you use this model, please cite: ```bibtex @misc{tinyaudio2024, author = {Alex Kroman}, title = {Tiny Audio: Minimal ASR Training}, year = {2024}, publisher = {GitHub}, url = {https://github.com/alexkroman/tiny-audio} } ``` ## Links - [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model - [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch - [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser ## Acknowledgments - [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder - [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model - [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data ## License MIT