Instructions to use mazesmazes/tiny-audio-swift-bundle with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mazesmazes/tiny-audio-swift-bundle with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio-swift-bundle")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mazesmazes/tiny-audio-swift-bundle", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| datasets: | |
| - speechbrain/LoquaciousSet | |
| base_model: | |
| - zai-org/GLM-ASR-Nano-2512 | |
| - Qwen/Qwen3-0.6B | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - asr | |
| - speech-recognition | |
| - audio | |
| - qwen | |
| - glm-asr | |
| library_name: transformers | |
| # Tiny Audio | |
| A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework. | |
| ## Quick Start | |
| ```python | |
| from transformers import pipeline | |
| pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) | |
| result = pipe("audio.wav") | |
| print(result["text"]) | |
| ``` | |
| ## Usage Examples | |
| ### Basic Transcription | |
| ```python | |
| from transformers import pipeline | |
| pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) | |
| # From file | |
| result = pipe("audio.wav") | |
| print(result["text"]) | |
| # From URL | |
| result = pipe("https://example.com/audio.mp3") | |
| # From numpy array (must be 16kHz) | |
| import numpy as np | |
| audio = np.random.randn(16000).astype(np.float32) # 1 second | |
| result = pipe(audio) | |
| ``` | |
| ### Batch Processing | |
| ```python | |
| # Process multiple files | |
| files = ["audio1.wav", "audio2.wav", "audio3.wav"] | |
| results = pipe(files, batch_size=4) | |
| for r in results: | |
| print(r["text"]) | |
| ``` | |
| ### Word-Level Timestamps | |
| ```python | |
| result = pipe("audio.wav", return_timestamps="word") | |
| # Returns: | |
| # { | |
| # "text": "hello world", | |
| # "chunks": [ | |
| # {"text": "hello", "timestamp": (0.0, 0.5)}, | |
| # {"text": "world", "timestamp": (0.6, 1.0)} | |
| # ] | |
| # } | |
| ``` | |
| ### Streaming Inference | |
| ```python | |
| from tiny_audio import ASRModel, ASRProcessor | |
| import torch | |
| model = ASRModel.from_pretrained("mazesmazes/tiny-audio") | |
| processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") | |
| # Load and process audio | |
| import librosa | |
| audio, sr = librosa.load("audio.wav", sr=16000) | |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt") | |
| # Stream tokens | |
| for token in model.generate_streaming(inputs["input_features"]): | |
| print(token, end="", flush=True) | |
| ``` | |
| ### Using with torch directly | |
| ```python | |
| from tiny_audio import ASRModel, ASRProcessor | |
| import torch | |
| import librosa | |
| # Load model and processor | |
| model = ASRModel.from_pretrained("mazesmazes/tiny-audio") | |
| processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") | |
| # Load audio (16kHz) | |
| audio, sr = librosa.load("audio.wav", sr=16000) | |
| # Process | |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt") | |
| # Generate | |
| with torch.no_grad(): | |
| output = model.generate( | |
| input_features=inputs["input_features"], | |
| attention_mask=inputs["attention_mask"], | |
| max_new_tokens=256 | |
| ) | |
| # Decode | |
| text = processor.batch_decode(output, skip_special_tokens=True)[0] | |
| print(text) | |
| ``` | |
| ### GPU Inference | |
| ```python | |
| import torch | |
| pipe = pipeline( | |
| "automatic-speech-recognition", | |
| model="mazesmazes/tiny-audio", | |
| trust_remote_code=True, | |
| device="cuda" # or device=0 | |
| ) | |
| ``` | |
| ### Half Precision | |
| ```python | |
| pipe = pipeline( | |
| "automatic-speech-recognition", | |
| model="mazesmazes/tiny-audio", | |
| trust_remote_code=True, | |
| torch_dtype=torch.float16, | |
| device="cuda" | |
| ) | |
| ``` | |
| ## Architecture | |
| ``` | |
| Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text | |
| ``` | |
| Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge. | |
| | Component | Model | Parameters | Status | | |
| |-----------|-------|------------|--------| | |
| | Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen | | |
| | Projector | 2-layer MLP | ~12M | Trained | | |
| | Language Model | Qwen3-0.6B | ~600M | Frozen | | |
| ### How It Works | |
| 1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim) | |
| 2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces | |
| 3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio | |
| The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1` | |
| ## Model Specifications | |
| | Specification | Value | | |
| |---------------|-------| | |
| | Input | Audio (16kHz mono) | | |
| | Output | Text transcription | | |
| | Max Audio Length | ~30 seconds (limited by encoder) | | |
| | Vocabulary | Qwen3 tokenizer | | |
| | Languages | English only | | |
| | Generation | Greedy decoding (num_beams=1, do_sample=False) | | |
| ## Training Details | |
| | | | | |
| |---|---| | |
| | **Dataset** | LoquaciousSet (25,000 hours) | | |
| | **Hardware** | Single NVIDIA A40 | | |
| | **Time** | ~24 hours | | |
| | **Cost** | ~$12 | | |
| | **Optimizer** | AdamW | | |
| | **Learning Rate** | 1e-4 | | |
| | **Batch Size** | 4 | | |
| | **Steps** | 50,000 | | |
| ## Limitations | |
| - **English only**: Not trained on other languages | |
| - **Sample rate**: Expects 16kHz audio (other rates resampled automatically) | |
| - **Audio length**: Best for clips under 30 seconds | |
| - **Accuracy**: May degrade on: | |
| - Heavily accented speech | |
| - Noisy or low-quality audio | |
| - Domain-specific terminology | |
| - Overlapping speakers | |
| - **No punctuation**: Output is lowercase without punctuation by default | |
| ## Requirements | |
| ``` | |
| transformers>=4.40.0 | |
| torch>=2.0.0 | |
| torchaudio>=2.0.0 | |
| ``` | |
| Optional for streaming: | |
| ``` | |
| librosa | |
| soundfile | |
| ``` | |
| ## Files | |
| | File | Description | | |
| |------|-------------| | |
| | `config.json` | Model configuration | | |
| | `model.safetensors` | Projector weights (~48MB) | | |
| | `preprocessor_config.json` | Audio preprocessing config | | |
| | `tokenizer.json` | Tokenizer | | |
| | `tokenizer_config.json` | Tokenizer config | | |
| | `special_tokens_map.json` | Special tokens | | |
| Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos. | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @misc{tinyaudio2024, | |
| author = {Alex Kroman}, | |
| title = {Tiny Audio: Minimal ASR Training}, | |
| year = {2024}, | |
| publisher = {GitHub}, | |
| url = {https://github.com/alexkroman/tiny-audio} | |
| } | |
| ``` | |
| ## Links | |
| - [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model | |
| - [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch | |
| - [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser | |
| ## Acknowledgments | |
| - [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder | |
| - [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model | |
| - [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data | |
| ## License | |
| MIT | |