mazesmazes
/

tiny-audio

@@ -14,21 +14,177 @@ tags:
 - audio
 - qwen
 - glm-asr
 ---
 # Tiny Audio
 A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
 ## Architecture
 ```
 Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
 ```
-Only the projector is trained (~12M params). The encoder and decoder remain frozen.
-## Training
 | | |
 |---|---|
@@ -36,24 +192,76 @@ Only the projector is trained (~12M params). The encoder and decoder remain froz
 | **Hardware** | Single NVIDIA A40 |
 | **Time** | ~24 hours |
 | **Cost** | ~$12 |
-## Usage
-```python
-from transformers import pipeline
-pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
-result = pipe("audio.wav")
-print(result["text"])
 ```
-## Limitations
-- English only
-- 16kHz audio (other sample rates resampled automatically)
-- May degrade on accented speech, noisy audio, or domain-specific terms
 ## Links
-- [Train your own](https://github.com/alexkroman/tiny-audio)
-- [Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)

 - audio
 - qwen
 - glm-asr
+library_name: transformers
 ---
 # Tiny Audio
 A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
+## Quick Start
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+result = pipe("audio.wav")
+print(result["text"])
+```
+## Usage Examples
+### Basic Transcription
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+# From file
+result = pipe("audio.wav")
+print(result["text"])
+# From URL
+result = pipe("https://example.com/audio.mp3")
+# From numpy array (must be 16kHz)
+import numpy as np
+audio = np.random.randn(16000).astype(np.float32)  # 1 second
+result = pipe(audio)
+```
+### Batch Processing
+```python
+# Process multiple files
+files = ["audio1.wav", "audio2.wav", "audio3.wav"]
+results = pipe(files, batch_size=4)
+for r in results:
+    print(r["text"])
+```
+### Word-Level Timestamps
+```python
+result = pipe("audio.wav", return_timestamps="word")
+# Returns:
+# {
+#   "text": "hello world",
+#   "chunks": [
+#     {"text": "hello", "timestamp": (0.0, 0.5)},
+#     {"text": "world", "timestamp": (0.6, 1.0)}
+#   ]
+# }
+```
+### Streaming Inference
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load and process audio
+import librosa
+audio, sr = librosa.load("audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Stream tokens
+for token in model.generate_streaming(inputs["input_features"]):
+    print(token, end="", flush=True)
+```
+### Using with torch directly
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+import librosa
+# Load model and processor
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load audio (16kHz)
+audio, sr = librosa.load("audio.wav", sr=16000)
+# Process
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Generate
+with torch.no_grad():
+    output = model.generate(
+        input_features=inputs["input_features"],
+        attention_mask=inputs["attention_mask"],
+        max_new_tokens=256
+    )
+# Decode
+text = processor.batch_decode(output, skip_special_tokens=True)[0]
+print(text)
+```
+### GPU Inference
+```python
+import torch
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    device="cuda"  # or device=0
+)
+```
+### Half Precision
+```python
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    torch_dtype=torch.float16,
+    device="cuda"
+)
+```
 ## Architecture
 ```
 Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
 ```
+Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
+| Component | Model | Parameters | Status |
+|-----------|-------|------------|--------|
+| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
+| Projector | 2-layer MLP | ~12M | Trained |
+| Language Model | Qwen3-0.6B | ~600M | Frozen |
+### How It Works
+1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
+2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
+3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
+The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
+## Model Specifications
+| Specification | Value |
+|---------------|-------|
+| Input | Audio (16kHz mono) |
+| Output | Text transcription |
+| Max Audio Length | ~30 seconds (limited by encoder) |
+| Vocabulary | Qwen3 tokenizer |
+| Languages | English only |
+| Generation | Greedy decoding (num_beams=1, do_sample=False) |
+## Training Details
 | | |
 |---|---|
 | **Hardware** | Single NVIDIA A40 |
 | **Time** | ~24 hours |
 | **Cost** | ~$12 |
+| **Optimizer** | AdamW |
+| **Learning Rate** | 1e-4 |
+| **Batch Size** | 4 |
+| **Steps** | 50,000 |
+## Limitations
+- **English only**: Not trained on other languages
+- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
+- **Audio length**: Best for clips under 30 seconds
+- **Accuracy**: May degrade on:
+  - Heavily accented speech
+  - Noisy or low-quality audio
+  - Domain-specific terminology
+  - Overlapping speakers
+- **No punctuation**: Output is lowercase without punctuation by default
+## Requirements
+```
+transformers>=4.40.0
+torch>=2.0.0
+torchaudio>=2.0.0
 ```
+Optional for streaming:
+```
+librosa
+soundfile
+```
+## Files
+| File | Description |
+|------|-------------|
+| `config.json` | Model configuration |
+| `model.safetensors` | Projector weights (~48MB) |
+| `preprocessor_config.json` | Audio preprocessing config |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `special_tokens_map.json` | Special tokens |
+Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{tinyaudio2024,
+  author = {Alex Kroman},
+  title = {Tiny Audio: Minimal ASR Training},
+  year = {2024},
+  publisher = {GitHub},
+  url = {https://github.com/alexkroman/tiny-audio}
+}
+```
 ## Links
+- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
+- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
+- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
+## Acknowledgments
+- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
+- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
+- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
+## License
+MIT

asr_modeling.py CHANGED Viewed

@@ -89,13 +89,27 @@ class ASRModel(PreTrainedModel, GenerationMixin):
                 if adapter_config_file is not None:
                     # Load saved adapter weights using the original repo_id/path
                     # PEFT handles Hub downloads and caching internally
-                    from peft import PeftModel
                     # language_model is bare (not PEFT-wrapped) since we skipped _setup_lora
                     model.language_model = PeftModel.from_pretrained(
                         model.language_model,
                         pretrained_model_name_or_path,  # Use original repo_id, not cache path
                         is_trainable=True,
                         **cache_kwargs,
                     )
                 else:
@@ -113,8 +127,8 @@ class ASRModel(PreTrainedModel, GenerationMixin):
                     model.language_model = get_peft_model(model.language_model, lora_config)
                     # Clear base_model_name_or_path so PEFT doesn't save a reference
-                    # to the base LLM. See _setup_lora for details.
-                    model.language_model.peft_config["default"].base_model_name_or_path = None
             return model
         finally:
@@ -295,8 +309,8 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         # Clear base_model_name_or_path so PEFT doesn't save a reference to the
         # base LLM (e.g. Qwen). This prevents pipeline() from redirecting to the
-        # wrong model. The correct path gets set during save_pretrained/push_to_hub.
-        self.language_model.peft_config["default"].base_model_name_or_path = None
     def _init_tokenizer(self, config: ASRConfig):
         """Initialize tokenizer with audio token."""
@@ -738,23 +752,25 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         if hasattr(self.language_model, "peft_config"):
             self.language_model.save_pretrained(save_dir, save_embedding_layers=False)
-            # Fix adapter_config.json to point base_model_name_or_path to the repo itself
-            # This prevents transformers pipeline() from redirecting to the base LLM repo
-            # (like Qwen) which breaks feature extractor loading for multimodal models.
-            # See: https://huggingface.co/ibm-granite/granite-speech-3.3-2b for reference
             adapter_config_path = save_dir / "adapter_config.json"
             if adapter_config_path.exists():
                 with adapter_config_path.open() as f:
                     adapter_config = json.load(f)
-                # Use repo_id from kwargs or config - never use checkpoint directory name
                 repo_id = (
                     kwargs.get("repo_id")
                     or kwargs.get("push_to_hub_model_id")
                     or getattr(self.config, "pretrained_model_path", None)
                 )
-                if repo_id:
-                    adapter_config["base_model_name_or_path"] = repo_id
                 with adapter_config_path.open("w") as f:
                     json.dump(adapter_config, f, indent=2)
@@ -785,8 +801,15 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
     def push_to_hub(self, repo_id: str, **kwargs) -> str:
-        """Push model to HuggingFace Hub, ensuring adapter_config points to repo."""
-        # Call parent's push_to_hub with repo_id in kwargs so save_pretrained can use it
         return super().push_to_hub(repo_id, repo_id=repo_id, **kwargs)
     def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:

                 if adapter_config_file is not None:
                     # Load saved adapter weights using the original repo_id/path
                     # PEFT handles Hub downloads and caching internally
+                    from peft import LoraConfig, PeftModel
+                    # Pre-load and fix the adapter config to avoid str(None) -> "None" bug.
+                    # Some PEFT/transformers versions convert null to "None" string which
+                    # causes HF to try loading a model called "None".
+                    with open(adapter_config_file) as f:
+                        adapter_config_dict = json.load(f)
+                    # Fix base_model_name_or_path if it's None/null
+                    if adapter_config_dict.get("base_model_name_or_path") is None:
+                        adapter_config_dict["base_model_name_or_path"] = ""
+                    # Create LoraConfig from the fixed dict
+                    peft_config = LoraConfig(**adapter_config_dict)
                     # language_model is bare (not PEFT-wrapped) since we skipped _setup_lora
                     model.language_model = PeftModel.from_pretrained(
                         model.language_model,
                         pretrained_model_name_or_path,  # Use original repo_id, not cache path
                         is_trainable=True,
+                        config=peft_config,  # Use our fixed config
                         **cache_kwargs,
                     )
                 else:
                     model.language_model = get_peft_model(model.language_model, lora_config)
                     # Clear base_model_name_or_path so PEFT doesn't save a reference
+                    # to the base LLM. Use empty string to avoid str(None) -> "None" bug.
+                    model.language_model.peft_config["default"].base_model_name_or_path = ""
             return model
         finally:
         # Clear base_model_name_or_path so PEFT doesn't save a reference to the
         # base LLM (e.g. Qwen). This prevents pipeline() from redirecting to the
+        # wrong model. Use empty string to avoid str(None) -> "None" bug.
+        self.language_model.peft_config["default"].base_model_name_or_path = ""
     def _init_tokenizer(self, config: ASRConfig):
         """Initialize tokenizer with audio token."""
         if hasattr(self.language_model, "peft_config"):
             self.language_model.save_pretrained(save_dir, save_embedding_layers=False)
+            # Clear base_model_name_or_path in adapter_config.json to prevent HF pipeline
+            # from redirecting to the base LLM repo (like Qwen) which breaks feature
+            # extractor loading for multimodal models. If a repo_id is provided, use that
+            # so the model can be loaded directly from the Hub.
             adapter_config_path = save_dir / "adapter_config.json"
             if adapter_config_path.exists():
                 with adapter_config_path.open() as f:
                     adapter_config = json.load(f)
+                # Use repo_id if available, otherwise clear to prevent redirect.
+                # Use empty string instead of None to avoid str(None) -> "None" bug
+                # in some transformers/PEFT versions.
                 repo_id = (
                     kwargs.get("repo_id")
                     or kwargs.get("push_to_hub_model_id")
                     or getattr(self.config, "pretrained_model_path", None)
+                    or ""  # Use empty string instead of None
                 )
+                adapter_config["base_model_name_or_path"] = repo_id
                 with adapter_config_path.open("w") as f:
                     json.dump(adapter_config, f, indent=2)
         shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
     def push_to_hub(self, repo_id: str, **kwargs) -> str:
+        """Push model to HuggingFace Hub, ensuring adapter_config points to repo.
+        IMPORTANT: Sets base_model_name_or_path in adapter_config.json to repo_id
+        so that transformers pipeline() can load the model correctly. Without this,
+        the pipeline tries to load from "None" which fails.
+        """
+        # Store repo_id in config so save_pretrained can access it
+        self.config.pretrained_model_path = repo_id
+        # Call parent's push_to_hub with repo_id in kwargs
         return super().push_to_hub(repo_id, repo_id=repo_id, **kwargs)
     def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:

asr_pipeline.py CHANGED Viewed

@@ -521,12 +521,19 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
         Returns:
             Dict with 'text' key containing transcription
         """
         # Handle list of outputs (from chunking)
         if isinstance(model_outputs, list):
             model_outputs = model_outputs[0] if model_outputs else {}
         tokens = model_outputs.get("tokens")
         if tokens is None:
             return super().postprocess(model_outputs, **kwargs)
         if torch.is_tensor(tokens):
@@ -537,15 +544,20 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
         # Post-process prediction
         text = self._post_process_prediction(text)
         return {"text": text}
     # Known hallucination patterns that should be deleted entirely
     HALLUCINATION_PATTERNS = frozenset(
         [
             "and gt and gt",
             "n",  # Single character noise
         ]
     )

         Returns:
             Dict with 'text' key containing transcription
         """
+        # DEBUG: Track which code path we're using
+        import sys
+        print(f"[DEBUG postprocess] type(model_outputs)={type(model_outputs).__name__}", file=sys.stderr)
         # Handle list of outputs (from chunking)
         if isinstance(model_outputs, list):
+            print(f"[DEBUG postprocess] list len={len(model_outputs)}", file=sys.stderr)
             model_outputs = model_outputs[0] if model_outputs else {}
         tokens = model_outputs.get("tokens")
+        print(f"[DEBUG postprocess] tokens is None: {tokens is None}", file=sys.stderr)
         if tokens is None:
+            print("[DEBUG postprocess] FALLING BACK TO SUPER", file=sys.stderr)
             return super().postprocess(model_outputs, **kwargs)
         if torch.is_tensor(tokens):
         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+        print(f"[DEBUG postprocess] BEFORE truncation: {len(text.split())} words", file=sys.stderr)
         # Post-process prediction
         text = self._post_process_prediction(text)
+        print(f"[DEBUG postprocess] AFTER truncation: {len(text.split())} words", file=sys.stderr)
         return {"text": text}
     # Known hallucination patterns that should be deleted entirely
     HALLUCINATION_PATTERNS = frozenset(
         [
             "and gt and gt",
+            "and gt",
+            "gt and gt",
             "n",  # Single character noise
+            "and",  # Common short hallucination
         ]
     )