mazesmazes
/

tiny-audio

@@ -1,4 +1,3 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text
-tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text

README.md CHANGED Viewed

@@ -1,87 +1,267 @@
 ---
-library_name: transformers
 tags:
-- generated_from_trainer
-model-index:
-- name: tiny-audio
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# tiny-audio
-This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.8002
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.002
-- train_batch_size: 14
-- eval_batch_size: 14
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: polynomial
-- lr_scheduler_warmup_steps: 1000
-- num_epochs: 4
-- label_smoothing_factor: 0.1
-### Training results
-| Training Loss | Epoch  | Step   | Validation Loss |
-|:-------------:|:------:|:------:|:---------------:|
-| 2.1624        | 0.1303 | 10000  | 1.8803          |
-| 2.1100        | 0.2607 | 20000  | 1.8542          |
-| 2.0734        | 0.3910 | 30000  | 1.8479          |
-| 2.1233        | 0.5214 | 40000  | 1.8361          |
-| 2.1015        | 0.6517 | 50000  | 1.8280          |
-| 2.0839        | 0.7820 | 60000  | 1.8288          |
-| 2.0971        | 0.9124 | 70000  | 1.8219          |
-| 2.0907        | 1.0427 | 80000  | 1.8218          |
-| 2.0599        | 1.1731 | 90000  | 1.8167          |
-| 2.0747        | 1.3034 | 100000 | 1.8171          |
-| 2.0713        | 1.4337 | 110000 | 1.8152          |
-| 2.0866        | 1.5641 | 120000 | 1.8133          |
-| 2.0904        | 1.6944 | 130000 | 1.8104          |
-| 2.0554        | 1.8248 | 140000 | 1.8092          |
-| 2.0968        | 1.9551 | 150000 | 1.8100          |
-| 2.0644        | 2.0855 | 160000 | 1.8077          |
-| 2.0499        | 2.2158 | 170000 | 1.8054          |
-| 2.0570        | 2.3461 | 180000 | 1.8056          |
-| 2.0432        | 2.4765 | 190000 | 1.8066          |
-| 2.0413        | 2.6068 | 200000 | 1.8050          |
-| 2.0373        | 2.7372 | 210000 | 1.8039          |
-| 2.0117        | 2.8675 | 220000 | 1.8036          |
-| 2.0437        | 2.9978 | 230000 | 1.8036          |
-| 2.0454        | 3.1282 | 240000 | 1.8032          |
-| 2.0181        | 3.2585 | 250000 | 1.8022          |
-| 2.0266        | 3.3889 | 260000 | 1.8015          |
-| 2.0451        | 3.5192 | 270000 | 1.8018          |
-| 2.0308        | 3.6495 | 280000 | 1.8019          |
-| 2.0419        | 3.7799 | 290000 | 1.8005          |
-| 2.0172        | 3.9102 | 300000 | 1.8002          |
-### Framework versions
-- Transformers 5.0.0.dev0
-- Pytorch 2.8.0+cu128
-- Datasets 3.6.0
-- Tokenizers 0.22.2

 ---
+license: mit
+language:
+- en
+datasets:
+- speechbrain/LoquaciousSet
+base_model:
+- zai-org/GLM-ASR-Nano-2512
+- Qwen/Qwen3-0.6B
+pipeline_tag: automatic-speech-recognition
 tags:
+- asr
+- speech-recognition
+- audio
+- qwen
+- glm-asr
+library_name: transformers
 ---
+# Tiny Audio
+A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
+## Quick Start
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+result = pipe("audio.wav")
+print(result["text"])
+```
+## Usage Examples
+### Basic Transcription
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+# From file
+result = pipe("audio.wav")
+print(result["text"])
+# From URL
+result = pipe("https://example.com/audio.mp3")
+# From numpy array (must be 16kHz)
+import numpy as np
+audio = np.random.randn(16000).astype(np.float32)  # 1 second
+result = pipe(audio)
+```
+### Batch Processing
+```python
+# Process multiple files
+files = ["audio1.wav", "audio2.wav", "audio3.wav"]
+results = pipe(files, batch_size=4)
+for r in results:
+    print(r["text"])
+```
+### Word-Level Timestamps
+```python
+result = pipe("audio.wav", return_timestamps="word")
+# Returns:
+# {
+#   "text": "hello world",
+#   "chunks": [
+#     {"text": "hello", "timestamp": (0.0, 0.5)},
+#     {"text": "world", "timestamp": (0.6, 1.0)}
+#   ]
+# }
+```
+### Streaming Inference
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load and process audio
+import librosa
+audio, sr = librosa.load("audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Stream tokens
+for token in model.generate_streaming(inputs["input_features"]):
+    print(token, end="", flush=True)
+```
+### Using with torch directly
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+import librosa
+# Load model and processor
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load audio (16kHz)
+audio, sr = librosa.load("audio.wav", sr=16000)
+# Process
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Generate
+with torch.no_grad():
+    output = model.generate(
+        input_features=inputs["input_features"],
+        attention_mask=inputs["attention_mask"],
+        max_new_tokens=256
+    )
+# Decode
+text = processor.batch_decode(output, skip_special_tokens=True)[0]
+print(text)
+```
+### GPU Inference
+```python
+import torch
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    device="cuda"  # or device=0
+)
+```
+### Half Precision
+```python
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    torch_dtype=torch.float16,
+    device="cuda"
+)
+```
+## Architecture
+```
+Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
+```
+Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
+| Component | Model | Parameters | Status |
+|-----------|-------|------------|--------|
+| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
+| Projector | 2-layer MLP | ~12M | Trained |
+| Language Model | Qwen3-0.6B | ~600M | Frozen |
+### How It Works
+1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
+2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
+3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
+The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
+## Model Specifications
+| Specification | Value |
+|---------------|-------|
+| Input | Audio (16kHz mono) |
+| Output | Text transcription |
+| Max Audio Length | ~30 seconds (limited by encoder) |
+| Vocabulary | Qwen3 tokenizer |
+| Languages | English only |
+| Generation | Greedy decoding (num_beams=1, do_sample=False) |
+## Training Details
+| | |
+|---|---|
+| **Dataset** | LoquaciousSet (25,000 hours) |
+| **Hardware** | Single NVIDIA A40 |
+| **Time** | ~24 hours |
+| **Cost** | ~$12 |
+| **Optimizer** | AdamW |
+| **Learning Rate** | 1e-4 |
+| **Batch Size** | 4 |
+| **Steps** | 50,000 |
+## Limitations
+- **English only**: Not trained on other languages
+- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
+- **Audio length**: Best for clips under 30 seconds
+- **Accuracy**: May degrade on:
+  - Heavily accented speech
+  - Noisy or low-quality audio
+  - Domain-specific terminology
+  - Overlapping speakers
+- **No punctuation**: Output is lowercase without punctuation by default
+## Requirements
+```
+transformers>=4.40.0
+torch>=2.0.0
+torchaudio>=2.0.0
+```
+Optional for streaming:
+```
+librosa
+soundfile
+```
+## Files
+| File | Description |
+|------|-------------|
+| `config.json` | Model configuration |
+| `model.safetensors` | Projector weights (~48MB) |
+| `preprocessor_config.json` | Audio preprocessing config |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `special_tokens_map.json` | Special tokens |
+Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{tinyaudio2024,
+  author = {Alex Kroman},
+  title = {Tiny Audio: Minimal ASR Training},
+  year = {2024},
+  publisher = {GitHub},
+  url = {https://github.com/alexkroman/tiny-audio}
+}
+```
+## Links
+- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
+- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
+- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
+## Acknowledgments
+- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
+- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
+- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
+## License
+MIT

asr_modeling.py CHANGED Viewed

@@ -703,6 +703,57 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         thread.join()
     def save_pretrained(self, save_directory: Union[str, Path], **kwargs) -> None:
         """Save model, tokenizer, and processor."""
         import shutil
@@ -796,8 +847,8 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         """
         # Store repo_id in config so save_pretrained can access it
         self.config.pretrained_model_path = repo_id
-        # Call parent's push_to_hub with repo_id in kwargs
-        return super().push_to_hub(repo_id, repo_id=repo_id, **kwargs)
     def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:
         """No-op for model card creation - we use MODEL_CARD.md in repo instead."""

         thread.join()
+    @torch.no_grad()
+    def generate_text_only(
+        self,
+        messages: list[dict[str, str]],
+        max_new_tokens: int = 256,
+        **generate_kwargs,
+    ) -> str:
+        """Generate text using only the LLM (no audio encoding).
+        Used for SIFT-style response generation from metadata prompts.
+        Args:
+            messages: List of chat messages [{"role": "user", "content": "..."}]
+            max_new_tokens: Maximum tokens to generate
+            **generate_kwargs: Additional generation arguments
+        Returns:
+            Generated text response
+        """
+        device = next(self.language_model.parameters()).device
+        # Apply chat template
+        input_ids = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+            enable_thinking=False,
+        ).to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        attention_mask = torch.ones_like(input_ids)
+        # Generate using language model directly
+        output = self.language_model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            pad_token_id=self.tokenizer.pad_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            **generate_kwargs,
+        )
+        # Decode only the new tokens
+        new_tokens = output[0, input_ids.shape[1] :]
+        response = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+        return response.strip()
     def save_pretrained(self, save_directory: Union[str, Path], **kwargs) -> None:
         """Save model, tokenizer, and processor."""
         import shutil
         """
         # Store repo_id in config so save_pretrained can access it
         self.config.pretrained_model_path = repo_id
+        # Call parent's push_to_hub
+        return super().push_to_hub(repo_id, **kwargs)
     def create_or_update_model_card(self, output_dir: Union[str, Path]) -> None:
         """No-op for model card creation - we use MODEL_CARD.md in repo instead."""

asr_pipeline.py CHANGED Viewed

@@ -418,4 +418,57 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
         return {"text": text}

         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+        # Truncate repetitions at end of text
+        text = _truncate_repetitions(text)
         return {"text": text}
+def _truncate_repetitions(text: str, min_repeats: int = 3) -> str:
+    """Truncate repeated words/phrases/characters at end of text.
+    Detects patterns like:
+    - Repeated words: "the the the the" -> "the"
+    - Repeated phrases: "i am sorry i am sorry i am sorry" -> "i am sorry"
+    - Repeated characters: "444444" -> "4"
+    Args:
+        text: Input text to process
+        min_repeats: Minimum repetitions to trigger truncation (default 3)
+    Returns:
+        Text with trailing repetitions removed
+    """
+    if not text:
+        return text
+    # 1. Truncate repeated characters at end (e.g., "444444" -> "4")
+    char_pattern = re.compile(r"(.)\1{" + str(min_repeats - 1) + r",}$")
+    text = char_pattern.sub(r"\1", text)
+    # 2. Truncate repeated words at end (e.g., "the the the" -> "the")
+    word_pattern = re.compile(r"\b(\w+)(?:\s+\1){" + str(min_repeats - 1) + r",}\s*$", re.IGNORECASE)
+    while word_pattern.search(text):
+        text = word_pattern.sub(r"\1", text)
+    # 3. Truncate repeated phrases (2-20 words) at end
+    # e.g., "i am sorry i am sorry i am sorry" -> "i am sorry"
+    words = text.split()
+    if len(words) >= min_repeats * 2:
+        # Try phrase lengths from 2 to 20 words
+        for phrase_len in range(2, min(21, len(words) // min_repeats + 1)):
+            # Check if the last phrase_len words repeat
+            phrase = " ".join(words[-phrase_len:])
+            # Build pattern to match repeated phrases at end
+            phrase_escaped = re.escape(phrase)
+            phrase_pattern = re.compile(
+                r"(^|.*?\s)(" + phrase_escaped + r")(?:\s+" + phrase_escaped + r"){" + str(min_repeats - 1) + r",}\s*$",
+                re.IGNORECASE,
+            )
+            match = phrase_pattern.match(text)
+            if match:
+                # Keep prefix + one instance of the phrase
+                text = (match.group(1) + match.group(2)).strip()
+                words = text.split()
+                break
+    return text