mazesmazes
/

tiny-audio

@@ -1,3 +1,4 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,267 +1,87 @@
 ---
-license: mit
-language:
-- en
-datasets:
-- speechbrain/LoquaciousSet
-base_model:
-- zai-org/GLM-ASR-Nano-2512
-- Qwen/Qwen3-0.6B
-pipeline_tag: automatic-speech-recognition
-tags:
-- asr
-- speech-recognition
-- audio
-- qwen
-- glm-asr
 library_name: transformers
 ---
-# Tiny Audio
-A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
-## Quick Start
-```python
-from transformers import pipeline
-pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
-result = pipe("audio.wav")
-print(result["text"])
-```
-## Usage Examples
-### Basic Transcription
-```python
-from transformers import pipeline
-pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
-# From file
-result = pipe("audio.wav")
-print(result["text"])
-# From URL
-result = pipe("https://example.com/audio.mp3")
-# From numpy array (must be 16kHz)
-import numpy as np
-audio = np.random.randn(16000).astype(np.float32)  # 1 second
-result = pipe(audio)
-```
-### Batch Processing
-```python
-# Process multiple files
-files = ["audio1.wav", "audio2.wav", "audio3.wav"]
-results = pipe(files, batch_size=4)
-for r in results:
-    print(r["text"])
-```
-### Word-Level Timestamps
-```python
-result = pipe("audio.wav", return_timestamps="word")
-# Returns:
-# {
-#   "text": "hello world",
-#   "chunks": [
-#     {"text": "hello", "timestamp": (0.0, 0.5)},
-#     {"text": "world", "timestamp": (0.6, 1.0)}
-#   ]
-# }
-```
-### Streaming Inference
-```python
-from tiny_audio import ASRModel, ASRProcessor
-import torch
-model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
-processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
-# Load and process audio
-import librosa
-audio, sr = librosa.load("audio.wav", sr=16000)
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
-# Stream tokens
-for token in model.generate_streaming(inputs["input_features"]):
-    print(token, end="", flush=True)
-```
-### Using with torch directly
-```python
-from tiny_audio import ASRModel, ASRProcessor
-import torch
-import librosa
-# Load model and processor
-model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
-processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
-# Load audio (16kHz)
-audio, sr = librosa.load("audio.wav", sr=16000)
-# Process
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
-# Generate
-with torch.no_grad():
-    output = model.generate(
-        input_features=inputs["input_features"],
-        attention_mask=inputs["attention_mask"],
-        max_new_tokens=256
-    )
-# Decode
-text = processor.batch_decode(output, skip_special_tokens=True)[0]
-print(text)
-```
-### GPU Inference
-```python
-import torch
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model="mazesmazes/tiny-audio",
-    trust_remote_code=True,
-    device="cuda"  # or device=0
-)
-```
-### Half Precision
-```python
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model="mazesmazes/tiny-audio",
-    trust_remote_code=True,
-    torch_dtype=torch.float16,
-    device="cuda"
-)
-```
-## Architecture
-```
-Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
-```
-Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
-| Component | Model | Parameters | Status |
-|-----------|-------|------------|--------|
-| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
-| Projector | 2-layer MLP | ~12M | Trained |
-| Language Model | Qwen3-0.6B | ~600M | Frozen |
-### How It Works
-1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
-2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
-3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
-The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
-## Model Specifications
-| Specification | Value |
-|---------------|-------|
-| Input | Audio (16kHz mono) |
-| Output | Text transcription |
-| Max Audio Length | ~30 seconds (limited by encoder) |
-| Vocabulary | Qwen3 tokenizer |
-| Languages | English only |
-| Generation | Greedy decoding (num_beams=1, do_sample=False) |
-## Training Details
-| | |
-|---|---|
-| **Dataset** | LoquaciousSet (25,000 hours) |
-| **Hardware** | Single NVIDIA A40 |
-| **Time** | ~24 hours |
-| **Cost** | ~$12 |
-| **Optimizer** | AdamW |
-| **Learning Rate** | 1e-4 |
-| **Batch Size** | 4 |
-| **Steps** | 50,000 |
-## Limitations
-- **English only**: Not trained on other languages
-- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
-- **Audio length**: Best for clips under 30 seconds
-- **Accuracy**: May degrade on:
-  - Heavily accented speech
-  - Noisy or low-quality audio
-  - Domain-specific terminology
-  - Overlapping speakers
-- **No punctuation**: Output is lowercase without punctuation by default
-## Requirements
-```
-transformers>=4.40.0
-torch>=2.0.0
-torchaudio>=2.0.0
-```
-Optional for streaming:
-```
-librosa
-soundfile
-```
-## Files
-| File | Description |
-|------|-------------|
-| `config.json` | Model configuration |
-| `model.safetensors` | Projector weights (~48MB) |
-| `preprocessor_config.json` | Audio preprocessing config |
-| `tokenizer.json` | Tokenizer |
-| `tokenizer_config.json` | Tokenizer config |
-| `special_tokens_map.json` | Special tokens |
-Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{tinyaudio2024,
-  author = {Alex Kroman},
-  title = {Tiny Audio: Minimal ASR Training},
-  year = {2024},
-  publisher = {GitHub},
-  url = {https://github.com/alexkroman/tiny-audio}
-}
-```
-## Links
-- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
-- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
-- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
-## Acknowledgments
-- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
-- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
-- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
-## License
-MIT

 ---
 library_name: transformers
+tags:
+- generated_from_trainer
+model-index:
+- name: tiny-audio
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# tiny-audio
+This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
+It achieves the following results on the evaluation set:
+- Loss: 1.8002
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.002
+- train_batch_size: 14
+- eval_batch_size: 14
+- seed: 42
+- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 1000
+- num_epochs: 4
+- label_smoothing_factor: 0.1
+### Training results
+| Training Loss | Epoch  | Step   | Validation Loss |
+|:-------------:|:------:|:------:|:---------------:|
+| 2.1624        | 0.1303 | 10000  | 1.8803          |
+| 2.1100        | 0.2607 | 20000  | 1.8542          |
+| 2.0734        | 0.3910 | 30000  | 1.8479          |
+| 2.1233        | 0.5214 | 40000  | 1.8361          |
+| 2.1015        | 0.6517 | 50000  | 1.8280          |
+| 2.0839        | 0.7820 | 60000  | 1.8288          |
+| 2.0971        | 0.9124 | 70000  | 1.8219          |
+| 2.0907        | 1.0427 | 80000  | 1.8218          |
+| 2.0599        | 1.1731 | 90000  | 1.8167          |
+| 2.0747        | 1.3034 | 100000 | 1.8171          |
+| 2.0713        | 1.4337 | 110000 | 1.8152          |
+| 2.0866        | 1.5641 | 120000 | 1.8133          |
+| 2.0904        | 1.6944 | 130000 | 1.8104          |
+| 2.0554        | 1.8248 | 140000 | 1.8092          |
+| 2.0968        | 1.9551 | 150000 | 1.8100          |
+| 2.0644        | 2.0855 | 160000 | 1.8077          |
+| 2.0499        | 2.2158 | 170000 | 1.8054          |
+| 2.0570        | 2.3461 | 180000 | 1.8056          |
+| 2.0432        | 2.4765 | 190000 | 1.8066          |
+| 2.0413        | 2.6068 | 200000 | 1.8050          |
+| 2.0373        | 2.7372 | 210000 | 1.8039          |
+| 2.0117        | 2.8675 | 220000 | 1.8036          |
+| 2.0437        | 2.9978 | 230000 | 1.8036          |
+| 2.0454        | 3.1282 | 240000 | 1.8032          |
+| 2.0181        | 3.2585 | 250000 | 1.8022          |
+| 2.0266        | 3.3889 | 260000 | 1.8015          |
+| 2.0451        | 3.5192 | 270000 | 1.8018          |
+| 2.0308        | 3.6495 | 280000 | 1.8019          |
+| 2.0419        | 3.7799 | 290000 | 1.8005          |
+| 2.0172        | 3.9102 | 300000 | 1.8002          |
+### Framework versions
+- Transformers 5.0.0.dev0
+- Pytorch 2.8.0+cu128
+- Datasets 3.6.0
+- Tokenizers 0.22.2