mazesmazes commited on
Commit
9749d46
·
verified ·
1 Parent(s): 0c9fde7

Update custom model files, README, and requirements

Browse files
Files changed (3) hide show
  1. README.md +67 -80
  2. asr_config.py +1 -1
  3. asr_pipeline.py +4 -1
README.md CHANGED
@@ -1,84 +1,71 @@
1
  ---
2
- library_name: transformers
 
 
 
 
 
 
 
 
3
  tags:
4
- - generated_from_trainer
5
- model-index:
6
- - name: tiny-audio
7
- results: []
 
 
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
- # tiny-audio
14
-
15
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
- It achieves the following results on the evaluation set:
17
- - Loss: 0.3030
18
-
19
- ## Model description
20
-
21
- More information needed
22
-
23
- ## Intended uses & limitations
24
-
25
- More information needed
26
-
27
- ## Training and evaluation data
28
-
29
- More information needed
30
-
31
- ## Training procedure
32
-
33
- ### Training hyperparameters
34
-
35
- The following hyperparameters were used during training:
36
- - learning_rate: 0.0002
37
- - train_batch_size: 20
38
- - eval_batch_size: 20
39
- - seed: 42
40
- - gradient_accumulation_steps: 2
41
- - total_train_batch_size: 40
42
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
43
- - lr_scheduler_type: cosine
44
- - lr_scheduler_warmup_steps: 500
45
- - num_epochs: 1
46
-
47
- ### Training results
48
-
49
- | Training Loss | Epoch | Step | Validation Loss |
50
- |:-------------:|:------:|:-----:|:---------------:|
51
- | 0.5056 | 0.0372 | 1000 | 0.4229 |
52
- | 0.4725 | 0.0745 | 2000 | 0.3813 |
53
- | 0.4602 | 0.1117 | 3000 | 0.3660 |
54
- | 0.4293 | 0.1490 | 4000 | 0.3542 |
55
- | 0.4113 | 0.1862 | 5000 | 0.3468 |
56
- | 0.4153 | 0.2234 | 6000 | 0.3404 |
57
- | 0.3996 | 0.2607 | 7000 | 0.3337 |
58
- | 0.4089 | 0.2979 | 8000 | 0.3294 |
59
- | 0.3743 | 0.3352 | 9000 | 0.3250 |
60
- | 0.4082 | 0.3724 | 10000 | 0.3214 |
61
- | 0.3679 | 0.4096 | 11000 | 0.3193 |
62
- | 0.3792 | 0.4469 | 12000 | 0.3185 |
63
- | 0.3723 | 0.4841 | 13000 | 0.3140 |
64
- | 0.3696 | 0.5214 | 14000 | 0.3109 |
65
- | 0.3920 | 0.5586 | 15000 | 0.3104 |
66
- | 0.3827 | 0.5958 | 16000 | 0.3070 |
67
- | 0.3593 | 0.6331 | 17000 | 0.3054 |
68
- | 0.3832 | 0.6703 | 18000 | 0.3056 |
69
- | 0.3772 | 0.7076 | 19000 | 0.3038 |
70
- | 0.3887 | 0.7448 | 20000 | 0.3035 |
71
- | 0.3666 | 0.7821 | 21000 | 0.3034 |
72
- | 0.3690 | 0.8193 | 22000 | 0.3033 |
73
- | 0.3568 | 0.8565 | 23000 | 0.3031 |
74
- | 0.3809 | 0.8938 | 24000 | 0.3029 |
75
- | 0.3736 | 0.9310 | 25000 | 0.3032 |
76
- | 0.3553 | 0.9683 | 26000 | 0.3030 |
77
-
78
-
79
- ### Framework versions
80
-
81
- - Transformers 5.0.0.dev0
82
- - Pytorch 2.8.0+cu128
83
- - Datasets 3.6.0
84
- - Tokenizers 0.22.1
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ datasets:
6
+ - speechbrain/LoquaciousSet
7
+ base_model:
8
+ - openai/whisper-large-v3-turbo
9
+ - HuggingFaceTB/SmolLM3-3B
10
+ pipeline_tag: automatic-speech-recognition
11
  tags:
12
+ - asr
13
+ - speech-recognition
14
+ - audio
15
+ - smollm
16
+ - whisper
17
+ - mlp
18
  ---
19
 
20
+ # Tiny Audio
21
+
22
+ A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
23
+
24
+ ## Architecture
25
+
26
+ ```
27
+ Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
28
+ ```
29
+
30
+ **MLP Projector:**
31
+ - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
32
+ - Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
33
+ - Output normalization: RMSNorm
34
+
35
+ ## Training Details
36
+
37
+ | | |
38
+ |---|---|
39
+ | **Dataset** | LoquaciousSet (25,000 hours) |
40
+ | **Hardware** | Single NVIDIA A40 40GB |
41
+ | **Training Time** | ~24 hours |
42
+ | **Cost** | ~$12 |
43
+ | **Trainable Parameters** | ~12M (projector only) |
44
+
45
+ ## Performance
46
+
47
+ **Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
48
+
49
+
50
+ ## Usage
51
+
52
+ ```python
53
+ from transformers import pipeline
54
+
55
+ pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
56
+
57
+ result = pipe("path/to/audio.wav")
58
+ print(result["text"])
59
+ ```
60
+
61
+ ## Limitations
62
+
63
+ - English only
64
+ - Optimized for 16kHz audio; other sample rates are resampled automatically
65
+ - Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
66
+ - Maximum audio length limited by context window
67
+
68
+ ## Learn More
69
+
70
+ - **[Train your own model](https://github.com/alexkroman/tiny-audio)** The full codebase with training scripts
71
+ - **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** Build your own ASR system from scratch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
asr_config.py CHANGED
@@ -71,7 +71,7 @@ class ASRConfig(transformers.PretrainedConfig):
71
  "min_new_tokens": 0,
72
  "repetition_penalty": 1.0,
73
  "length_penalty": 1.0,
74
- "no_repeat_ngram_size": 0,
75
  "use_cache": True,
76
  }
77
 
 
71
  "min_new_tokens": 0,
72
  "repetition_penalty": 1.0,
73
  "length_penalty": 1.0,
74
+ "no_repeat_ngram_size": 3, # Prevent repeating 3-grams like "so so so"
75
  "use_cache": True,
76
  }
77
 
asr_pipeline.py CHANGED
@@ -486,6 +486,7 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
486
  return ""
487
 
488
  original_len = len(text.split())
 
489
 
490
  # 1. LOWERCASE
491
  text = text.lower()
@@ -505,8 +506,10 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
505
  words = words[: idx + n]
506
  text = " ".join(words)
507
  print(
508
- f"[DEBUG] Truncated repetition: {original_len} -> {len(words)} words (n={n}, repeats={repeat_count})"
509
  )
 
 
510
  break
511
 
512
  # 3. COMBINE ACRONYMS
 
486
  return ""
487
 
488
  original_len = len(text.split())
489
+ original_text = text # Keep for debug
490
 
491
  # 1. LOWERCASE
492
  text = text.lower()
 
506
  words = words[: idx + n]
507
  text = " ".join(words)
508
  print(
509
+ f"[POSTPROCESS] Truncated repetition: {original_len} -> {len(words)} words (n={n}, repeats={repeat_count})"
510
  )
511
+ print(f"[POSTPROCESS] Before: {original_text[:100]}...")
512
+ print(f"[POSTPROCESS] After: {text[:100]}...")
513
  break
514
 
515
  # 3. COMBINE ACRONYMS