mazesmazes commited on
Commit
5fb5dbf
·
verified ·
1 Parent(s): e626577

Model save

Browse files
Files changed (1) hide show
  1. README.md +80 -67
README.md CHANGED
@@ -1,71 +1,84 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- datasets:
6
- - speechbrain/LoquaciousSet
7
- base_model:
8
- - openai/whisper-large-v3-turbo
9
- - HuggingFaceTB/SmolLM3-3B
10
- pipeline_tag: automatic-speech-recognition
11
  tags:
12
- - asr
13
- - speech-recognition
14
- - audio
15
- - smollm
16
- - whisper
17
- - mlp
18
  ---
19
 
20
- # Tiny Audio
21
-
22
- A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
23
-
24
- ## Architecture
25
-
26
- ```
27
- Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
28
- ```
29
-
30
- **MLP Projector:**
31
- - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
32
- - Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
33
- - Output normalization: RMSNorm
34
-
35
- ## Training Details
36
-
37
- | | |
38
- |---|---|
39
- | **Dataset** | LoquaciousSet (25,000 hours) |
40
- | **Hardware** | Single NVIDIA A40 40GB |
41
- | **Training Time** | ~24 hours |
42
- | **Cost** | ~$12 |
43
- | **Trainable Parameters** | ~12M (projector only) |
44
-
45
- ## Performance
46
-
47
- **Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
48
-
49
-
50
- ## Usage
51
-
52
- ```python
53
- from transformers import pipeline
54
-
55
- pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
56
-
57
- result = pipe("path/to/audio.wav")
58
- print(result["text"])
59
- ```
60
-
61
- ## Limitations
62
-
63
- - English only
64
- - Optimized for 16kHz audio; other sample rates are resampled automatically
65
- - Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
66
- - Maximum audio length limited by context window
67
-
68
- ## Learn More
69
-
70
- - **[Train your own model](https://github.com/alexkroman/tiny-audio)** The full codebase with training scripts
71
- - **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** Build your own ASR system from scratch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
 
 
 
 
 
 
 
 
3
  tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: tiny-audio
7
+ results: []
 
 
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
+ # tiny-audio
14
+
15
+ This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
+ It achieves the following results on the evaluation set:
17
+ - Loss: 0.3030
18
+
19
+ ## Model description
20
+
21
+ More information needed
22
+
23
+ ## Intended uses & limitations
24
+
25
+ More information needed
26
+
27
+ ## Training and evaluation data
28
+
29
+ More information needed
30
+
31
+ ## Training procedure
32
+
33
+ ### Training hyperparameters
34
+
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 0.0002
37
+ - train_batch_size: 20
38
+ - eval_batch_size: 20
39
+ - seed: 42
40
+ - gradient_accumulation_steps: 2
41
+ - total_train_batch_size: 40
42
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
43
+ - lr_scheduler_type: cosine
44
+ - lr_scheduler_warmup_steps: 500
45
+ - num_epochs: 1
46
+
47
+ ### Training results
48
+
49
+ | Training Loss | Epoch | Step | Validation Loss |
50
+ |:-------------:|:------:|:-----:|:---------------:|
51
+ | 0.5056 | 0.0372 | 1000 | 0.4229 |
52
+ | 0.4725 | 0.0745 | 2000 | 0.3813 |
53
+ | 0.4602 | 0.1117 | 3000 | 0.3660 |
54
+ | 0.4293 | 0.1490 | 4000 | 0.3542 |
55
+ | 0.4113 | 0.1862 | 5000 | 0.3468 |
56
+ | 0.4153 | 0.2234 | 6000 | 0.3404 |
57
+ | 0.3996 | 0.2607 | 7000 | 0.3337 |
58
+ | 0.4089 | 0.2979 | 8000 | 0.3294 |
59
+ | 0.3743 | 0.3352 | 9000 | 0.3250 |
60
+ | 0.4082 | 0.3724 | 10000 | 0.3214 |
61
+ | 0.3679 | 0.4096 | 11000 | 0.3193 |
62
+ | 0.3792 | 0.4469 | 12000 | 0.3185 |
63
+ | 0.3723 | 0.4841 | 13000 | 0.3140 |
64
+ | 0.3696 | 0.5214 | 14000 | 0.3109 |
65
+ | 0.3920 | 0.5586 | 15000 | 0.3104 |
66
+ | 0.3827 | 0.5958 | 16000 | 0.3070 |
67
+ | 0.3593 | 0.6331 | 17000 | 0.3054 |
68
+ | 0.3832 | 0.6703 | 18000 | 0.3056 |
69
+ | 0.3772 | 0.7076 | 19000 | 0.3038 |
70
+ | 0.3887 | 0.7448 | 20000 | 0.3035 |
71
+ | 0.3666 | 0.7821 | 21000 | 0.3034 |
72
+ | 0.3690 | 0.8193 | 22000 | 0.3033 |
73
+ | 0.3568 | 0.8565 | 23000 | 0.3031 |
74
+ | 0.3809 | 0.8938 | 24000 | 0.3029 |
75
+ | 0.3736 | 0.9310 | 25000 | 0.3032 |
76
+ | 0.3553 | 0.9683 | 26000 | 0.3030 |
77
+
78
+
79
+ ### Framework versions
80
+
81
+ - Transformers 5.0.0.dev0
82
+ - Pytorch 2.8.0+cu128
83
+ - Datasets 3.6.0
84
+ - Tokenizers 0.22.1