mazesmazes commited on
Commit
2dd8b08
·
verified ·
1 Parent(s): 4ca0417

Update custom model files, README, and requirements

Browse files
Files changed (2) hide show
  1. README.md +47 -62
  2. handler.py +3 -21
README.md CHANGED
@@ -1,82 +1,67 @@
1
  ---
2
- library_name: transformers
3
- tags:
4
- - generated_from_trainer
5
- - trl
6
- - sft
7
  datasets:
8
- - generator
9
- model-index:
10
- - name: tiny-audio
11
- results: []
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
 
 
 
16
 
17
- # tiny-audio
18
 
19
- This model is a fine-tuned version of [](https://huggingface.co/) on the generator dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 0.4543
22
 
23
- ## Model description
24
 
25
- More information needed
 
 
26
 
27
- ## Intended uses & limitations
 
 
 
 
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
32
 
33
- More information needed
34
 
35
- ## Training procedure
36
 
37
- ### Training hyperparameters
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 0.0001
41
- - train_batch_size: 8
42
- - eval_batch_size: 32
43
- - seed: 123
44
- - gradient_accumulation_steps: 3
45
- - total_train_batch_size: 24
46
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_steps: 1000
49
- - training_steps: 20000
50
 
51
- ### Training results
52
 
53
- | Training Loss | Epoch | Step | Validation Loss |
54
- |:-------------:|:-----:|:-----:|:---------------:|
55
- | 8.8546 | 0.05 | 1000 | 3.8075 |
56
- | 0.8286 | 0.1 | 2000 | 0.5193 |
57
- | 0.7909 | 0.15 | 3000 | 0.4701 |
58
- | 0.6955 | 0.2 | 4000 | 0.4581 |
59
- | 0.599 | 0.25 | 5000 | 0.4434 |
60
- | 0.6159 | 0.3 | 6000 | 0.4353 |
61
- | 0.5764 | 0.35 | 7000 | 0.4260 |
62
- | 0.602 | 0.05 | 8000 | 0.4298 |
63
- | 0.5363 | 0.1 | 9000 | 0.4430 |
64
- | 0.5643 | 0.15 | 10000 | 0.4636 |
65
- | 0.5135 | 0.2 | 11000 | 0.4423 |
66
- | 0.4419 | 0.25 | 12000 | 0.4473 |
67
- | 0.4848 | 0.3 | 13000 | 0.4539 |
68
- | 0.4692 | 0.35 | 14000 | 0.4481 |
69
- | 0.5154 | 0.05 | 15000 | 0.4482 |
70
- | 0.4736 | 0.1 | 16000 | 0.4522 |
71
- | 0.5097 | 0.15 | 17000 | 0.4537 |
72
- | 0.4729 | 0.2 | 18000 | 0.4542 |
73
- | 0.4142 | 0.25 | 19000 | 0.4543 |
74
- | 0.4718 | 0.3 | 20000 | 0.4543 |
75
 
 
76
 
77
- ### Framework versions
78
 
79
- - Transformers 4.57.1
80
- - Pytorch 2.8.0+cu128
81
- - Datasets 4.4.1
82
- - Tokenizers 0.22.1
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
 
 
5
  datasets:
6
+ - speechbrain/LoquaciousSet
7
+ base_model:
8
+ - openai/whisper-large-v3-turbo
9
+ - HuggingFaceTB/SmolLM3-3B
10
+ pipeline_tag: automatic-speech-recognition
11
+ tags:
12
+ - asr
13
+ - speech-recognition
14
+ - audio
15
+ - smollm
16
+ - whisper
17
+ - moe
18
  ---
19
 
20
+ # Tiny Audio Model Card
21
+
22
+ This model was born from a simple idea: what if anyone could train a powerful, modern speech recognition model for the price of a few coffees? This model is the result of the [Tiny Audio course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md), a free, hands-on guide to building your own ASR system from scratch.
23
+
24
+ ## The Story of this Model
25
 
26
+ This model isn't the product of a massive research lab with an unlimited budget. It's the result of a 24-hour training run on a single GPU, made possible by an efficient projector-only training approach. By combining the strengths of OpenAI's Whisper encoder (`openai/whisper-large-v3-turbo`) and a powerful language model (`HuggingFaceTB/SmolLM3-3B`), and only training a Mixture of Simple Adapters (MOSA) projector between them, we can create a high-quality ASR model with minimal resources.
27
 
28
+ This model is a testament to the power of open-source and the incredible tools and models that are now available to everyone.
 
 
29
 
30
+ ## Architecture
31
 
32
+ ```
33
+ Audio (16kHz) → Whisper Encoder (frozen) → MoE Projector (trainable) → SmolLM3-3B (frozen) → Text
34
+ ```
35
 
36
+ **MoE Projector (MOSA):**
37
+ - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
38
+ - Router: Linear→ReLU→Linear with dense softmax over 4 experts
39
+ - Experts: 4 adapters, each Linear→ReLU→Linear (2048→4096→2048)
40
+ - Output normalization: RMSNorm
41
 
42
+ ## Intended Use
43
 
44
+ This model is for you. It's for the curious, the builders, the learners. It's for anyone who wants to understand how modern AI works by getting their hands dirty. Use it to transcribe your podcasts, your meetings, your voice memos. But more importantly, use it as a starting point. Fork it, fine-tune it, break it, and make it your own.
45
 
46
+ ## Performance
47
 
48
+ This model achieves a Word Error Rate (WER) of **12.14%** on the LoquaciousSet test set. It's not perfect, but it's a solid baseline that you can build on. See how it compares to other models on the [community leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard).
49
 
50
+ ## How to Use
51
 
52
+ ```python
53
+ from transformers import pipeline
 
 
 
 
 
 
 
 
 
54
 
55
+ pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
56
 
57
+ result = pipe("path/to/audio.wav")
58
+ print(result["text"])
59
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ ## How to Get Involved
62
 
63
+ This project is more than just a model; it's a community. Here's how you can get involved:
64
 
65
+ - **Take the course**: The best way to start is to go through the [free 6-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) and train your own model.
66
+ - **Share your results**: Add your model to the [leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard) and share what you've learned.
67
+ - **Join the conversation**: Ask questions, share your ideas, and connect with other builders in the [GitHub Discussions](https://github.com/alexkroman/tiny-audio/discussions).
 
handler.py CHANGED
@@ -108,25 +108,7 @@ class EndpointHandler:
108
  if inputs is None:
109
  raise ValueError("Missing 'inputs' in request data")
110
 
 
111
  params = data.get("parameters", {})
112
- max_new_tokens = params.get("max_new_tokens", 128)
113
- num_beams = params.get("num_beams", 1)
114
- do_sample = params.get("do_sample", False)
115
- length_penalty = params.get("length_penalty", 1.0)
116
- repetition_penalty = params.get("repetition_penalty", 1.05)
117
- no_repeat_ngram_size = params.get("no_repeat_ngram_size", 0)
118
- early_stopping = params.get("early_stopping", True)
119
- default_diversity = self.pipe.model.config.inference_diversity_penalty
120
- diversity_penalty = params.get("diversity_penalty", default_diversity)
121
-
122
- return self.pipe(
123
- inputs,
124
- max_new_tokens=max_new_tokens,
125
- num_beams=num_beams,
126
- do_sample=do_sample,
127
- length_penalty=length_penalty,
128
- repetition_penalty=repetition_penalty,
129
- no_repeat_ngram_size=no_repeat_ngram_size,
130
- early_stopping=early_stopping,
131
- diversity_penalty=diversity_penalty,
132
- )
 
108
  if inputs is None:
109
  raise ValueError("Missing 'inputs' in request data")
110
 
111
+ # Pass through any parameters from request, let model config provide defaults
112
  params = data.get("parameters", {})
113
+
114
+ return self.pipe(inputs, **params)