iamanishx commited on
Commit
dc28dd7
·
verified ·
1 Parent(s): f9d7942

Upload model via docker model push

Browse files
Files changed (2) hide show
  1. README.md +230 -696
  2. model.gguf +2 -2
README.md CHANGED
@@ -1,723 +1,257 @@
1
  ---
2
- base_model:
3
- - google/gemma-3-270m-it
4
- license: gemma
5
- tags:
6
- - gemma3
7
- - unsloth
8
- - gemma
9
- - google
10
- pipeline_tag: text-generation
11
- library_name: transformers
12
  ---
13
- > [!NOTE]
14
- > Please use the correct settings: `temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0`
15
- >
16
- <div>
17
- <p style="margin-bottom: 0; margin-top: 0;">
18
- <strong>See <a href="https://huggingface.co/collections/unsloth/gemma-3-67d12b7e8816ec6efa7e4e5b">our collection</a> for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats.</strong>
19
- </p>
20
- <p style="margin-bottom: 0;">
21
- <em><a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-effectively">Read our Guide</a> to see how to Run Gemma 3 correctly.</em>
22
- </p>
23
- <div style="display: flex; gap: 5px; align-items: center; ">
24
- <a href="https://github.com/unslothai/unsloth/">
25
- <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
26
- </a>
27
- <a href="https://discord.gg/unsloth">
28
- <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
29
- </a>
30
- <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-r1-on-your-own-local-device">
31
- <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
32
- </a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  </div>
34
- <h1 style="margin-top: 0rem;">✨ Fine-tune Gemma 3 with Unsloth!</h1>
35
  </div>
36
 
37
- - Fine-tune Gemma 3 (270M) for free using our Google [Colab notebook here](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(270M).ipynb)!
38
- - Read our Blog about Gemma 3 support: [unsloth.ai/blog/gemma3](https://unsloth.ai/blog/gemma3)
39
- - View the rest of our notebooks in our [docs here](https://docs.unsloth.ai/get-started/unsloth-notebooks).
40
-
41
-
42
- | Unsloth supports | Free Notebooks | Performance | Memory use |
43
- |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
44
- | **Gemma 3 (4B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B).ipynb) | 2x faster | 80% less |
45
- | **Gemma-3n-E4B** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3N_(4B)-Conversational.ipynb) | 2x faster | 60% less |
46
- | **Gemma-3n-E4B (Audio)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3N_(4B)-Audio.ipynb) | 2x faster | 60% less |
47
- | **GRPO with Gemma 3 (1B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(1B)-GRPO.ipynb) | 2x faster | 80% less |
48
- | **Gemma 3 (4B) Vision** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb) | 2x faster | 60% less |
49
 
50
- # Gemma 3 model card
 
51
 
52
- **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
53
 
54
- **Resources and Technical Documentation**:
55
 
56
- * [Gemma 3 Technical Report][g3-tech-report]
57
- * [Responsible Generative AI Toolkit][rai-toolkit]
58
- * [Gemma on Kaggle][kaggle-gemma]
59
- * [Gemma on Vertex Model Garden][vertex-mg-gemma3]
60
 
61
- **Terms of Use**: [Terms][terms]
62
 
63
- **Authors**: Google DeepMind
64
-
65
- ## Model Information
66
-
67
- Summary description and brief definition of inputs and outputs.
68
-
69
- ### Description
70
-
71
- Gemma is a family of lightweight, state-of-the-art open models from Google,
72
- built from the same research and technology used to create the Gemini models.
73
- Gemma 3 models are multimodal, handling text and image input and generating text
74
- output, with open weights for both pre-trained variants and instruction-tuned
75
- variants. Gemma 3 has a large, 128K context window, multilingual support in over
76
- 140 languages, and is available in more sizes than previous versions. Gemma 3
77
- models are well-suited for a variety of text generation and image understanding
78
- tasks, including question answering, summarization, and reasoning. Their
79
- relatively small size makes it possible to deploy them in environments with
80
- limited resources such as laptops, desktops or your own cloud infrastructure,
81
- democratizing access to state of the art AI models and helping foster innovation
82
- for everyone.
83
-
84
- ### Inputs and outputs
85
-
86
- - **Input:**
87
- - Text string, such as a question, a prompt, or a document to be summarized
88
- - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
89
- each
90
- - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
91
- 32K tokens for the 1B and 270M sizes.
92
-
93
- - **Output:**
94
- - Generated text in response to the input, such as an answer to a
95
- question, analysis of image content, or a summary of a document
96
- - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes,
97
- and 32K tokens for the 1B and 270M sizes per request, subtracting the
98
- request input tokens
99
-
100
- ### Citation
101
-
102
- ```none
103
- @article{gemma_2025,
104
- title={Gemma 3},
105
- url={https://arxiv.org/abs/2503.19786},
106
- publisher={Google DeepMind},
107
- author={Gemma Team},
108
- year={2025}
109
- }
110
  ```
111
 
112
- ## Model Data
113
-
114
- Data used for model training and how the data was processed.
115
-
116
- ### Training Dataset
117
-
118
- These models were trained on a dataset of text data that includes a wide variety
119
- of sources. The 27B model was trained with 14 trillion tokens, the 12B model was
120
- trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens,
121
- the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The
122
- knowledge cutoff date for the training data was August 2024. Here are the key
123
- components:
124
-
125
- - Web Documents: A diverse collection of web text ensures the model is
126
- exposed to a broad range of linguistic styles, topics, and vocabulary. The
127
- training dataset includes content in over 140 languages.
128
- - Code: Exposing the model to code helps it to learn the syntax and
129
- patterns of programming languages, which improves its ability to generate
130
- code and understand code-related questions.
131
- - Mathematics: Training on mathematical text helps the model learn logical
132
- reasoning, symbolic representation, and to address mathematical queries.
133
- - Images: A wide range of images enables the model to perform image
134
- analysis and visual data extraction tasks.
135
-
136
- The combination of these diverse data sources is crucial for training a powerful
137
- multimodal model that can handle a wide variety of different tasks and data
138
- formats.
139
-
140
- ### Data Preprocessing
141
-
142
- Here are the key data cleaning and filtering methods applied to the training
143
- data:
144
-
145
- - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering
146
- was applied at multiple stages in the data preparation process to ensure
147
- the exclusion of harmful and illegal content.
148
- - Sensitive Data Filtering: As part of making Gemma pre-trained models
149
- safe and reliable, automated techniques were used to filter out certain
150
- personal information and other sensitive data from training sets.
151
- - Additional methods: Filtering based on content quality and safety in
152
- line with [our policies][safety-policies].
153
-
154
- ## Implementation Information
155
-
156
- Details about the model internals.
157
-
158
- ### Hardware
159
-
160
- Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p,
161
- TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant
162
- computational power. TPUs, designed specifically for matrix operations common in
163
- machine learning, offer several advantages in this domain:
164
-
165
- - Performance: TPUs are specifically designed to handle the massive
166
- computations involved in training VLMs. They can speed up training
167
- considerably compared to CPUs.
168
- - Memory: TPUs often come with large amounts of high-bandwidth memory,
169
- allowing for the handling of large models and batch sizes during training.
170
- This can lead to better model quality.
171
- - Scalability: TPU Pods (large clusters of TPUs) provide a scalable
172
- solution for handling the growing complexity of large foundation models.
173
- You can distribute training across multiple TPU devices for faster and more
174
- efficient processing.
175
- - Cost-effectiveness: In many scenarios, TPUs can provide a more
176
- cost-effective solution for training large models compared to CPU-based
177
- infrastructure, especially when considering the time and resources saved
178
- due to faster training.
179
- - These advantages are aligned with
180
- [Google's commitments to operate sustainably][sustainability].
181
-
182
- ### Software
183
-
184
- Training was done using [JAX][jax] and [ML Pathways][ml-pathways].
 
 
 
 
 
 
 
 
 
 
185
 
186
- JAX allows researchers to take advantage of the latest generation of hardware,
187
- including TPUs, for faster and more efficient training of large models. ML
188
- Pathways is Google's latest effort to build artificially intelligent systems
189
- capable of generalizing across multiple tasks. This is specially suitable for
190
- foundation models, including large language models like these ones.
191
 
192
- Together, JAX and ML Pathways are used as described in the
193
- [paper about the Gemini family of models][gemini-2-paper]; *"the 'single
194
- controller' programming model of Jax and Pathways allows a single Python
195
- process to orchestrate the entire training run, dramatically simplifying the
196
- development workflow."*
197
 
198
- ## Evaluation
 
 
199
 
200
- Model evaluation metrics and results.
201
 
202
- ### Benchmark Results
 
203
 
204
- These models were evaluated against a large collection of different datasets and
205
- metrics to cover different aspects of text generation. Evaluation results marked
206
- with **IT** are for instruction-tuned models. Evaluation results marked with
207
- **PT** are for pre-trained models.
208
 
 
 
209
 
 
 
210
 
211
- # Gemma 3 model card
212
-
213
- **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
214
-
215
- **Resources and Technical Documentation**:
216
-
217
- * [Gemma 3 Technical Report][g3-tech-report]
218
- * [Responsible Generative AI Toolkit][rai-toolkit]
219
- * [Gemma on Kaggle][kaggle-gemma]
220
- * [Gemma on Vertex Model Garden][vertex-mg-gemma3]
221
-
222
- **Terms of Use**: [Terms][terms]
223
-
224
- **Authors**: Google DeepMind
225
-
226
- ## Model Information
227
-
228
- Summary description and brief definition of inputs and outputs.
229
-
230
- ### Description
231
 
232
- Gemma is a family of lightweight, state-of-the-art open models from Google,
233
- built from the same research and technology used to create the Gemini models.
234
- Gemma 3 models are multimodal, handling text and image input and generating text
235
- output, with open weights for both pre-trained variants and instruction-tuned
236
- variants. Gemma 3 has a large, 128K context window, multilingual support in over
237
- 140 languages, and is available in more sizes than previous versions. Gemma 3
238
- models are well-suited for a variety of text generation and image understanding
239
- tasks, including question answering, summarization, and reasoning. Their
240
- relatively small size makes it possible to deploy them in environments with
241
- limited resources such as laptops, desktops or your own cloud infrastructure,
242
- democratizing access to state of the art AI models and helping foster innovation
243
- for everyone.
244
 
245
- ### Inputs and outputs
246
-
247
- - **Input:**
248
- - Text string, such as a question, a prompt, or a document to be summarized
249
- - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
250
- each, for the 4B, 12B, and 27B sizes.
251
- - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
252
- 32K tokens for the 1B and 270M sizes.
253
 
254
- - **Output:**
255
- - Generated text in response to the input, such as an answer to a
256
- question, analysis of image content, or a summary of a document
257
- - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes,
258
- and 32K tokens for the 1B and 270M sizes per request, subtracting the
259
- request input tokens
260
-
261
- ### Citation
262
-
263
- ```none
264
- @article{gemma_2025,
265
- title={Gemma 3},
266
- url={https://arxiv.org/abs/2503.19786},
267
- publisher={Google DeepMind},
268
- author={Gemma Team},
269
- year={2025}
270
- }
271
  ```
272
 
273
- ## Model Data
274
-
275
- Data used for model training and how the data was processed.
276
-
277
- ### Training Dataset
278
-
279
- These models were trained on a dataset of text data that includes a wide variety
280
- of sources. The 27B model was trained with 14 trillion tokens, the 12B model was
281
- trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens,
282
- the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The
283
- knowledge cutoff date for the training data was August 2024. Here are the key
284
- components:
285
-
286
- - Web Documents: A diverse collection of web text ensures the model is
287
- exposed to a broad range of linguistic styles, topics, and vocabulary. The
288
- training dataset includes content in over 140 languages.
289
- - Code: Exposing the model to code helps it to learn the syntax and
290
- patterns of programming languages, which improves its ability to generate
291
- code and understand code-related questions.
292
- - Mathematics: Training on mathematical text helps the model learn logical
293
- reasoning, symbolic representation, and to address mathematical queries.
294
- - Images: A wide range of images enables the model to perform image
295
- analysis and visual data extraction tasks.
296
-
297
- The combination of these diverse data sources is crucial for training a powerful
298
- multimodal model that can handle a wide variety of different tasks and data
299
- formats.
300
-
301
- ### Data Preprocessing
302
-
303
- Here are the key data cleaning and filtering methods applied to the training
304
- data:
305
-
306
- - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering
307
- was applied at multiple stages in the data preparation process to ensure
308
- the exclusion of harmful and illegal content.
309
- - Sensitive Data Filtering: As part of making Gemma pre-trained models
310
- safe and reliable, automated techniques were used to filter out certain
311
- personal information and other sensitive data from training sets.
312
- - Additional methods: Filtering based on content quality and safety in
313
- line with [our policies][safety-policies].
314
-
315
- ## Implementation Information
316
-
317
- Details about the model internals.
318
-
319
- ### Hardware
320
-
321
- Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p,
322
- TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant
323
- computational power. TPUs, designed specifically for matrix operations common in
324
- machine learning, offer several advantages in this domain:
325
-
326
- - Performance: TPUs are specifically designed to handle the massive
327
- computations involved in training VLMs. They can speed up training
328
- considerably compared to CPUs.
329
- - Memory: TPUs often come with large amounts of high-bandwidth memory,
330
- allowing for the handling of large models and batch sizes during training.
331
- This can lead to better model quality.
332
- - Scalability: TPU Pods (large clusters of TPUs) provide a scalable
333
- solution for handling the growing complexity of large foundation models.
334
- You can distribute training across multiple TPU devices for faster and more
335
- efficient processing.
336
- - Cost-effectiveness: In many scenarios, TPUs can provide a more
337
- cost-effective solution for training large models compared to CPU-based
338
- infrastructure, especially when considering the time and resources saved
339
- due to faster training.
340
- - These advantages are aligned with
341
- [Google's commitments to operate sustainably][sustainability].
342
-
343
- ### Software
344
-
345
- Training was done using [JAX][jax] and [ML Pathways][ml-pathways].
346
-
347
- JAX allows researchers to take advantage of the latest generation of hardware,
348
- including TPUs, for faster and more efficient training of large models. ML
349
- Pathways is Google's latest effort to build artificially intelligent systems
350
- capable of generalizing across multiple tasks. This is specially suitable for
351
- foundation models, including large language models like these ones.
352
-
353
- Together, JAX and ML Pathways are used as described in the
354
- [paper about the Gemini family of models][gemini-2-paper]; *"the 'single
355
- controller' programming model of Jax and Pathways allows a single Python
356
- process to orchestrate the entire training run, dramatically simplifying the
357
- development workflow."*
358
-
359
- ## Evaluation
360
-
361
- Model evaluation metrics and results.
362
-
363
- ### Benchmark Results
364
-
365
- These models were evaluated against a large collection of different datasets and
366
- metrics to cover different aspects of text generation. Evaluation results marked
367
- with **IT** are for instruction-tuned models. Evaluation results marked with
368
- **PT** are for pre-trained models.
369
-
370
- #### Gemma 3 270M
371
-
372
- | **Benchmark** | **n-shot** | **Gemma 3 PT 270M** |
373
- | :------------------------ | :-----------: | ------------------: |
374
- | [HellaSwag][hellaswag] | 10-shot | 40.9 |
375
- | [BoolQ][boolq] | 0-shot | 61.4 |
376
- | [PIQA][piqa] | 0-shot | 67.7 |
377
- | [TriviaQA][triviaqa] | 5-shot | 15.4 |
378
- | [ARC-c][arc] | 25-shot | 29.0 |
379
- | [ARC-e][arc] | 0-shot | 57.7 |
380
- | [WinoGrande][winogrande] | 5-shot | 52.0 |
381
-
382
- [hellaswag]: https://arxiv.org/abs/1905.07830
383
- [boolq]: https://arxiv.org/abs/1905.10044
384
- [piqa]: https://arxiv.org/abs/1911.11641
385
- [triviaqa]: https://arxiv.org/abs/1705.03551
386
- [arc]: https://arxiv.org/abs/1911.01547
387
- [winogrande]: https://arxiv.org/abs/1907.10641
388
-
389
- | **Benchmark** | **n-shot** | **Gemma 3 IT 270m** |
390
- | :------------------------ | :-----------: | ------------------: |
391
- | [HellaSwag][hellaswag] | 0-shot | 37.7 |
392
- | [PIQA][piqa] | 0-shot | 66.2 |
393
- | [ARC-c][arc] | 0-shot | 28.2 |
394
- | [WinoGrande][winogrande] | 0-shot | 52.3 |
395
- | [BIG-Bench Hard][bbh] | few-shot | 26.7 |
396
- | [IF Eval][ifeval] | 0-shot | 51.2 |
397
-
398
- [hellaswag]: https://arxiv.org/abs/1905.07830
399
- [piqa]: https://arxiv.org/abs/1911.11641
400
- [arc]: https://arxiv.org/abs/1911.01547
401
- [winogrande]: https://arxiv.org/abs/1907.10641
402
- [bbh]: https://paperswithcode.com/dataset/bbh
403
- [bbh]: https://paperswithcode.com/dataset/bbh
404
- [ifeval]: https://arxiv.org/abs/2311.07911
405
-
406
- #### Gemma 3 1B, 4B, 12B & 27B
407
-
408
- ##### Reasoning and factuality
409
-
410
- | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
411
- |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:|
412
- | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 |
413
- | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 |
414
- | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 |
415
- | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 |
416
- | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 |
417
- | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 |
418
-
419
- | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
420
- | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:|
421
- | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 |
422
- | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 |
423
- | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 |
424
- | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 |
425
- | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 |
426
- | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 |
427
- | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 |
428
- | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 |
429
- | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 |
430
- | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 |
431
- | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 |
432
-
433
- [gpqa]: https://arxiv.org/abs/2311.12022
434
- [simpleqa]: https://arxiv.org/abs/2411.04368
435
- [facts-grdg]: https://goo.gle/FACTS_paper
436
- [bbeh]: https://github.com/google-deepmind/bbeh
437
- [ifeval]: https://arxiv.org/abs/2311.07911
438
- [hellaswag]: https://arxiv.org/abs/1905.07830
439
- [boolq]: https://arxiv.org/abs/1905.10044
440
- [piqa]: https://arxiv.org/abs/1911.11641
441
- [socialiqa]: https://arxiv.org/abs/1904.09728
442
- [triviaqa]: https://arxiv.org/abs/1705.03551
443
- [naturalq]: https://github.com/google-research-datasets/natural-questions
444
- [arc]: https://arxiv.org/abs/1911.01547
445
- [winogrande]: https://arxiv.org/abs/1907.10641
446
- [bbh]: https://paperswithcode.com/dataset/bbh
447
- [drop]: https://arxiv.org/abs/1903.00161
448
-
449
- ##### STEM and code
450
-
451
- | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
452
- |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:|
453
- | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 |
454
- | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 |
455
- | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 |
456
- | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 |
457
- | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 |
458
- | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 |
459
- | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 |
460
- | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 |
461
- | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 |
462
-
463
- | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
464
- | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:|
465
- | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 |
466
- | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 |
467
- | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 |
468
- | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 |
469
- | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 |
470
- | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 |
471
- | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 |
472
- | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 |
473
-
474
- [mmlu]: https://arxiv.org/abs/2009.03300
475
- [agieval]: https://arxiv.org/abs/2304.06364
476
- [math]: https://arxiv.org/abs/2103.03874
477
- [gsm8k]: https://arxiv.org/abs/2110.14168
478
- [gpqa]: https://arxiv.org/abs/2311.12022
479
- [mbpp]: https://arxiv.org/abs/2108.07732
480
- [humaneval]: https://arxiv.org/abs/2107.03374
481
- [lcb]: https://arxiv.org/abs/2403.07974
482
- [bird-sql]: https://arxiv.org/abs/2305.03111
483
- [nat2code]: https://arxiv.org/abs/2405.04520
484
-
485
- #### Multilingual
486
-
487
- | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
488
- |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:|
489
- | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 |
490
- | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 |
491
- | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 |
492
-
493
- | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
494
- | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:|
495
- | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 |
496
- | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 |
497
- | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 |
498
- | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 |
499
- | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 |
500
- | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 |
501
- | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 |
502
-
503
- [mgsm]: https://arxiv.org/abs/2210.03057
504
- [flores]: https://arxiv.org/abs/2106.03193
505
- [xquad]: https://arxiv.org/abs/1910.11856v3
506
- [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
507
- [wmt24pp]: https://arxiv.org/abs/2502.12404v1
508
- [eclektic]: https://arxiv.org/abs/2502.21228
509
- [indicgenbench]: https://arxiv.org/abs/2404.16816
510
-
511
- ##### Multimodal
512
-
513
- | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B |
514
- |-----------------------------------|:-------------:|:--------------:|:--------------:|
515
- | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 |
516
- | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 |
517
- | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 |
518
- | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 |
519
- | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 |
520
- | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 |
521
- | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 |
522
- | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 |
523
-
524
- | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B |
525
- | ------------------------------ |:-------------:|:--------------:|:--------------:|
526
- | [COCOcap][coco-cap] | 102 | 111 | 116 |
527
- | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 |
528
- | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 |
529
- | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 |
530
- | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 |
531
- | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 |
532
- | [ReMI][remi] | 27.3 | 38.5 | 44.8 |
533
- | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 |
534
- | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 |
535
- | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 |
536
- | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 |
537
- | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 |
538
- | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 |
539
- | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 |
540
- | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 |
541
-
542
- [coco-cap]: https://cocodataset.org/#home
543
- [docvqa]: https://www.docvqa.org/
544
- [info-vqa]: https://arxiv.org/abs/2104.12756
545
- [mmmu]: https://arxiv.org/abs/2311.16502
546
- [textvqa]: https://textvqa.org/
547
- [realworldqa]: https://paperswithcode.com/dataset/realworldqa
548
- [remi]: https://arxiv.org/html/2406.09175v1
549
- [ai2d]: https://allenai.org/data/diagrams
550
- [chartqa]: https://arxiv.org/abs/2203.10244
551
- [vqav2]: https://visualqa.org/index.html
552
- [blinkvqa]: https://arxiv.org/abs/2404.12390
553
- [okvqa]: https://okvqa.allenai.org/
554
- [tallyqa]: https://arxiv.org/abs/1810.12440
555
- [ss-vqa]: https://arxiv.org/abs/1908.02660
556
- [countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/
557
- [mathvista]: https://arxiv.org/abs/2310.02255
558
-
559
- ## Ethics and Safety
560
-
561
- Ethics and safety evaluation approach and results.
562
-
563
- ### Evaluation Approach
564
-
565
- Our evaluation methods include structured evaluations and internal red-teaming
566
- testing of relevant content policies. Red-teaming was conducted by a number of
567
- different teams, each with different goals and human evaluation metrics. These
568
- models were evaluated against a number of different categories relevant to
569
- ethics and safety, including:
570
-
571
- - **Child Safety**: Evaluation of text-to-text and image to text prompts
572
- covering child safety policies, including child sexual abuse and
573
- exploitation.
574
- - **Content Safety:** Evaluation of text-to-text and image to text prompts
575
- covering safety policies including, harassment, violence and gore, and hate
576
- speech.
577
- - **Representational Harms**: Evaluation of text-to-text and image to text
578
- prompts covering safety policies including bias, stereotyping, and harmful
579
- associations or inaccuracies.
580
-
581
- In addition to development level evaluations, we conduct "assurance
582
- evaluations" which are our 'arms-length' internal evaluations for responsibility
583
- governance decision making. They are conducted separately from the model
584
- development team, to inform decision making about release. High level findings
585
- are fed back to the model team, but prompt sets are held-out to prevent
586
- overfitting and preserve the results' ability to inform decision making.
587
- Assurance evaluation results are reported to our Responsibility & Safety Council
588
- as part of release review.
589
-
590
- ### Evaluation Results
591
-
592
- For all areas of safety testing, we saw major improvements in the categories of
593
- child safety, content safety, and representational harms relative to previous
594
- Gemma models. All testing was conducted without safety filters to evaluate the
595
- model capabilities and behaviors. For both text-to-text and image-to-text, and
596
- across all model sizes, the model produced minimal policy violations, and showed
597
- significant improvements over previous Gemma models' performance with respect
598
- to ungrounded inferences. A limitation of our evaluations was they included only
599
- English language prompts.
600
-
601
- ## Usage and Limitations
602
-
603
- These models have certain limitations that users should be aware of.
604
-
605
- ### Intended Usage
606
-
607
- Open vision-language models (VLMs) models have a wide range of applications
608
- across various industries and domains. The following list of potential uses is
609
- not comprehensive. The purpose of this list is to provide contextual information
610
- about the possible use-cases that the model creators considered as part of model
611
- training and development.
612
-
613
- - Content Creation and Communication
614
- - Text Generation: These models can be used to generate creative text
615
- formats such as poems, scripts, code, marketing copy, and email drafts.
616
- - Chatbots and Conversational AI: Power conversational interfaces
617
- for customer service, virtual assistants, or interactive applications.
618
- - Text Summarization: Generate concise summaries of a text corpus,
619
- research papers, or reports.
620
- - Image Data Extraction: These models can be used to extract,
621
- interpret, and summarize visual data for text communications.
622
- - Research and Education
623
- - Natural Language Processing (NLP) and VLM Research: These
624
- models can serve as a foundation for researchers to experiment with VLM
625
- and NLP techniques, develop algorithms, and contribute to the
626
- advancement of the field.
627
- - Language Learning Tools: Support interactive language learning
628
- experiences, aiding in grammar correction or providing writing practice.
629
- - Knowledge Exploration: Assist researchers in exploring large
630
- bodies of text by generating summaries or answering questions about
631
- specific topics.
632
-
633
- ### Limitations
634
-
635
- - Training Data
636
- - The quality and diversity of the training data significantly
637
- influence the model's capabilities. Biases or gaps in the training data
638
- can lead to limitations in the model's responses.
639
- - The scope of the training dataset determines the subject areas
640
- the model can handle effectively.
641
- - Context and Task Complexity
642
- - Models are better at tasks that can be framed with clear
643
- prompts and instructions. Open-ended or highly complex tasks might be
644
- challenging.
645
- - A model's performance can be influenced by the amount of context
646
- provided (longer context generally leads to better outputs, up to a
647
- certain point).
648
- - Language Ambiguity and Nuance
649
- - Natural language is inherently complex. Models might struggle
650
- to grasp subtle nuances, sarcasm, or figurative language.
651
- - Factual Accuracy
652
- - Models generate responses based on information they learned
653
- from their training datasets, but they are not knowledge bases. They
654
- may generate incorrect or outdated factual statements.
655
- - Common Sense
656
- - Models rely on statistical patterns in language. They might
657
- lack the ability to apply common sense reasoning in certain situations.
658
-
659
- ### Ethical Considerations and Risks
660
-
661
- The development of vision-language models (VLMs) raises several ethical
662
- concerns. In creating an open model, we have carefully considered the following:
663
-
664
- - Bias and Fairness
665
- - VLMs trained on large-scale, real-world text and image data can
666
- reflect socio-cultural biases embedded in the training material. These
667
- models underwent careful scrutiny, input data pre-processing described
668
- and posterior evaluations reported in this card.
669
- - Misinformation and Misuse
670
- - VLMs can be misused to generate text that is false, misleading,
671
- or harmful.
672
- - Guidelines are provided for responsible use with the model, see the
673
- [Responsible Generative AI Toolkit][rai-toolkit].
674
- - Transparency and Accountability:
675
- - This model card summarizes details on the models' architecture,
676
- capabilities, limitations, and evaluation processes.
677
- - A responsibly developed open model offers the opportunity to
678
- share innovation by making VLM technology accessible to developers and
679
- researchers across the AI ecosystem.
680
-
681
- Risks identified and mitigations:
682
-
683
- - **Perpetuation of biases**: It's encouraged to perform continuous
684
- monitoring (using evaluation metrics, human review) and the exploration of
685
- de-biasing techniques during model training, fine-tuning, and other use
686
- cases.
687
- - **Generation of harmful content**: Mechanisms and guidelines for content
688
- safety are essential. Developers are encouraged to exercise caution and
689
- implement appropriate content safety safeguards based on their specific
690
- product policies and application use cases.
691
- - **Misuse for malicious purposes**: Technical limitations and developer
692
- and end-user education can help mitigate against malicious applications of
693
- VLMs. Educational resources and reporting mechanisms for users to flag
694
- misuse are provided. Prohibited uses of Gemma models are outlined in the
695
- [Gemma Prohibited Use Policy][prohibited-use].
696
- - **Privacy violations**: Models were trained on data filtered for removal
697
- of certain personal information and other sensitive data. Developers are
698
- encouraged to adhere to privacy regulations with privacy-preserving
699
- techniques.
700
-
701
- ### Benefits
702
-
703
- At the time of release, this family of models provides high-performance open
704
- vision-language model implementations designed from the ground up for
705
- responsible AI development compared to similarly sized models.
706
-
707
- Using the benchmark evaluation metrics described in this document, these models
708
- have shown to provide superior performance to other, comparably-sized open model
709
- alternatives.
710
-
711
- [g3-tech-report]: https://arxiv.org/abs/2503.19786
712
- [rai-toolkit]: https://ai.google.dev/responsible
713
- [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3
714
- [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3
715
- [terms]: https://ai.google.dev/gemma/terms
716
- [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf
717
- [prohibited-use]: https://ai.google.dev/gemma/prohibited_use_policy
718
- [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu
719
- [sustainability]: https://sustainability.google/operating-sustainably/
720
- [jax]: https://github.com/jax-ml/jax
721
- [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
722
- [sustainability]: https://sustainability.google/operating-sustainably/
723
- [gemini-2-paper]: https://arxiv.org/abs/2312.11805
 
1
  ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - facebook/multilingual_librispeech
5
+ - parler-tts/libritts_r_filtered
6
+ language:
7
+ - en
8
+ pipeline_tag: text-to-speech
 
 
 
9
  ---
10
+ <style>
11
+ table {
12
+ border-collapse: collapse;
13
+ width: 100%;
14
+ margin-bottom: 20px;
15
+ }
16
+ th, td {
17
+ border: 1px solid #ddd;
18
+ padding: 8px;
19
+ text-align: center;
20
+ }
21
+ .best {
22
+ font-weight: bold;
23
+ text-decoration: underline;
24
+ }
25
+ .badges {
26
+ display: flex;
27
+ justify-content: center;
28
+ gap: 10px;
29
+ flex-wrap: wrap;
30
+ margin-top: 10px;
31
+ }
32
+ .badge {
33
+ text-decoration: none;
34
+ display: inline-block;
35
+ padding: 4px 8px;
36
+ border-radius: 5px;
37
+ color: #fff;
38
+ font-size: 12px;
39
+ font-weight: bold;
40
+ width: 250px;
41
+ }
42
+ .badge-hf-blue {
43
+ background-color: #767b81;
44
+ }
45
+ .badge-hf-pink {
46
+ background-color: #7b768a;
47
+ }
48
+ .badge-github {
49
+ background-color: #2c2b2b;
50
+ }
51
+ </style>
52
+
53
+ <div style="text-align: center; margin: 20px auto; padding: 10px; border: 2px solid #ddd; border-radius: 10px;">
54
+ <div style="margin-bottom: 20px;">
55
+ <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
56
+ <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
57
+ <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🤝 Join our Discord</a>
58
+ <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
59
+ </div>
60
+ <div class="badges">
61
+ <a href="https://huggingface.co/OuteAI/OuteTTS-0.1-350M" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.1 350M</a>
62
+ <a href="https://huggingface.co/OuteAI/OuteTTS-0.1-350M-GGUF" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.1 350M GGUF</a>
63
+ <a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.1-350M-Demo" target="_blank" class="badge badge-hf-pink">🤗 Hugging Face - Demo</a>
64
+ <a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
65
  </div>
 
66
  </div>
67
 
68
+ ## Model Description
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ > [!IMPORTANT]
71
+ > A newer version of this model is available: [OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M)
72
 
73
+ OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture using our Oute3-350M-DEV base model, it demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
74
 
75
+ ## Key Features
76
 
77
+ - Pure language modeling approach to TTS
78
+ - Voice cloning capabilities
79
+ - LLaMa architecture
80
+ - Compatible with llama.cpp and GGUF format
81
 
82
+ ## Technical Details
83
 
84
+ The model utilizes a three-step approach to audio processing:
85
+ 1. Audio tokenization using WavTokenizer (processing 75 tokens per second)
86
+ 2. CTC forced alignment for precise word-to-audio token mapping
87
+ 3. Structured prompt creation following the format:
88
+ ```
89
+ [full transcription]
90
+ [word] [duration token] [audio tokens]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
+ ## Technical Blog
94
+ https://www.outeai.com/blog/OuteTTS-0.1-350M
95
+
96
+ ## Limitations
97
+ Being an experimental v0.1 release, there are some known issues:
98
+
99
+ - Vocabulary constraints due to training data limitations
100
+ - String-only input support
101
+ - Given its compact 350M parameter size, the model may frequently alter, insert, or omit wrong words, leading to variations in output quality.
102
+ - Variable temperature sensitivity depending on use case
103
+ - Performs best with shorter sentences, as accuracy may decrease with longer inputs
104
+
105
+ ### Speech Samples
106
+
107
+ Listen to samples generated by OuteTTS-0.1-350M:
108
+
109
+ <div style="margin-top: 20px;">
110
+ <table style="width: 100%; border-collapse: collapse;">
111
+ <thead>
112
+ <tr>
113
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Input</th>
114
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Audio</th>
115
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Notes</th>
116
+ </tr>
117
+ </thead>
118
+ <tbody>
119
+ <tr>
120
+ <td style="border: 1px solid #ddd; padding: 8px;">Hello, I can speak pretty well, but sometimes I make some mistakes.</td>
121
+ <td style="border: 1px solid #ddd; padding: 8px;">
122
+ <audio controls style="width: 100%;">
123
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav" type="audio/wav">
124
+ Your browser does not support the audio element.
125
+ </audio>
126
+ <audio controls style="width: 100%;">
127
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/1.wav" type="audio/wav">
128
+ Your browser does not support the audio element.
129
+ </audio>
130
+ </td>
131
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
132
+ </tr>
133
+ <tr>
134
+ <td style="border: 1px solid #ddd; padding: 8px;">Once upon a time, there was a</td>
135
+ <td style="border: 1px solid #ddd; padding: 8px;">
136
+ <audio controls style="width: 100%;">
137
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/3.wav" type="audio/wav">
138
+ Your browser does not support the audio element.
139
+ </audio>
140
+ </td>
141
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
142
+ </tr>
143
+ <tr>
144
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
145
+ <td style="border: 1px solid #ddd; padding: 8px;">
146
+ <audio controls style="width: 100%;">
147
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/6.wav" type="audio/wav">
148
+ Your browser does not support the audio element.
149
+ </audio>
150
+ </td>
151
+ <td style="border: 1px solid #ddd; padding: 8px;">Using the Q4_K_M quantized model. (temperature=0.7, repetition_penalty=1.1)</td>
152
+ </tr>
153
+ <tr>
154
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
155
+ <td style="border: 1px solid #ddd; padding: 8px;">
156
+ <audio controls style="width: 100%;">
157
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/4.wav" type="audio/wav">
158
+ Your browser does not support the audio element.
159
+ </audio>
160
+ </td>
161
+ <td style="border: 1px solid #ddd; padding: 8px;">The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1) </td>
162
+ </tr>
163
+ <tr>
164
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
165
+ <td style="border: 1px solid #ddd; padding: 8px;">
166
+ <audio controls style="width: 100%;">
167
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/5.wav" type="audio/wav">
168
+ Your browser does not support the audio element.
169
+ </audio>
170
+ </td>
171
+ <td style="border: 1px solid #ddd; padding: 8px;">In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1)</td>
172
+ </tr>
173
+ </tbody>
174
+ </table>
175
+ </div>
176
 
177
+ ## Installation
 
 
 
 
178
 
179
+ [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
 
 
 
 
180
 
181
+ ```bash
182
+ pip install outetts
183
+ ```
184
 
185
+ ## Usage
186
 
187
+ > [!WARNING]
188
+ > The example below works with older `outetts` version (`==0.1.7`). The new version (`>=0.2.0`) introduces changes to the interface. Please refer to the [GitHub Usage Example](https://github.com/edwko/OuteTTS?tab=readme-ov-file#usage) for updated examples.
189
 
190
+ ### Interface Usage
191
+ ```python
192
+ from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF
 
193
 
194
+ # Initialize the interface with the Hugging Face model
195
+ interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")
196
 
197
+ # Or initialize the interface with a GGUF model
198
+ # interface = InterfaceGGUF("path/to/model.gguf")
199
 
200
+ # Generate TTS output
201
+ # Without a speaker reference, the model generates speech with random speaker characteristics
202
+ output = interface.generate(
203
+ text="Hello, am I working?",
204
+ temperature=0.1,
205
+ repetition_penalty=1.1,
206
+ max_length=4096
207
+ )
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
+ # Play the generated audio
210
+ output.play()
 
 
 
 
 
 
 
 
 
 
211
 
212
+ # Save the generated audio to a file
213
+ output.save("output.wav")
214
+ ```
 
 
 
 
 
215
 
216
+ ### Voice Cloning
217
+ ```python
218
+ # Create a custom speaker from an audio file
219
+ speaker = interface.create_speaker(
220
+ "path/to/reference.wav",
221
+ "reference text matching the audio"
222
+ )
223
+
224
+ # Generate TTS with the custom voice
225
+ output = interface.generate(
226
+ text="This is a cloned voice speaking",
227
+ speaker=speaker,
228
+ temperature=0.1,
229
+ repetition_penalty=1.1,
230
+ max_length=4096
231
+ )
 
232
  ```
233
 
234
+ ## Model Details
235
+ - **Model Type:** LLaMa-based language model
236
+ - **Size:** 350M parameters
237
+ - **Language Support:** English
238
+ - **License:** CC BY 4.0
239
+ - **Speech Datasets Used:**
240
+ - LibriTTS-R (CC BY 4.0)
241
+ - Multilingual LibriSpeech (MLS) (CC BY 4.0)
242
+
243
+ ## Future Improvements
244
+ - Scaling up parameters and training data
245
+ - Exploring alternative alignment methods for better character compatibility
246
+ - Potential expansion into speech-to-speech assistant models
247
+
248
+ ## Credits
249
+
250
+ - WavTokenizer: https://github.com/jishengpeng/WavTokenizer
251
+ - CTC Forced Alignment: https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html
252
+
253
+ ## Disclaimer
254
+ By using this model, you acknowledge that you understand and assume the risks associated with its use.
255
+ You are solely responsible for ensuring compliance with all applicable laws and regulations.
256
+ We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages.
257
+ We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.gguf CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b1baabd6b729e4041822220d3e648e00d99cac5df86b10dffb77bcccf0688e39
3
- size 253115424
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a8e13d7e6e44907ce6cb414c8ea64664e3b1c38ff6f30c191b8660890ce4f3c
3
+ size 275778592