cocktailpeanut commited on
Commit
be35f06
·
verified ·
1 Parent(s): 244aedd
Files changed (6) hide show
  1. README.md +92 -3
  2. codec.pth +3 -0
  3. config.json +32 -0
  4. model.pth +3 -0
  5. special_tokens.json +0 -0
  6. tokenizer.tiktoken +0 -0
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-to-speech
4
+ license: cc-by-nc-sa-4.0
5
+ language:
6
+ - zh
7
+ - en
8
+ - de
9
+ - ja
10
+ - fr
11
+ - es
12
+ - ko
13
+ - ar
14
+ - nl
15
+ - ru
16
+ - it
17
+ - pl
18
+ - pt
19
+ pipeline_tag: text-to-speech
20
+ inference: false
21
+ extra_gated_prompt: >-
22
+ You agree to not use the model to generate contents that violate DMCA or local
23
+ laws.
24
+ extra_gated_fields:
25
+ Country: country
26
+ Specific date: date_picker
27
+ I agree to use this model for non-commercial use ONLY: checkbox
28
+ ---
29
+
30
+
31
+ # OpenAudio S1
32
+
33
+ **OpenAudio S1** is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.
34
+
35
+ Supported languages:
36
+ - English (en)
37
+ - Chinese (zh)
38
+ - Japanese (ja)
39
+ - German (de)
40
+ - French (fr)
41
+ - Spanish (es)
42
+ - Korean (ko)
43
+ - Arabic (ar)
44
+ - Russian (ru)
45
+ - Dutch (nl)
46
+ - Italian (it)
47
+ - Polish (pl)
48
+ - Portuguese (pt)
49
+
50
+ Please refer to [Fish Speech Github](https://github.com/fishaudio/fish-speech) for more info.
51
+ Demo available at [Fish Audio Playground](https://fish.audio).
52
+ Visit the [OpenAudio website](https://openaudio.com) for blog & tech report.
53
+
54
+ ## Emotion and Tone Support
55
+
56
+ OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:
57
+
58
+ **1. Emotional markers:**
59
+ (angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)
60
+
61
+ **2. Tone markers:**
62
+ (in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
63
+
64
+ **3. Special markers:**
65
+ (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)
66
+
67
+ **Special markers with corresponding onomatopoeia:**
68
+ - Laughing: Ha,ha,ha
69
+ - Chuckling: Hmm,hmm
70
+
71
+ ## Model Variants and Performance
72
+
73
+ OpenAudio S1 includes the following models:
74
+ - **S1 (4B, proprietary):** The full-sized model.
75
+ - **S1-mini (0.5B):** A distilled version of S1.
76
+
77
+ Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
78
+
79
+ **Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):**
80
+
81
+ - **S1:**
82
+ - WER (Word Error Rate): **0.008**
83
+ - CER (Character Error Rate): **0.004**
84
+ - Distance: **0.332**
85
+ - **S1-mini:**
86
+ - WER (Word Error Rate): **0.011**
87
+ - CER (Character Error Rate): **0.005**
88
+ - Distance: **0.380**
89
+
90
+ ## License
91
+
92
+ This model is permissively licensed under the CC-BY-NC-SA-4.0 license.
codec.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74fc41c5a7151c6f350af8bd7e5d6e3accfcc7f3dfbfac23afd35af07052bb2f
3
+ size 1871099728
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_o_bias": false,
3
+ "attention_qk_norm": true,
4
+ "attention_qkv_bias": false,
5
+ "codebook_size": 4096,
6
+ "dim": 1024,
7
+ "dropout": 0.0,
8
+ "fast_attention_o_bias": false,
9
+ "fast_attention_qk_norm": false,
10
+ "fast_attention_qkv_bias": false,
11
+ "fast_dim": 1024,
12
+ "fast_head_dim": 64,
13
+ "fast_intermediate_size": 3072,
14
+ "fast_n_head": 16,
15
+ "fast_n_local_heads": 8,
16
+ "head_dim": 128,
17
+ "initializer_range": 0.03125,
18
+ "intermediate_size": 3072,
19
+ "max_seq_len": 8192,
20
+ "model_type": "dual_ar",
21
+ "n_fast_layer": 4,
22
+ "n_head": 16,
23
+ "n_layer": 28,
24
+ "n_local_heads": 8,
25
+ "norm_eps": 1e-06,
26
+ "num_codebooks": 10,
27
+ "rope_base": 1000000,
28
+ "scale_codebook_embeddings": true,
29
+ "tie_word_embeddings": false,
30
+ "use_gradient_checkpointing": true,
31
+ "vocab_size": 155776
32
+ }
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e59be7dc6714040dce3cde1f41e730c2f0daa5339785b1cd3b60041208c35e6
3
+ size 1735122974
special_tokens.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.tiktoken ADDED
The diff for this file is too large to render. See raw diff