upload
Browse files- README.md +92 -3
- codec.pth +3 -0
- config.json +32 -0
- model.pth +3 -0
- special_tokens.json +0 -0
- tokenizer.tiktoken +0 -0
README.md
CHANGED
|
@@ -1,3 +1,92 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- text-to-speech
|
| 4 |
+
license: cc-by-nc-sa-4.0
|
| 5 |
+
language:
|
| 6 |
+
- zh
|
| 7 |
+
- en
|
| 8 |
+
- de
|
| 9 |
+
- ja
|
| 10 |
+
- fr
|
| 11 |
+
- es
|
| 12 |
+
- ko
|
| 13 |
+
- ar
|
| 14 |
+
- nl
|
| 15 |
+
- ru
|
| 16 |
+
- it
|
| 17 |
+
- pl
|
| 18 |
+
- pt
|
| 19 |
+
pipeline_tag: text-to-speech
|
| 20 |
+
inference: false
|
| 21 |
+
extra_gated_prompt: >-
|
| 22 |
+
You agree to not use the model to generate contents that violate DMCA or local
|
| 23 |
+
laws.
|
| 24 |
+
extra_gated_fields:
|
| 25 |
+
Country: country
|
| 26 |
+
Specific date: date_picker
|
| 27 |
+
I agree to use this model for non-commercial use ONLY: checkbox
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
# OpenAudio S1
|
| 32 |
+
|
| 33 |
+
**OpenAudio S1** is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.
|
| 34 |
+
|
| 35 |
+
Supported languages:
|
| 36 |
+
- English (en)
|
| 37 |
+
- Chinese (zh)
|
| 38 |
+
- Japanese (ja)
|
| 39 |
+
- German (de)
|
| 40 |
+
- French (fr)
|
| 41 |
+
- Spanish (es)
|
| 42 |
+
- Korean (ko)
|
| 43 |
+
- Arabic (ar)
|
| 44 |
+
- Russian (ru)
|
| 45 |
+
- Dutch (nl)
|
| 46 |
+
- Italian (it)
|
| 47 |
+
- Polish (pl)
|
| 48 |
+
- Portuguese (pt)
|
| 49 |
+
|
| 50 |
+
Please refer to [Fish Speech Github](https://github.com/fishaudio/fish-speech) for more info.
|
| 51 |
+
Demo available at [Fish Audio Playground](https://fish.audio).
|
| 52 |
+
Visit the [OpenAudio website](https://openaudio.com) for blog & tech report.
|
| 53 |
+
|
| 54 |
+
## Emotion and Tone Support
|
| 55 |
+
|
| 56 |
+
OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:
|
| 57 |
+
|
| 58 |
+
**1. Emotional markers:**
|
| 59 |
+
(angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)
|
| 60 |
+
|
| 61 |
+
**2. Tone markers:**
|
| 62 |
+
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
|
| 63 |
+
|
| 64 |
+
**3. Special markers:**
|
| 65 |
+
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)
|
| 66 |
+
|
| 67 |
+
**Special markers with corresponding onomatopoeia:**
|
| 68 |
+
- Laughing: Ha,ha,ha
|
| 69 |
+
- Chuckling: Hmm,hmm
|
| 70 |
+
|
| 71 |
+
## Model Variants and Performance
|
| 72 |
+
|
| 73 |
+
OpenAudio S1 includes the following models:
|
| 74 |
+
- **S1 (4B, proprietary):** The full-sized model.
|
| 75 |
+
- **S1-mini (0.5B):** A distilled version of S1.
|
| 76 |
+
|
| 77 |
+
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
|
| 78 |
+
|
| 79 |
+
**Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):**
|
| 80 |
+
|
| 81 |
+
- **S1:**
|
| 82 |
+
- WER (Word Error Rate): **0.008**
|
| 83 |
+
- CER (Character Error Rate): **0.004**
|
| 84 |
+
- Distance: **0.332**
|
| 85 |
+
- **S1-mini:**
|
| 86 |
+
- WER (Word Error Rate): **0.011**
|
| 87 |
+
- CER (Character Error Rate): **0.005**
|
| 88 |
+
- Distance: **0.380**
|
| 89 |
+
|
| 90 |
+
## License
|
| 91 |
+
|
| 92 |
+
This model is permissively licensed under the CC-BY-NC-SA-4.0 license.
|
codec.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:74fc41c5a7151c6f350af8bd7e5d6e3accfcc7f3dfbfac23afd35af07052bb2f
|
| 3 |
+
size 1871099728
|
config.json
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"attention_o_bias": false,
|
| 3 |
+
"attention_qk_norm": true,
|
| 4 |
+
"attention_qkv_bias": false,
|
| 5 |
+
"codebook_size": 4096,
|
| 6 |
+
"dim": 1024,
|
| 7 |
+
"dropout": 0.0,
|
| 8 |
+
"fast_attention_o_bias": false,
|
| 9 |
+
"fast_attention_qk_norm": false,
|
| 10 |
+
"fast_attention_qkv_bias": false,
|
| 11 |
+
"fast_dim": 1024,
|
| 12 |
+
"fast_head_dim": 64,
|
| 13 |
+
"fast_intermediate_size": 3072,
|
| 14 |
+
"fast_n_head": 16,
|
| 15 |
+
"fast_n_local_heads": 8,
|
| 16 |
+
"head_dim": 128,
|
| 17 |
+
"initializer_range": 0.03125,
|
| 18 |
+
"intermediate_size": 3072,
|
| 19 |
+
"max_seq_len": 8192,
|
| 20 |
+
"model_type": "dual_ar",
|
| 21 |
+
"n_fast_layer": 4,
|
| 22 |
+
"n_head": 16,
|
| 23 |
+
"n_layer": 28,
|
| 24 |
+
"n_local_heads": 8,
|
| 25 |
+
"norm_eps": 1e-06,
|
| 26 |
+
"num_codebooks": 10,
|
| 27 |
+
"rope_base": 1000000,
|
| 28 |
+
"scale_codebook_embeddings": true,
|
| 29 |
+
"tie_word_embeddings": false,
|
| 30 |
+
"use_gradient_checkpointing": true,
|
| 31 |
+
"vocab_size": 155776
|
| 32 |
+
}
|
model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e59be7dc6714040dce3cde1f41e730c2f0daa5339785b1cd3b60041208c35e6
|
| 3 |
+
size 1735122974
|
special_tokens.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.tiktoken
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|