upload

Browse files

Files changed (6) hide show

README.md +92 -3
codec.pth +3 -0
config.json +32 -0
model.pth +3 -0
special_tokens.json +0 -0
tokenizer.tiktoken +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
----
-license: apache-2.0
----

+---
+tags:
+- text-to-speech
+license: cc-by-nc-sa-4.0
+language:
+- zh
+- en
+- de
+- ja
+- fr
+- es
+- ko
+- ar
+- nl
+- ru
+- it
+- pl
+- pt
+pipeline_tag: text-to-speech
+inference: false
+extra_gated_prompt: >-
+  You agree to not use the model to generate contents that violate DMCA or local
+  laws.
+extra_gated_fields:
+  Country: country
+  Specific date: date_picker
+  I agree to use this model for non-commercial use ONLY: checkbox
+---
+# OpenAudio S1
+**OpenAudio S1** is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.
+Supported languages:
+- English (en)
+- Chinese (zh)
+- Japanese (ja)
+- German (de)
+- French (fr)
+- Spanish (es)
+- Korean (ko)
+- Arabic (ar)
+- Russian (ru)
+- Dutch (nl)
+- Italian (it)
+- Polish (pl)
+- Portuguese (pt)
+Please refer to [Fish Speech Github](https://github.com/fishaudio/fish-speech) for more info.
+Demo available at [Fish Audio Playground](https://fish.audio).
+Visit the [OpenAudio website](https://openaudio.com) for blog & tech report.
+## Emotion and Tone Support
+OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:
+**1. Emotional markers:**
+(angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)
+**2. Tone markers:**
+(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
+**3. Special markers:**
+(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)
+**Special markers with corresponding onomatopoeia:**
+- Laughing: Ha,ha,ha
+- Chuckling: Hmm,hmm
+## Model Variants and Performance
+OpenAudio S1 includes the following models:
+-   **S1 (4B, proprietary):** The full-sized model.
+-   **S1-mini (0.5B):** A distilled version of S1.
+Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
+**Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):**
+-   **S1:**
+    -   WER (Word Error Rate): **0.008**
+    -   CER (Character Error Rate): **0.004**
+    -   Distance: **0.332**
+-   **S1-mini:**
+    -   WER (Word Error Rate): **0.011**
+    -   CER (Character Error Rate): **0.005**
+    -   Distance: **0.380**
+## License
+This model is permissively licensed under the CC-BY-NC-SA-4.0 license.

codec.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74fc41c5a7151c6f350af8bd7e5d6e3accfcc7f3dfbfac23afd35af07052bb2f
+size 1871099728

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+    "attention_o_bias": false,
+    "attention_qk_norm": true,
+    "attention_qkv_bias": false,
+    "codebook_size": 4096,
+    "dim": 1024,
+    "dropout": 0.0,
+    "fast_attention_o_bias": false,
+    "fast_attention_qk_norm": false,
+    "fast_attention_qkv_bias": false,
+    "fast_dim": 1024,
+    "fast_head_dim": 64,
+    "fast_intermediate_size": 3072,
+    "fast_n_head": 16,
+    "fast_n_local_heads": 8,
+    "head_dim": 128,
+    "initializer_range": 0.03125,
+    "intermediate_size": 3072,
+    "max_seq_len": 8192,
+    "model_type": "dual_ar",
+    "n_fast_layer": 4,
+    "n_head": 16,
+    "n_layer": 28,
+    "n_local_heads": 8,
+    "norm_eps": 1e-06,
+    "num_codebooks": 10,
+    "rope_base": 1000000,
+    "scale_codebook_embeddings": true,
+    "tie_word_embeddings": false,
+    "use_gradient_checkpointing": true,
+    "vocab_size": 155776
+}

model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e59be7dc6714040dce3cde1f41e730c2f0daa5339785b1cd3b60041208c35e6
+size 1735122974

special_tokens.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.tiktoken ADDED Viewed

The diff for this file is too large to render. See raw diff