Upload README.md
Browse files
README.md
CHANGED
|
@@ -14,14 +14,17 @@ pipeline_tag: text-to-speech
|
|
| 14 |
|
| 15 |
On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
|
| 16 |
|
| 17 |
-
|
| 18 |
-
- XTTS v2: 467M, CPML, >10k hours
|
| 19 |
-
- Edge TTS: Microsoft, proprietary
|
| 20 |
-
- MetaVoice: 1.2B, Apache, 100k hours
|
| 21 |
-
- Parler Mini: 880M, Apache, 45k hours
|
| 22 |
-
- Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
|
| 23 |
|
| 24 |
-
Kokoro
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
| 27 |
|
|
@@ -40,21 +43,30 @@ from models import build_model
|
|
| 40 |
import torch
|
| 41 |
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 42 |
MODEL = build_model('kokoro-v0_19.pth', device)
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
# 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
|
| 46 |
from kokoro import generate
|
| 47 |
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
|
| 48 |
-
audio, out_ps = generate(MODEL, text, VOICEPACK)
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
# 4️⃣ Display the 24khz audio and print the output phonemes
|
| 51 |
from IPython.display import display, Audio
|
| 52 |
display(Audio(data=audio, rate=24000, autoplay=True))
|
| 53 |
print(out_ps)
|
| 54 |
```
|
| 55 |
-
|
| 56 |
|
| 57 |
-
### Model
|
| 58 |
|
| 59 |
No affiliation can be assumed between parties on different lines.
|
| 60 |
|
|
@@ -67,15 +79,16 @@ No affiliation can be assumed between parties on different lines.
|
|
| 67 |
|
| 68 |
**Trained by**: `@rzvzn` on Discord
|
| 69 |
|
| 70 |
-
**Supported Languages:** English
|
| 71 |
|
| 72 |
**Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
|
| 73 |
|
| 74 |
-
|
| 75 |
- 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
|
| 76 |
- 26 Dec 2024: `am_adam`, `am_michael`
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
- Apache 2.0 weights in this repository
|
| 80 |
- MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
| 81 |
- GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
|
|
@@ -117,12 +130,14 @@ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
|
|
| 117 |
|
| 118 |
### Limitations
|
| 119 |
|
| 120 |
-
Kokoro v0.19 is limited in some ways,
|
| 121 |
- [Data] Lacks voice cloning capability, likely due to small <100h training set
|
| 122 |
- [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
|
| 123 |
- [Data] Training dataset is mostly long-form reading and narration, not conversation
|
| 124 |
- [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
|
| 125 |
-
- [Data] Multilingual capability is architecturally feasible, but training data is
|
|
|
|
|
|
|
| 126 |
|
| 127 |
**Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
| 128 |
|
|
|
|
| 14 |
|
| 15 |
On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
|
| 16 |
|
| 17 |
+
As of 28 Dec 2024, **8 unique Voicepacks have been released**: 2F 2M each for American and British English.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
At the time of release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
|
| 20 |
+
1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio, for <20 epochs**
|
| 21 |
+
2. XTTS v2: 467M, CPML, >10k hours
|
| 22 |
+
3. Edge TTS: Microsoft, proprietary
|
| 23 |
+
4. MetaVoice: 1.2B, Apache, 100k hours
|
| 24 |
+
5. Parler Mini: 880M, Apache, 45k hours
|
| 25 |
+
6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
|
| 26 |
+
|
| 27 |
+
Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
|
| 28 |
|
| 29 |
You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
| 30 |
|
|
|
|
| 43 |
import torch
|
| 44 |
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 45 |
MODEL = build_model('kokoro-v0_19.pth', device)
|
| 46 |
+
VOICE_NAME = [
|
| 47 |
+
'af', # Default voice is a 50-50 mix of af_bella & af_sarah
|
| 48 |
+
'af_bella', 'af_sarah', 'am_adam', 'am_michael',
|
| 49 |
+
'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
|
| 50 |
+
][0]
|
| 51 |
+
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
|
| 52 |
+
print(f'Loaded voice: {VOICE_NAME}')
|
| 53 |
|
| 54 |
# 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
|
| 55 |
from kokoro import generate
|
| 56 |
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
|
| 57 |
+
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
|
| 58 |
+
# Language is determined by the first letter of the VOICE_NAME:
|
| 59 |
+
# 🇺🇸 'a' => American English => en-us
|
| 60 |
+
# 🇬🇧 'b' => British English => en-gb
|
| 61 |
|
| 62 |
# 4️⃣ Display the 24khz audio and print the output phonemes
|
| 63 |
from IPython.display import display, Audio
|
| 64 |
display(Audio(data=audio, rate=24000, autoplay=True))
|
| 65 |
print(out_ps)
|
| 66 |
```
|
| 67 |
+
The inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
|
| 68 |
|
| 69 |
+
### Model Facts
|
| 70 |
|
| 71 |
No affiliation can be assumed between parties on different lines.
|
| 72 |
|
|
|
|
| 79 |
|
| 80 |
**Trained by**: `@rzvzn` on Discord
|
| 81 |
|
| 82 |
+
**Supported Languages:** American English, British English
|
| 83 |
|
| 84 |
**Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
|
| 85 |
|
| 86 |
+
### Releases
|
| 87 |
- 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
|
| 88 |
- 26 Dec 2024: `am_adam`, `am_michael`
|
| 89 |
+
- 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
|
| 90 |
|
| 91 |
+
### Licenses
|
| 92 |
- Apache 2.0 weights in this repository
|
| 93 |
- MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
| 94 |
- GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
|
|
|
|
| 130 |
|
| 131 |
### Limitations
|
| 132 |
|
| 133 |
+
Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
|
| 134 |
- [Data] Lacks voice cloning capability, likely due to small <100h training set
|
| 135 |
- [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
|
| 136 |
- [Data] Training dataset is mostly long-form reading and narration, not conversation
|
| 137 |
- [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
|
| 138 |
+
- [Data] Multilingual capability is architecturally feasible, but training data is mostly English
|
| 139 |
+
|
| 140 |
+
Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
|
| 141 |
|
| 142 |
**Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
| 143 |
|