Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ datasets:
|
|
| 16 |
|
| 17 |
# CSM-1B Georgian
|
| 18 |
|
| 19 |
-
A fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) for **Georgian text-to-speech**. This is
|
| 20 |
|
| 21 |
## Model Details
|
| 22 |
|
|
@@ -76,15 +76,12 @@ sf.write("output.wav", audio[0].cpu().numpy(), 24000)
|
|
| 76 |
|
| 77 |
### Speaker Selection
|
| 78 |
|
| 79 |
-
The model supports 12 speakers (IDs
|
| 80 |
|
| 81 |
| Speaker | CER | Recommended |
|
| 82 |
|---------|-----|-------------|
|
| 83 |
| 7 | 0.0042 | Best overall |
|
| 84 |
| 3 | 0.0174 | Best speaker similarity |
|
| 85 |
-
| 8 | 0.0142 | Good |
|
| 86 |
-
| 11 | 0.0129 | Good |
|
| 87 |
-
| 14 | 0.0117 | Good |
|
| 88 |
|
| 89 |
Use the speaker ID in the text prefix: `[7]your text here`.
|
| 90 |
|
|
@@ -135,12 +132,8 @@ for i, audio in enumerate(audios):
|
|
| 135 |
- Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
|
| 136 |
- Long sentences (>10s of audio) may produce hallucinations or truncations
|
| 137 |
- 4.1% of FLEURS samples had CER > 50% (failure cases on complex text)
|
| 138 |
-
- No emotion or prosody control
|
| 139 |
- Georgian only
|
| 140 |
|
| 141 |
-
## Part of the Georgian TTS Benchmark
|
| 142 |
-
|
| 143 |
-
This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines)
|
| 144 |
|
| 145 |
## Citation
|
| 146 |
|
|
|
|
| 16 |
|
| 17 |
# CSM-1B Georgian
|
| 18 |
|
| 19 |
+
A fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) for **Georgian text-to-speech**. This is open-source Georgian TTS model based on the CSM architecture.
|
| 20 |
|
| 21 |
## Model Details
|
| 22 |
|
|
|
|
| 76 |
|
| 77 |
### Speaker Selection
|
| 78 |
|
| 79 |
+
The model supports 12 speakers (IDs ['1', '10', '11', '12', '14', '2', '3', '4', '5', '6', '7', '8'], sorry for this! going to fix in next models). Speaker 7 produces the most intelligible output (0.42% CER in-domain). Per-speaker quality varies:
|
| 80 |
|
| 81 |
| Speaker | CER | Recommended |
|
| 82 |
|---------|-----|-------------|
|
| 83 |
| 7 | 0.0042 | Best overall |
|
| 84 |
| 3 | 0.0174 | Best speaker similarity |
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
Use the speaker ID in the text prefix: `[7]your text here`.
|
| 87 |
|
|
|
|
| 132 |
- Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
|
| 133 |
- Long sentences (>10s of audio) may produce hallucinations or truncations
|
| 134 |
- 4.1% of FLEURS samples had CER > 50% (failure cases on complex text)
|
|
|
|
| 135 |
- Georgian only
|
| 136 |
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
## Citation
|
| 139 |
|