pevers commited on
Commit
86fc289
·
1 Parent(s): 3df7ade

update readme

Browse files
Files changed (1) hide show
  1. README.md +48 -6
README.md CHANGED
@@ -16,9 +16,10 @@ Open-weights Dutch TTS based on the [Parakeet](https://jordandarefsky.com/blog/2
16
 
17
  | Text | File |
18
  |---|---|
19
- | [S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo (laughs). | [intro](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/intro.mp3)
20
  | [S1] hoeveel stemmen worden er ondersteund? [S2] nou, uhm, ik denk toch wel meer dan twee. [S3] ja, ja, d dat is het mooie aan dit model. [S4] ja klopt, het ondersteund tot vier verschillende stemmen per prompt. | [multi](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/multi.mp3)
21
  | [S1] h h et is dus ook mogelijk, om eh ... uhm, heel veel t te st stotteren in een prompt. | [stutter](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/stutter.mp3) |
 
22
  | [S1] je hebt maar weinig audio nodig om een stem te clonen de rest van deze tekst is uitgesproken door een computer. [S2] wauw, dat klinkt wel erg goed. [S1] ja, ik hoop dat je er wat aan hebt. | [clone](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/voice_out.mp3)
23
 
24
 
@@ -27,11 +28,14 @@ Open-weights Dutch TTS based on the [Parakeet](https://jordandarefsky.com/blog/2
27
  * Use `[S1]`, `[S2]`, `[S3]`, `[S4]` to indicate the different speakers. Always start with `[S1]` and always alternate between [S1] and [S2] (i.e. [S1]... [S1]... is not good).
28
  * Prefer lower capital text prompts with punctuation. Write out digits as words. Even though the model should be able to handle some variety, it is better to stick close to the output of [WhisperD-NL](https://huggingface.co/pevers/whisperd-nl).
29
  * Slowing down can be encouraged by using `...` in the prompt.
30
- * Stuttering and disfluencies can be encouraged by using `eh`, `ehm`, `uh` or `uhm`.
31
  * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
 
32
 
33
  ## Quickstart
34
 
 
 
35
  ```bash
36
  uv sync # For CPU
37
  uv sync --extra tpu # For TPU
@@ -49,8 +53,46 @@ unzip weights/dia-nl-v1.zip -d weights
49
  uv run python src/parkiet/jax/inference.py
50
  ```
51
 
52
- Notes:
53
- - We use the JAX model by default. The model is also ported back to PyTorch. However, I suspect that due to small differences in the attention kernel between PyTorch and JAX, the PyTorch model hallucinates more and generates strange artifacts more than the JAX model. You can download the PyTorch model from [HuggingFace](https://huggingface.co/pevers/parkiet/blob/main/dia-nl-v1.pth) and use it in the [Dia](https://github.com/nari-labs/dia) pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## ⚠️ Disclaimer
56
  This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
@@ -62,7 +104,7 @@ By using this model, you agree to uphold relevant legal standards and ethical re
62
 
63
  ## Training
64
 
65
- For a full guide on data preparation, model conversion and the TPU setup to train this model for any language, see [TRAINING.md](https://github.com/pevers/parkiet/blob/main/TRAINING.md).
66
 
67
  ## Acknowledgements
68
 
@@ -73,4 +115,4 @@ For a full guide on data preparation, model conversion and the TPU setup to trai
73
 
74
  ## License
75
 
76
- Repository code is licensed under the [MIT License](LICENSE). The TTS model itself is licensed as [RAIL-M](MODEL_LICENSE).
 
16
 
17
  | Text | File |
18
  |---|---|
19
+ | [S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face. | [intro](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/intro.mp3)
20
  | [S1] hoeveel stemmen worden er ondersteund? [S2] nou, uhm, ik denk toch wel meer dan twee. [S3] ja, ja, d dat is het mooie aan dit model. [S4] ja klopt, het ondersteund tot vier verschillende stemmen per prompt. | [multi](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/multi.mp3)
21
  | [S1] h h et is dus ook mogelijk, om eh ... uhm, heel veel t te st stotteren in een prompt. | [stutter](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/stutter.mp3) |
22
+ | [S1] (laughs) luister, ik heb een mop, wat uhm, drinkt een webdesigner het liefst? [S2] nou ... ? [S1] Earl Grey (laughs) . [S2] (laughs) heel goed. | [laughs](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/laughs.mp3)
23
  | [S1] je hebt maar weinig audio nodig om een stem te clonen de rest van deze tekst is uitgesproken door een computer. [S2] wauw, dat klinkt wel erg goed. [S1] ja, ik hoop dat je er wat aan hebt. | [clone](https://github.com/pevers/parkiet/raw/refs/heads/main/samples/voice_out.mp3)
24
 
25
 
 
28
  * Use `[S1]`, `[S2]`, `[S3]`, `[S4]` to indicate the different speakers. Always start with `[S1]` and always alternate between [S1] and [S2] (i.e. [S1]... [S1]... is not good).
29
  * Prefer lower capital text prompts with punctuation. Write out digits as words. Even though the model should be able to handle some variety, it is better to stick close to the output of [WhisperD-NL](https://huggingface.co/pevers/whisperd-nl).
30
  * Slowing down can be encouraged by using `...` in the prompt.
31
+ * Stuttering and disfluencies can be encouraged by using `uh`, `uhm`, `mmm`.
32
  * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
33
+ * Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.
34
 
35
  ## Quickstart
36
 
37
+ The JAX model has the best performance in terms of quality, but requires a bit more setup, and is (for the moment) a little bit slower. The model is also ported back to PyTorch. However, I suspect that due to small differences in the attention kernel between PyTorch and JAX, the PyTorch model hallucinates more and generates strange artifacts more than the JAX model.
38
+
39
  ```bash
40
  uv sync # For CPU
41
  uv sync --extra tpu # For TPU
 
53
  uv run python src/parkiet/jax/inference.py
54
  ```
55
 
56
+ <details>
57
+
58
+ <summary>PyTorch</summary>
59
+
60
+ ```bash
61
+ uv sync # For CPU
62
+ uv sync --extra cuda # For CUDA
63
+
64
+ wget https://huggingface.co/pevers/parkiet/blob/main/dia-nl-v1.pth -O weights/dia-nl-v1.pth
65
+ uv run python src/parkiet/dia/inference.py
66
+ ```
67
+
68
+ </details>
69
+
70
+ <details>
71
+
72
+ <summary>Dia Plug-and-Play Transformers</summary>
73
+
74
+ NOTE: Tune the `cfg_scale` option and temperature to reduce hallucinations.
75
+
76
+ ```python
77
+ from transformers import AutoProcessor, DiaForConditionalGeneration
78
+
79
+ torch_device = "cuda"
80
+ model_checkpoint = "pevers/parkiet/v1/"
81
+
82
+ text = [
83
+ "[S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face."
84
+ ]
85
+ processor = AutoProcessor.from_pretrained(model_checkpoint)
86
+ inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
87
+
88
+ model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
89
+ outputs = model.generate(**inputs, max_new_tokens=3072, guidance_scale=3.0, temperature=1.8, top_p=0.90, top_k=50)
90
+
91
+ outputs = processor.batch_decode(outputs)
92
+ processor.save_audio(outputs, "example.mp3")
93
+ ```
94
+
95
+ </details>
96
 
97
  ## ⚠️ Disclaimer
98
  This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
 
104
 
105
  ## Training
106
 
107
+ For a full guide on data preparation, model conversion and the TPU setup to train this model for any language, see [TRAINING.md](TRAINING.md).
108
 
109
  ## Acknowledgements
110
 
 
115
 
116
  ## License
117
 
118
+ Repository code is licensed under the [MIT License](LICENSE). The TTS model itself is licensed as [RAIL-M](MODEL_LICENSE).