pevers commited on
Commit
57ef297
·
verified ·
1 Parent(s): 9b999cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -28
README.md CHANGED
@@ -32,25 +32,26 @@ Open-weights Dutch TTS based on the [Parakeet](https://jordandarefsky.com/blog/2
32
  * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
33
  * Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.
34
 
 
 
 
 
35
  ## Quickstart
36
 
37
- The JAX model has the best performance in terms of quality, but requires a bit more setup, and is (for the moment) a little bit slower. The model is also ported back to PyTorch. However, I suspect that due to small differences in the attention kernel between PyTorch and JAX, the PyTorch model hallucinates more and generates strange artifacts more than the JAX model.
 
 
38
 
39
  ```bash
 
 
 
 
40
  uv sync # For CPU
41
- uv sync --extra tpu # For TPU
42
  uv sync --extra cuda # For CUDA
43
 
44
- # Download the checkpoint
45
- wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.zip?download=true -O weights/dia-nl-v1.zip
46
-
47
- # Create the checkpoint folder and unzip
48
- mkdir -p weights
49
- unzip weights/dia-nl-v1.zip -d weights
50
-
51
- # Run the inference demo
52
- # NOTE: Inference can take a while because of JAX compilation. Subsequent calls will be cached and much faster. I'm working on some performance improvements.
53
- uv run python src/parkiet/jax/inference.py
54
  ```
55
 
56
  <details>
@@ -58,6 +59,9 @@ uv run python src/parkiet/jax/inference.py
58
  <summary>PyTorch</summary>
59
 
60
  ```bash
 
 
 
61
  uv sync # For CPU
62
  uv sync --extra cuda # For CUDA
63
 
@@ -69,31 +73,38 @@ uv run python src/parkiet/dia/inference.py
69
 
70
  <details>
71
 
72
- <summary>Dia Plug-and-Play Transformers</summary>
73
 
74
- NOTE: Tune the `cfg_scale` option and temperature to reduce hallucinations.
75
-
76
- ```python
77
- from transformers import AutoProcessor, DiaForConditionalGeneration
78
 
79
- torch_device = "cuda"
80
- model_checkpoint = "pevers/parkiet"
81
 
82
- text = [
83
- "[S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face."
84
- ]
85
- processor = AutoProcessor.from_pretrained(model_checkpoint)
86
- inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
87
 
88
- model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
89
- outputs = model.generate(**inputs, max_new_tokens=3072, guidance_scale=3.0, temperature=1.8, top_p=0.90, top_k=50)
 
90
 
91
- outputs = processor.batch_decode(outputs)
92
- processor.save_audio(outputs, "example.mp3")
 
93
  ```
94
 
95
  </details>
96
 
 
 
 
 
 
 
 
 
 
97
  ## ⚠️ Disclaimer
98
  This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
99
 
 
32
  * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
33
  * Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.
34
 
35
+ ## News
36
+
37
+ **September 28, 2025**: Added tensorsafe format support allowing the model to run directly in the Dia pipeline without conversion.
38
+
39
  ## Quickstart
40
 
41
+ There are three flavours of the model. The HF transformers version (recommended), the original JAX model, and the backported PyTorch model. The HF transformers version is the easiest to use and integrates seamlessly with the Hugging Face ecosystem.
42
+
43
+ ### HF Transformers (Recommended)
44
 
45
  ```bash
46
+ # Make sure you have the runtime dependencies installed for JAX
47
+ # You can also extract the HF inference code and the transformers dependency
48
+ sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
49
+
50
  uv sync # For CPU
 
51
  uv sync --extra cuda # For CUDA
52
 
53
+ # Run the inference demo with HF transformers
54
+ uv run python src/parkiet/dia/inference_hf.py
 
 
 
 
 
 
 
 
55
  ```
56
 
57
  <details>
 
59
  <summary>PyTorch</summary>
60
 
61
  ```bash
62
+ # Make sure you have the runtime dependencies installed for JAX
63
+ sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
64
+
65
  uv sync # For CPU
66
  uv sync --extra cuda # For CUDA
67
 
 
73
 
74
  <details>
75
 
76
+ <summary>JAX</summary>
77
 
78
+ ```bash
79
+ # Make sure you have the runtime dependencies installed for JAX
80
+ sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
 
81
 
82
+ uv sync --extra tpu # For TPU
83
+ uv sync --extra cuda # For CUDA
84
 
85
+ # Download the checkpoint
86
+ wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.zip?download=true -O weights/dia-nl-v1.zip
 
 
 
87
 
88
+ # Create the checkpoint folder and unzip
89
+ mkdir -p weights
90
+ unzip weights/dia-nl-v1.zip -d weights
91
 
92
+ # Run the inference demo
93
+ # NOTE: Inference can take a while because of JAX compilation. Subsequent calls will be cached and much faster. I'm working on some performance improvements.
94
+ uv run python src/parkiet/jax/inference.py
95
  ```
96
 
97
  </details>
98
 
99
+ ## Hardware Requirements
100
+
101
+ | Framework | float32 VRAM | bfloat16 VRAM |
102
+ |---|---:|---:|
103
+ | JAX | ≥19 GB | ≥10GB |
104
+ | PyTorch | ≥15 GB | ≥10GB |
105
+
106
+ Note: `bfloat16` typically reduces VRAM usage versus `float32` on supported hardware to about 10GB. However, converting the full model to `bfloat16` causes more instability and hallucinations. Setting just the `compute_dtype` to `bfloat16` is a good compromise and is also done during training. We would like to reduce the VRAM requirements in a next training run.
107
+
108
  ## ⚠️ Disclaimer
109
  This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
110