pevers
/

parkiet

@@ -32,25 +32,26 @@ Open-weights Dutch TTS based on the [Parakeet](https://jordandarefsky.com/blog/2
 * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
 * Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.
 ## Quickstart
-The JAX model has the best performance in terms of quality, but requires a bit more setup, and is (for the moment) a little bit slower. The model is also ported back to PyTorch. However, I suspect that due to small differences in the attention kernel between PyTorch and JAX, the PyTorch model hallucinates more and generates strange artifacts more than the JAX model.
 ```bash
 uv sync # For CPU
-uv sync --extra tpu # For TPU
 uv sync --extra cuda # For CUDA
-# Download the checkpoint
-wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.zip?download=true -O weights/dia-nl-v1.zip
-# Create the checkpoint folder and unzip
-mkdir -p weights
-unzip weights/dia-nl-v1.zip -d weights
-# Run the inference demo
-# NOTE: Inference can take a while because of JAX compilation. Subsequent calls will be cached and much faster. I'm working on some performance improvements.
-uv run python src/parkiet/jax/inference.py
 ```
 <details>
@@ -58,6 +59,9 @@ uv run python src/parkiet/jax/inference.py
 <summary>PyTorch</summary>
 ```bash
 uv sync # For CPU
 uv sync --extra cuda # For CUDA
@@ -69,31 +73,38 @@ uv run python src/parkiet/dia/inference.py
 <details>
-<summary>Dia Plug-and-Play Transformers</summary>
-NOTE: Tune the `cfg_scale` option and temperature to reduce hallucinations.
-```python
-from transformers import AutoProcessor, DiaForConditionalGeneration
-torch_device = "cuda"
-model_checkpoint = "pevers/parkiet"
-text = [
-    "[S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face."
-]
-processor = AutoProcessor.from_pretrained(model_checkpoint)
-inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
-model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
-outputs = model.generate(**inputs, max_new_tokens=3072, guidance_scale=3.0, temperature=1.8, top_p=0.90, top_k=50)
-outputs = processor.batch_decode(outputs)
-processor.save_audio(outputs, "example.mp3")
 ```
 </details>
 ## ⚠️ Disclaimer
 This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:

 * Laughter can be added with the `(laughs)` tag. However, use it sparingly because the model quickly derails for too many events.
 * Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.
+## News
+**September 28, 2025**: Added tensorsafe format support allowing the model to run directly in the Dia pipeline without conversion.
 ## Quickstart
+There are three flavours of the model. The HF transformers version (recommended), the original JAX model, and the backported PyTorch model. The HF transformers version is the easiest to use and integrates seamlessly with the Hugging Face ecosystem.
+### HF Transformers (Recommended)
 ```bash
+# Make sure you have the runtime dependencies installed for JAX
+# You can also extract the HF inference code and the transformers dependency
+sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
 uv sync # For CPU
 uv sync --extra cuda # For CUDA
+# Run the inference demo with HF transformers
+uv run python src/parkiet/dia/inference_hf.py
 ```
 <details>
 <summary>PyTorch</summary>
 ```bash
+# Make sure you have the runtime dependencies installed for JAX
+sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
 uv sync # For CPU
 uv sync --extra cuda # For CUDA
 <details>
+<summary>JAX</summary>
+```bash
+# Make sure you have the runtime dependencies installed for JAX
+sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev
+uv sync --extra tpu # For TPU
+uv sync --extra cuda # For CUDA
+# Download the checkpoint
+wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.zip?download=true -O weights/dia-nl-v1.zip
+# Create the checkpoint folder and unzip
+mkdir -p weights
+unzip weights/dia-nl-v1.zip -d weights
+# Run the inference demo
+# NOTE: Inference can take a while because of JAX compilation. Subsequent calls will be cached and much faster. I'm working on some performance improvements.
+uv run python src/parkiet/jax/inference.py
 ```
 </details>
+## Hardware Requirements
+| Framework | float32 VRAM | bfloat16 VRAM |
+|---|---:|---:|
+| JAX | ≥19 GB | ≥10GB |
+| PyTorch | ≥15 GB | ≥10GB |
+Note: `bfloat16` typically reduces VRAM usage versus `float32` on supported hardware to about 10GB. However, converting the full model to `bfloat16` causes more instability and hallucinations. Setting just the `compute_dtype` to `bfloat16` is a good compromise and is also done during training. We would like to reduce the VRAM requirements in a next training run.
 ## ⚠️ Disclaimer
 This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden: