docs: update README - rebrand to Prat-9B-NF4, MIT license, remove LoRA details, fix links
Browse files
README.md
CHANGED
|
@@ -1,32 +1,26 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
| 3 |
language:
|
| 4 |
-
-
|
| 5 |
-
-
|
| 6 |
tags:
|
| 7 |
-
- text-to-speech
|
| 8 |
- tts
|
|
|
|
| 9 |
- speech-synthesis
|
| 10 |
- norwegian
|
| 11 |
-
-
|
| 12 |
- bitsandbytes
|
| 13 |
- 4bit
|
| 14 |
- quantized
|
| 15 |
-
datasets:
|
| 16 |
-
- heiertech/vibevoice-norwegian-mcv
|
| 17 |
pipeline_tag: text-to-speech
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Prat-
|
| 21 |
-
|
| 22 |
-
A 4-bit quantized version of Prat-9b-nob fine-tuned for Norwegian text-to-speech synthesis.
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
which was fine-tuned from [vibevoice/VibeVoice-7b](https://huggingface.co/aoi-ot/VibeVoice-Large) on Norwegian speech data.
|
| 28 |
-
|
| 29 |
-
### Quantization Details
|
| 30 |
|
| 31 |
- **Method**: bitsandbytes NF4 (4-bit NormalFloat)
|
| 32 |
- **Double quantization**: Enabled
|
|
@@ -36,43 +30,11 @@ which was fine-tuned from [vibevoice/VibeVoice-7b](https://huggingface.co/aoi-ot
|
|
| 36 |
|
| 37 |
## Training Details
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
| Validation samples | 216 |
|
| 45 |
-
| Training steps | 1,000 |
|
| 46 |
-
| Epochs | ~2.24 |
|
| 47 |
-
| Effective batch size | 4 (1 x 4 gradient accumulation) |
|
| 48 |
-
| Optimizer | Adafactor |
|
| 49 |
-
| Learning rate | 2.5e-4 |
|
| 50 |
-
| LR scheduler | Cosine |
|
| 51 |
-
| Warmup ratio | 3% |
|
| 52 |
-
| Training time | ~33 minutes (RTX 3090) |
|
| 53 |
-
|
| 54 |
-
### LoRA Configuration
|
| 55 |
-
|
| 56 |
-
| Parameter | Value |
|
| 57 |
-
|-----------|-------|
|
| 58 |
-
| Rank (r) | 32 |
|
| 59 |
-
| Alpha | 128 |
|
| 60 |
-
| Dropout | 0.05 |
|
| 61 |
-
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
| 62 |
-
|
| 63 |
-
### Loss Weights
|
| 64 |
-
|
| 65 |
-
| Loss | Weight |
|
| 66 |
-
|------|--------|
|
| 67 |
-
| Diffusion loss | 1.4 |
|
| 68 |
-
| Cross-entropy loss | 0.04 |
|
| 69 |
-
| Voice prompt drop rate | 0.2 |
|
| 70 |
-
|
| 71 |
-
### Training Metrics
|
| 72 |
-
|
| 73 |
-
- **Initial loss**: 4.97 (step 10)
|
| 74 |
-
- **Final loss**: 4.72
|
| 75 |
-
- **Final train loss (avg)**: 5.33
|
| 76 |
|
| 77 |
## Usage
|
| 78 |
|
|
@@ -91,7 +53,7 @@ bnb_config = BitsAndBytesConfig(
|
|
| 91 |
)
|
| 92 |
|
| 93 |
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
| 94 |
-
"heiertech/Prat-
|
| 95 |
quantization_config=bnb_config,
|
| 96 |
device_map="auto",
|
| 97 |
torch_dtype=torch.bfloat16,
|
|
@@ -99,7 +61,7 @@ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
|
| 99 |
model.eval()
|
| 100 |
model.set_ddpm_inference_steps(num_steps=10)
|
| 101 |
|
| 102 |
-
processor = VibeVoiceProcessor.from_pretrained("heiertech/
|
| 103 |
|
| 104 |
# Generate Norwegian speech
|
| 105 |
text = "Speaker 0: Hei, jeg heter Maria og jeg kommer fra Norge."
|
|
@@ -115,4 +77,13 @@ with torch.no_grad():
|
|
| 115 |
)
|
| 116 |
|
| 117 |
audio = outputs.speech_outputs[0] # 24kHz audio
|
| 118 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model: vibevoice/VibeVoice-7B
|
| 4 |
language:
|
| 5 |
+
- "no"
|
| 6 |
+
- nb
|
| 7 |
tags:
|
|
|
|
| 8 |
- tts
|
| 9 |
+
- text-to-speech
|
| 10 |
- speech-synthesis
|
| 11 |
- norwegian
|
| 12 |
+
- bokmal
|
| 13 |
- bitsandbytes
|
| 14 |
- 4bit
|
| 15 |
- quantized
|
|
|
|
|
|
|
| 16 |
pipeline_tag: text-to-speech
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Prat-9B-NF4
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
A 4-bit (NF4) quantized Norwegian (Bokmal) text-to-speech model fine-tuned for the Ostnorsk/Oslo dialect.
|
| 22 |
|
| 23 |
+
## Quantization Details
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
- **Method**: bitsandbytes NF4 (4-bit NormalFloat)
|
| 26 |
- **Double quantization**: Enabled
|
|
|
|
| 30 |
|
| 31 |
## Training Details
|
| 32 |
|
| 33 |
+
This model was trained using a progressive 3-stage fine-tuning approach:
|
| 34 |
+
|
| 35 |
+
1. **Stage 1**: Initial Norwegian (Bokmal) training on Mozilla Common Voice
|
| 36 |
+
2. **Stage 2**: Continued training on broader Norwegian data
|
| 37 |
+
3. **Stage 3**: Dialect-specific fine-tuning for Ostnorsk/Oslo dialect
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
## Usage
|
| 40 |
|
|
|
|
| 53 |
)
|
| 54 |
|
| 55 |
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
| 56 |
+
"heiertech/Prat-9B-NF4",
|
| 57 |
quantization_config=bnb_config,
|
| 58 |
device_map="auto",
|
| 59 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 61 |
model.eval()
|
| 62 |
model.set_ddpm_inference_steps(num_steps=10)
|
| 63 |
|
| 64 |
+
processor = VibeVoiceProcessor.from_pretrained("heiertech/Prat-9B-NF4")
|
| 65 |
|
| 66 |
# Generate Norwegian speech
|
| 67 |
text = "Speaker 0: Hei, jeg heter Maria og jeg kommer fra Norge."
|
|
|
|
| 77 |
)
|
| 78 |
|
| 79 |
audio = outputs.speech_outputs[0] # 24kHz audio
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Base Model
|
| 83 |
+
|
| 84 |
+
This model is a fine-tune of [VibeVoice-7B](https://huggingface.co/vibevoice/VibeVoice-7B). Note that despite the name, VibeVoice-7B is actually a 9B parameter model.
|
| 85 |
+
|
| 86 |
+
## Acknowledgments
|
| 87 |
+
|
| 88 |
+
- Base model: [vibevoice/VibeVoice-7B](https://huggingface.co/vibevoice/VibeVoice-7B)
|
| 89 |
+
- Training data: Mozilla Common Voice Norwegian
|