heiertech
/

Prat-9B-NF4

+---
+license: apache-2.0
+language:
+- "no"
+- en
+tags:
+- text-to-speech
+- tts
+- speech-synthesis
+- norwegian
+- vibevoice
+- bitsandbytes
+- 4bit
+- quantized
+base_model: aoi-ot/VibeVoice-Large
+datasets:
+- heiertech/vibevoice-norwegian-mcv
+pipeline_tag: text-to-speech
+---
+# VibeVoice-7B Norwegian (4-bit Quantized)
+A 4-bit quantized version of VibeVoice-7B fine-tuned for Norwegian text-to-speech synthesis.
+## Model Description
+This model is a bitsandbytes 4-bit (NF4) quantized version of [heiertech/vibevoice-7b-nob](https://huggingface.co/heiertech/vibevoice-7b-nob), which was fine-tuned from [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) on Norwegian speech data.
+### Quantization Details
+- **Method**: bitsandbytes NF4 (4-bit NormalFloat)
+- **Double quantization**: Enabled
+- **Compute dtype**: bfloat16
+- **Model size**: ~6.2 GB (vs ~19 GB for bf16)
+- **VRAM usage**: ~7 GB
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Base model | aoi-ot/VibeVoice-Large |
+| Dataset | heiertech/vibevoice-norwegian-mcv |
+| Training samples | 1,784 (43 speakers) |
+| Validation samples | 216 |
+| Training steps | 1,000 |
+| Epochs | ~2.24 |
+| Effective batch size | 4 (1 x 4 gradient accumulation) |
+| Optimizer | Adafactor |
+| Learning rate | 2.5e-4 |
+| LR scheduler | Cosine |
+| Warmup ratio | 3% |
+| Training time | ~33 minutes (RTX 3090) |
+### LoRA Configuration
+| Parameter | Value |
+|-----------|-------|
+| Rank (r) | 32 |
+| Alpha | 128 |
+| Dropout | 0.05 |
+| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+### Loss Weights
+| Loss | Weight |
+|------|--------|
+| Diffusion loss | 1.4 |
+| Cross-entropy loss | 0.04 |
+| Voice prompt drop rate | 0.2 |
+### Training Metrics
+- **Initial loss**: 4.97 (step 10)
+- **Final loss**: 4.72
+- **Final train loss (avg)**: 5.33
+## Usage
+```python
+import torch
+from transformers import BitsAndBytesConfig
+from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
+from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
+# Load with 4-bit quantization
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+)
+model = VibeVoiceForConditionalGenerationInference.from_pretrained(
+    "heiertech/vibevoice-7b-nob-bnb-4bit",
+    quantization_config=bnb_config,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+model.eval()
+model.set_ddpm_inference_steps(num_steps=10)
+processor = VibeVoiceProcessor.from_pretrained("heiertech/vibevoice-7b-nob-bnb-4bit")
+# Generate Norwegian speech
+text = "Speaker 0: Hei, jeg heter Maria og jeg kommer fra Norge."
+inputs = processor(text=[text], padding=True, return_tensors="pt", return_attention_mask=True)
+inputs = {k: v.to(model.device) for k, v in inputs.items() if torch.is_tensor(v)}
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        cfg_scale=1.3,
+        tokenizer=processor.tokenizer,
+        generation_config={"do_sample": False},
+    )
+audio = outputs.speech_outputs[0]  # 24kHz audio
+```
+## Related Models
+- [heiertech/vibevoice-7b-nob](https://huggingface.co/heiertech/vibevoice-7b-nob) - LoRA adapter
+- [heiertech/vibevoice-7b-nob-lora-merged](https://huggingface.co/heiertech/vibevoice-7b-nob-lora-merged) - Full bf16 merged model
+- [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) - Original base model
+## License
+Apache 2.0