marksverdhai commited on
Commit
42bb1dc
·
verified ·
1 Parent(s): 650660e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - "no"
5
+ - en
6
+ tags:
7
+ - text-to-speech
8
+ - tts
9
+ - speech-synthesis
10
+ - norwegian
11
+ - vibevoice
12
+ - bitsandbytes
13
+ - 4bit
14
+ - quantized
15
+ base_model: aoi-ot/VibeVoice-Large
16
+ datasets:
17
+ - heiertech/vibevoice-norwegian-mcv
18
+ pipeline_tag: text-to-speech
19
+ ---
20
+
21
+ # VibeVoice-7B Norwegian (4-bit Quantized)
22
+
23
+ A 4-bit quantized version of VibeVoice-7B fine-tuned for Norwegian text-to-speech synthesis.
24
+
25
+ ## Model Description
26
+
27
+ This model is a bitsandbytes 4-bit (NF4) quantized version of [heiertech/vibevoice-7b-nob](https://huggingface.co/heiertech/vibevoice-7b-nob), which was fine-tuned from [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) on Norwegian speech data.
28
+
29
+ ### Quantization Details
30
+
31
+ - **Method**: bitsandbytes NF4 (4-bit NormalFloat)
32
+ - **Double quantization**: Enabled
33
+ - **Compute dtype**: bfloat16
34
+ - **Model size**: ~6.2 GB (vs ~19 GB for bf16)
35
+ - **VRAM usage**: ~7 GB
36
+
37
+ ## Training Details
38
+
39
+ | Parameter | Value |
40
+ |-----------|-------|
41
+ | Base model | aoi-ot/VibeVoice-Large |
42
+ | Dataset | heiertech/vibevoice-norwegian-mcv |
43
+ | Training samples | 1,784 (43 speakers) |
44
+ | Validation samples | 216 |
45
+ | Training steps | 1,000 |
46
+ | Epochs | ~2.24 |
47
+ | Effective batch size | 4 (1 x 4 gradient accumulation) |
48
+ | Optimizer | Adafactor |
49
+ | Learning rate | 2.5e-4 |
50
+ | LR scheduler | Cosine |
51
+ | Warmup ratio | 3% |
52
+ | Training time | ~33 minutes (RTX 3090) |
53
+
54
+ ### LoRA Configuration
55
+
56
+ | Parameter | Value |
57
+ |-----------|-------|
58
+ | Rank (r) | 32 |
59
+ | Alpha | 128 |
60
+ | Dropout | 0.05 |
61
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
62
+
63
+ ### Loss Weights
64
+
65
+ | Loss | Weight |
66
+ |------|--------|
67
+ | Diffusion loss | 1.4 |
68
+ | Cross-entropy loss | 0.04 |
69
+ | Voice prompt drop rate | 0.2 |
70
+
71
+ ### Training Metrics
72
+
73
+ - **Initial loss**: 4.97 (step 10)
74
+ - **Final loss**: 4.72
75
+ - **Final train loss (avg)**: 5.33
76
+
77
+ ## Usage
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import BitsAndBytesConfig
82
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
83
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
84
+
85
+ # Load with 4-bit quantization
86
+ bnb_config = BitsAndBytesConfig(
87
+ load_in_4bit=True,
88
+ bnb_4bit_compute_dtype=torch.bfloat16,
89
+ bnb_4bit_use_double_quant=True,
90
+ bnb_4bit_quant_type="nf4",
91
+ )
92
+
93
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
94
+ "heiertech/vibevoice-7b-nob-bnb-4bit",
95
+ quantization_config=bnb_config,
96
+ device_map="auto",
97
+ torch_dtype=torch.bfloat16,
98
+ )
99
+ model.eval()
100
+ model.set_ddpm_inference_steps(num_steps=10)
101
+
102
+ processor = VibeVoiceProcessor.from_pretrained("heiertech/vibevoice-7b-nob-bnb-4bit")
103
+
104
+ # Generate Norwegian speech
105
+ text = "Speaker 0: Hei, jeg heter Maria og jeg kommer fra Norge."
106
+ inputs = processor(text=[text], padding=True, return_tensors="pt", return_attention_mask=True)
107
+ inputs = {k: v.to(model.device) for k, v in inputs.items() if torch.is_tensor(v)}
108
+
109
+ with torch.no_grad():
110
+ outputs = model.generate(
111
+ **inputs,
112
+ cfg_scale=1.3,
113
+ tokenizer=processor.tokenizer,
114
+ generation_config={"do_sample": False},
115
+ )
116
+
117
+ audio = outputs.speech_outputs[0] # 24kHz audio
118
+ ```
119
+
120
+ ## Related Models
121
+
122
+ - [heiertech/vibevoice-7b-nob](https://huggingface.co/heiertech/vibevoice-7b-nob) - LoRA adapter
123
+ - [heiertech/vibevoice-7b-nob-lora-merged](https://huggingface.co/heiertech/vibevoice-7b-nob-lora-merged) - Full bf16 merged model
124
+ - [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) - Original base model
125
+
126
+ ## License
127
+
128
+ Apache 2.0