12labs commited on
Commit
026659d
·
verified ·
1 Parent(s): 70ff92e

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +121 -0
  2. app.py +110 -0
  3. requirements.txt +11 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Hindi Voice Cloning (VibeVoice)
3
+ emoji: 🎙️
4
+ colorFrom: red
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "4.44.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # 🇮🇳 Hindi Voice Cloning with Emotion
13
+
14
+ This Hugging Face Space provides **high-quality Hindi Text-to-Speech with voice cloning and expressive emotion**.
15
+
16
+ Users can upload a short reference voice sample and generate Hindi speech in the **same voice, tone, and emotional style**.
17
+
18
+ The system is powered by **VibeVoice-7B** with **Hindi LoRA fine-tuning**, optimized for natural prosody and long-form speech.
19
+
20
+ ---
21
+
22
+ ## ✨ Features
23
+
24
+ - 🎙️ Voice cloning from uploaded reference audio
25
+ - 🎭 Emotion & speaking style transfer
26
+ - 🗣️ Natural-sounding Hindi TTS
27
+ - 📄 Long-form narration support
28
+ - 🚀 GPU-accelerated inference
29
+ - 🎚️ Expression strength control (CFG scale)
30
+
31
+ ---
32
+
33
+ ## 🧪 How to Use
34
+
35
+ 1. Enter Hindi text in the text box
36
+ 2. Upload a **reference voice (WAV format)**
37
+ 3. Adjust **Expression Strength (CFG Scale)**
38
+ 4. Click **🚀 Generate Voice**
39
+ 5. Listen to or download the generated audio
40
+
41
+ ---
42
+
43
+ ## 🎧 Reference Voice Guidelines (Very Important)
44
+
45
+ For best quality voice cloning:
46
+
47
+ - WAV format only
48
+ - 10–30 seconds duration recommended
49
+ - Single speaker
50
+ - Clear audio, minimal background noise
51
+ - Natural emotion (happy, calm, sad, etc.)
52
+
53
+ > ⚠️ Emotion is copied from the **reference voice**, not from the text.
54
+
55
+ ---
56
+
57
+ ## 🎭 Expression Control (CFG Scale)
58
+
59
+ | CFG Scale | Effect |
60
+ |---------|------|
61
+ | 0.8 – 1.0 | Calm / neutral |
62
+ | 1.2 – 1.4 | Natural & expressive (recommended) |
63
+ | 1.5 – 2.0 | Strong emotion (may distort if too high) |
64
+
65
+ ---
66
+
67
+ ## ⚠️ System Requirements
68
+
69
+ - ✅ GPU required
70
+ - Recommended: A10 / A100 / H100
71
+ - ❌ CPU-only Spaces will not work
72
+ - ⏳ First run may take time due to model loading
73
+
74
+ ---
75
+
76
+ ## 🔐 Privacy & Data Handling
77
+
78
+ - Uploaded voice files are used **only for generation**
79
+ - Voice files are overwritten per request
80
+ - No permanent storage or reuse of user voices
81
+
82
+ ---
83
+
84
+ ## 🚫 Responsible Use Policy
85
+
86
+ This Space is intended for **research and demonstration purposes only**.
87
+
88
+ ❌ Do NOT clone voices of real individuals without **explicit consent**
89
+ ❌ Do NOT use for impersonation, fraud, or misinformation
90
+ ❌ Do NOT present generated audio as real recordings
91
+
92
+ ✔ Always disclose AI-generated audio when sharing publicly
93
+
94
+ ---
95
+
96
+ ## 🧠 Model Information
97
+
98
+ - **Base Model:** VibeVoice-7B
99
+ - **Hindi Fine-Tuning:** Hindi LoRA adapters
100
+ - **Architecture:** LLM + acoustic & semantic tokenizers + diffusion head
101
+ - **Technique:** LoRA (parameter-efficient fine-tuning)
102
+
103
+ ---
104
+
105
+ ## 📜 License
106
+
107
+ MIT License
108
+ (Same as the base VibeVoice model and adapters)
109
+
110
+ ---
111
+
112
+ ## 🙏 Acknowledgements
113
+
114
+ - Microsoft Research – VibeVoice
115
+ - VibeVoice Community
116
+ - Hugging Face Open-Source Ecosystem
117
+
118
+ ---
119
+
120
+ ### ⚡ Note
121
+ This is a **research/demo Space**, not recommended for production or real-time applications.
app.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import subprocess
3
+ import uuid
4
+ import os
5
+ import shutil
6
+
7
+ BASE_MODEL = "vibevoice/VibeVoice-7B"
8
+ CHECKPOINT = "tarun7r/vibevoice-hindi-lora"
9
+
10
+ VOICES_DIR = "demo/voices"
11
+ OUTPUT_DIR = "outputs"
12
+
13
+ os.makedirs(VOICES_DIR, exist_ok=True)
14
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
15
+
16
+
17
+ def generate_voice(text, voice_file, cfg_scale, seed):
18
+ if not text.strip():
19
+ raise gr.Error("❌ Hindi text empty hai")
20
+
21
+ if voice_file is None:
22
+ raise gr.Error("❌ Reference voice upload karo (WAV)")
23
+
24
+ speaker_name = "user_voice"
25
+ speaker_path = os.path.join(VOICES_DIR, f"{speaker_name}.wav")
26
+
27
+ # Replace previous voice
28
+ shutil.copy(voice_file, speaker_path)
29
+
30
+ out_file = os.path.join(
31
+ OUTPUT_DIR, f"out_{uuid.uuid4().hex}.wav"
32
+ )
33
+
34
+ cmd = [
35
+ "python", "demo/inference_from_file.py",
36
+ "--model_path", BASE_MODEL,
37
+ "--checkpoint_path", CHECKPOINT,
38
+ "--speaker_names", speaker_name,
39
+ "--txt", text,
40
+ "--cfg_scale", str(cfg_scale),
41
+ "--seed", str(seed),
42
+ "--output_path", out_file
43
+ ]
44
+
45
+ try:
46
+ subprocess.run(cmd, check=True)
47
+ except subprocess.CalledProcessError:
48
+ raise gr.Error("❌ Generation failed (check GPU / logs)")
49
+
50
+ return out_file
51
+
52
+
53
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
54
+ gr.Markdown(
55
+ """
56
+ # 🇮🇳 Hindi Voice Cloning (VibeVoice)
57
+ **High-quality Hindi TTS with emotion & voice cloning**
58
+ Upload a reference voice (10–30 sec) and generate expressive speech.
59
+ """
60
+ )
61
+
62
+ with gr.Row():
63
+ with gr.Column(scale=1):
64
+ text = gr.Textbox(
65
+ label="📝 Hindi Text",
66
+ placeholder="नमस्ते, आज हम आर्टिफिशियल इंटेलिजेंस के बारे में बात करेंगे...",
67
+ lines=6
68
+ )
69
+
70
+ voice = gr.Audio(
71
+ label="🎙️ Reference Voice (WAV only)",
72
+ type="filepath"
73
+ )
74
+
75
+ cfg = gr.Slider(
76
+ 0.8, 2.0, value=1.3, step=0.1,
77
+ label="🎭 Expression Strength (CFG Scale)"
78
+ )
79
+
80
+ seed = gr.Number(
81
+ value=42,
82
+ precision=0,
83
+ label="🎲 Seed (same seed = same style)"
84
+ )
85
+
86
+ btn = gr.Button("🚀 Generate Voice")
87
+
88
+ with gr.Column(scale=1):
89
+ output = gr.Audio(
90
+ label="🔊 Generated Audio",
91
+ type="filepath"
92
+ )
93
+
94
+ btn.click(
95
+ generate_voice,
96
+ inputs=[text, voice, cfg, seed],
97
+ outputs=output
98
+ )
99
+
100
+ gr.Markdown(
101
+ """
102
+ ### ℹ️ Tips for Best Quality
103
+ - Use **clean WAV** (10–30 sec)
104
+ - Emotion reference voice se aata hai
105
+ - Higher CFG = more expressive (but too high = distortion)
106
+ - GPU required (A10 / A100 / H100 recommended)
107
+ """
108
+ )
109
+
110
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch
2
+ torchaudio
3
+ transformers
4
+ gradio
5
+ peft
6
+ diffusers
7
+ accelerate
8
+ sentencepiece
9
+ soundfile
10
+ uv
11
+ git+https://github.com/vibevoice-community/VibeVoice.git