File size: 6,629 Bytes
9578559
 
 
 
 
 
 
 
 
 
 
bed519a
 
 
 
 
 
 
 
 
 
bab752f
348f2c0
 
 
bab752f
348f2c0
bed519a
 
 
 
 
 
 
 
 
 
3fb876e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bed519a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fb876e
 
 
 
 
 
 
 
bed519a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fb876e
bed519a
 
 
 
 
 
 
 
 
 
 
3fb876e
 
474c019
3fb876e
 
bed519a
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: Conference Generator VibeVoice
emoji: 
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: "5.44.1"
app_file: app.py
pinned: false
---

<p align="center">
  <img src="public/images/banner.png" alt="VibeVoice Conference Generator" width="100%"/>
</p>

# Conference Generator — powered by VibeVoice

Generate realistic multi-speaker conference calls, meetings, and podcasts from a single text prompt. The app uses an LLM to write a natural-sounding script, then synthesizes long-form multi-speaker audio with Microsoft's [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) model.

**Try it live:** [Hugging Face Space](https://huggingface.co/spaces/ACloudCenter/Conference-Generator-VibeVoice)

### Listen to the Demo

https://github.com/user-attachments/assets/cfe5397f-7aad-4662-b0b8-e62d7546a9fb

_A 3-speaker example — Wizard (Chicago), Orc (Janus), and Mom (Cherry) — generated from a single sentence prompt. Audio visualizer created separately._

---

## Features

- **Prompt-to-audio in one step** — describe the scenario ("a 4-person product meeting about pricing") and get a full generated conversation
- **1–4 speakers** with 6 distinct voice presets (male/female tagged)
- **Up to ~90 minutes of continuous speech** via VibeVoice's long-form generation
- **Editable turn-by-turn script** — tweak speaker assignments or dialogue before rendering
- **Title generation** — the LLM names each script automatically
- **Two model sizes** — VibeVoice-1.5B (fast) and VibeVoice-7B (higher quality)
- **Gender-aware voice casting** — female characters get female voices automatically (Mom → Cherry, Wizard → Chicago, etc.) with one-click override
- **Voice preview** — sample any of the 6 voices before committing to a long generation

---

## Walkthrough

### 1. Describe your scenario

Type any scenario — a meeting, podcast, argument, TED talk — and the LLM writes the full script.

<p align="center">
  <img src="public/images/Screenshot-1.png" alt="Step 1: Prompt input" width="100%"/>
</p>

### 2. Review the script and pick voices

Speaker tags auto-assign by gender. Every voice dropdown stays in sync with the tags above. Preview any voice before generating.

<p align="center">
  <img src="public/images/Screenshot-2.png" alt="Step 2: Script editor with voice sync" width="100%"/>
</p>

### 3. Generate the audio

Kick off the GPU job on Modal. A funny parody narration keeps you entertained during the wait.

<p align="center">
  <img src="public/images/Screenshot-3.png" alt="Step 3: Generating" width="85%"/>
</p>

### 4. Listen and download

Full-length multi-speaker audio, ready to play or download as a WAV.

<p align="center">
  <img src="public/images/Screenshot-4.png" alt="Step 4: Complete" width="85%"/>
</p>

---

## About VibeVoice

VibeVoice is Microsoft's open-source long-form, multi-speaker TTS model. It uses a frozen LLM backbone with acoustic + semantic tokenizers and a diffusion head to produce up to 90 minutes of natural conversational audio with up to 4 distinct speakers.

<p align="center">
  <img src="public/images/diagram.jpg" alt="VibeVoice architecture" width="85%"/>
</p>

<p align="center">
  <em>Speaker voice prompts and a plain text script are fed into the VibeVoice backbone, which streams audio chunks through per-turn diffusion heads.</em>
</p>

### Benchmark performance

<p align="center">
  <img src="public/images/chart.png" alt="VibeVoice benchmark comparison" width="75%"/>
</p>

VibeVoice leads on preference, realism, and richness among long-form multi-speaker TTS models.

---

## Architecture

This project separates the lightweight Gradio frontend (hosted on HF Spaces) from the GPU-heavy model backend (hosted on [Modal](https://modal.com)).

```
┌──────────────────────┐      ┌─────────────────────────┐
│  HF Space (Gradio)   │      │   Modal (GPU backend)   │
│  ─────────────────   │      │   ───────────────────   │
│  • Prompt UI         │ ───► │  • VibeVoice-1.5B / 7B  │
│  • Script editor     │      │  • Voice prompt loader  │
│  • Qwen2.5-Coder 32B │      │  • Long-form synthesis  │
│    (script writing)  │ ◄─── │  • Returns WAV bytes    │
└──────────────────────┘      └─────────────────────────┘
```

- **Frontend** (`app.py`): Gradio UI, script generation via HF Inference API (Qwen2.5-Coder-32B), script parsing, playback.
- **Backend** (`backend_modal/`, not included in this repo): deployed separately on Modal as a class-based GPU service exposing `generate_podcast`.

---

## Voices

| Voice     | Gender |
| --------- | :----: |
| Cherry    |   F    |
| Chicago   |   M    |
| Janus     |   M    |
| Mantis    |   F    |
| Sponge    |   M    |
| Starchild |   F    |

Voice samples live in `public/voices/` and are loaded as short reference clips by the VibeVoice backend.

---

## Running locally

```bash
git clone https://github.com/Josh-E-S/vibevoice-conference-generator.git
cd vibevoice-conference-generator
pip install -r requirements.txt

# Set your HF token (used for the script-writing LLM)
export HF_TOKEN=your_hf_token_here

# Deploy the Modal backend separately (not in this repo)
# modal deploy backend_modal/modal_runner.py

python app.py
```

Required env:

- `HF_TOKEN` — Hugging Face token with Inference API access

---

## Repo layout

```
.
├── app.py                # Gradio frontend + script generation
├── requirements.txt      # gradio, modal, huggingface_hub
├── public/
│   ├── images/              # Banner, architecture diagram, screenshots
│   ├── voices/              # Voice reference clips (Cherry, Chicago, ...)
│   └── sample-generations/  # Example generations
├── text_examples/           # Example scripts (1p, 2p, 3p, 4p scenarios)
├── tests/                   # Parser tests + example prompts
└── README.md
```

---

## Credits

- **[VibeVoice](https://github.com/microsoft/VibeVoice)** — Microsoft Research's long-form multi-speaker TTS model
- **[Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)** — script generation
- **[Modal](https://modal.com)** — GPU compute for inference
- **[Gradio](https://gradio.app)** + **[Hugging Face Spaces](https://huggingface.co/spaces)** — frontend hosting

---