File size: 6,629 Bytes
9578559 bed519a bab752f 348f2c0 bab752f 348f2c0 bed519a 3fb876e bed519a 3fb876e bed519a 3fb876e bed519a 3fb876e 474c019 3fb876e bed519a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | ---
title: Conference Generator VibeVoice
emoji: ⭐
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: "5.44.1"
app_file: app.py
pinned: false
---
<p align="center">
<img src="public/images/banner.png" alt="VibeVoice Conference Generator" width="100%"/>
</p>
# Conference Generator — powered by VibeVoice
Generate realistic multi-speaker conference calls, meetings, and podcasts from a single text prompt. The app uses an LLM to write a natural-sounding script, then synthesizes long-form multi-speaker audio with Microsoft's [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) model.
**Try it live:** [Hugging Face Space](https://huggingface.co/spaces/ACloudCenter/Conference-Generator-VibeVoice)
### Listen to the Demo
https://github.com/user-attachments/assets/cfe5397f-7aad-4662-b0b8-e62d7546a9fb
_A 3-speaker example — Wizard (Chicago), Orc (Janus), and Mom (Cherry) — generated from a single sentence prompt. Audio visualizer created separately._
---
## Features
- **Prompt-to-audio in one step** — describe the scenario ("a 4-person product meeting about pricing") and get a full generated conversation
- **1–4 speakers** with 6 distinct voice presets (male/female tagged)
- **Up to ~90 minutes of continuous speech** via VibeVoice's long-form generation
- **Editable turn-by-turn script** — tweak speaker assignments or dialogue before rendering
- **Title generation** — the LLM names each script automatically
- **Two model sizes** — VibeVoice-1.5B (fast) and VibeVoice-7B (higher quality)
- **Gender-aware voice casting** — female characters get female voices automatically (Mom → Cherry, Wizard → Chicago, etc.) with one-click override
- **Voice preview** — sample any of the 6 voices before committing to a long generation
---
## Walkthrough
### 1. Describe your scenario
Type any scenario — a meeting, podcast, argument, TED talk — and the LLM writes the full script.
<p align="center">
<img src="public/images/Screenshot-1.png" alt="Step 1: Prompt input" width="100%"/>
</p>
### 2. Review the script and pick voices
Speaker tags auto-assign by gender. Every voice dropdown stays in sync with the tags above. Preview any voice before generating.
<p align="center">
<img src="public/images/Screenshot-2.png" alt="Step 2: Script editor with voice sync" width="100%"/>
</p>
### 3. Generate the audio
Kick off the GPU job on Modal. A funny parody narration keeps you entertained during the wait.
<p align="center">
<img src="public/images/Screenshot-3.png" alt="Step 3: Generating" width="85%"/>
</p>
### 4. Listen and download
Full-length multi-speaker audio, ready to play or download as a WAV.
<p align="center">
<img src="public/images/Screenshot-4.png" alt="Step 4: Complete" width="85%"/>
</p>
---
## About VibeVoice
VibeVoice is Microsoft's open-source long-form, multi-speaker TTS model. It uses a frozen LLM backbone with acoustic + semantic tokenizers and a diffusion head to produce up to 90 minutes of natural conversational audio with up to 4 distinct speakers.
<p align="center">
<img src="public/images/diagram.jpg" alt="VibeVoice architecture" width="85%"/>
</p>
<p align="center">
<em>Speaker voice prompts and a plain text script are fed into the VibeVoice backbone, which streams audio chunks through per-turn diffusion heads.</em>
</p>
### Benchmark performance
<p align="center">
<img src="public/images/chart.png" alt="VibeVoice benchmark comparison" width="75%"/>
</p>
VibeVoice leads on preference, realism, and richness among long-form multi-speaker TTS models.
---
## Architecture
This project separates the lightweight Gradio frontend (hosted on HF Spaces) from the GPU-heavy model backend (hosted on [Modal](https://modal.com)).
```
┌──────────────────────┐ ┌─────────────────────────┐
│ HF Space (Gradio) │ │ Modal (GPU backend) │
│ ───────────────── │ │ ─────────────────── │
│ • Prompt UI │ ───► │ • VibeVoice-1.5B / 7B │
│ • Script editor │ │ • Voice prompt loader │
│ • Qwen2.5-Coder 32B │ │ • Long-form synthesis │
│ (script writing) │ ◄─── │ • Returns WAV bytes │
└──────────────────────┘ └─────────────────────────┘
```
- **Frontend** (`app.py`): Gradio UI, script generation via HF Inference API (Qwen2.5-Coder-32B), script parsing, playback.
- **Backend** (`backend_modal/`, not included in this repo): deployed separately on Modal as a class-based GPU service exposing `generate_podcast`.
---
## Voices
| Voice | Gender |
| --------- | :----: |
| Cherry | F |
| Chicago | M |
| Janus | M |
| Mantis | F |
| Sponge | M |
| Starchild | F |
Voice samples live in `public/voices/` and are loaded as short reference clips by the VibeVoice backend.
---
## Running locally
```bash
git clone https://github.com/Josh-E-S/vibevoice-conference-generator.git
cd vibevoice-conference-generator
pip install -r requirements.txt
# Set your HF token (used for the script-writing LLM)
export HF_TOKEN=your_hf_token_here
# Deploy the Modal backend separately (not in this repo)
# modal deploy backend_modal/modal_runner.py
python app.py
```
Required env:
- `HF_TOKEN` — Hugging Face token with Inference API access
---
## Repo layout
```
.
├── app.py # Gradio frontend + script generation
├── requirements.txt # gradio, modal, huggingface_hub
├── public/
│ ├── images/ # Banner, architecture diagram, screenshots
│ ├── voices/ # Voice reference clips (Cherry, Chicago, ...)
│ └── sample-generations/ # Example generations
├── text_examples/ # Example scripts (1p, 2p, 3p, 4p scenarios)
├── tests/ # Parser tests + example prompts
└── README.md
```
---
## Credits
- **[VibeVoice](https://github.com/microsoft/VibeVoice)** — Microsoft Research's long-form multi-speaker TTS model
- **[Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)** — script generation
- **[Modal](https://modal.com)** — GPU compute for inference
- **[Gradio](https://gradio.app)** + **[Hugging Face Spaces](https://huggingface.co/spaces)** — frontend hosting
---
|