Spaces:

ACloudCenter
/

Conference-Generator-VibeVoice

Running

File size: 6,629 Bytes

---
title: Conference Generator VibeVoice
emoji: ⭐
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: "5.44.1"
app_file: app.py
pinned: false
---

<p align="center">
  <img src="public/images/banner.png" alt="VibeVoice Conference Generator" width="100%"/>
</p>

# Conference Generator — powered by VibeVoice

Generate realistic multi-speaker conference calls, meetings, and podcasts from a single text prompt. The app uses an LLM to write a natural-sounding script, then synthesizes long-form multi-speaker audio with Microsoft's [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) model.

**Try it live:** [Hugging Face Space](https://huggingface.co/spaces/ACloudCenter/Conference-Generator-VibeVoice)

### Listen to the Demo

https://github.com/user-attachments/assets/cfe5397f-7aad-4662-b0b8-e62d7546a9fb

_A 3-speaker example — Wizard (Chicago), Orc (Janus), and Mom (Cherry) — generated from a single sentence prompt. Audio visualizer created separately._

---

## Features

- **Prompt-to-audio in one step** — describe the scenario ("a 4-person product meeting about pricing") and get a full generated conversation
- **1–4 speakers** with 6 distinct voice presets (male/female tagged)
- **Up to ~90 minutes of continuous speech** via VibeVoice's long-form generation
- **Editable turn-by-turn script** — tweak speaker assignments or dialogue before rendering
- **Title generation** — the LLM names each script automatically
- **Two model sizes** — VibeVoice-1.5B (fast) and VibeVoice-7B (higher quality)
- **Gender-aware voice casting** — female characters get female voices automatically (Mom → Cherry, Wizard → Chicago, etc.) with one-click override
- **Voice preview** — sample any of the 6 voices before committing to a long generation

---

## Walkthrough

### 1. Describe your scenario

Type any scenario — a meeting, podcast, argument, TED talk — and the LLM writes the full script.

<p align="center">
  <img src="public/images/Screenshot-1.png" alt="Step 1: Prompt input" width="100%"/>
</p>

### 2. Review the script and pick voices

Speaker tags auto-assign by gender. Every voice dropdown stays in sync with the tags above. Preview any voice before generating.

<p align="center">
  <img src="public/images/Screenshot-2.png" alt="Step 2: Script editor with voice sync" width="100%"/>
</p>

### 3. Generate the audio

Kick off the GPU job on Modal. A funny parody narration keeps you entertained during the wait.

<p align="center">
  <img src="public/images/Screenshot-3.png" alt="Step 3: Generating" width="85%"/>
</p>

### 4. Listen and download

Full-length multi-speaker audio, ready to play or download as a WAV.

<p align="center">
  <img src="public/images/Screenshot-4.png" alt="Step 4: Complete" width="85%"/>
</p>

---

## About VibeVoice

VibeVoice is Microsoft's open-source long-form, multi-speaker TTS model. It uses a frozen LLM backbone with acoustic + semantic tokenizers and a diffusion head to produce up to 90 minutes of natural conversational audio with up to 4 distinct speakers.

<p align="center">
  <img src="public/images/diagram.jpg" alt="VibeVoice architecture" width="85%"/>
</p>

<p align="center">
  <em>Speaker voice prompts and a plain text script are fed into the VibeVoice backbone, which streams audio chunks through per-turn diffusion heads.</em>
</p>

### Benchmark performance

<p align="center">
  <img src="public/images/chart.png" alt="VibeVoice benchmark comparison" width="75%"/>
</p>

VibeVoice leads on preference, realism, and richness among long-form multi-speaker TTS models.

---

## Architecture

This project separates the lightweight Gradio frontend (hosted on HF Spaces) from the GPU-heavy model backend (hosted on [Modal](https://modal.com)).

```
┌──────────────────────┐      ┌─────────────────────────┐
│  HF Space (Gradio)   │      │   Modal (GPU backend)   │
│  ─────────────────   │      │   ───────────────────   │
│  • Prompt UI         │ ───► │  • VibeVoice-1.5B / 7B  │
│  • Script editor     │      │  • Voice prompt loader  │
│  • Qwen2.5-Coder 32B │      │  • Long-form synthesis  │
│    (script writing)  │ ◄─── │  • Returns WAV bytes    │
└──────────────────────┘      └─────────────────────────┘
```

- **Frontend** (`app.py`): Gradio UI, script generation via HF Inference API (Qwen2.5-Coder-32B), script parsing, playback.
- **Backend** (`backend_modal/`, not included in this repo): deployed separately on Modal as a class-based GPU service exposing `generate_podcast`.

---

## Voices

| Voice     | Gender |
| --------- | :----: |
| Cherry    |   F    |
| Chicago   |   M    |
| Janus     |   M    |
| Mantis    |   F    |
| Sponge    |   M    |
| Starchild |   F    |

Voice samples live in `public/voices/` and are loaded as short reference clips by the VibeVoice backend.

---

## Running locally

```bash
git clone https://github.com/Josh-E-S/vibevoice-conference-generator.git
cd vibevoice-conference-generator
pip install -r requirements.txt

# Set your HF token (used for the script-writing LLM)
export HF_TOKEN=your_hf_token_here

# Deploy the Modal backend separately (not in this repo)
# modal deploy backend_modal/modal_runner.py

python app.py
```

Required env:

- `HF_TOKEN` — Hugging Face token with Inference API access

---

## Repo layout

```
.
├── app.py                # Gradio frontend + script generation
├── requirements.txt      # gradio, modal, huggingface_hub
├── public/
│   ├── images/              # Banner, architecture diagram, screenshots
│   ├── voices/              # Voice reference clips (Cherry, Chicago, ...)
│   └── sample-generations/  # Example generations
├── text_examples/           # Example scripts (1p, 2p, 3p, 4p scenarios)
├── tests/                   # Parser tests + example prompts
└── README.md
```

---

## Credits

- **[VibeVoice](https://github.com/microsoft/VibeVoice)** — Microsoft Research's long-form multi-speaker TTS model
- **[Qwen2.5-Coder-32B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)** — script generation
- **[Modal](https://modal.com)** — GPU compute for inference
- **[Gradio](https://gradio.app)** + **[Hugging Face Spaces](https://huggingface.co/spaces)** — frontend hosting

---