Spaces:

jcudit
/

voice-tools

Paused

File size: 12,353 Bytes

---
title: Voice Tools - Speaker Separation & Audio Extraction
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
hardware: zero-gpu
tags:
- audio
- audio-classification
- speech-processing
- speaker-diarization
- speaker-recognition
- audio-to-audio
- voice-activity-detection
- noise-reduction
- gradio
- pytorch
- pyannote-audio
- audio-extraction
- voice-isolation
- podcast
- interview
- video-editing
- content-creation
models:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
- MIT/ast-finetuned-audioset-10-10-0.4593
- facebook/demucs
- silero/silero-vad
---

# Voice Tools

**AI-powered speaker separation, voice extraction, and audio denoising for podcasts, interviews, and video editing**

Voice Tools is an AI-powered audio processing toolkit that helps content creators, podcasters, and video editors extract specific voices from mixed audio, separate multi-speaker conversations, and remove background noise. Using state-of-the-art open-source AI models (pyannote.audio, Silero VAD, Demucs), it performs speaker diarization, speaker identification, voice activity detection, and noise reduction—all running locally on your hardware or accelerated with HuggingFace ZeroGPU.

## Features

### Three Powerful Workflows

1. **Speaker Separation**: Automatically separate multi-speaker conversations into individual speaker tracks
   - Detects and isolates each speaker in an audio file
   - Outputs separate M4A files for each speaker
   - Ideal for interviews, podcasts, or conversations

2. **Speaker Extraction**: Extract a specific speaker using a reference voice clip
   - Provide a reference clip of the target speaker
   - Finds and extracts all segments matching that speaker
   - High accuracy speaker identification (90%+)

3. **Voice Denoising**: Remove silence and background noise from audio
   - Automatic voice activity detection (VAD)
   - Background noise reduction
   - Removes long silent gaps while preserving natural pauses
   - Smooth segment transitions with crossfade

### Additional Features

- **Speech/Nonverbal Classification**: Extract only speech or capture emotional sounds separately
- **Batch Processing**: Process multiple files at once with progress tracking
- **Dual Interface**: Use via command-line (CLI) for automation or web interface (Gradio) for visual interaction
- **GPU Acceleration**: Supports HuggingFace ZeroGPU for 10-20x faster processing on Spaces
- **Local & Private**: Can run entirely on your computer - no cloud services, no data transmission
- **Flexible Deployment**: Works on CPU-only hardware or with GPU acceleration

## Requirements

- Python 3.11 or higher
- 4-8 core CPU recommended
- ~2GB RAM for processing
- ~600MB for AI model downloads (one-time)
- FFmpeg (for audio format conversion)

## Installation

### 1. Install FFmpeg

**macOS** (via Homebrew):
```bash
brew install ffmpeg
```

**Linux** (Ubuntu/Debian):
```bash
sudo apt-get update
sudo apt-get install ffmpeg
```

**Windows**:
Download from [ffmpeg.org](https://ffmpeg.org/download.html)

### 2. Install Voice Tools

```bash
# Clone the repository
git clone <repository-url>
cd voice-tools

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e .
```

### 3. HuggingFace Authentication (Required)

Some AI models require HuggingFace authentication:

```bash
# Install HuggingFace CLI
pip install huggingface-hub

# Login (follow prompts)
huggingface-cli login

# Accept model licenses at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/embedding
```

## Quick Start

### Web Interface (Recommended for Beginners)

The easiest way to use Voice Tools is through the web interface:

```bash
voice-tools web
```

This opens a browser-based UI at http://localhost:7860 where you can:
- Upload your reference voice and audio files
- Configure extraction settings with sliders
- See real-time progress
- Download results as individual files or a ZIP

**Web Interface Options:**
```bash
# Run on different port
voice-tools web --port 8080

# Create public share link (accessible from other devices)
voice-tools web --share

# Custom host
voice-tools web --host 127.0.0.1 --port 7860
```

### CLI Usage

#### 1. Speaker Separation

**Separate speakers in a conversation:**

```bash
voice-tools separate conversation.m4a
```

**Custom output directory:**

```bash
voice-tools separate interview.m4a --output-dir ./speakers
```

**Specify speaker count:**

```bash
voice-tools separate podcast.m4a --min-speakers 2 --max-speakers 3
```

#### 2. Speaker Extraction

**Extract specific speaker using reference clip:**

```bash
voice-tools extract-speaker reference.m4a target.m4a
```

**Adjust matching sensitivity:**

```bash
voice-tools extract-speaker reference.m4a target.m4a \
  --threshold 0.35 \
  --min-confidence 0.25
```

**Save as separate segments instead of concatenated:**

```bash
voice-tools extract-speaker reference.m4a target.m4a \
  --no-concatenate \
  --output ./segments
```

#### 3. Voice Denoising

**Remove silence and background noise:**

```bash
voice-tools denoise noisy_audio.m4a
```

**Aggressive noise removal:**

```bash
voice-tools denoise noisy_audio.m4a \
  --vad-threshold 0.7 \
  --silence-threshold 1.0
```

**Custom output and format:**

```bash
voice-tools denoise noisy_audio.m4a \
  --output clean_audio.wav \
  --output-format wav
```

#### Legacy Voice Extraction

**Extract speech from single file:**

```bash
voice-tools extract reference.m4a input.m4a
```

**Extract from multiple files:**

```bash
voice-tools extract reference.m4a file1.m4a file2.m4a file3.m4a
```

**Process entire directory:**

```bash
voice-tools extract reference.m4a ./audio_files/
```

**Extract nonverbal sounds:**

```bash
voice-tools extract reference.m4a input.m4a --mode nonverbal
```

**Scan file for voice activity:**

```bash
voice-tools scan input.m4a
```

## HuggingFace Spaces Deployment

Voice Tools supports deployment to HuggingFace Spaces with GPU acceleration using ZeroGPU. This provides 10-20x faster processing compared to CPU-only execution.

### Prerequisites

1. **HuggingFace Account**: Sign up at [huggingface.co](https://huggingface.co/)
2. **HuggingFace Token**: Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) with "read" access
3. **Model Access**: Accept the licenses for required models:
   - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
   - [pyannote/embedding](https://huggingface.co/pyannote/embedding)

### Deployment Steps

1. **Fork this repository** to your GitHub account

2. **Create a new Space** at [huggingface.co/new-space](https://huggingface.co/new-space):
   - Choose a name for your Space
   - Select "Gradio" as the SDK
   - Choose "ZeroGPU" as the hardware (or "CPU basic" for free tier)
   - Make it Public or Private based on your preference

3. **Connect your GitHub repository**:
   - In Space settings, go to "Files and versions"
   - Click "Connect to GitHub"
   - Select your forked repository

4. **Configure environment variables**:
   - Go to Space settings → "Variables and secrets"
   - Add a new secret:
     - Name: `HF_TOKEN`
     - Value: Your HuggingFace token (from prerequisites)

5. **Wait for deployment**: The Space will automatically build and deploy. This takes 5-10 minutes for the first deployment.

6. **Access your Space**: Once deployed, your Space will be available at `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`

### Configuration

The deployment is configured through these files:

- **`.space/README.md`**: Space metadata (title, emoji, SDK version, hardware)
- **`requirements.txt`**: Python dependencies
- **`Dockerfile`**: Custom Docker build for optimized layer caching
- **`app.py`**: Entry point that launches the Gradio interface

### GPU Acceleration

GPU acceleration is automatically enabled when running on ZeroGPU hardware:

- **Speaker Separation**: 90 seconds GPU duration per request
- **Speaker Extraction**: 60 seconds GPU duration per request  
- **Voice Denoising**: 45 seconds GPU duration per request

The application automatically detects the ZeroGPU environment and applies the `@spaces.GPU()` decorator to inference functions. Models are loaded on CPU and moved to GPU only during processing for efficient resource usage.

### Troubleshooting Deployment

**Build fails with "No module named 'src.models'":**
- Ensure `.gitignore` uses `/models/` and `/lib/` (with leading slash) to avoid excluding `src/models/` and `src/lib/` directories
- Verify `src/models/` and `src/lib/` are committed to git

**Runtime error: "CUDA has been initialized before importing spaces":**
- Ensure `app.py` imports `spaces` before any PyTorch imports
- Check that service files use fallback decorator when spaces is not available

**Models not loading:**
- Verify `HF_TOKEN` secret is set correctly in Space settings
- Ensure you've accepted the model licenses in prerequisites
- Check Space logs for authentication errors

**Out of GPU memory:**
- Reduce audio file duration (split long files)
- Adjust GPU duration parameters in service decorators if needed
- Consider using CPU hardware tier for very long files

### Local Testing with ZeroGPU

To test ZeroGPU compatibility locally:

```bash
# Set environment variable to simulate Spaces environment
export SPACES_ZERO_GPU=1

# Install spaces package
pip install spaces>=0.28.3

# Run the app
python app.py
```

Note: GPU decorators will be applied but actual GPU allocation requires HuggingFace infrastructure.

## How It Works

1. **Voice Activity Detection**: Scans audio to find voice regions, skipping silence (saves 50% processing time)
2. **Speaker Identification**: Uses AI to match voice patterns against your reference clip
3. **Speech Classification**: Separates speech from nonverbal sounds using audio classification
4. **Quality Filtering**: Automatically filters out low-quality segments below threshold
5. **Extraction**: Saves only the target voice segments as clean audio files
6. **Optional Noise Removal**: Reduces background music and ambient noise

## Performance

Based on real-world testing with mostly inaudible audio files:

- **Single 45-minute file**: 4-6 minutes processing (with VAD optimization)
- **Batch of 13 files (9.5 hours)**: ~60-90 minutes total
- **Typical yield**: 5-10 minutes of usable audio per 45-minute file
- **Quality**: 90%+ voice identification accuracy, 60%+ noise reduction

## Output Format

- **Format**: M4A (AAC codec)
- **Sample Rate**: 48kHz (video standard)
- **Bit Rate**: 192kbps
- **Channels**: Mono
- **Compatible with**: Adobe Premiere, Final Cut Pro, DaVinci Resolve, most video generation tools

## Architecture

- **Models**: Open-source AI models from HuggingFace
  - pyannote/speaker-diarization-3.1 (speaker identification)
  - MIT/ast-finetuned-audioset (speech/nonverbal classification)
  - Silero VAD v4.0 (voice activity detection)
  - Demucs (music separation)
- **License**: All models use permissive licenses (MIT, Apache 2.0)
- **Processing**: Fully local on CPU - no cloud APIs

## Troubleshooting

**Models not downloading:**
- Ensure HuggingFace authentication is complete
- Check internet connection for first-time download
- Models cache in `./models/` directory

**Low extraction yield:**
- This is normal for mostly inaudible source files
- Check extraction statistics in output logs
- Try different reference clips for better voice matching

**Out of memory errors:**
- Process fewer files at once
- Close other applications
- Consider files with shorter duration

**FFmpeg not found:**
- Verify FFmpeg is installed: `ffmpeg -version`
- Add FFmpeg to system PATH

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

- [pyannote.audio](https://github.com/pyannote/pyannote-audio) for speaker diarization
- [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection
- [Hugging Face](https://huggingface.co/) for model hosting
- [Gradio](https://gradio.app/) for web interface framework