voice-tools / README.md
jcudit's picture
jcudit HF Staff
feat: optimize README metadata for HuggingFace search visibility
19764b0
---
title: Voice Tools - Speaker Separation & Audio Extraction
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
hardware: zero-gpu
tags:
- audio
- audio-classification
- speech-processing
- speaker-diarization
- speaker-recognition
- audio-to-audio
- voice-activity-detection
- noise-reduction
- gradio
- pytorch
- pyannote-audio
- audio-extraction
- voice-isolation
- podcast
- interview
- video-editing
- content-creation
models:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
- MIT/ast-finetuned-audioset-10-10-0.4593
- facebook/demucs
- silero/silero-vad
---
# Voice Tools
**AI-powered speaker separation, voice extraction, and audio denoising for podcasts, interviews, and video editing**
Voice Tools is an AI-powered audio processing toolkit that helps content creators, podcasters, and video editors extract specific voices from mixed audio, separate multi-speaker conversations, and remove background noise. Using state-of-the-art open-source AI models (pyannote.audio, Silero VAD, Demucs), it performs speaker diarization, speaker identification, voice activity detection, and noise reduction—all running locally on your hardware or accelerated with HuggingFace ZeroGPU.
## Features
### Three Powerful Workflows
1. **Speaker Separation**: Automatically separate multi-speaker conversations into individual speaker tracks
- Detects and isolates each speaker in an audio file
- Outputs separate M4A files for each speaker
- Ideal for interviews, podcasts, or conversations
2. **Speaker Extraction**: Extract a specific speaker using a reference voice clip
- Provide a reference clip of the target speaker
- Finds and extracts all segments matching that speaker
- High accuracy speaker identification (90%+)
3. **Voice Denoising**: Remove silence and background noise from audio
- Automatic voice activity detection (VAD)
- Background noise reduction
- Removes long silent gaps while preserving natural pauses
- Smooth segment transitions with crossfade
### Additional Features
- **Speech/Nonverbal Classification**: Extract only speech or capture emotional sounds separately
- **Batch Processing**: Process multiple files at once with progress tracking
- **Dual Interface**: Use via command-line (CLI) for automation or web interface (Gradio) for visual interaction
- **GPU Acceleration**: Supports HuggingFace ZeroGPU for 10-20x faster processing on Spaces
- **Local & Private**: Can run entirely on your computer - no cloud services, no data transmission
- **Flexible Deployment**: Works on CPU-only hardware or with GPU acceleration
## Requirements
- Python 3.11 or higher
- 4-8 core CPU recommended
- ~2GB RAM for processing
- ~600MB for AI model downloads (one-time)
- FFmpeg (for audio format conversion)
## Installation
### 1. Install FFmpeg
**macOS** (via Homebrew):
```bash
brew install ffmpeg
```
**Linux** (Ubuntu/Debian):
```bash
sudo apt-get update
sudo apt-get install ffmpeg
```
**Windows**:
Download from [ffmpeg.org](https://ffmpeg.org/download.html)
### 2. Install Voice Tools
```bash
# Clone the repository
git clone <repository-url>
cd voice-tools
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e .
```
### 3. HuggingFace Authentication (Required)
Some AI models require HuggingFace authentication:
```bash
# Install HuggingFace CLI
pip install huggingface-hub
# Login (follow prompts)
huggingface-cli login
# Accept model licenses at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/embedding
```
## Quick Start
### Web Interface (Recommended for Beginners)
The easiest way to use Voice Tools is through the web interface:
```bash
voice-tools web
```
This opens a browser-based UI at http://localhost:7860 where you can:
- Upload your reference voice and audio files
- Configure extraction settings with sliders
- See real-time progress
- Download results as individual files or a ZIP
**Web Interface Options:**
```bash
# Run on different port
voice-tools web --port 8080
# Create public share link (accessible from other devices)
voice-tools web --share
# Custom host
voice-tools web --host 127.0.0.1 --port 7860
```
### CLI Usage
#### 1. Speaker Separation
**Separate speakers in a conversation:**
```bash
voice-tools separate conversation.m4a
```
**Custom output directory:**
```bash
voice-tools separate interview.m4a --output-dir ./speakers
```
**Specify speaker count:**
```bash
voice-tools separate podcast.m4a --min-speakers 2 --max-speakers 3
```
#### 2. Speaker Extraction
**Extract specific speaker using reference clip:**
```bash
voice-tools extract-speaker reference.m4a target.m4a
```
**Adjust matching sensitivity:**
```bash
voice-tools extract-speaker reference.m4a target.m4a \
--threshold 0.35 \
--min-confidence 0.25
```
**Save as separate segments instead of concatenated:**
```bash
voice-tools extract-speaker reference.m4a target.m4a \
--no-concatenate \
--output ./segments
```
#### 3. Voice Denoising
**Remove silence and background noise:**
```bash
voice-tools denoise noisy_audio.m4a
```
**Aggressive noise removal:**
```bash
voice-tools denoise noisy_audio.m4a \
--vad-threshold 0.7 \
--silence-threshold 1.0
```
**Custom output and format:**
```bash
voice-tools denoise noisy_audio.m4a \
--output clean_audio.wav \
--output-format wav
```
#### Legacy Voice Extraction
**Extract speech from single file:**
```bash
voice-tools extract reference.m4a input.m4a
```
**Extract from multiple files:**
```bash
voice-tools extract reference.m4a file1.m4a file2.m4a file3.m4a
```
**Process entire directory:**
```bash
voice-tools extract reference.m4a ./audio_files/
```
**Extract nonverbal sounds:**
```bash
voice-tools extract reference.m4a input.m4a --mode nonverbal
```
**Scan file for voice activity:**
```bash
voice-tools scan input.m4a
```
## HuggingFace Spaces Deployment
Voice Tools supports deployment to HuggingFace Spaces with GPU acceleration using ZeroGPU. This provides 10-20x faster processing compared to CPU-only execution.
### Prerequisites
1. **HuggingFace Account**: Sign up at [huggingface.co](https://huggingface.co/)
2. **HuggingFace Token**: Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) with "read" access
3. **Model Access**: Accept the licenses for required models:
- [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
- [pyannote/embedding](https://huggingface.co/pyannote/embedding)
### Deployment Steps
1. **Fork this repository** to your GitHub account
2. **Create a new Space** at [huggingface.co/new-space](https://huggingface.co/new-space):
- Choose a name for your Space
- Select "Gradio" as the SDK
- Choose "ZeroGPU" as the hardware (or "CPU basic" for free tier)
- Make it Public or Private based on your preference
3. **Connect your GitHub repository**:
- In Space settings, go to "Files and versions"
- Click "Connect to GitHub"
- Select your forked repository
4. **Configure environment variables**:
- Go to Space settings → "Variables and secrets"
- Add a new secret:
- Name: `HF_TOKEN`
- Value: Your HuggingFace token (from prerequisites)
5. **Wait for deployment**: The Space will automatically build and deploy. This takes 5-10 minutes for the first deployment.
6. **Access your Space**: Once deployed, your Space will be available at `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
### Configuration
The deployment is configured through these files:
- **`.space/README.md`**: Space metadata (title, emoji, SDK version, hardware)
- **`requirements.txt`**: Python dependencies
- **`Dockerfile`**: Custom Docker build for optimized layer caching
- **`app.py`**: Entry point that launches the Gradio interface
### GPU Acceleration
GPU acceleration is automatically enabled when running on ZeroGPU hardware:
- **Speaker Separation**: 90 seconds GPU duration per request
- **Speaker Extraction**: 60 seconds GPU duration per request
- **Voice Denoising**: 45 seconds GPU duration per request
The application automatically detects the ZeroGPU environment and applies the `@spaces.GPU()` decorator to inference functions. Models are loaded on CPU and moved to GPU only during processing for efficient resource usage.
### Troubleshooting Deployment
**Build fails with "No module named 'src.models'":**
- Ensure `.gitignore` uses `/models/` and `/lib/` (with leading slash) to avoid excluding `src/models/` and `src/lib/` directories
- Verify `src/models/` and `src/lib/` are committed to git
**Runtime error: "CUDA has been initialized before importing spaces":**
- Ensure `app.py` imports `spaces` before any PyTorch imports
- Check that service files use fallback decorator when spaces is not available
**Models not loading:**
- Verify `HF_TOKEN` secret is set correctly in Space settings
- Ensure you've accepted the model licenses in prerequisites
- Check Space logs for authentication errors
**Out of GPU memory:**
- Reduce audio file duration (split long files)
- Adjust GPU duration parameters in service decorators if needed
- Consider using CPU hardware tier for very long files
### Local Testing with ZeroGPU
To test ZeroGPU compatibility locally:
```bash
# Set environment variable to simulate Spaces environment
export SPACES_ZERO_GPU=1
# Install spaces package
pip install spaces>=0.28.3
# Run the app
python app.py
```
Note: GPU decorators will be applied but actual GPU allocation requires HuggingFace infrastructure.
## How It Works
1. **Voice Activity Detection**: Scans audio to find voice regions, skipping silence (saves 50% processing time)
2. **Speaker Identification**: Uses AI to match voice patterns against your reference clip
3. **Speech Classification**: Separates speech from nonverbal sounds using audio classification
4. **Quality Filtering**: Automatically filters out low-quality segments below threshold
5. **Extraction**: Saves only the target voice segments as clean audio files
6. **Optional Noise Removal**: Reduces background music and ambient noise
## Performance
Based on real-world testing with mostly inaudible audio files:
- **Single 45-minute file**: 4-6 minutes processing (with VAD optimization)
- **Batch of 13 files (9.5 hours)**: ~60-90 minutes total
- **Typical yield**: 5-10 minutes of usable audio per 45-minute file
- **Quality**: 90%+ voice identification accuracy, 60%+ noise reduction
## Output Format
- **Format**: M4A (AAC codec)
- **Sample Rate**: 48kHz (video standard)
- **Bit Rate**: 192kbps
- **Channels**: Mono
- **Compatible with**: Adobe Premiere, Final Cut Pro, DaVinci Resolve, most video generation tools
## Architecture
- **Models**: Open-source AI models from HuggingFace
- pyannote/speaker-diarization-3.1 (speaker identification)
- MIT/ast-finetuned-audioset (speech/nonverbal classification)
- Silero VAD v4.0 (voice activity detection)
- Demucs (music separation)
- **License**: All models use permissive licenses (MIT, Apache 2.0)
- **Processing**: Fully local on CPU - no cloud APIs
## Troubleshooting
**Models not downloading:**
- Ensure HuggingFace authentication is complete
- Check internet connection for first-time download
- Models cache in `./models/` directory
**Low extraction yield:**
- This is normal for mostly inaudible source files
- Check extraction statistics in output logs
- Try different reference clips for better voice matching
**Out of memory errors:**
- Process fewer files at once
- Close other applications
- Consider files with shorter duration
**FFmpeg not found:**
- Verify FFmpeg is installed: `ffmpeg -version`
- Add FFmpeg to system PATH
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.
## License
MIT License - see [LICENSE](LICENSE) for details.
## Acknowledgments
- [pyannote.audio](https://github.com/pyannote/pyannote-audio) for speaker diarization
- [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection
- [Hugging Face](https://huggingface.co/) for model hosting
- [Gradio](https://gradio.app/) for web interface framework