--- title: Voice Tools - Speaker Separation & Audio Extraction emoji: 🎤 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit hardware: zero-gpu tags: - audio - audio-classification - speech-processing - speaker-diarization - speaker-recognition - audio-to-audio - voice-activity-detection - noise-reduction - gradio - pytorch - pyannote-audio - audio-extraction - voice-isolation - podcast - interview - video-editing - content-creation models: - pyannote/speaker-diarization-3.1 - pyannote/segmentation-3.0 - MIT/ast-finetuned-audioset-10-10-0.4593 - facebook/demucs - silero/silero-vad --- # Voice Tools **AI-powered speaker separation, voice extraction, and audio denoising for podcasts, interviews, and video editing** Voice Tools is an AI-powered audio processing toolkit that helps content creators, podcasters, and video editors extract specific voices from mixed audio, separate multi-speaker conversations, and remove background noise. Using state-of-the-art open-source AI models (pyannote.audio, Silero VAD, Demucs), it performs speaker diarization, speaker identification, voice activity detection, and noise reduction—all running locally on your hardware or accelerated with HuggingFace ZeroGPU. ## Features ### Three Powerful Workflows 1. **Speaker Separation**: Automatically separate multi-speaker conversations into individual speaker tracks - Detects and isolates each speaker in an audio file - Outputs separate M4A files for each speaker - Ideal for interviews, podcasts, or conversations 2. **Speaker Extraction**: Extract a specific speaker using a reference voice clip - Provide a reference clip of the target speaker - Finds and extracts all segments matching that speaker - High accuracy speaker identification (90%+) 3. **Voice Denoising**: Remove silence and background noise from audio - Automatic voice activity detection (VAD) - Background noise reduction - Removes long silent gaps while preserving natural pauses - Smooth segment transitions with crossfade ### Additional Features - **Speech/Nonverbal Classification**: Extract only speech or capture emotional sounds separately - **Batch Processing**: Process multiple files at once with progress tracking - **Dual Interface**: Use via command-line (CLI) for automation or web interface (Gradio) for visual interaction - **GPU Acceleration**: Supports HuggingFace ZeroGPU for 10-20x faster processing on Spaces - **Local & Private**: Can run entirely on your computer - no cloud services, no data transmission - **Flexible Deployment**: Works on CPU-only hardware or with GPU acceleration ## Requirements - Python 3.11 or higher - 4-8 core CPU recommended - ~2GB RAM for processing - ~600MB for AI model downloads (one-time) - FFmpeg (for audio format conversion) ## Installation ### 1. Install FFmpeg **macOS** (via Homebrew): ```bash brew install ffmpeg ``` **Linux** (Ubuntu/Debian): ```bash sudo apt-get update sudo apt-get install ffmpeg ``` **Windows**: Download from [ffmpeg.org](https://ffmpeg.org/download.html) ### 2. Install Voice Tools ```bash # Clone the repository git clone cd voice-tools # Create virtual environment python3.11 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies pip install -e . ``` ### 3. HuggingFace Authentication (Required) Some AI models require HuggingFace authentication: ```bash # Install HuggingFace CLI pip install huggingface-hub # Login (follow prompts) huggingface-cli login # Accept model licenses at: # https://huggingface.co/pyannote/speaker-diarization-3.1 # https://huggingface.co/pyannote/embedding ``` ## Quick Start ### Web Interface (Recommended for Beginners) The easiest way to use Voice Tools is through the web interface: ```bash voice-tools web ``` This opens a browser-based UI at http://localhost:7860 where you can: - Upload your reference voice and audio files - Configure extraction settings with sliders - See real-time progress - Download results as individual files or a ZIP **Web Interface Options:** ```bash # Run on different port voice-tools web --port 8080 # Create public share link (accessible from other devices) voice-tools web --share # Custom host voice-tools web --host 127.0.0.1 --port 7860 ``` ### CLI Usage #### 1. Speaker Separation **Separate speakers in a conversation:** ```bash voice-tools separate conversation.m4a ``` **Custom output directory:** ```bash voice-tools separate interview.m4a --output-dir ./speakers ``` **Specify speaker count:** ```bash voice-tools separate podcast.m4a --min-speakers 2 --max-speakers 3 ``` #### 2. Speaker Extraction **Extract specific speaker using reference clip:** ```bash voice-tools extract-speaker reference.m4a target.m4a ``` **Adjust matching sensitivity:** ```bash voice-tools extract-speaker reference.m4a target.m4a \ --threshold 0.35 \ --min-confidence 0.25 ``` **Save as separate segments instead of concatenated:** ```bash voice-tools extract-speaker reference.m4a target.m4a \ --no-concatenate \ --output ./segments ``` #### 3. Voice Denoising **Remove silence and background noise:** ```bash voice-tools denoise noisy_audio.m4a ``` **Aggressive noise removal:** ```bash voice-tools denoise noisy_audio.m4a \ --vad-threshold 0.7 \ --silence-threshold 1.0 ``` **Custom output and format:** ```bash voice-tools denoise noisy_audio.m4a \ --output clean_audio.wav \ --output-format wav ``` #### Legacy Voice Extraction **Extract speech from single file:** ```bash voice-tools extract reference.m4a input.m4a ``` **Extract from multiple files:** ```bash voice-tools extract reference.m4a file1.m4a file2.m4a file3.m4a ``` **Process entire directory:** ```bash voice-tools extract reference.m4a ./audio_files/ ``` **Extract nonverbal sounds:** ```bash voice-tools extract reference.m4a input.m4a --mode nonverbal ``` **Scan file for voice activity:** ```bash voice-tools scan input.m4a ``` ## HuggingFace Spaces Deployment Voice Tools supports deployment to HuggingFace Spaces with GPU acceleration using ZeroGPU. This provides 10-20x faster processing compared to CPU-only execution. ### Prerequisites 1. **HuggingFace Account**: Sign up at [huggingface.co](https://huggingface.co/) 2. **HuggingFace Token**: Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) with "read" access 3. **Model Access**: Accept the licenses for required models: - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) - [pyannote/embedding](https://huggingface.co/pyannote/embedding) ### Deployment Steps 1. **Fork this repository** to your GitHub account 2. **Create a new Space** at [huggingface.co/new-space](https://huggingface.co/new-space): - Choose a name for your Space - Select "Gradio" as the SDK - Choose "ZeroGPU" as the hardware (or "CPU basic" for free tier) - Make it Public or Private based on your preference 3. **Connect your GitHub repository**: - In Space settings, go to "Files and versions" - Click "Connect to GitHub" - Select your forked repository 4. **Configure environment variables**: - Go to Space settings → "Variables and secrets" - Add a new secret: - Name: `HF_TOKEN` - Value: Your HuggingFace token (from prerequisites) 5. **Wait for deployment**: The Space will automatically build and deploy. This takes 5-10 minutes for the first deployment. 6. **Access your Space**: Once deployed, your Space will be available at `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME` ### Configuration The deployment is configured through these files: - **`.space/README.md`**: Space metadata (title, emoji, SDK version, hardware) - **`requirements.txt`**: Python dependencies - **`Dockerfile`**: Custom Docker build for optimized layer caching - **`app.py`**: Entry point that launches the Gradio interface ### GPU Acceleration GPU acceleration is automatically enabled when running on ZeroGPU hardware: - **Speaker Separation**: 90 seconds GPU duration per request - **Speaker Extraction**: 60 seconds GPU duration per request - **Voice Denoising**: 45 seconds GPU duration per request The application automatically detects the ZeroGPU environment and applies the `@spaces.GPU()` decorator to inference functions. Models are loaded on CPU and moved to GPU only during processing for efficient resource usage. ### Troubleshooting Deployment **Build fails with "No module named 'src.models'":** - Ensure `.gitignore` uses `/models/` and `/lib/` (with leading slash) to avoid excluding `src/models/` and `src/lib/` directories - Verify `src/models/` and `src/lib/` are committed to git **Runtime error: "CUDA has been initialized before importing spaces":** - Ensure `app.py` imports `spaces` before any PyTorch imports - Check that service files use fallback decorator when spaces is not available **Models not loading:** - Verify `HF_TOKEN` secret is set correctly in Space settings - Ensure you've accepted the model licenses in prerequisites - Check Space logs for authentication errors **Out of GPU memory:** - Reduce audio file duration (split long files) - Adjust GPU duration parameters in service decorators if needed - Consider using CPU hardware tier for very long files ### Local Testing with ZeroGPU To test ZeroGPU compatibility locally: ```bash # Set environment variable to simulate Spaces environment export SPACES_ZERO_GPU=1 # Install spaces package pip install spaces>=0.28.3 # Run the app python app.py ``` Note: GPU decorators will be applied but actual GPU allocation requires HuggingFace infrastructure. ## How It Works 1. **Voice Activity Detection**: Scans audio to find voice regions, skipping silence (saves 50% processing time) 2. **Speaker Identification**: Uses AI to match voice patterns against your reference clip 3. **Speech Classification**: Separates speech from nonverbal sounds using audio classification 4. **Quality Filtering**: Automatically filters out low-quality segments below threshold 5. **Extraction**: Saves only the target voice segments as clean audio files 6. **Optional Noise Removal**: Reduces background music and ambient noise ## Performance Based on real-world testing with mostly inaudible audio files: - **Single 45-minute file**: 4-6 minutes processing (with VAD optimization) - **Batch of 13 files (9.5 hours)**: ~60-90 minutes total - **Typical yield**: 5-10 minutes of usable audio per 45-minute file - **Quality**: 90%+ voice identification accuracy, 60%+ noise reduction ## Output Format - **Format**: M4A (AAC codec) - **Sample Rate**: 48kHz (video standard) - **Bit Rate**: 192kbps - **Channels**: Mono - **Compatible with**: Adobe Premiere, Final Cut Pro, DaVinci Resolve, most video generation tools ## Architecture - **Models**: Open-source AI models from HuggingFace - pyannote/speaker-diarization-3.1 (speaker identification) - MIT/ast-finetuned-audioset (speech/nonverbal classification) - Silero VAD v4.0 (voice activity detection) - Demucs (music separation) - **License**: All models use permissive licenses (MIT, Apache 2.0) - **Processing**: Fully local on CPU - no cloud APIs ## Troubleshooting **Models not downloading:** - Ensure HuggingFace authentication is complete - Check internet connection for first-time download - Models cache in `./models/` directory **Low extraction yield:** - This is normal for mostly inaudible source files - Check extraction statistics in output logs - Try different reference clips for better voice matching **Out of memory errors:** - Process fewer files at once - Close other applications - Consider files with shorter duration **FFmpeg not found:** - Verify FFmpeg is installed: `ffmpeg -version` - Add FFmpeg to system PATH ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines. ## License MIT License - see [LICENSE](LICENSE) for details. ## Acknowledgments - [pyannote.audio](https://github.com/pyannote/pyannote-audio) for speaker diarization - [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection - [Hugging Face](https://huggingface.co/) for model hosting - [Gradio](https://gradio.app/) for web interface framework