Spaces:
Running on Zero
Running on Zero
| title: Voice Tools - Speaker Separation & Audio Extraction | |
| emoji: 🎤 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| hardware: zero-gpu | |
| tags: | |
| - audio | |
| - audio-classification | |
| - speech-processing | |
| - speaker-diarization | |
| - speaker-recognition | |
| - audio-to-audio | |
| - voice-activity-detection | |
| - noise-reduction | |
| - gradio | |
| - pytorch | |
| - pyannote-audio | |
| - audio-extraction | |
| - voice-isolation | |
| - podcast | |
| - interview | |
| - video-editing | |
| - content-creation | |
| models: | |
| - pyannote/speaker-diarization-3.1 | |
| - pyannote/segmentation-3.0 | |
| - MIT/ast-finetuned-audioset-10-10-0.4593 | |
| - facebook/demucs | |
| - silero/silero-vad | |
| # Voice Tools | |
| **AI-powered speaker separation, voice extraction, and audio denoising for podcasts, interviews, and video editing** | |
| Voice Tools is an AI-powered audio processing toolkit that helps content creators, podcasters, and video editors extract specific voices from mixed audio, separate multi-speaker conversations, and remove background noise. Using state-of-the-art open-source AI models (pyannote.audio, Silero VAD, Demucs), it performs speaker diarization, speaker identification, voice activity detection, and noise reduction—all running locally on your hardware or accelerated with HuggingFace ZeroGPU. | |
| ## Features | |
| ### Three Powerful Workflows | |
| 1. **Speaker Separation**: Automatically separate multi-speaker conversations into individual speaker tracks | |
| - Detects and isolates each speaker in an audio file | |
| - Outputs separate M4A files for each speaker | |
| - Ideal for interviews, podcasts, or conversations | |
| 2. **Speaker Extraction**: Extract a specific speaker using a reference voice clip | |
| - Provide a reference clip of the target speaker | |
| - Finds and extracts all segments matching that speaker | |
| - High accuracy speaker identification (90%+) | |
| 3. **Voice Denoising**: Remove silence and background noise from audio | |
| - Automatic voice activity detection (VAD) | |
| - Background noise reduction | |
| - Removes long silent gaps while preserving natural pauses | |
| - Smooth segment transitions with crossfade | |
| ### Additional Features | |
| - **Speech/Nonverbal Classification**: Extract only speech or capture emotional sounds separately | |
| - **Batch Processing**: Process multiple files at once with progress tracking | |
| - **Dual Interface**: Use via command-line (CLI) for automation or web interface (Gradio) for visual interaction | |
| - **GPU Acceleration**: Supports HuggingFace ZeroGPU for 10-20x faster processing on Spaces | |
| - **Local & Private**: Can run entirely on your computer - no cloud services, no data transmission | |
| - **Flexible Deployment**: Works on CPU-only hardware or with GPU acceleration | |
| ## Requirements | |
| - Python 3.11 or higher | |
| - 4-8 core CPU recommended | |
| - ~2GB RAM for processing | |
| - ~600MB for AI model downloads (one-time) | |
| - FFmpeg (for audio format conversion) | |
| ## Installation | |
| ### 1. Install FFmpeg | |
| **macOS** (via Homebrew): | |
| ```bash | |
| brew install ffmpeg | |
| ``` | |
| **Linux** (Ubuntu/Debian): | |
| ```bash | |
| sudo apt-get update | |
| sudo apt-get install ffmpeg | |
| ``` | |
| **Windows**: | |
| Download from [ffmpeg.org](https://ffmpeg.org/download.html) | |
| ### 2. Install Voice Tools | |
| ```bash | |
| # Clone the repository | |
| git clone <repository-url> | |
| cd voice-tools | |
| # Create virtual environment | |
| python3.11 -m venv .venv | |
| source .venv/bin/activate # On Windows: .venv\Scripts\activate | |
| # Install dependencies | |
| pip install -e . | |
| ``` | |
| ### 3. HuggingFace Authentication (Required) | |
| Some AI models require HuggingFace authentication: | |
| ```bash | |
| # Install HuggingFace CLI | |
| pip install huggingface-hub | |
| # Login (follow prompts) | |
| huggingface-cli login | |
| # Accept model licenses at: | |
| # https://huggingface.co/pyannote/speaker-diarization-3.1 | |
| # https://huggingface.co/pyannote/embedding | |
| ``` | |
| ## Quick Start | |
| ### Web Interface (Recommended for Beginners) | |
| The easiest way to use Voice Tools is through the web interface: | |
| ```bash | |
| voice-tools web | |
| ``` | |
| This opens a browser-based UI at http://localhost:7860 where you can: | |
| - Upload your reference voice and audio files | |
| - Configure extraction settings with sliders | |
| - See real-time progress | |
| - Download results as individual files or a ZIP | |
| **Web Interface Options:** | |
| ```bash | |
| # Run on different port | |
| voice-tools web --port 8080 | |
| # Create public share link (accessible from other devices) | |
| voice-tools web --share | |
| # Custom host | |
| voice-tools web --host 127.0.0.1 --port 7860 | |
| ``` | |
| ### CLI Usage | |
| #### 1. Speaker Separation | |
| **Separate speakers in a conversation:** | |
| ```bash | |
| voice-tools separate conversation.m4a | |
| ``` | |
| **Custom output directory:** | |
| ```bash | |
| voice-tools separate interview.m4a --output-dir ./speakers | |
| ``` | |
| **Specify speaker count:** | |
| ```bash | |
| voice-tools separate podcast.m4a --min-speakers 2 --max-speakers 3 | |
| ``` | |
| #### 2. Speaker Extraction | |
| **Extract specific speaker using reference clip:** | |
| ```bash | |
| voice-tools extract-speaker reference.m4a target.m4a | |
| ``` | |
| **Adjust matching sensitivity:** | |
| ```bash | |
| voice-tools extract-speaker reference.m4a target.m4a \ | |
| --threshold 0.35 \ | |
| --min-confidence 0.25 | |
| ``` | |
| **Save as separate segments instead of concatenated:** | |
| ```bash | |
| voice-tools extract-speaker reference.m4a target.m4a \ | |
| --no-concatenate \ | |
| --output ./segments | |
| ``` | |
| #### 3. Voice Denoising | |
| **Remove silence and background noise:** | |
| ```bash | |
| voice-tools denoise noisy_audio.m4a | |
| ``` | |
| **Aggressive noise removal:** | |
| ```bash | |
| voice-tools denoise noisy_audio.m4a \ | |
| --vad-threshold 0.7 \ | |
| --silence-threshold 1.0 | |
| ``` | |
| **Custom output and format:** | |
| ```bash | |
| voice-tools denoise noisy_audio.m4a \ | |
| --output clean_audio.wav \ | |
| --output-format wav | |
| ``` | |
| #### Legacy Voice Extraction | |
| **Extract speech from single file:** | |
| ```bash | |
| voice-tools extract reference.m4a input.m4a | |
| ``` | |
| **Extract from multiple files:** | |
| ```bash | |
| voice-tools extract reference.m4a file1.m4a file2.m4a file3.m4a | |
| ``` | |
| **Process entire directory:** | |
| ```bash | |
| voice-tools extract reference.m4a ./audio_files/ | |
| ``` | |
| **Extract nonverbal sounds:** | |
| ```bash | |
| voice-tools extract reference.m4a input.m4a --mode nonverbal | |
| ``` | |
| **Scan file for voice activity:** | |
| ```bash | |
| voice-tools scan input.m4a | |
| ``` | |
| ## HuggingFace Spaces Deployment | |
| Voice Tools supports deployment to HuggingFace Spaces with GPU acceleration using ZeroGPU. This provides 10-20x faster processing compared to CPU-only execution. | |
| ### Prerequisites | |
| 1. **HuggingFace Account**: Sign up at [huggingface.co](https://huggingface.co/) | |
| 2. **HuggingFace Token**: Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) with "read" access | |
| 3. **Model Access**: Accept the licenses for required models: | |
| - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) | |
| - [pyannote/embedding](https://huggingface.co/pyannote/embedding) | |
| ### Deployment Steps | |
| 1. **Fork this repository** to your GitHub account | |
| 2. **Create a new Space** at [huggingface.co/new-space](https://huggingface.co/new-space): | |
| - Choose a name for your Space | |
| - Select "Gradio" as the SDK | |
| - Choose "ZeroGPU" as the hardware (or "CPU basic" for free tier) | |
| - Make it Public or Private based on your preference | |
| 3. **Connect your GitHub repository**: | |
| - In Space settings, go to "Files and versions" | |
| - Click "Connect to GitHub" | |
| - Select your forked repository | |
| 4. **Configure environment variables**: | |
| - Go to Space settings → "Variables and secrets" | |
| - Add a new secret: | |
| - Name: `HF_TOKEN` | |
| - Value: Your HuggingFace token (from prerequisites) | |
| 5. **Wait for deployment**: The Space will automatically build and deploy. This takes 5-10 minutes for the first deployment. | |
| 6. **Access your Space**: Once deployed, your Space will be available at `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME` | |
| ### Configuration | |
| The deployment is configured through these files: | |
| - **`.space/README.md`**: Space metadata (title, emoji, SDK version, hardware) | |
| - **`requirements.txt`**: Python dependencies | |
| - **`Dockerfile`**: Custom Docker build for optimized layer caching | |
| - **`app.py`**: Entry point that launches the Gradio interface | |
| ### GPU Acceleration | |
| GPU acceleration is automatically enabled when running on ZeroGPU hardware: | |
| - **Speaker Separation**: 90 seconds GPU duration per request | |
| - **Speaker Extraction**: 60 seconds GPU duration per request | |
| - **Voice Denoising**: 45 seconds GPU duration per request | |
| The application automatically detects the ZeroGPU environment and applies the `@spaces.GPU()` decorator to inference functions. Models are loaded on CPU and moved to GPU only during processing for efficient resource usage. | |
| ### Troubleshooting Deployment | |
| **Build fails with "No module named 'src.models'":** | |
| - Ensure `.gitignore` uses `/models/` and `/lib/` (with leading slash) to avoid excluding `src/models/` and `src/lib/` directories | |
| - Verify `src/models/` and `src/lib/` are committed to git | |
| **Runtime error: "CUDA has been initialized before importing spaces":** | |
| - Ensure `app.py` imports `spaces` before any PyTorch imports | |
| - Check that service files use fallback decorator when spaces is not available | |
| **Models not loading:** | |
| - Verify `HF_TOKEN` secret is set correctly in Space settings | |
| - Ensure you've accepted the model licenses in prerequisites | |
| - Check Space logs for authentication errors | |
| **Out of GPU memory:** | |
| - Reduce audio file duration (split long files) | |
| - Adjust GPU duration parameters in service decorators if needed | |
| - Consider using CPU hardware tier for very long files | |
| ### Local Testing with ZeroGPU | |
| To test ZeroGPU compatibility locally: | |
| ```bash | |
| # Set environment variable to simulate Spaces environment | |
| export SPACES_ZERO_GPU=1 | |
| # Install spaces package | |
| pip install spaces>=0.28.3 | |
| # Run the app | |
| python app.py | |
| ``` | |
| Note: GPU decorators will be applied but actual GPU allocation requires HuggingFace infrastructure. | |
| ## How It Works | |
| 1. **Voice Activity Detection**: Scans audio to find voice regions, skipping silence (saves 50% processing time) | |
| 2. **Speaker Identification**: Uses AI to match voice patterns against your reference clip | |
| 3. **Speech Classification**: Separates speech from nonverbal sounds using audio classification | |
| 4. **Quality Filtering**: Automatically filters out low-quality segments below threshold | |
| 5. **Extraction**: Saves only the target voice segments as clean audio files | |
| 6. **Optional Noise Removal**: Reduces background music and ambient noise | |
| ## Performance | |
| Based on real-world testing with mostly inaudible audio files: | |
| - **Single 45-minute file**: 4-6 minutes processing (with VAD optimization) | |
| - **Batch of 13 files (9.5 hours)**: ~60-90 minutes total | |
| - **Typical yield**: 5-10 minutes of usable audio per 45-minute file | |
| - **Quality**: 90%+ voice identification accuracy, 60%+ noise reduction | |
| ## Output Format | |
| - **Format**: M4A (AAC codec) | |
| - **Sample Rate**: 48kHz (video standard) | |
| - **Bit Rate**: 192kbps | |
| - **Channels**: Mono | |
| - **Compatible with**: Adobe Premiere, Final Cut Pro, DaVinci Resolve, most video generation tools | |
| ## Architecture | |
| - **Models**: Open-source AI models from HuggingFace | |
| - pyannote/speaker-diarization-3.1 (speaker identification) | |
| - MIT/ast-finetuned-audioset (speech/nonverbal classification) | |
| - Silero VAD v4.0 (voice activity detection) | |
| - Demucs (music separation) | |
| - **License**: All models use permissive licenses (MIT, Apache 2.0) | |
| - **Processing**: Fully local on CPU - no cloud APIs | |
| ## Troubleshooting | |
| **Models not downloading:** | |
| - Ensure HuggingFace authentication is complete | |
| - Check internet connection for first-time download | |
| - Models cache in `./models/` directory | |
| **Low extraction yield:** | |
| - This is normal for mostly inaudible source files | |
| - Check extraction statistics in output logs | |
| - Try different reference clips for better voice matching | |
| **Out of memory errors:** | |
| - Process fewer files at once | |
| - Close other applications | |
| - Consider files with shorter duration | |
| **FFmpeg not found:** | |
| - Verify FFmpeg is installed: `ffmpeg -version` | |
| - Add FFmpeg to system PATH | |
| ## Contributing | |
| See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines. | |
| ## License | |
| MIT License - see [LICENSE](LICENSE) for details. | |
| ## Acknowledgments | |
| - [pyannote.audio](https://github.com/pyannote/pyannote-audio) for speaker diarization | |
| - [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection | |
| - [Hugging Face](https://huggingface.co/) for model hosting | |
| - [Gradio](https://gradio.app/) for web interface framework | |