Spaces:

jcudit
/

voice-tools

Running on Zero

App Files Files Community

voice-tools / README.md

jcudit HF Staff

feat: optimize README metadata for HuggingFace search visibility

19764b0 2 months ago

preview code

raw

history blame contribute delete

12.4 kB

	---
	title: Voice Tools - Speaker Separation & Audio Extraction
	emoji: 🎤
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	hardware: zero-gpu
	tags:
	- audio
	- audio-classification
	- speech-processing
	- speaker-diarization
	- speaker-recognition
	- audio-to-audio
	- voice-activity-detection
	- noise-reduction
	- gradio
	- pytorch
	- pyannote-audio
	- audio-extraction
	- voice-isolation
	- podcast
	- interview
	- video-editing
	- content-creation
	models:
	- pyannote/speaker-diarization-3.1
	- pyannote/segmentation-3.0
	- MIT/ast-finetuned-audioset-10-10-0.4593
	- facebook/demucs
	- silero/silero-vad
	---

	# Voice Tools

	AI-powered speaker separation, voice extraction, and audio denoising for podcasts, interviews, and video editing

	Voice Tools is an AI-powered audio processing toolkit that helps content creators, podcasters, and video editors extract specific voices from mixed audio, separate multi-speaker conversations, and remove background noise. Using state-of-the-art open-source AI models (pyannote.audio, Silero VAD, Demucs), it performs speaker diarization, speaker identification, voice activity detection, and noise reduction—all running locally on your hardware or accelerated with HuggingFace ZeroGPU.

	## Features

	### Three Powerful Workflows

	1. Speaker Separation: Automatically separate multi-speaker conversations into individual speaker tracks
	- Detects and isolates each speaker in an audio file
	- Outputs separate M4A files for each speaker
	- Ideal for interviews, podcasts, or conversations

	2. Speaker Extraction: Extract a specific speaker using a reference voice clip
	- Provide a reference clip of the target speaker
	- Finds and extracts all segments matching that speaker
	- High accuracy speaker identification (90%+)

	3. Voice Denoising: Remove silence and background noise from audio
	- Automatic voice activity detection (VAD)
	- Background noise reduction
	- Removes long silent gaps while preserving natural pauses
	- Smooth segment transitions with crossfade

	### Additional Features

	- Speech/Nonverbal Classification: Extract only speech or capture emotional sounds separately
	- Batch Processing: Process multiple files at once with progress tracking
	- Dual Interface: Use via command-line (CLI) for automation or web interface (Gradio) for visual interaction
	- GPU Acceleration: Supports HuggingFace ZeroGPU for 10-20x faster processing on Spaces
	- Local & Private: Can run entirely on your computer - no cloud services, no data transmission
	- Flexible Deployment: Works on CPU-only hardware or with GPU acceleration

	## Requirements

	- Python 3.11 or higher
	- 4-8 core CPU recommended
	- ~2GB RAM for processing
	- ~600MB for AI model downloads (one-time)
	- FFmpeg (for audio format conversion)

	## Installation

	### 1. Install FFmpeg

	macOS (via Homebrew):
	```bash
	brew install ffmpeg
	```

	Linux (Ubuntu/Debian):
	```bash
	sudo apt-get update
	sudo apt-get install ffmpeg
	```

	Windows:
	Download from [ffmpeg.org](https://ffmpeg.org/download.html)

	### 2. Install Voice Tools

	```bash
	# Clone the repository
	git clone <repository-url>
	cd voice-tools

	# Create virtual environment
	python3.11 -m venv .venv
	source .venv/bin/activate # On Windows: .venv\Scripts\activate

	# Install dependencies
	pip install -e .
	```

	### 3. HuggingFace Authentication (Required)

	Some AI models require HuggingFace authentication:

	```bash
	# Install HuggingFace CLI
	pip install huggingface-hub

	# Login (follow prompts)
	huggingface-cli login

	# Accept model licenses at:
	# https://huggingface.co/pyannote/speaker-diarization-3.1
	# https://huggingface.co/pyannote/embedding
	```

	## Quick Start

	### Web Interface (Recommended for Beginners)

	The easiest way to use Voice Tools is through the web interface:

	```bash
	voice-tools web
	```

	This opens a browser-based UI at http://localhost:7860 where you can:
	- Upload your reference voice and audio files
	- Configure extraction settings with sliders
	- See real-time progress
	- Download results as individual files or a ZIP

	Web Interface Options:
	```bash
	# Run on different port
	voice-tools web --port 8080

	# Create public share link (accessible from other devices)
	voice-tools web --share

	# Custom host
	voice-tools web --host 127.0.0.1 --port 7860
	```

	### CLI Usage

	#### 1. Speaker Separation

	Separate speakers in a conversation:

	```bash
	voice-tools separate conversation.m4a
	```

	Custom output directory:

	```bash
	voice-tools separate interview.m4a --output-dir ./speakers
	```

	Specify speaker count:

	```bash
	voice-tools separate podcast.m4a --min-speakers 2 --max-speakers 3
	```

	#### 2. Speaker Extraction

	Extract specific speaker using reference clip:

	```bash
	voice-tools extract-speaker reference.m4a target.m4a
	```

	Adjust matching sensitivity:

	```bash
	voice-tools extract-speaker reference.m4a target.m4a \
	--threshold 0.35 \
	--min-confidence 0.25
	```

	Save as separate segments instead of concatenated:

	```bash
	voice-tools extract-speaker reference.m4a target.m4a \
	--no-concatenate \
	--output ./segments
	```

	#### 3. Voice Denoising

	Remove silence and background noise:

	```bash
	voice-tools denoise noisy_audio.m4a
	```

	Aggressive noise removal:

	```bash
	voice-tools denoise noisy_audio.m4a \
	--vad-threshold 0.7 \
	--silence-threshold 1.0
	```

	Custom output and format:

	```bash
	voice-tools denoise noisy_audio.m4a \
	--output clean_audio.wav \
	--output-format wav
	```

	#### Legacy Voice Extraction

	Extract speech from single file:

	```bash
	voice-tools extract reference.m4a input.m4a
	```

	Extract from multiple files:

	```bash
	voice-tools extract reference.m4a file1.m4a file2.m4a file3.m4a
	```

	Process entire directory:

	```bash
	voice-tools extract reference.m4a ./audio_files/
	```

	Extract nonverbal sounds:

	```bash
	voice-tools extract reference.m4a input.m4a --mode nonverbal
	```

	Scan file for voice activity:

	```bash
	voice-tools scan input.m4a
	```

	## HuggingFace Spaces Deployment

	Voice Tools supports deployment to HuggingFace Spaces with GPU acceleration using ZeroGPU. This provides 10-20x faster processing compared to CPU-only execution.

	### Prerequisites

	1. HuggingFace Account: Sign up at [huggingface.co](https://huggingface.co/)
	2. HuggingFace Token: Create a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) with "read" access
	3. Model Access: Accept the licenses for required models:
	- [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
	- [pyannote/embedding](https://huggingface.co/pyannote/embedding)

	### Deployment Steps

	1. Fork this repository to your GitHub account

	2. Create a new Space at [huggingface.co/new-space](https://huggingface.co/new-space):
	- Choose a name for your Space
	- Select "Gradio" as the SDK
	- Choose "ZeroGPU" as the hardware (or "CPU basic" for free tier)
	- Make it Public or Private based on your preference

	3. Connect your GitHub repository:
	- In Space settings, go to "Files and versions"
	- Click "Connect to GitHub"
	- Select your forked repository

	4. Configure environment variables:
	- Go to Space settings → "Variables and secrets"
	- Add a new secret:
	- Name: `HF_TOKEN`
	- Value: Your HuggingFace token (from prerequisites)

	5. Wait for deployment: The Space will automatically build and deploy. This takes 5-10 minutes for the first deployment.

	6. Access your Space: Once deployed, your Space will be available at `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`

	### Configuration

	The deployment is configured through these files:

	- `.space/README.md`: Space metadata (title, emoji, SDK version, hardware)
	- `requirements.txt`: Python dependencies
	- `Dockerfile`: Custom Docker build for optimized layer caching
	- `app.py`: Entry point that launches the Gradio interface

	### GPU Acceleration

	GPU acceleration is automatically enabled when running on ZeroGPU hardware:

	- Speaker Separation: 90 seconds GPU duration per request
	- Speaker Extraction: 60 seconds GPU duration per request
	- Voice Denoising: 45 seconds GPU duration per request

	The application automatically detects the ZeroGPU environment and applies the `@spaces.GPU()` decorator to inference functions. Models are loaded on CPU and moved to GPU only during processing for efficient resource usage.

	### Troubleshooting Deployment

	Build fails with "No module named 'src.models'":
	- Ensure `.gitignore` uses `/models/` and `/lib/` (with leading slash) to avoid excluding `src/models/` and `src/lib/` directories
	- Verify `src/models/` and `src/lib/` are committed to git

	Runtime error: "CUDA has been initialized before importing spaces":
	- Ensure `app.py` imports `spaces` before any PyTorch imports
	- Check that service files use fallback decorator when spaces is not available

	Models not loading:
	- Verify `HF_TOKEN` secret is set correctly in Space settings
	- Ensure you've accepted the model licenses in prerequisites
	- Check Space logs for authentication errors

	Out of GPU memory:
	- Reduce audio file duration (split long files)
	- Adjust GPU duration parameters in service decorators if needed
	- Consider using CPU hardware tier for very long files

	### Local Testing with ZeroGPU

	To test ZeroGPU compatibility locally:

	```bash
	# Set environment variable to simulate Spaces environment
	export SPACES_ZERO_GPU=1

	# Install spaces package
	pip install spaces>=0.28.3

	# Run the app
	python app.py
	```

	Note: GPU decorators will be applied but actual GPU allocation requires HuggingFace infrastructure.

	## How It Works

	1. Voice Activity Detection: Scans audio to find voice regions, skipping silence (saves 50% processing time)
	2. Speaker Identification: Uses AI to match voice patterns against your reference clip
	3. Speech Classification: Separates speech from nonverbal sounds using audio classification
	4. Quality Filtering: Automatically filters out low-quality segments below threshold
	5. Extraction: Saves only the target voice segments as clean audio files
	6. Optional Noise Removal: Reduces background music and ambient noise

	## Performance

	Based on real-world testing with mostly inaudible audio files:

	- Single 45-minute file: 4-6 minutes processing (with VAD optimization)
	- Batch of 13 files (9.5 hours): ~60-90 minutes total
	- Typical yield: 5-10 minutes of usable audio per 45-minute file
	- Quality: 90%+ voice identification accuracy, 60%+ noise reduction

	## Output Format

	- Format: M4A (AAC codec)
	- Sample Rate: 48kHz (video standard)
	- Bit Rate: 192kbps
	- Channels: Mono
	- Compatible with: Adobe Premiere, Final Cut Pro, DaVinci Resolve, most video generation tools

	## Architecture

	- Models: Open-source AI models from HuggingFace
	- pyannote/speaker-diarization-3.1 (speaker identification)
	- MIT/ast-finetuned-audioset (speech/nonverbal classification)
	- Silero VAD v4.0 (voice activity detection)
	- Demucs (music separation)
	- License: All models use permissive licenses (MIT, Apache 2.0)
	- Processing: Fully local on CPU - no cloud APIs

	## Troubleshooting

	Models not downloading:
	- Ensure HuggingFace authentication is complete
	- Check internet connection for first-time download
	- Models cache in `./models/` directory

	Low extraction yield:
	- This is normal for mostly inaudible source files
	- Check extraction statistics in output logs
	- Try different reference clips for better voice matching

	Out of memory errors:
	- Process fewer files at once
	- Close other applications
	- Consider files with shorter duration

	FFmpeg not found:
	- Verify FFmpeg is installed: `ffmpeg -version`
	- Add FFmpeg to system PATH

	## Contributing

	See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

	## License

	MIT License - see [LICENSE](LICENSE) for details.

	## Acknowledgments

	- [pyannote.audio](https://github.com/pyannote/pyannote-audio) for speaker diarization
	- [Silero VAD](https://github.com/snakers4/silero-vad) for voice activity detection
	- [Hugging Face](https://huggingface.co/) for model hosting
	- [Gradio](https://gradio.app/) for web interface framework