Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
title: VAD Speaker Diarization
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
ποΈ Real-Time VAD + Speaker Diarization System
Production-ready system for Voice Activity Detection (VAD) and Speaker Diarization with real-time performance and state-of-the-art accuracy.
β¨ Features
- Real-Time VAD: <100ms latency using Silero VAD (40MB model)
- Speaker Diarization: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
- Interactive Demo: Beautiful Gradio web interface with visualizations
- π€ Live Recording: Record audio directly in the browser with microphone
- π₯ Custom Speaker Names: Replace generic labels with real names (e.g., "Alice", "Bob")
- π₯ Audio Download: Download processed recordings with results
- Production Ready: Fully containerized with Docker
- GPU Accelerated: CUDA 12.1+ support for faster processing
- Multiple Formats: Export results as JSON, RTTM, or text
- Modular Architecture: Clean, maintainable, and extensible code
π Quick Start
Prerequisites
- Python 3.10+
- CUDA 12.1+ (optional, for GPU acceleration)
- FFmpeg
- Hugging Face account with access to pyannote/speaker-diarization-3.1
Installation
Option 1: Conda (Recommended)
# Create and activate conda environment
conda create -n vad_diarization python=3.10 -y
conda activate vad_diarization
# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Install dependencies
pip install -r requirements.txt
Option 2: Virtual Environment
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install PyTorch with CUDA support (for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install other dependencies
pip install -r requirements.txt
Option 3: Automated Setup
# For conda users (activate environment first)
conda activate vad_diarization
./setup.sh
# For venv users
./setup.sh
Hugging Face Token Setup
- Get your token: Visit https://huggingface.co/settings/tokens
- Accept model conditions: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
- Set environment variable:
export HF_TOKEN='your_token_here'
Running the Demo
Launch Gradio Web Interface:
export HF_TOKEN='your_token_here'
python app.py
Then open http://localhost:7860 in your browser.
Or use the helper script:
./run_app.sh
π€ New Features
Live Audio Recording:
- Click the "π€ Record Live" tab in the interface
- Record audio directly from your microphone
- Automatic processing when recording stops
- Download your recording with results
Custom Speaker Names:
- Open "βοΈ Advanced Settings"
- Add speaker name mappings:
SPEAKER_00: Alice SPEAKER_01: Bob - See custom names in all outputs and visualizations!
For detailed feature documentation, see FEATURES.md
Command Line Usage:
from src.pipeline import VADDiarizationPipeline
# Initialize pipeline
pipeline = VADDiarizationPipeline(
token='your_hf_token',
vad_threshold=0.5
)
# Process audio file
result = pipeline.process_file('audio.wav')
# Print results
print(pipeline.format_output(result))
π Project Structure
VAD+SD/
βββ src/
β βββ __init__.py # Package initialization
β βββ vad.py # Silero VAD wrapper
β βββ diarization.py # Pyannote diarization wrapper
β βββ pipeline.py # Integrated pipeline
β βββ utils.py # Utility functions
βββ tests/ # Unit tests
β βββ test_vad.py
β βββ test_pipeline.py
β βββ __init__.py
βββ notebooks/ # Jupyter notebooks
β βββ demo.ipynb
βββ benchmarks/ # Benchmark scripts
β βββ run_benchmarks.py
βββ app.py # Gradio web interface
βββ vad_diarization.py # CLI demo script
βββ requirements.txt # Python dependencies
βββ environment.yml # Conda environment file
βββ Dockerfile # Container configuration
βββ docker-compose.yml # Docker Compose config
βββ .dockerignore # Docker ignore patterns
βββ .gitignore # Git ignore patterns
βββ setup.sh # Automated setup script
βββ run_app.sh # App launcher script
βββ verify_installation.py # Installation verification
βββ README.md # This file
π³ Docker Deployment
Build and Run
# Build image
docker build -t vad-diarization:latest .
# Run container
docker run -p 7860:7860 \
-e HF_TOKEN='your_token_here' \
--gpus all \
vad-diarization:latest
Docker Compose
# Set your token in .env file
echo "HF_TOKEN=your_token_here" > .env
# Start services
docker-compose up
π Performance Benchmarks
VAD Performance
- Latency: ~9.73ms per second of audio β
- Model Size: 40MB
- Real-time Factor: ~0.01x (100x faster than real-time)
- Accuracy: High precision on speech detection
Diarization Performance
- DER on FEARLESS STEPS: ~19-20%
- Processing Speed: Depends on audio length and hardware
- GPU Memory: ~2-4GB for typical audio
- Supports: 2-10 speakers (configurable)
System Requirements
- Minimum: 4GB RAM, CPU-only
- Recommended: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
- Optimal: 16GB+ RAM, RTX 3060+ or better
π§ Configuration
VAD Parameters
from src.vad import SileroVAD
vad = SileroVAD(
threshold=0.5, # Speech probability threshold (0.0-1.0)
sampling_rate=16000, # Audio sample rate
min_speech_duration_ms=250, # Minimum speech segment duration
min_silence_duration_ms=100,# Minimum silence between segments
use_onnx=False # Use ONNX runtime for speed
)
Diarization Parameters
from src.diarization import SpeakerDiarization
diarization = SpeakerDiarization(
model_name="pyannote/speaker-diarization-3.1",
token='your_token',
num_speakers=None, # Fixed number (if known)
min_speakers=None, # Minimum speakers
max_speakers=None # Maximum speakers
)
Pipeline Configuration
from src.pipeline import VADDiarizationPipeline
pipeline = VADDiarizationPipeline(
vad_threshold=0.5, # VAD sensitivity
token='your_token', # HF token
num_speakers=None, # Auto-detect speakers
use_onnx_vad=False # Use ONNX for VAD
)
π Usage Examples
Basic Processing
from src.pipeline import VADDiarizationPipeline
# Initialize
pipeline = VADDiarizationPipeline(token='your_token')
# Process file
result = pipeline.process_file('meeting.wav')
# Access results
print(f"Speakers: {result['metadata']['num_speakers']}")
print(f"Segments: {result['metadata']['num_segments']}")
# Print timeline
for seg in result['speaker_segments']:
print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")
Batch Processing
# Process multiple files
audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
results = pipeline.process_batch(audio_files)
# Export results
for result in results:
pipeline.save_results(result, 'outputs/', format='json')
Custom Configuration
# Initialize with custom settings
pipeline = VADDiarizationPipeline(
vad_threshold=0.3, # More sensitive VAD
num_speakers=3, # Fixed 3 speakers
use_onnx_vad=True # Faster VAD inference
)
# Process with overrides
result = pipeline.process_file(
'audio.wav',
num_speakers=2 # Override to 2 speakers for this file
)
VAD Only
from src.vad import SileroVAD
vad = SileroVAD(threshold=0.5)
# Process audio
timestamps = vad.process_file('audio.wav')
# Print speech segments
for ts in timestamps:
print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")
Diarization Only
from src.diarization import SpeakerDiarization
diarizer = SpeakerDiarization(token='your_token')
# Process audio
segments, time_ms, metadata = diarizer.process_file('audio.wav')
# Print speaker segments
for seg in segments:
print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")
π§ͺ Testing
# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
# Test specific module
python -m pytest tests/test_vad.py -v
# Verify installation
python verify_installation.py
# Run benchmarks
python benchmarks/run_benchmarks.py
π Output Formats
JSON Format
{
"audio_path": "audio.wav",
"speaker_segments": [
{
"start": 0.5,
"end": 3.2,
"speaker": "SPEAKER_00",
"duration": 2.7
}
],
"vad_segments": [
{
"start": 0.5,
"end": 3.2
}
],
"metadata": {
"num_speakers": 2,
"num_segments": 15,
"total_speech_time": 45.3
},
"processing_time": {
"vad_ms": 150.2,
"diarization_ms": 3200.5,
"total_ms": 3350.7
}
}
RTTM Format
SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>
Text Timeline
[0.50s - 3.20s] SPEAKER_00
[3.50s - 7.70s] SPEAKER_01
[8.00s - 10.50s] SPEAKER_00
π― Use Cases
- Meeting Transcription: Identify who spoke when in recordings
- Podcast Analysis: Track speaker segments and statistics
- Call Center Analytics: Analyze customer-agent interactions
- Video Production: Generate speaker labels for editing
- Research: Speaker diarization for linguistic studies
- Interview Processing: Separate interviewer and interviewee
- Broadcast Media: Analyze news programs and talk shows
π Troubleshooting
Common Issues
1. HF Token Error
Error: Invalid token or model access denied
Solution:
- Get token from https://huggingface.co/settings/tokens
- Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
- Set environment variable:
export HF_TOKEN='your_token'
2. CUDA Out of Memory
RuntimeError: CUDA out of memory
Solution:
- Process shorter audio segments
- Use CPU mode:
device='cpu' - Reduce batch size
3. Audio Format Not Supported
Error loading audio
Solution: Convert to WAV format using FFmpeg:
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
4. DiarizeOutput Error
'DiarizeOutput' object has no attribute 'itertracks'
Solution: This is fixed in the current version. Make sure you have the latest code.
5. Import Errors
ModuleNotFoundError: No module named 'torch'
Solution:
- Activate your environment:
conda activate vad_diarization - Reinstall dependencies:
pip install -r requirements.txt
π API Compatibility
This project supports both:
- Pyannote.audio 3.x: Returns
Annotationobjects - Pyannote.audio 4.0+: Returns
DiarizeOutputobjects
The code automatically detects and handles both formats.
π Deployment Options
Local Development
python app.py
Docker
docker-compose up
Cloud Platforms
Hugging Face Spaces:
- Fork this repository
- Create new Space
- Connect repository
- Set
HF_TOKENsecret - Deploy!
AWS/GCP/Azure:
- Use provided Dockerfile
- Deploy as container service
- Configure GPU instances for best performance
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
π License
This project is licensed under the MIT License.
π Acknowledgments
- Silero VAD - Fast and accurate VAD
- Pyannote.audio - Speaker diarization toolkit
- Gradio - Web interface framework
- PyTorch - Deep learning framework
π§ Support
For questions or issues:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the troubleshooting section
Built with β€οΈ for the speech processing community