saadmannan's picture
fix token
9467e5a
---
title: VAD Speaker Diarization
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---
# πŸŽ™οΈ Real-Time VAD + Speaker Diarization System
Production-ready system for **Voice Activity Detection (VAD)** and **Speaker Diarization** with real-time performance and state-of-the-art accuracy.
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
## ✨ Features
- **Real-Time VAD**: <100ms latency using Silero VAD (40MB model)
- **Speaker Diarization**: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
- **Interactive Demo**: Beautiful Gradio web interface with visualizations
- **🎀 Live Recording**: Record audio directly in the browser with microphone
- **πŸ‘₯ Custom Speaker Names**: Replace generic labels with real names (e.g., "Alice", "Bob")
- **πŸ“₯ Audio Download**: Download processed recordings with results
- **Production Ready**: Fully containerized with Docker
- **GPU Accelerated**: CUDA 12.1+ support for faster processing
- **Multiple Formats**: Export results as JSON, RTTM, or text
- **Modular Architecture**: Clean, maintainable, and extensible code
## πŸš€ Quick Start
### Prerequisites
- Python 3.10+
- CUDA 12.1+ (optional, for GPU acceleration)
- FFmpeg
- Hugging Face account with access to [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
### Installation
#### Option 1: Conda (Recommended)
```bash
# Create and activate conda environment
conda create -n vad_diarization python=3.10 -y
conda activate vad_diarization
# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Install dependencies
pip install -r requirements.txt
```
#### Option 2: Virtual Environment
```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install PyTorch with CUDA support (for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install other dependencies
pip install -r requirements.txt
```
#### Option 3: Automated Setup
```bash
# For conda users (activate environment first)
conda activate vad_diarization
./setup.sh
# For venv users
./setup.sh
```
### Hugging Face Token Setup
1. **Get your token**: Visit https://huggingface.co/settings/tokens
2. **Accept model conditions**: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
3. **Set environment variable**:
```bash
export HF_TOKEN='your_token_here'
```
### Running the Demo
**Launch Gradio Web Interface:**
```bash
export HF_TOKEN='your_token_here'
python app.py
```
Then open http://localhost:7860 in your browser.
**Or use the helper script:**
```bash
./run_app.sh
```
### 🎀 New Features
**Live Audio Recording:**
- Click the "🎀 Record Live" tab in the interface
- Record audio directly from your microphone
- Automatic processing when recording stops
- Download your recording with results
**Custom Speaker Names:**
- Open "βš™οΈ Advanced Settings"
- Add speaker name mappings:
```
SPEAKER_00: Alice
SPEAKER_01: Bob
```
- See custom names in all outputs and visualizations!
For detailed feature documentation, see [FEATURES.md](FEATURES.md)
**Command Line Usage:**
```python
from src.pipeline import VADDiarizationPipeline
# Initialize pipeline
pipeline = VADDiarizationPipeline(
token='your_hf_token',
vad_threshold=0.5
)
# Process audio file
result = pipeline.process_file('audio.wav')
# Print results
print(pipeline.format_output(result))
```
## πŸ“ Project Structure
```
VAD+SD/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py # Package initialization
β”‚ β”œβ”€β”€ vad.py # Silero VAD wrapper
β”‚ β”œβ”€β”€ diarization.py # Pyannote diarization wrapper
β”‚ β”œβ”€β”€ pipeline.py # Integrated pipeline
β”‚ └── utils.py # Utility functions
β”œβ”€β”€ tests/ # Unit tests
β”‚ β”œβ”€β”€ test_vad.py
β”‚ β”œβ”€β”€ test_pipeline.py
β”‚ └── __init__.py
β”œβ”€β”€ notebooks/ # Jupyter notebooks
β”‚ └── demo.ipynb
β”œβ”€β”€ benchmarks/ # Benchmark scripts
β”‚ └── run_benchmarks.py
β”œβ”€β”€ app.py # Gradio web interface
β”œβ”€β”€ vad_diarization.py # CLI demo script
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ environment.yml # Conda environment file
β”œβ”€β”€ Dockerfile # Container configuration
β”œβ”€β”€ docker-compose.yml # Docker Compose config
β”œβ”€β”€ .dockerignore # Docker ignore patterns
β”œβ”€β”€ .gitignore # Git ignore patterns
β”œβ”€β”€ setup.sh # Automated setup script
β”œβ”€β”€ run_app.sh # App launcher script
β”œβ”€β”€ verify_installation.py # Installation verification
└── README.md # This file
```
## 🐳 Docker Deployment
### Build and Run
```bash
# Build image
docker build -t vad-diarization:latest .
# Run container
docker run -p 7860:7860 \
-e HF_TOKEN='your_token_here' \
--gpus all \
vad-diarization:latest
```
### Docker Compose
```bash
# Set your token in .env file
echo "HF_TOKEN=your_token_here" > .env
# Start services
docker-compose up
```
## πŸ“Š Performance Benchmarks
### VAD Performance
- **Latency**: ~9.73ms per second of audio βœ…
- **Model Size**: 40MB
- **Real-time Factor**: ~0.01x (100x faster than real-time)
- **Accuracy**: High precision on speech detection
### Diarization Performance
- **DER on FEARLESS STEPS**: ~19-20%
- **Processing Speed**: Depends on audio length and hardware
- **GPU Memory**: ~2-4GB for typical audio
- **Supports**: 2-10 speakers (configurable)
### System Requirements
- **Minimum**: 4GB RAM, CPU-only
- **Recommended**: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
- **Optimal**: 16GB+ RAM, RTX 3060+ or better
## πŸ”§ Configuration
### VAD Parameters
```python
from src.vad import SileroVAD
vad = SileroVAD(
threshold=0.5, # Speech probability threshold (0.0-1.0)
sampling_rate=16000, # Audio sample rate
min_speech_duration_ms=250, # Minimum speech segment duration
min_silence_duration_ms=100,# Minimum silence between segments
use_onnx=False # Use ONNX runtime for speed
)
```
### Diarization Parameters
```python
from src.diarization import SpeakerDiarization
diarization = SpeakerDiarization(
model_name="pyannote/speaker-diarization-3.1",
token='your_token',
num_speakers=None, # Fixed number (if known)
min_speakers=None, # Minimum speakers
max_speakers=None # Maximum speakers
)
```
### Pipeline Configuration
```python
from src.pipeline import VADDiarizationPipeline
pipeline = VADDiarizationPipeline(
vad_threshold=0.5, # VAD sensitivity
token='your_token', # HF token
num_speakers=None, # Auto-detect speakers
use_onnx_vad=False # Use ONNX for VAD
)
```
## πŸ“ˆ Usage Examples
### Basic Processing
```python
from src.pipeline import VADDiarizationPipeline
# Initialize
pipeline = VADDiarizationPipeline(token='your_token')
# Process file
result = pipeline.process_file('meeting.wav')
# Access results
print(f"Speakers: {result['metadata']['num_speakers']}")
print(f"Segments: {result['metadata']['num_segments']}")
# Print timeline
for seg in result['speaker_segments']:
print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")
```
### Batch Processing
```python
# Process multiple files
audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
results = pipeline.process_batch(audio_files)
# Export results
for result in results:
pipeline.save_results(result, 'outputs/', format='json')
```
### Custom Configuration
```python
# Initialize with custom settings
pipeline = VADDiarizationPipeline(
vad_threshold=0.3, # More sensitive VAD
num_speakers=3, # Fixed 3 speakers
use_onnx_vad=True # Faster VAD inference
)
# Process with overrides
result = pipeline.process_file(
'audio.wav',
num_speakers=2 # Override to 2 speakers for this file
)
```
### VAD Only
```python
from src.vad import SileroVAD
vad = SileroVAD(threshold=0.5)
# Process audio
timestamps = vad.process_file('audio.wav')
# Print speech segments
for ts in timestamps:
print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")
```
### Diarization Only
```python
from src.diarization import SpeakerDiarization
diarizer = SpeakerDiarization(token='your_token')
# Process audio
segments, time_ms, metadata = diarizer.process_file('audio.wav')
# Print speaker segments
for seg in segments:
print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")
```
## πŸ§ͺ Testing
```bash
# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
# Test specific module
python -m pytest tests/test_vad.py -v
# Verify installation
python verify_installation.py
# Run benchmarks
python benchmarks/run_benchmarks.py
```
## πŸ“ Output Formats
### JSON Format
```json
{
"audio_path": "audio.wav",
"speaker_segments": [
{
"start": 0.5,
"end": 3.2,
"speaker": "SPEAKER_00",
"duration": 2.7
}
],
"vad_segments": [
{
"start": 0.5,
"end": 3.2
}
],
"metadata": {
"num_speakers": 2,
"num_segments": 15,
"total_speech_time": 45.3
},
"processing_time": {
"vad_ms": 150.2,
"diarization_ms": 3200.5,
"total_ms": 3350.7
}
}
```
### RTTM Format
```
SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>
```
### Text Timeline
```
[0.50s - 3.20s] SPEAKER_00
[3.50s - 7.70s] SPEAKER_01
[8.00s - 10.50s] SPEAKER_00
```
## 🎯 Use Cases
- **Meeting Transcription**: Identify who spoke when in recordings
- **Podcast Analysis**: Track speaker segments and statistics
- **Call Center Analytics**: Analyze customer-agent interactions
- **Video Production**: Generate speaker labels for editing
- **Research**: Speaker diarization for linguistic studies
- **Interview Processing**: Separate interviewer and interviewee
- **Broadcast Media**: Analyze news programs and talk shows
## πŸ› Troubleshooting
### Common Issues
#### 1. HF Token Error
```
Error: Invalid token or model access denied
```
**Solution**:
- Get token from https://huggingface.co/settings/tokens
- Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
- Set environment variable: `export HF_TOKEN='your_token'`
#### 2. CUDA Out of Memory
```
RuntimeError: CUDA out of memory
```
**Solution**:
- Process shorter audio segments
- Use CPU mode: `device='cpu'`
- Reduce batch size
#### 3. Audio Format Not Supported
```
Error loading audio
```
**Solution**: Convert to WAV format using FFmpeg:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
```
#### 4. DiarizeOutput Error
```
'DiarizeOutput' object has no attribute 'itertracks'
```
**Solution**: This is fixed in the current version. Make sure you have the latest code.
#### 5. Import Errors
```
ModuleNotFoundError: No module named 'torch'
```
**Solution**:
- Activate your environment: `conda activate vad_diarization`
- Reinstall dependencies: `pip install -r requirements.txt`
## πŸ”„ API Compatibility
This project supports both:
- **Pyannote.audio 3.x**: Returns `Annotation` objects
- **Pyannote.audio 4.0+**: Returns `DiarizeOutput` objects
The code automatically detects and handles both formats.
## πŸš€ Deployment Options
### Local Development
```bash
python app.py
```
### Docker
```bash
docker-compose up
```
### Cloud Platforms
**Hugging Face Spaces:**
- Fork this repository
- Create new Space
- Connect repository
- Set `HF_TOKEN` secret
- Deploy!
**AWS/GCP/Azure:**
- Use provided Dockerfile
- Deploy as container service
- Configure GPU instances for best performance
## 🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## πŸ“„ License
This project is licensed under the MIT License.
## πŸ™ Acknowledgments
- [Silero VAD](https://github.com/snakers4/silero-vad) - Fast and accurate VAD
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - Speaker diarization toolkit
- [Gradio](https://gradio.app/) - Web interface framework
- [PyTorch](https://pytorch.org/) - Deep learning framework
## πŸ“§ Support
For questions or issues:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the troubleshooting section
---
**Built with ❀️ for the speech processing community**