---
title: VAD Speaker Diarization
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---

# 🎙️ Real-Time VAD + Speaker Diarization System

Production-ready system for **Voice Activity Detection (VAD)** and **Speaker Diarization** with real-time performance and state-of-the-art accuracy.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## ✨ Features

- **Real-Time VAD**: <100ms latency using Silero VAD (40MB model)
- **Speaker Diarization**: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
- **Interactive Demo**: Beautiful Gradio web interface with visualizations
- **🎤 Live Recording**: Record audio directly in the browser with microphone
- **👥 Custom Speaker Names**: Replace generic labels with real names (e.g., "Alice", "Bob")
- **📥 Audio Download**: Download processed recordings with results
- **Production Ready**: Fully containerized with Docker
- **GPU Accelerated**: CUDA 12.1+ support for faster processing
- **Multiple Formats**: Export results as JSON, RTTM, or text
- **Modular Architecture**: Clean, maintainable, and extensible code

## 🚀 Quick Start

### Prerequisites

- Python 3.10+
- CUDA 12.1+ (optional, for GPU acceleration)
- FFmpeg
- Hugging Face account with access to [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

### Installation

#### Option 1: Conda (Recommended)

```bash
# Create and activate conda environment
conda create -n vad_diarization python=3.10 -y
conda activate vad_diarization

# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

# Install dependencies
pip install -r requirements.txt
```

#### Option 2: Virtual Environment

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install PyTorch with CUDA support (for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other dependencies
pip install -r requirements.txt
```

#### Option 3: Automated Setup

```bash
# For conda users (activate environment first)
conda activate vad_diarization
./setup.sh

# For venv users
./setup.sh
```

### Hugging Face Token Setup

1. **Get your token**: Visit https://huggingface.co/settings/tokens
2. **Accept model conditions**: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
3. **Set environment variable**:
   ```bash
   export HF_TOKEN='your_token_here'
   ```

### Running the Demo

**Launch Gradio Web Interface:**
```bash
export HF_TOKEN='your_token_here'
python app.py
```

Then open http://localhost:7860 in your browser.

**Or use the helper script:**
```bash
./run_app.sh
```

### 🎤 New Features

**Live Audio Recording:**
- Click the "🎤 Record Live" tab in the interface
- Record audio directly from your microphone
- Automatic processing when recording stops
- Download your recording with results

**Custom Speaker Names:**
- Open "⚙️ Advanced Settings"
- Add speaker name mappings:
  ```
  SPEAKER_00: Alice
  SPEAKER_01: Bob
  ```
- See custom names in all outputs and visualizations!

For detailed feature documentation, see [FEATURES.md](FEATURES.md)

**Command Line Usage:**
```python
from src.pipeline import VADDiarizationPipeline

# Initialize pipeline
pipeline = VADDiarizationPipeline(
    token='your_hf_token',
    vad_threshold=0.5
)

# Process audio file
result = pipeline.process_file('audio.wav')

# Print results
print(pipeline.format_output(result))
```

## 📁 Project Structure

```
VAD+SD/
├── src/
│   ├── __init__.py          # Package initialization
│   ├── vad.py               # Silero VAD wrapper
│   ├── diarization.py       # Pyannote diarization wrapper
│   ├── pipeline.py          # Integrated pipeline
│   └── utils.py             # Utility functions
├── tests/                   # Unit tests
│   ├── test_vad.py
│   ├── test_pipeline.py
│   └── __init__.py
├── notebooks/               # Jupyter notebooks
│   └── demo.ipynb
├── benchmarks/              # Benchmark scripts
│   └── run_benchmarks.py
├── app.py                   # Gradio web interface
├── vad_diarization.py       # CLI demo script
├── requirements.txt         # Python dependencies
├── environment.yml          # Conda environment file
├── Dockerfile               # Container configuration
├── docker-compose.yml       # Docker Compose config
├── .dockerignore           # Docker ignore patterns
├── .gitignore              # Git ignore patterns
├── setup.sh                # Automated setup script
├── run_app.sh              # App launcher script
├── verify_installation.py  # Installation verification
└── README.md               # This file
```

## 🐳 Docker Deployment

### Build and Run

```bash
# Build image
docker build -t vad-diarization:latest .

# Run container
docker run -p 7860:7860 \
  -e HF_TOKEN='your_token_here' \
  --gpus all \
  vad-diarization:latest
```

### Docker Compose

```bash
# Set your token in .env file
echo "HF_TOKEN=your_token_here" > .env

# Start services
docker-compose up
```

## 📊 Performance Benchmarks

### VAD Performance
- **Latency**: ~9.73ms per second of audio ✅
- **Model Size**: 40MB
- **Real-time Factor**: ~0.01x (100x faster than real-time)
- **Accuracy**: High precision on speech detection

### Diarization Performance
- **DER on FEARLESS STEPS**: ~19-20%
- **Processing Speed**: Depends on audio length and hardware
- **GPU Memory**: ~2-4GB for typical audio
- **Supports**: 2-10 speakers (configurable)

### System Requirements
- **Minimum**: 4GB RAM, CPU-only
- **Recommended**: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
- **Optimal**: 16GB+ RAM, RTX 3060+ or better

## 🔧 Configuration

### VAD Parameters

```python
from src.vad import SileroVAD

vad = SileroVAD(
    threshold=0.5,              # Speech probability threshold (0.0-1.0)
    sampling_rate=16000,        # Audio sample rate
    min_speech_duration_ms=250, # Minimum speech segment duration
    min_silence_duration_ms=100,# Minimum silence between segments
    use_onnx=False             # Use ONNX runtime for speed
)
```

### Diarization Parameters

```python
from src.diarization import SpeakerDiarization

diarization = SpeakerDiarization(
    model_name="pyannote/speaker-diarization-3.1",
    token='your_token',
    num_speakers=None,          # Fixed number (if known)
    min_speakers=None,          # Minimum speakers
    max_speakers=None           # Maximum speakers
)
```

### Pipeline Configuration

```python
from src.pipeline import VADDiarizationPipeline

pipeline = VADDiarizationPipeline(
    vad_threshold=0.5,          # VAD sensitivity
    token='your_token',         # HF token
    num_speakers=None,          # Auto-detect speakers
    use_onnx_vad=False         # Use ONNX for VAD
)
```

## 📈 Usage Examples

### Basic Processing

```python
from src.pipeline import VADDiarizationPipeline

# Initialize
pipeline = VADDiarizationPipeline(token='your_token')

# Process file
result = pipeline.process_file('meeting.wav')

# Access results
print(f"Speakers: {result['metadata']['num_speakers']}")
print(f"Segments: {result['metadata']['num_segments']}")

# Print timeline
for seg in result['speaker_segments']:
    print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")
```

### Batch Processing

```python
# Process multiple files
audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
results = pipeline.process_batch(audio_files)

# Export results
for result in results:
    pipeline.save_results(result, 'outputs/', format='json')
```

### Custom Configuration

```python
# Initialize with custom settings
pipeline = VADDiarizationPipeline(
    vad_threshold=0.3,          # More sensitive VAD
    num_speakers=3,             # Fixed 3 speakers
    use_onnx_vad=True          # Faster VAD inference
)

# Process with overrides
result = pipeline.process_file(
    'audio.wav',
    num_speakers=2  # Override to 2 speakers for this file
)
```

### VAD Only

```python
from src.vad import SileroVAD

vad = SileroVAD(threshold=0.5)

# Process audio
timestamps = vad.process_file('audio.wav')

# Print speech segments
for ts in timestamps:
    print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")
```

### Diarization Only

```python
from src.diarization import SpeakerDiarization

diarizer = SpeakerDiarization(token='your_token')

# Process audio
segments, time_ms, metadata = diarizer.process_file('audio.wav')

# Print speaker segments
for seg in segments:
    print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")
```

## 🧪 Testing

```bash
# Run all tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

# Test specific module
python -m pytest tests/test_vad.py -v

# Verify installation
python verify_installation.py

# Run benchmarks
python benchmarks/run_benchmarks.py
```

## 📝 Output Formats

### JSON Format
```json
{
  "audio_path": "audio.wav",
  "speaker_segments": [
    {
      "start": 0.5,
      "end": 3.2,
      "speaker": "SPEAKER_00",
      "duration": 2.7
    }
  ],
  "vad_segments": [
    {
      "start": 0.5,
      "end": 3.2
    }
  ],
  "metadata": {
    "num_speakers": 2,
    "num_segments": 15,
    "total_speech_time": 45.3
  },
  "processing_time": {
    "vad_ms": 150.2,
    "diarization_ms": 3200.5,
    "total_ms": 3350.7
  }
}
```

### RTTM Format
```
SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>
```

### Text Timeline
```
[0.50s - 3.20s] SPEAKER_00
[3.50s - 7.70s] SPEAKER_01
[8.00s - 10.50s] SPEAKER_00
```

## 🎯 Use Cases

- **Meeting Transcription**: Identify who spoke when in recordings
- **Podcast Analysis**: Track speaker segments and statistics
- **Call Center Analytics**: Analyze customer-agent interactions
- **Video Production**: Generate speaker labels for editing
- **Research**: Speaker diarization for linguistic studies
- **Interview Processing**: Separate interviewer and interviewee
- **Broadcast Media**: Analyze news programs and talk shows

## 🐛 Troubleshooting

### Common Issues

#### 1. HF Token Error
```
Error: Invalid token or model access denied
```
**Solution**: 
- Get token from https://huggingface.co/settings/tokens
- Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
- Set environment variable: `export HF_TOKEN='your_token'`

#### 2. CUDA Out of Memory
```
RuntimeError: CUDA out of memory
```
**Solution**: 
- Process shorter audio segments
- Use CPU mode: `device='cpu'`
- Reduce batch size

#### 3. Audio Format Not Supported
```
Error loading audio
```
**Solution**: Convert to WAV format using FFmpeg:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
```

#### 4. DiarizeOutput Error
```
'DiarizeOutput' object has no attribute 'itertracks'
```
**Solution**: This is fixed in the current version. Make sure you have the latest code.

#### 5. Import Errors
```
ModuleNotFoundError: No module named 'torch'
```
**Solution**: 
- Activate your environment: `conda activate vad_diarization`
- Reinstall dependencies: `pip install -r requirements.txt`

## 🔄 API Compatibility

This project supports both:
- **Pyannote.audio 3.x**: Returns `Annotation` objects
- **Pyannote.audio 4.0+**: Returns `DiarizeOutput` objects

The code automatically detects and handles both formats.

## 🚀 Deployment Options

### Local Development
```bash
python app.py
```

### Docker
```bash
docker-compose up
```

### Cloud Platforms

**Hugging Face Spaces:**
- Fork this repository
- Create new Space
- Connect repository
- Set `HF_TOKEN` secret
- Deploy!

**AWS/GCP/Azure:**
- Use provided Dockerfile
- Deploy as container service
- Configure GPU instances for best performance

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License.

## 🙏 Acknowledgments

- [Silero VAD](https://github.com/snakers4/silero-vad) - Fast and accurate VAD
- [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - Speaker diarization toolkit
- [Gradio](https://gradio.app/) - Web interface framework
- [PyTorch](https://pytorch.org/) - Deep learning framework

## 📧 Support

For questions or issues:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the troubleshooting section

---

**Built with ❤️ for the speech processing community**