saadmannan's picture
fix token
9467e5a

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: VAD Speaker Diarization
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

πŸŽ™οΈ Real-Time VAD + Speaker Diarization System

Production-ready system for Voice Activity Detection (VAD) and Speaker Diarization with real-time performance and state-of-the-art accuracy.

Python 3.10+ PyTorch License: MIT

✨ Features

  • Real-Time VAD: <100ms latency using Silero VAD (40MB model)
  • Speaker Diarization: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
  • Interactive Demo: Beautiful Gradio web interface with visualizations
  • 🎀 Live Recording: Record audio directly in the browser with microphone
  • πŸ‘₯ Custom Speaker Names: Replace generic labels with real names (e.g., "Alice", "Bob")
  • πŸ“₯ Audio Download: Download processed recordings with results
  • Production Ready: Fully containerized with Docker
  • GPU Accelerated: CUDA 12.1+ support for faster processing
  • Multiple Formats: Export results as JSON, RTTM, or text
  • Modular Architecture: Clean, maintainable, and extensible code

πŸš€ Quick Start

Prerequisites

Installation

Option 1: Conda (Recommended)

# Create and activate conda environment
conda create -n vad_diarization python=3.10 -y
conda activate vad_diarization

# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

# Install dependencies
pip install -r requirements.txt

Option 2: Virtual Environment

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install PyTorch with CUDA support (for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other dependencies
pip install -r requirements.txt

Option 3: Automated Setup

# For conda users (activate environment first)
conda activate vad_diarization
./setup.sh

# For venv users
./setup.sh

Hugging Face Token Setup

  1. Get your token: Visit https://huggingface.co/settings/tokens
  2. Accept model conditions: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
  3. Set environment variable:
    export HF_TOKEN='your_token_here'
    

Running the Demo

Launch Gradio Web Interface:

export HF_TOKEN='your_token_here'
python app.py

Then open http://localhost:7860 in your browser.

Or use the helper script:

./run_app.sh

🎀 New Features

Live Audio Recording:

  • Click the "🎀 Record Live" tab in the interface
  • Record audio directly from your microphone
  • Automatic processing when recording stops
  • Download your recording with results

Custom Speaker Names:

  • Open "βš™οΈ Advanced Settings"
  • Add speaker name mappings:
    SPEAKER_00: Alice
    SPEAKER_01: Bob
    
  • See custom names in all outputs and visualizations!

For detailed feature documentation, see FEATURES.md

Command Line Usage:

from src.pipeline import VADDiarizationPipeline

# Initialize pipeline
pipeline = VADDiarizationPipeline(
    token='your_hf_token',
    vad_threshold=0.5
)

# Process audio file
result = pipeline.process_file('audio.wav')

# Print results
print(pipeline.format_output(result))

πŸ“ Project Structure

VAD+SD/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py          # Package initialization
β”‚   β”œβ”€β”€ vad.py               # Silero VAD wrapper
β”‚   β”œβ”€β”€ diarization.py       # Pyannote diarization wrapper
β”‚   β”œβ”€β”€ pipeline.py          # Integrated pipeline
β”‚   └── utils.py             # Utility functions
β”œβ”€β”€ tests/                   # Unit tests
β”‚   β”œβ”€β”€ test_vad.py
β”‚   β”œβ”€β”€ test_pipeline.py
β”‚   └── __init__.py
β”œβ”€β”€ notebooks/               # Jupyter notebooks
β”‚   └── demo.ipynb
β”œβ”€β”€ benchmarks/              # Benchmark scripts
β”‚   └── run_benchmarks.py
β”œβ”€β”€ app.py                   # Gradio web interface
β”œβ”€β”€ vad_diarization.py       # CLI demo script
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ environment.yml          # Conda environment file
β”œβ”€β”€ Dockerfile               # Container configuration
β”œβ”€β”€ docker-compose.yml       # Docker Compose config
β”œβ”€β”€ .dockerignore           # Docker ignore patterns
β”œβ”€β”€ .gitignore              # Git ignore patterns
β”œβ”€β”€ setup.sh                # Automated setup script
β”œβ”€β”€ run_app.sh              # App launcher script
β”œβ”€β”€ verify_installation.py  # Installation verification
└── README.md               # This file

🐳 Docker Deployment

Build and Run

# Build image
docker build -t vad-diarization:latest .

# Run container
docker run -p 7860:7860 \
  -e HF_TOKEN='your_token_here' \
  --gpus all \
  vad-diarization:latest

Docker Compose

# Set your token in .env file
echo "HF_TOKEN=your_token_here" > .env

# Start services
docker-compose up

πŸ“Š Performance Benchmarks

VAD Performance

  • Latency: ~9.73ms per second of audio βœ…
  • Model Size: 40MB
  • Real-time Factor: ~0.01x (100x faster than real-time)
  • Accuracy: High precision on speech detection

Diarization Performance

  • DER on FEARLESS STEPS: ~19-20%
  • Processing Speed: Depends on audio length and hardware
  • GPU Memory: ~2-4GB for typical audio
  • Supports: 2-10 speakers (configurable)

System Requirements

  • Minimum: 4GB RAM, CPU-only
  • Recommended: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
  • Optimal: 16GB+ RAM, RTX 3060+ or better

πŸ”§ Configuration

VAD Parameters

from src.vad import SileroVAD

vad = SileroVAD(
    threshold=0.5,              # Speech probability threshold (0.0-1.0)
    sampling_rate=16000,        # Audio sample rate
    min_speech_duration_ms=250, # Minimum speech segment duration
    min_silence_duration_ms=100,# Minimum silence between segments
    use_onnx=False             # Use ONNX runtime for speed
)

Diarization Parameters

from src.diarization import SpeakerDiarization

diarization = SpeakerDiarization(
    model_name="pyannote/speaker-diarization-3.1",
    token='your_token',
    num_speakers=None,          # Fixed number (if known)
    min_speakers=None,          # Minimum speakers
    max_speakers=None           # Maximum speakers
)

Pipeline Configuration

from src.pipeline import VADDiarizationPipeline

pipeline = VADDiarizationPipeline(
    vad_threshold=0.5,          # VAD sensitivity
    token='your_token',         # HF token
    num_speakers=None,          # Auto-detect speakers
    use_onnx_vad=False         # Use ONNX for VAD
)

πŸ“ˆ Usage Examples

Basic Processing

from src.pipeline import VADDiarizationPipeline

# Initialize
pipeline = VADDiarizationPipeline(token='your_token')

# Process file
result = pipeline.process_file('meeting.wav')

# Access results
print(f"Speakers: {result['metadata']['num_speakers']}")
print(f"Segments: {result['metadata']['num_segments']}")

# Print timeline
for seg in result['speaker_segments']:
    print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")

Batch Processing

# Process multiple files
audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
results = pipeline.process_batch(audio_files)

# Export results
for result in results:
    pipeline.save_results(result, 'outputs/', format='json')

Custom Configuration

# Initialize with custom settings
pipeline = VADDiarizationPipeline(
    vad_threshold=0.3,          # More sensitive VAD
    num_speakers=3,             # Fixed 3 speakers
    use_onnx_vad=True          # Faster VAD inference
)

# Process with overrides
result = pipeline.process_file(
    'audio.wav',
    num_speakers=2  # Override to 2 speakers for this file
)

VAD Only

from src.vad import SileroVAD

vad = SileroVAD(threshold=0.5)

# Process audio
timestamps = vad.process_file('audio.wav')

# Print speech segments
for ts in timestamps:
    print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")

Diarization Only

from src.diarization import SpeakerDiarization

diarizer = SpeakerDiarization(token='your_token')

# Process audio
segments, time_ms, metadata = diarizer.process_file('audio.wav')

# Print speaker segments
for seg in segments:
    print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")

πŸ§ͺ Testing

# Run all tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

# Test specific module
python -m pytest tests/test_vad.py -v

# Verify installation
python verify_installation.py

# Run benchmarks
python benchmarks/run_benchmarks.py

πŸ“ Output Formats

JSON Format

{
  "audio_path": "audio.wav",
  "speaker_segments": [
    {
      "start": 0.5,
      "end": 3.2,
      "speaker": "SPEAKER_00",
      "duration": 2.7
    }
  ],
  "vad_segments": [
    {
      "start": 0.5,
      "end": 3.2
    }
  ],
  "metadata": {
    "num_speakers": 2,
    "num_segments": 15,
    "total_speech_time": 45.3
  },
  "processing_time": {
    "vad_ms": 150.2,
    "diarization_ms": 3200.5,
    "total_ms": 3350.7
  }
}

RTTM Format

SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>

Text Timeline

[0.50s - 3.20s] SPEAKER_00
[3.50s - 7.70s] SPEAKER_01
[8.00s - 10.50s] SPEAKER_00

🎯 Use Cases

  • Meeting Transcription: Identify who spoke when in recordings
  • Podcast Analysis: Track speaker segments and statistics
  • Call Center Analytics: Analyze customer-agent interactions
  • Video Production: Generate speaker labels for editing
  • Research: Speaker diarization for linguistic studies
  • Interview Processing: Separate interviewer and interviewee
  • Broadcast Media: Analyze news programs and talk shows

πŸ› Troubleshooting

Common Issues

1. HF Token Error

Error: Invalid token or model access denied

Solution:

2. CUDA Out of Memory

RuntimeError: CUDA out of memory

Solution:

  • Process shorter audio segments
  • Use CPU mode: device='cpu'
  • Reduce batch size

3. Audio Format Not Supported

Error loading audio

Solution: Convert to WAV format using FFmpeg:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

4. DiarizeOutput Error

'DiarizeOutput' object has no attribute 'itertracks'

Solution: This is fixed in the current version. Make sure you have the latest code.

5. Import Errors

ModuleNotFoundError: No module named 'torch'

Solution:

  • Activate your environment: conda activate vad_diarization
  • Reinstall dependencies: pip install -r requirements.txt

πŸ”„ API Compatibility

This project supports both:

  • Pyannote.audio 3.x: Returns Annotation objects
  • Pyannote.audio 4.0+: Returns DiarizeOutput objects

The code automatically detects and handles both formats.

πŸš€ Deployment Options

Local Development

python app.py

Docker

docker-compose up

Cloud Platforms

Hugging Face Spaces:

  • Fork this repository
  • Create new Space
  • Connect repository
  • Set HF_TOKEN secret
  • Deploy!

AWS/GCP/Azure:

  • Use provided Dockerfile
  • Deploy as container service
  • Configure GPU instances for best performance

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License.

πŸ™ Acknowledgments

πŸ“§ Support

For questions or issues:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review the troubleshooting section

Built with ❀️ for the speech processing community