Spaces:

saadmannan
/

VAD-speakerDiarization

Sleeping

App Files Files Community

VAD-speakerDiarization / README.md

saadmannan

fix token

9467e5a 2 months ago

preview code

raw

history blame contribute delete

13.2 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

metadata

title: VAD Speaker Diarization
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

🎙️ Real-Time VAD + Speaker Diarization System

Production-ready system for Voice Activity Detection (VAD) and Speaker Diarization with real-time performance and state-of-the-art accuracy.

✨ Features

Real-Time VAD: <100ms latency using Silero VAD (40MB model)
Speaker Diarization: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
Interactive Demo: Beautiful Gradio web interface with visualizations
🎤 Live Recording: Record audio directly in the browser with microphone
👥 Custom Speaker Names: Replace generic labels with real names (e.g., "Alice", "Bob")
📥 Audio Download: Download processed recordings with results
Production Ready: Fully containerized with Docker
GPU Accelerated: CUDA 12.1+ support for faster processing
Multiple Formats: Export results as JSON, RTTM, or text
Modular Architecture: Clean, maintainable, and extensible code

🚀 Quick Start

Prerequisites

Python 3.10+
CUDA 12.1+ (optional, for GPU acceleration)
FFmpeg
Hugging Face account with access to pyannote/speaker-diarization-3.1

Installation

Option 1: Conda (Recommended)

# Create and activate conda environment
conda create -n vad_diarization python=3.10 -y
conda activate vad_diarization

# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

# Install dependencies
pip install -r requirements.txt

Option 2: Virtual Environment

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install PyTorch with CUDA support (for GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other dependencies
pip install -r requirements.txt

Option 3: Automated Setup

# For conda users (activate environment first)
conda activate vad_diarization
./setup.sh

# For venv users
./setup.sh

Hugging Face Token Setup

Get your token: Visit https://huggingface.co/settings/tokens
Accept model conditions: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
Set environment variable:
```
export HF_TOKEN='your_token_here'
```

Running the Demo

Launch Gradio Web Interface:

export HF_TOKEN='your_token_here'
python app.py

Then open http://localhost:7860 in your browser.

Or use the helper script:

./run_app.sh

🎤 New Features

Live Audio Recording:

Click the "🎤 Record Live" tab in the interface
Record audio directly from your microphone
Automatic processing when recording stops
Download your recording with results

Custom Speaker Names:

Open "⚙️ Advanced Settings"
Add speaker name mappings:
```
SPEAKER_00: Alice
SPEAKER_01: Bob
```
See custom names in all outputs and visualizations!

For detailed feature documentation, see FEATURES.md

Command Line Usage:

from src.pipeline import VADDiarizationPipeline

# Initialize pipeline
pipeline = VADDiarizationPipeline(
    token='your_hf_token',
    vad_threshold=0.5
)

# Process audio file
result = pipeline.process_file('audio.wav')

# Print results
print(pipeline.format_output(result))

📁 Project Structure

VAD+SD/
├── src/
│   ├── __init__.py          # Package initialization
│   ├── vad.py               # Silero VAD wrapper
│   ├── diarization.py       # Pyannote diarization wrapper
│   ├── pipeline.py          # Integrated pipeline
│   └── utils.py             # Utility functions
├── tests/                   # Unit tests
│   ├── test_vad.py
│   ├── test_pipeline.py
│   └── __init__.py
├── notebooks/               # Jupyter notebooks
│   └── demo.ipynb
├── benchmarks/              # Benchmark scripts
│   └── run_benchmarks.py
├── app.py                   # Gradio web interface
├── vad_diarization.py       # CLI demo script
├── requirements.txt         # Python dependencies
├── environment.yml          # Conda environment file
├── Dockerfile               # Container configuration
├── docker-compose.yml       # Docker Compose config
├── .dockerignore           # Docker ignore patterns
├── .gitignore              # Git ignore patterns
├── setup.sh                # Automated setup script
├── run_app.sh              # App launcher script
├── verify_installation.py  # Installation verification
└── README.md               # This file

🐳 Docker Deployment

Build and Run

# Build image
docker build -t vad-diarization:latest .

# Run container
docker run -p 7860:7860 \
  -e HF_TOKEN='your_token_here' \
  --gpus all \
  vad-diarization:latest

Docker Compose

# Set your token in .env file
echo "HF_TOKEN=your_token_here" > .env

# Start services
docker-compose up

📊 Performance Benchmarks

VAD Performance

Latency: ~9.73ms per second of audio ✅
Model Size: 40MB
Real-time Factor: ~0.01x (100x faster than real-time)
Accuracy: High precision on speech detection

Diarization Performance

DER on FEARLESS STEPS: ~19-20%
Processing Speed: Depends on audio length and hardware
GPU Memory: ~2-4GB for typical audio
Supports: 2-10 speakers (configurable)

System Requirements

Minimum: 4GB RAM, CPU-only
Recommended: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
Optimal: 16GB+ RAM, RTX 3060+ or better

🔧 Configuration

VAD Parameters

from src.vad import SileroVAD

vad = SileroVAD(
    threshold=0.5,              # Speech probability threshold (0.0-1.0)
    sampling_rate=16000,        # Audio sample rate
    min_speech_duration_ms=250, # Minimum speech segment duration
    min_silence_duration_ms=100,# Minimum silence between segments
    use_onnx=False             # Use ONNX runtime for speed
)

Diarization Parameters

from src.diarization import SpeakerDiarization

diarization = SpeakerDiarization(
    model_name="pyannote/speaker-diarization-3.1",
    token='your_token',
    num_speakers=None,          # Fixed number (if known)
    min_speakers=None,          # Minimum speakers
    max_speakers=None           # Maximum speakers
)

Pipeline Configuration

from src.pipeline import VADDiarizationPipeline

pipeline = VADDiarizationPipeline(
    vad_threshold=0.5,          # VAD sensitivity
    token='your_token',         # HF token
    num_speakers=None,          # Auto-detect speakers
    use_onnx_vad=False         # Use ONNX for VAD
)

📈 Usage Examples

Basic Processing

from src.pipeline import VADDiarizationPipeline

# Initialize
pipeline = VADDiarizationPipeline(token='your_token')

# Process file
result = pipeline.process_file('meeting.wav')

# Access results
print(f"Speakers: {result['metadata']['num_speakers']}")
print(f"Segments: {result['metadata']['num_segments']}")

# Print timeline
for seg in result['speaker_segments']:
    print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")

Batch Processing

# Process multiple files
audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
results = pipeline.process_batch(audio_files)

# Export results
for result in results:
    pipeline.save_results(result, 'outputs/', format='json')

Custom Configuration

# Initialize with custom settings
pipeline = VADDiarizationPipeline(
    vad_threshold=0.3,          # More sensitive VAD
    num_speakers=3,             # Fixed 3 speakers
    use_onnx_vad=True          # Faster VAD inference
)

# Process with overrides
result = pipeline.process_file(
    'audio.wav',
    num_speakers=2  # Override to 2 speakers for this file
)

VAD Only

from src.vad import SileroVAD

vad = SileroVAD(threshold=0.5)

# Process audio
timestamps = vad.process_file('audio.wav')

# Print speech segments
for ts in timestamps:
    print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")

Diarization Only

from src.diarization import SpeakerDiarization

diarizer = SpeakerDiarization(token='your_token')

# Process audio
segments, time_ms, metadata = diarizer.process_file('audio.wav')

# Print speaker segments
for seg in segments:
    print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")

🧪 Testing

# Run all tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html

# Test specific module
python -m pytest tests/test_vad.py -v

# Verify installation
python verify_installation.py

# Run benchmarks
python benchmarks/run_benchmarks.py

📝 Output Formats

JSON Format

{
  "audio_path": "audio.wav",
  "speaker_segments": [
    {
      "start": 0.5,
      "end": 3.2,
      "speaker": "SPEAKER_00",
      "duration": 2.7
    }
  ],
  "vad_segments": [
    {
      "start": 0.5,
      "end": 3.2
    }
  ],
  "metadata": {
    "num_speakers": 2,
    "num_segments": 15,
    "total_speech_time": 45.3
  },
  "processing_time": {
    "vad_ms": 150.2,
    "diarization_ms": 3200.5,
    "total_ms": 3350.7
  }
}

RTTM Format

SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>

Text Timeline

[0.50s - 3.20s] SPEAKER_00
[3.50s - 7.70s] SPEAKER_01
[8.00s - 10.50s] SPEAKER_00

🎯 Use Cases

Meeting Transcription: Identify who spoke when in recordings
Podcast Analysis: Track speaker segments and statistics
Call Center Analytics: Analyze customer-agent interactions
Video Production: Generate speaker labels for editing
Research: Speaker diarization for linguistic studies
Interview Processing: Separate interviewer and interviewee
Broadcast Media: Analyze news programs and talk shows

🐛 Troubleshooting

Common Issues

1. HF Token Error

Error: Invalid token or model access denied

Solution:

Get token from https://huggingface.co/settings/tokens
Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
Set environment variable: export HF_TOKEN='your_token'

2. CUDA Out of Memory

RuntimeError: CUDA out of memory

Solution:

Process shorter audio segments
Use CPU mode: device='cpu'
Reduce batch size

3. Audio Format Not Supported

Error loading audio

Solution: Convert to WAV format using FFmpeg:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

4. DiarizeOutput Error

'DiarizeOutput' object has no attribute 'itertracks'

Solution: This is fixed in the current version. Make sure you have the latest code.

5. Import Errors

ModuleNotFoundError: No module named 'torch'

Solution:

Activate your environment: conda activate vad_diarization
Reinstall dependencies: pip install -r requirements.txt

🔄 API Compatibility

This project supports both:

Pyannote.audio 3.x: Returns Annotation objects
Pyannote.audio 4.0+: Returns DiarizeOutput objects

The code automatically detects and handles both formats.

🚀 Deployment Options

Local Development

python app.py

Docker

docker-compose up

Cloud Platforms

Hugging Face Spaces:

Fork this repository
Create new Space
Connect repository
Set HF_TOKEN secret
Deploy!

AWS/GCP/Azure:

Use provided Dockerfile
Deploy as container service
Configure GPU instances for best performance

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Silero VAD - Fast and accurate VAD
Pyannote.audio - Speaker diarization toolkit
Gradio - Web interface framework
PyTorch - Deep learning framework

📧 Support

For questions or issues:

Open an issue on GitHub
Check existing issues for solutions
Review the troubleshooting section

Built with ❤️ for the speech processing community