Spaces:
Sleeping
Sleeping
| title: VAD Speaker Diarization | |
| emoji: ποΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # ποΈ Real-Time VAD + Speaker Diarization System | |
| Production-ready system for **Voice Activity Detection (VAD)** and **Speaker Diarization** with real-time performance and state-of-the-art accuracy. | |
| [](https://www.python.org/downloads/) | |
| [](https://pytorch.org/) | |
| [](https://opensource.org/licenses/MIT) | |
| ## β¨ Features | |
| - **Real-Time VAD**: <100ms latency using Silero VAD (40MB model) | |
| - **Speaker Diarization**: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+ | |
| - **Interactive Demo**: Beautiful Gradio web interface with visualizations | |
| - **π€ Live Recording**: Record audio directly in the browser with microphone | |
| - **π₯ Custom Speaker Names**: Replace generic labels with real names (e.g., "Alice", "Bob") | |
| - **π₯ Audio Download**: Download processed recordings with results | |
| - **Production Ready**: Fully containerized with Docker | |
| - **GPU Accelerated**: CUDA 12.1+ support for faster processing | |
| - **Multiple Formats**: Export results as JSON, RTTM, or text | |
| - **Modular Architecture**: Clean, maintainable, and extensible code | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - CUDA 12.1+ (optional, for GPU acceleration) | |
| - FFmpeg | |
| - Hugging Face account with access to [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) | |
| ### Installation | |
| #### Option 1: Conda (Recommended) | |
| ```bash | |
| # Create and activate conda environment | |
| conda create -n vad_diarization python=3.10 -y | |
| conda activate vad_diarization | |
| # Install PyTorch with CUDA | |
| conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| #### Option 2: Virtual Environment | |
| ```bash | |
| # Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| # Install PyTorch with CUDA support (for GPU) | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | |
| # Install other dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| #### Option 3: Automated Setup | |
| ```bash | |
| # For conda users (activate environment first) | |
| conda activate vad_diarization | |
| ./setup.sh | |
| # For venv users | |
| ./setup.sh | |
| ``` | |
| ### Hugging Face Token Setup | |
| 1. **Get your token**: Visit https://huggingface.co/settings/tokens | |
| 2. **Accept model conditions**: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository" | |
| 3. **Set environment variable**: | |
| ```bash | |
| export HF_TOKEN='your_token_here' | |
| ``` | |
| ### Running the Demo | |
| **Launch Gradio Web Interface:** | |
| ```bash | |
| export HF_TOKEN='your_token_here' | |
| python app.py | |
| ``` | |
| Then open http://localhost:7860 in your browser. | |
| **Or use the helper script:** | |
| ```bash | |
| ./run_app.sh | |
| ``` | |
| ### π€ New Features | |
| **Live Audio Recording:** | |
| - Click the "π€ Record Live" tab in the interface | |
| - Record audio directly from your microphone | |
| - Automatic processing when recording stops | |
| - Download your recording with results | |
| **Custom Speaker Names:** | |
| - Open "βοΈ Advanced Settings" | |
| - Add speaker name mappings: | |
| ``` | |
| SPEAKER_00: Alice | |
| SPEAKER_01: Bob | |
| ``` | |
| - See custom names in all outputs and visualizations! | |
| For detailed feature documentation, see [FEATURES.md](FEATURES.md) | |
| **Command Line Usage:** | |
| ```python | |
| from src.pipeline import VADDiarizationPipeline | |
| # Initialize pipeline | |
| pipeline = VADDiarizationPipeline( | |
| token='your_hf_token', | |
| vad_threshold=0.5 | |
| ) | |
| # Process audio file | |
| result = pipeline.process_file('audio.wav') | |
| # Print results | |
| print(pipeline.format_output(result)) | |
| ``` | |
| ## π Project Structure | |
| ``` | |
| VAD+SD/ | |
| βββ src/ | |
| β βββ __init__.py # Package initialization | |
| β βββ vad.py # Silero VAD wrapper | |
| β βββ diarization.py # Pyannote diarization wrapper | |
| β βββ pipeline.py # Integrated pipeline | |
| β βββ utils.py # Utility functions | |
| βββ tests/ # Unit tests | |
| β βββ test_vad.py | |
| β βββ test_pipeline.py | |
| β βββ __init__.py | |
| βββ notebooks/ # Jupyter notebooks | |
| β βββ demo.ipynb | |
| βββ benchmarks/ # Benchmark scripts | |
| β βββ run_benchmarks.py | |
| βββ app.py # Gradio web interface | |
| βββ vad_diarization.py # CLI demo script | |
| βββ requirements.txt # Python dependencies | |
| βββ environment.yml # Conda environment file | |
| βββ Dockerfile # Container configuration | |
| βββ docker-compose.yml # Docker Compose config | |
| βββ .dockerignore # Docker ignore patterns | |
| βββ .gitignore # Git ignore patterns | |
| βββ setup.sh # Automated setup script | |
| βββ run_app.sh # App launcher script | |
| βββ verify_installation.py # Installation verification | |
| βββ README.md # This file | |
| ``` | |
| ## π³ Docker Deployment | |
| ### Build and Run | |
| ```bash | |
| # Build image | |
| docker build -t vad-diarization:latest . | |
| # Run container | |
| docker run -p 7860:7860 \ | |
| -e HF_TOKEN='your_token_here' \ | |
| --gpus all \ | |
| vad-diarization:latest | |
| ``` | |
| ### Docker Compose | |
| ```bash | |
| # Set your token in .env file | |
| echo "HF_TOKEN=your_token_here" > .env | |
| # Start services | |
| docker-compose up | |
| ``` | |
| ## π Performance Benchmarks | |
| ### VAD Performance | |
| - **Latency**: ~9.73ms per second of audio β | |
| - **Model Size**: 40MB | |
| - **Real-time Factor**: ~0.01x (100x faster than real-time) | |
| - **Accuracy**: High precision on speech detection | |
| ### Diarization Performance | |
| - **DER on FEARLESS STEPS**: ~19-20% | |
| - **Processing Speed**: Depends on audio length and hardware | |
| - **GPU Memory**: ~2-4GB for typical audio | |
| - **Supports**: 2-10 speakers (configurable) | |
| ### System Requirements | |
| - **Minimum**: 4GB RAM, CPU-only | |
| - **Recommended**: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM | |
| - **Optimal**: 16GB+ RAM, RTX 3060+ or better | |
| ## π§ Configuration | |
| ### VAD Parameters | |
| ```python | |
| from src.vad import SileroVAD | |
| vad = SileroVAD( | |
| threshold=0.5, # Speech probability threshold (0.0-1.0) | |
| sampling_rate=16000, # Audio sample rate | |
| min_speech_duration_ms=250, # Minimum speech segment duration | |
| min_silence_duration_ms=100,# Minimum silence between segments | |
| use_onnx=False # Use ONNX runtime for speed | |
| ) | |
| ``` | |
| ### Diarization Parameters | |
| ```python | |
| from src.diarization import SpeakerDiarization | |
| diarization = SpeakerDiarization( | |
| model_name="pyannote/speaker-diarization-3.1", | |
| token='your_token', | |
| num_speakers=None, # Fixed number (if known) | |
| min_speakers=None, # Minimum speakers | |
| max_speakers=None # Maximum speakers | |
| ) | |
| ``` | |
| ### Pipeline Configuration | |
| ```python | |
| from src.pipeline import VADDiarizationPipeline | |
| pipeline = VADDiarizationPipeline( | |
| vad_threshold=0.5, # VAD sensitivity | |
| token='your_token', # HF token | |
| num_speakers=None, # Auto-detect speakers | |
| use_onnx_vad=False # Use ONNX for VAD | |
| ) | |
| ``` | |
| ## π Usage Examples | |
| ### Basic Processing | |
| ```python | |
| from src.pipeline import VADDiarizationPipeline | |
| # Initialize | |
| pipeline = VADDiarizationPipeline(token='your_token') | |
| # Process file | |
| result = pipeline.process_file('meeting.wav') | |
| # Access results | |
| print(f"Speakers: {result['metadata']['num_speakers']}") | |
| print(f"Segments: {result['metadata']['num_segments']}") | |
| # Print timeline | |
| for seg in result['speaker_segments']: | |
| print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}") | |
| ``` | |
| ### Batch Processing | |
| ```python | |
| # Process multiple files | |
| audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav'] | |
| results = pipeline.process_batch(audio_files) | |
| # Export results | |
| for result in results: | |
| pipeline.save_results(result, 'outputs/', format='json') | |
| ``` | |
| ### Custom Configuration | |
| ```python | |
| # Initialize with custom settings | |
| pipeline = VADDiarizationPipeline( | |
| vad_threshold=0.3, # More sensitive VAD | |
| num_speakers=3, # Fixed 3 speakers | |
| use_onnx_vad=True # Faster VAD inference | |
| ) | |
| # Process with overrides | |
| result = pipeline.process_file( | |
| 'audio.wav', | |
| num_speakers=2 # Override to 2 speakers for this file | |
| ) | |
| ``` | |
| ### VAD Only | |
| ```python | |
| from src.vad import SileroVAD | |
| vad = SileroVAD(threshold=0.5) | |
| # Process audio | |
| timestamps = vad.process_file('audio.wav') | |
| # Print speech segments | |
| for ts in timestamps: | |
| print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s") | |
| ``` | |
| ### Diarization Only | |
| ```python | |
| from src.diarization import SpeakerDiarization | |
| diarizer = SpeakerDiarization(token='your_token') | |
| # Process audio | |
| segments, time_ms, metadata = diarizer.process_file('audio.wav') | |
| # Print speaker segments | |
| for seg in segments: | |
| print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s") | |
| ``` | |
| ## π§ͺ Testing | |
| ```bash | |
| # Run all tests | |
| python -m pytest tests/ -v | |
| # Run with coverage | |
| python -m pytest tests/ --cov=src --cov-report=html | |
| # Test specific module | |
| python -m pytest tests/test_vad.py -v | |
| # Verify installation | |
| python verify_installation.py | |
| # Run benchmarks | |
| python benchmarks/run_benchmarks.py | |
| ``` | |
| ## π Output Formats | |
| ### JSON Format | |
| ```json | |
| { | |
| "audio_path": "audio.wav", | |
| "speaker_segments": [ | |
| { | |
| "start": 0.5, | |
| "end": 3.2, | |
| "speaker": "SPEAKER_00", | |
| "duration": 2.7 | |
| } | |
| ], | |
| "vad_segments": [ | |
| { | |
| "start": 0.5, | |
| "end": 3.2 | |
| } | |
| ], | |
| "metadata": { | |
| "num_speakers": 2, | |
| "num_segments": 15, | |
| "total_speech_time": 45.3 | |
| }, | |
| "processing_time": { | |
| "vad_ms": 150.2, | |
| "diarization_ms": 3200.5, | |
| "total_ms": 3350.7 | |
| } | |
| } | |
| ``` | |
| ### RTTM Format | |
| ``` | |
| SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA> | |
| SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA> | |
| ``` | |
| ### Text Timeline | |
| ``` | |
| [0.50s - 3.20s] SPEAKER_00 | |
| [3.50s - 7.70s] SPEAKER_01 | |
| [8.00s - 10.50s] SPEAKER_00 | |
| ``` | |
| ## π― Use Cases | |
| - **Meeting Transcription**: Identify who spoke when in recordings | |
| - **Podcast Analysis**: Track speaker segments and statistics | |
| - **Call Center Analytics**: Analyze customer-agent interactions | |
| - **Video Production**: Generate speaker labels for editing | |
| - **Research**: Speaker diarization for linguistic studies | |
| - **Interview Processing**: Separate interviewer and interviewee | |
| - **Broadcast Media**: Analyze news programs and talk shows | |
| ## π Troubleshooting | |
| ### Common Issues | |
| #### 1. HF Token Error | |
| ``` | |
| Error: Invalid token or model access denied | |
| ``` | |
| **Solution**: | |
| - Get token from https://huggingface.co/settings/tokens | |
| - Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1 | |
| - Set environment variable: `export HF_TOKEN='your_token'` | |
| #### 2. CUDA Out of Memory | |
| ``` | |
| RuntimeError: CUDA out of memory | |
| ``` | |
| **Solution**: | |
| - Process shorter audio segments | |
| - Use CPU mode: `device='cpu'` | |
| - Reduce batch size | |
| #### 3. Audio Format Not Supported | |
| ``` | |
| Error loading audio | |
| ``` | |
| **Solution**: Convert to WAV format using FFmpeg: | |
| ```bash | |
| ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav | |
| ``` | |
| #### 4. DiarizeOutput Error | |
| ``` | |
| 'DiarizeOutput' object has no attribute 'itertracks' | |
| ``` | |
| **Solution**: This is fixed in the current version. Make sure you have the latest code. | |
| #### 5. Import Errors | |
| ``` | |
| ModuleNotFoundError: No module named 'torch' | |
| ``` | |
| **Solution**: | |
| - Activate your environment: `conda activate vad_diarization` | |
| - Reinstall dependencies: `pip install -r requirements.txt` | |
| ## π API Compatibility | |
| This project supports both: | |
| - **Pyannote.audio 3.x**: Returns `Annotation` objects | |
| - **Pyannote.audio 4.0+**: Returns `DiarizeOutput` objects | |
| The code automatically detects and handles both formats. | |
| ## π Deployment Options | |
| ### Local Development | |
| ```bash | |
| python app.py | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker-compose up | |
| ``` | |
| ### Cloud Platforms | |
| **Hugging Face Spaces:** | |
| - Fork this repository | |
| - Create new Space | |
| - Connect repository | |
| - Set `HF_TOKEN` secret | |
| - Deploy! | |
| **AWS/GCP/Azure:** | |
| - Use provided Dockerfile | |
| - Deploy as container service | |
| - Configure GPU instances for best performance | |
| ## π€ Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| 1. Fork the repository | |
| 2. Create your feature branch (`git checkout -b feature/AmazingFeature`) | |
| 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) | |
| 4. Push to the branch (`git push origin feature/AmazingFeature`) | |
| 5. Open a Pull Request | |
| ## π License | |
| This project is licensed under the MIT License. | |
| ## π Acknowledgments | |
| - [Silero VAD](https://github.com/snakers4/silero-vad) - Fast and accurate VAD | |
| - [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - Speaker diarization toolkit | |
| - [Gradio](https://gradio.app/) - Web interface framework | |
| - [PyTorch](https://pytorch.org/) - Deep learning framework | |
| ## π§ Support | |
| For questions or issues: | |
| - Open an issue on GitHub | |
| - Check existing issues for solutions | |
| - Review the troubleshooting section | |
| --- | |
| **Built with β€οΈ for the speech processing community** | |