--- title: VAD Speaker Diarization emoji: ๐ŸŽ™๏ธ colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit --- # ๐ŸŽ™๏ธ Real-Time VAD + Speaker Diarization System Production-ready system for **Voice Activity Detection (VAD)** and **Speaker Diarization** with real-time performance and state-of-the-art accuracy. [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) ## โœจ Features - **Real-Time VAD**: <100ms latency using Silero VAD (40MB model) - **Speaker Diarization**: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+ - **Interactive Demo**: Beautiful Gradio web interface with visualizations - **๐ŸŽค Live Recording**: Record audio directly in the browser with microphone - **๐Ÿ‘ฅ Custom Speaker Names**: Replace generic labels with real names (e.g., "Alice", "Bob") - **๐Ÿ“ฅ Audio Download**: Download processed recordings with results - **Production Ready**: Fully containerized with Docker - **GPU Accelerated**: CUDA 12.1+ support for faster processing - **Multiple Formats**: Export results as JSON, RTTM, or text - **Modular Architecture**: Clean, maintainable, and extensible code ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.10+ - CUDA 12.1+ (optional, for GPU acceleration) - FFmpeg - Hugging Face account with access to [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) ### Installation #### Option 1: Conda (Recommended) ```bash # Create and activate conda environment conda create -n vad_diarization python=3.10 -y conda activate vad_diarization # Install PyTorch with CUDA conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y # Install dependencies pip install -r requirements.txt ``` #### Option 2: Virtual Environment ```bash # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install PyTorch with CUDA support (for GPU) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Install other dependencies pip install -r requirements.txt ``` #### Option 3: Automated Setup ```bash # For conda users (activate environment first) conda activate vad_diarization ./setup.sh # For venv users ./setup.sh ``` ### Hugging Face Token Setup 1. **Get your token**: Visit https://huggingface.co/settings/tokens 2. **Accept model conditions**: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository" 3. **Set environment variable**: ```bash export HF_TOKEN='your_token_here' ``` ### Running the Demo **Launch Gradio Web Interface:** ```bash export HF_TOKEN='your_token_here' python app.py ``` Then open http://localhost:7860 in your browser. **Or use the helper script:** ```bash ./run_app.sh ``` ### ๐ŸŽค New Features **Live Audio Recording:** - Click the "๐ŸŽค Record Live" tab in the interface - Record audio directly from your microphone - Automatic processing when recording stops - Download your recording with results **Custom Speaker Names:** - Open "โš™๏ธ Advanced Settings" - Add speaker name mappings: ``` SPEAKER_00: Alice SPEAKER_01: Bob ``` - See custom names in all outputs and visualizations! For detailed feature documentation, see [FEATURES.md](FEATURES.md) **Command Line Usage:** ```python from src.pipeline import VADDiarizationPipeline # Initialize pipeline pipeline = VADDiarizationPipeline( token='your_hf_token', vad_threshold=0.5 ) # Process audio file result = pipeline.process_file('audio.wav') # Print results print(pipeline.format_output(result)) ``` ## ๐Ÿ“ Project Structure ``` VAD+SD/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ __init__.py # Package initialization โ”‚ โ”œโ”€โ”€ vad.py # Silero VAD wrapper โ”‚ โ”œโ”€โ”€ diarization.py # Pyannote diarization wrapper โ”‚ โ”œโ”€โ”€ pipeline.py # Integrated pipeline โ”‚ โ””โ”€โ”€ utils.py # Utility functions โ”œโ”€โ”€ tests/ # Unit tests โ”‚ โ”œโ”€โ”€ test_vad.py โ”‚ โ”œโ”€โ”€ test_pipeline.py โ”‚ โ””โ”€โ”€ __init__.py โ”œโ”€โ”€ notebooks/ # Jupyter notebooks โ”‚ โ””โ”€โ”€ demo.ipynb โ”œโ”€โ”€ benchmarks/ # Benchmark scripts โ”‚ โ””โ”€โ”€ run_benchmarks.py โ”œโ”€โ”€ app.py # Gradio web interface โ”œโ”€โ”€ vad_diarization.py # CLI demo script โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ environment.yml # Conda environment file โ”œโ”€โ”€ Dockerfile # Container configuration โ”œโ”€โ”€ docker-compose.yml # Docker Compose config โ”œโ”€โ”€ .dockerignore # Docker ignore patterns โ”œโ”€โ”€ .gitignore # Git ignore patterns โ”œโ”€โ”€ setup.sh # Automated setup script โ”œโ”€โ”€ run_app.sh # App launcher script โ”œโ”€โ”€ verify_installation.py # Installation verification โ””โ”€โ”€ README.md # This file ``` ## ๐Ÿณ Docker Deployment ### Build and Run ```bash # Build image docker build -t vad-diarization:latest . # Run container docker run -p 7860:7860 \ -e HF_TOKEN='your_token_here' \ --gpus all \ vad-diarization:latest ``` ### Docker Compose ```bash # Set your token in .env file echo "HF_TOKEN=your_token_here" > .env # Start services docker-compose up ``` ## ๐Ÿ“Š Performance Benchmarks ### VAD Performance - **Latency**: ~9.73ms per second of audio โœ… - **Model Size**: 40MB - **Real-time Factor**: ~0.01x (100x faster than real-time) - **Accuracy**: High precision on speech detection ### Diarization Performance - **DER on FEARLESS STEPS**: ~19-20% - **Processing Speed**: Depends on audio length and hardware - **GPU Memory**: ~2-4GB for typical audio - **Supports**: 2-10 speakers (configurable) ### System Requirements - **Minimum**: 4GB RAM, CPU-only - **Recommended**: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM - **Optimal**: 16GB+ RAM, RTX 3060+ or better ## ๐Ÿ”ง Configuration ### VAD Parameters ```python from src.vad import SileroVAD vad = SileroVAD( threshold=0.5, # Speech probability threshold (0.0-1.0) sampling_rate=16000, # Audio sample rate min_speech_duration_ms=250, # Minimum speech segment duration min_silence_duration_ms=100,# Minimum silence between segments use_onnx=False # Use ONNX runtime for speed ) ``` ### Diarization Parameters ```python from src.diarization import SpeakerDiarization diarization = SpeakerDiarization( model_name="pyannote/speaker-diarization-3.1", token='your_token', num_speakers=None, # Fixed number (if known) min_speakers=None, # Minimum speakers max_speakers=None # Maximum speakers ) ``` ### Pipeline Configuration ```python from src.pipeline import VADDiarizationPipeline pipeline = VADDiarizationPipeline( vad_threshold=0.5, # VAD sensitivity token='your_token', # HF token num_speakers=None, # Auto-detect speakers use_onnx_vad=False # Use ONNX for VAD ) ``` ## ๐Ÿ“ˆ Usage Examples ### Basic Processing ```python from src.pipeline import VADDiarizationPipeline # Initialize pipeline = VADDiarizationPipeline(token='your_token') # Process file result = pipeline.process_file('meeting.wav') # Access results print(f"Speakers: {result['metadata']['num_speakers']}") print(f"Segments: {result['metadata']['num_segments']}") # Print timeline for seg in result['speaker_segments']: print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}") ``` ### Batch Processing ```python # Process multiple files audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav'] results = pipeline.process_batch(audio_files) # Export results for result in results: pipeline.save_results(result, 'outputs/', format='json') ``` ### Custom Configuration ```python # Initialize with custom settings pipeline = VADDiarizationPipeline( vad_threshold=0.3, # More sensitive VAD num_speakers=3, # Fixed 3 speakers use_onnx_vad=True # Faster VAD inference ) # Process with overrides result = pipeline.process_file( 'audio.wav', num_speakers=2 # Override to 2 speakers for this file ) ``` ### VAD Only ```python from src.vad import SileroVAD vad = SileroVAD(threshold=0.5) # Process audio timestamps = vad.process_file('audio.wav') # Print speech segments for ts in timestamps: print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s") ``` ### Diarization Only ```python from src.diarization import SpeakerDiarization diarizer = SpeakerDiarization(token='your_token') # Process audio segments, time_ms, metadata = diarizer.process_file('audio.wav') # Print speaker segments for seg in segments: print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s") ``` ## ๐Ÿงช Testing ```bash # Run all tests python -m pytest tests/ -v # Run with coverage python -m pytest tests/ --cov=src --cov-report=html # Test specific module python -m pytest tests/test_vad.py -v # Verify installation python verify_installation.py # Run benchmarks python benchmarks/run_benchmarks.py ``` ## ๐Ÿ“ Output Formats ### JSON Format ```json { "audio_path": "audio.wav", "speaker_segments": [ { "start": 0.5, "end": 3.2, "speaker": "SPEAKER_00", "duration": 2.7 } ], "vad_segments": [ { "start": 0.5, "end": 3.2 } ], "metadata": { "num_speakers": 2, "num_segments": 15, "total_speech_time": 45.3 }, "processing_time": { "vad_ms": 150.2, "diarization_ms": 3200.5, "total_ms": 3350.7 } } ``` ### RTTM Format ``` SPEAKER audio 1 0.500 2.700 SPEAKER_00 SPEAKER audio 1 3.500 4.200 SPEAKER_01 ``` ### Text Timeline ``` [0.50s - 3.20s] SPEAKER_00 [3.50s - 7.70s] SPEAKER_01 [8.00s - 10.50s] SPEAKER_00 ``` ## ๐ŸŽฏ Use Cases - **Meeting Transcription**: Identify who spoke when in recordings - **Podcast Analysis**: Track speaker segments and statistics - **Call Center Analytics**: Analyze customer-agent interactions - **Video Production**: Generate speaker labels for editing - **Research**: Speaker diarization for linguistic studies - **Interview Processing**: Separate interviewer and interviewee - **Broadcast Media**: Analyze news programs and talk shows ## ๐Ÿ› Troubleshooting ### Common Issues #### 1. HF Token Error ``` Error: Invalid token or model access denied ``` **Solution**: - Get token from https://huggingface.co/settings/tokens - Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1 - Set environment variable: `export HF_TOKEN='your_token'` #### 2. CUDA Out of Memory ``` RuntimeError: CUDA out of memory ``` **Solution**: - Process shorter audio segments - Use CPU mode: `device='cpu'` - Reduce batch size #### 3. Audio Format Not Supported ``` Error loading audio ``` **Solution**: Convert to WAV format using FFmpeg: ```bash ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav ``` #### 4. DiarizeOutput Error ``` 'DiarizeOutput' object has no attribute 'itertracks' ``` **Solution**: This is fixed in the current version. Make sure you have the latest code. #### 5. Import Errors ``` ModuleNotFoundError: No module named 'torch' ``` **Solution**: - Activate your environment: `conda activate vad_diarization` - Reinstall dependencies: `pip install -r requirements.txt` ## ๐Ÿ”„ API Compatibility This project supports both: - **Pyannote.audio 3.x**: Returns `Annotation` objects - **Pyannote.audio 4.0+**: Returns `DiarizeOutput` objects The code automatically detects and handles both formats. ## ๐Ÿš€ Deployment Options ### Local Development ```bash python app.py ``` ### Docker ```bash docker-compose up ``` ### Cloud Platforms **Hugging Face Spaces:** - Fork this repository - Create new Space - Connect repository - Set `HF_TOKEN` secret - Deploy! **AWS/GCP/Azure:** - Use provided Dockerfile - Deploy as container service - Configure GPU instances for best performance ## ๐Ÿค Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/AmazingFeature`) 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request ## ๐Ÿ“„ License This project is licensed under the MIT License. ## ๐Ÿ™ Acknowledgments - [Silero VAD](https://github.com/snakers4/silero-vad) - Fast and accurate VAD - [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - Speaker diarization toolkit - [Gradio](https://gradio.app/) - Web interface framework - [PyTorch](https://pytorch.org/) - Deep learning framework ## ๐Ÿ“ง Support For questions or issues: - Open an issue on GitHub - Check existing issues for solutions - Review the troubleshooting section --- **Built with โค๏ธ for the speech processing community**