Spaces:

saadmannan
/

VAD-speakerDiarization

Sleeping

App Files Files Community

VAD-speakerDiarization / README.md

saadmannan

fix token

9467e5a 2 months ago

preview code

raw

history blame contribute delete

13.2 kB

	---
	title: VAD Speaker Diarization
	emoji: 🎙️
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	---

	# 🎙️ Real-Time VAD + Speaker Diarization System

	Production-ready system for Voice Activity Detection (VAD) and Speaker Diarization with real-time performance and state-of-the-art accuracy.

	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

	## ✨ Features

	- Real-Time VAD: <100ms latency using Silero VAD (40MB model)
	- Speaker Diarization: State-of-the-art accuracy with Pyannote.audio 3.1/4.0+
	- Interactive Demo: Beautiful Gradio web interface with visualizations
	- 🎤 Live Recording: Record audio directly in the browser with microphone
	- 👥 Custom Speaker Names: Replace generic labels with real names (e.g., "Alice", "Bob")
	- 📥 Audio Download: Download processed recordings with results
	- Production Ready: Fully containerized with Docker
	- GPU Accelerated: CUDA 12.1+ support for faster processing
	- Multiple Formats: Export results as JSON, RTTM, or text
	- Modular Architecture: Clean, maintainable, and extensible code

	## 🚀 Quick Start

	### Prerequisites

	- Python 3.10+
	- CUDA 12.1+ (optional, for GPU acceleration)
	- FFmpeg
	- Hugging Face account with access to [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)

	### Installation

	#### Option 1: Conda (Recommended)

	```bash
	# Create and activate conda environment
	conda create -n vad_diarization python=3.10 -y
	conda activate vad_diarization

	# Install PyTorch with CUDA
	conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

	# Install dependencies
	pip install -r requirements.txt
	```

	#### Option 2: Virtual Environment

	```bash
	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate

	# Install PyTorch with CUDA support (for GPU)
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

	# Install other dependencies
	pip install -r requirements.txt
	```

	#### Option 3: Automated Setup

	```bash
	# For conda users (activate environment first)
	conda activate vad_diarization
	./setup.sh

	# For venv users
	./setup.sh
	```

	### Hugging Face Token Setup

	1. Get your token: Visit https://huggingface.co/settings/tokens
	2. Accept model conditions: Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and click "Agree and access repository"
	3. Set environment variable:
	```bash
	export HF_TOKEN='your_token_here'
	```

	### Running the Demo

	Launch Gradio Web Interface:
	```bash
	export HF_TOKEN='your_token_here'
	python app.py
	```

	Then open http://localhost:7860 in your browser.

	Or use the helper script:
	```bash
	./run_app.sh
	```

	### 🎤 New Features

	Live Audio Recording:
	- Click the "🎤 Record Live" tab in the interface
	- Record audio directly from your microphone
	- Automatic processing when recording stops
	- Download your recording with results

	Custom Speaker Names:
	- Open "⚙️ Advanced Settings"
	- Add speaker name mappings:
	```
	SPEAKER_00: Alice
	SPEAKER_01: Bob
	```
	- See custom names in all outputs and visualizations!

	For detailed feature documentation, see [FEATURES.md](FEATURES.md)

	Command Line Usage:
	```python
	from src.pipeline import VADDiarizationPipeline

	# Initialize pipeline
	pipeline = VADDiarizationPipeline(
	token='your_hf_token',
	vad_threshold=0.5
	)

	# Process audio file
	result = pipeline.process_file('audio.wav')

	# Print results
	print(pipeline.format_output(result))
	```

	## 📁 Project Structure

	```
	VAD+SD/
	├── src/
	│ ├── __init__.py # Package initialization
	│ ├── vad.py # Silero VAD wrapper
	│ ├── diarization.py # Pyannote diarization wrapper
	│ ├── pipeline.py # Integrated pipeline
	│ └── utils.py # Utility functions
	├── tests/ # Unit tests
	│ ├── test_vad.py
	│ ├── test_pipeline.py
	│ └── __init__.py
	├── notebooks/ # Jupyter notebooks
	│ └── demo.ipynb
	├── benchmarks/ # Benchmark scripts
	│ └── run_benchmarks.py
	├── app.py # Gradio web interface
	├── vad_diarization.py # CLI demo script
	├── requirements.txt # Python dependencies
	├── environment.yml # Conda environment file
	├── Dockerfile # Container configuration
	├── docker-compose.yml # Docker Compose config
	├── .dockerignore # Docker ignore patterns
	├── .gitignore # Git ignore patterns
	├── setup.sh # Automated setup script
	├── run_app.sh # App launcher script
	├── verify_installation.py # Installation verification
	└── README.md # This file
	```

	## 🐳 Docker Deployment

	### Build and Run

	```bash
	# Build image
	docker build -t vad-diarization:latest .

	# Run container
	docker run -p 7860:7860 \
	-e HF_TOKEN='your_token_here' \
	--gpus all \
	vad-diarization:latest
	```

	### Docker Compose

	```bash
	# Set your token in .env file
	echo "HF_TOKEN=your_token_here" > .env

	# Start services
	docker-compose up
	```

	## 📊 Performance Benchmarks

	### VAD Performance
	- Latency: ~9.73ms per second of audio ✅
	- Model Size: 40MB
	- Real-time Factor: ~0.01x (100x faster than real-time)
	- Accuracy: High precision on speech detection

	### Diarization Performance
	- DER on FEARLESS STEPS: ~19-20%
	- Processing Speed: Depends on audio length and hardware
	- GPU Memory: ~2-4GB for typical audio
	- Supports: 2-10 speakers (configurable)

	### System Requirements
	- Minimum: 4GB RAM, CPU-only
	- Recommended: 8GB+ RAM, NVIDIA GPU with 4GB+ VRAM
	- Optimal: 16GB+ RAM, RTX 3060+ or better

	## 🔧 Configuration

	### VAD Parameters

	```python
	from src.vad import SileroVAD

	vad = SileroVAD(
	threshold=0.5, # Speech probability threshold (0.0-1.0)
	sampling_rate=16000, # Audio sample rate
	min_speech_duration_ms=250, # Minimum speech segment duration
	min_silence_duration_ms=100,# Minimum silence between segments
	use_onnx=False # Use ONNX runtime for speed
	)
	```

	### Diarization Parameters

	```python
	from src.diarization import SpeakerDiarization

	diarization = SpeakerDiarization(
	model_name="pyannote/speaker-diarization-3.1",
	token='your_token',
	num_speakers=None, # Fixed number (if known)
	min_speakers=None, # Minimum speakers
	max_speakers=None # Maximum speakers
	)
	```

	### Pipeline Configuration

	```python
	from src.pipeline import VADDiarizationPipeline

	pipeline = VADDiarizationPipeline(
	vad_threshold=0.5, # VAD sensitivity
	token='your_token', # HF token
	num_speakers=None, # Auto-detect speakers
	use_onnx_vad=False # Use ONNX for VAD
	)
	```

	## 📈 Usage Examples

	### Basic Processing

	```python
	from src.pipeline import VADDiarizationPipeline

	# Initialize
	pipeline = VADDiarizationPipeline(token='your_token')

	# Process file
	result = pipeline.process_file('meeting.wav')

	# Access results
	print(f"Speakers: {result['metadata']['num_speakers']}")
	print(f"Segments: {result['metadata']['num_segments']}")

	# Print timeline
	for seg in result['speaker_segments']:
	print(f"{seg['start']:.2f}s - {seg['end']:.2f}s: {seg['speaker']}")
	```

	### Batch Processing

	```python
	# Process multiple files
	audio_files = ['audio1.wav', 'audio2.wav', 'audio3.wav']
	results = pipeline.process_batch(audio_files)

	# Export results
	for result in results:
	pipeline.save_results(result, 'outputs/', format='json')
	```

	### Custom Configuration

	```python
	# Initialize with custom settings
	pipeline = VADDiarizationPipeline(
	vad_threshold=0.3, # More sensitive VAD
	num_speakers=3, # Fixed 3 speakers
	use_onnx_vad=True # Faster VAD inference
	)

	# Process with overrides
	result = pipeline.process_file(
	'audio.wav',
	num_speakers=2 # Override to 2 speakers for this file
	)
	```

	### VAD Only

	```python
	from src.vad import SileroVAD

	vad = SileroVAD(threshold=0.5)

	# Process audio
	timestamps = vad.process_file('audio.wav')

	# Print speech segments
	for ts in timestamps:
	print(f"Speech: {ts['start']:.2f}s - {ts['end']:.2f}s")
	```

	### Diarization Only

	```python
	from src.diarization import SpeakerDiarization

	diarizer = SpeakerDiarization(token='your_token')

	# Process audio
	segments, time_ms, metadata = diarizer.process_file('audio.wav')

	# Print speaker segments
	for seg in segments:
	print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")
	```

	## 🧪 Testing

	```bash
	# Run all tests
	python -m pytest tests/ -v

	# Run with coverage
	python -m pytest tests/ --cov=src --cov-report=html

	# Test specific module
	python -m pytest tests/test_vad.py -v

	# Verify installation
	python verify_installation.py

	# Run benchmarks
	python benchmarks/run_benchmarks.py
	```

	## 📝 Output Formats

	### JSON Format
	```json
	{
	"audio_path": "audio.wav",
	"speaker_segments": [
	{
	"start": 0.5,
	"end": 3.2,
	"speaker": "SPEAKER_00",
	"duration": 2.7
	}
	],
	"vad_segments": [
	{
	"start": 0.5,
	"end": 3.2
	}
	],
	"metadata": {
	"num_speakers": 2,
	"num_segments": 15,
	"total_speech_time": 45.3
	},
	"processing_time": {
	"vad_ms": 150.2,
	"diarization_ms": 3200.5,
	"total_ms": 3350.7
	}
	}
	```

	### RTTM Format
	```
	SPEAKER audio 1 0.500 2.700 <NA> <NA> SPEAKER_00 <NA> <NA>
	SPEAKER audio 1 3.500 4.200 <NA> <NA> SPEAKER_01 <NA> <NA>
	```

	### Text Timeline
	```
	[0.50s - 3.20s] SPEAKER_00
	[3.50s - 7.70s] SPEAKER_01
	[8.00s - 10.50s] SPEAKER_00
	```

	## 🎯 Use Cases

	- Meeting Transcription: Identify who spoke when in recordings
	- Podcast Analysis: Track speaker segments and statistics
	- Call Center Analytics: Analyze customer-agent interactions
	- Video Production: Generate speaker labels for editing
	- Research: Speaker diarization for linguistic studies
	- Interview Processing: Separate interviewer and interviewee
	- Broadcast Media: Analyze news programs and talk shows

	## 🐛 Troubleshooting

	### Common Issues

	#### 1. HF Token Error
	```
	Error: Invalid token or model access denied
	```
	Solution:
	- Get token from https://huggingface.co/settings/tokens
	- Accept model conditions at https://huggingface.co/pyannote/speaker-diarization-3.1
	- Set environment variable: `export HF_TOKEN='your_token'`

	#### 2. CUDA Out of Memory
	```
	RuntimeError: CUDA out of memory
	```
	Solution:
	- Process shorter audio segments
	- Use CPU mode: `device='cpu'`
	- Reduce batch size

	#### 3. Audio Format Not Supported
	```
	Error loading audio
	```
	Solution: Convert to WAV format using FFmpeg:
	```bash
	ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
	```

	#### 4. DiarizeOutput Error
	```
	'DiarizeOutput' object has no attribute 'itertracks'
	```
	Solution: This is fixed in the current version. Make sure you have the latest code.

	#### 5. Import Errors
	```
	ModuleNotFoundError: No module named 'torch'
	```
	Solution:
	- Activate your environment: `conda activate vad_diarization`
	- Reinstall dependencies: `pip install -r requirements.txt`

	## 🔄 API Compatibility

	This project supports both:
	- Pyannote.audio 3.x: Returns `Annotation` objects
	- Pyannote.audio 4.0+: Returns `DiarizeOutput` objects

	The code automatically detects and handles both formats.

	## 🚀 Deployment Options

	### Local Development
	```bash
	python app.py
	```

	### Docker
	```bash
	docker-compose up
	```

	### Cloud Platforms

	Hugging Face Spaces:
	- Fork this repository
	- Create new Space
	- Connect repository
	- Set `HF_TOKEN` secret
	- Deploy!

	AWS/GCP/Azure:
	- Use provided Dockerfile
	- Deploy as container service
	- Configure GPU instances for best performance

	## 🤝 Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	1. Fork the repository
	2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
	3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. Push to the branch (`git push origin feature/AmazingFeature`)
	5. Open a Pull Request

	## 📄 License

	This project is licensed under the MIT License.

	## 🙏 Acknowledgments

	- [Silero VAD](https://github.com/snakers4/silero-vad) - Fast and accurate VAD
	- [Pyannote.audio](https://github.com/pyannote/pyannote-audio) - Speaker diarization toolkit
	- [Gradio](https://gradio.app/) - Web interface framework
	- [PyTorch](https://pytorch.org/) - Deep learning framework

	## 📧 Support

	For questions or issues:
	- Open an issue on GitHub
	- Check existing issues for solutions
	- Review the troubleshooting section

	---

	Built with ❤️ for the speech processing community