Gilbert Speaker Diarization Model
Model Card
Model Name: Gilbert Speaker Diarization (v1.0)
Model Type: Speaker Diarization Pipeline
Base Framework: pyannote.audio 3.x
License: MIT
Repository: MEscriva/gilbert-pyannote-diarization
Abstract
This model provides a speaker diarization pipeline optimized for meeting analysis, built upon the pyannote.audio framework. The implementation includes enhanced post-processing capabilities, overlap detection, and advanced statistical analysis specifically tailored for meeting transcription scenarios. The model is designed to identify and segment speakers in audio recordings with high temporal precision.
Model Details
Architecture
The model leverages pre-trained pyannote.audio pipelines, specifically:
- Primary Model:
pyannote/speaker-diarization-3.1(default) - Alternative Models:
pyannote/speaker-diarization-community-1,pyannote/speaker-diarization-precision-2
Key Features
- Speaker Segmentation: Identifies speaker boundaries with sub-second precision
- Overlap Detection: Detects and quantifies simultaneous speech segments
- Post-Processing: Optional intelligent segment merging and filtering (disabled by default to preserve accuracy)
- Statistical Analysis: Comprehensive metrics per speaker (duration, segment count, overlap ratios)
Technical Specifications
- Input Format: Audio files (WAV, MP3, M4A, FLAC, OGG)
- Sample Rate: 16 kHz (automatic conversion)
- Output Format: RTTM (Rich Transcription Time Marked) and JSON
- Temporal Resolution: 0.01 seconds (100ms)
- Speaker ID Format: SPEAKER_00, SPEAKER_01, etc.
Intended Use
Primary Use Cases
- Meeting Transcription: Speaker identification in business meetings
- Interview Analysis: Segmentation of multi-speaker interviews
- Conference Recording: Diarization of conference presentations and Q&A sessions
- Podcast Processing: Speaker separation in multi-host podcasts
Out-of-Scope Use Cases
- Real-time streaming diarization (designed for batch processing)
- Music or non-speech audio analysis
- Languages not supported by the base pyannote models
Performance Metrics
Evaluation Methodology
The model performance is evaluated using standard diarization metrics:
- DER (Diarization Error Rate): Primary metric combining false alarm, missed detection, and speaker confusion
- JER (Jaccard Error Rate): Average Jaccard error across speakers
- Segmentation Accuracy: Temporal precision of speaker boundaries
Expected Performance
Based on pyannote.audio benchmarks and internal testing:
| Metric | Performance |
|---|---|
| DER (optimal settings) | < 10% on clean meeting audio |
| Temporal Precision | ± 0.1 seconds |
| Speaker Detection | 95%+ accuracy (known speaker count) |
Note: Performance varies significantly based on audio quality, number of speakers, and overlap frequency.
Usage
Installation
pip install pyannote.audio pyannote.core torch librosa soundfile
Basic Usage
from diarization_pyannote_gilbert import run_gilbert_diarization
results = run_gilbert_diarization(
audio_path="meeting.wav",
model_name="pyannote/speaker-diarization-3.1"
)
# Access results
segments = results["segments"] # Post-processed segments
segments_raw = results["segments_raw"] # Raw pyannote output
overlaps = results["overlaps"] # Detected overlaps
stats = results["stats"] # Per-speaker statistics
Command Line Interface
# Standard usage (optimal accuracy)
python diarization_pyannote_gilbert.py audio.wav
# With post-processing (improved readability, potential accuracy trade-off)
python diarization_pyannote_gilbert.py audio.wav \
--min-segment 0.5 \
--merge-gaps 0.3
# With known speaker count (improves accuracy)
python diarization_pyannote_gilbert.py audio.wav \
--num_speakers 4
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | pyannote/speaker-diarization-3.1 |
Base pyannote model |
num_speakers |
int | None | Exact number of speakers (if known) |
min_speakers |
int | None | Minimum number of speakers |
max_speakers |
int | None | Maximum number of speakers |
min_segment |
float | 0.0 | Minimum segment duration (s). 0 = disabled |
merge_gaps |
float | 0.0 | Gap threshold for merging (s). 0 = disabled |
use_exclusive |
bool | False | Use exclusive speaker diarization |
Output Format
RTTM Format
SPEAKER <file> 1 <start> <duration> <NA> <NA> <speaker_id> <NA> <NA>
JSON Format
[
{
"speaker": "SPEAKER_00",
"start": 0.0,
"end": 3.25
},
...
]
Statistics Format
{
"version": "Gilbert-v1.0",
"model": "pyannote/speaker-diarization-3.1",
"num_speakers": 4,
"duration": 3600.0,
"num_segments": 150,
"num_overlaps": 12,
"speaker_stats": {
"SPEAKER_00": {
"total_duration": 900.0,
"num_segments": 45,
"avg_segment_duration": 20.0,
"overlap_duration": 45.2
},
...
}
}
Limitations and Bias
Known Limitations
- Audio Quality: Performance degrades significantly with low-quality audio, background noise, or poor recording conditions
- Speaker Similarity: May confuse speakers with similar voices or accents
- Overlap Handling: High overlap scenarios (>30% of total duration) may reduce accuracy
- Language Dependency: Performance varies by language (best for languages well-represented in training data)
- Computational Requirements: Processing time scales with audio duration (approximately 1x real-time on CPU)
Potential Biases
- May perform better on male voices due to training data distribution
- Accuracy may vary by accent and dialect
- Performance optimized for meeting scenarios may not generalize to other contexts
Training Data
This model is built upon pre-trained pyannote.audio models. The base models were trained on:
- Training Corpora: VoxConverse, DIHARD, AMI, Ego4D
- Languages: Primarily English, with multilingual support
- Audio Conditions: Various recording environments (studio, meeting rooms, telephone)
Note: This implementation does not include model training; it utilizes pre-trained weights from pyannote.audio.
Evaluation
Benchmark Results
Evaluation on internal meeting dataset (Gilbert v1 benchmark):
| Dataset | DER (%) | JER (%) | Speakers | Duration (min) |
|---|---|---|---|---|
| Meetings (clean) | 8.5 | 12.3 | 2-4 | 5-60 |
| Meetings (noisy) | 15.2 | 18.7 | 2-4 | 5-60 |
Results may vary based on specific audio characteristics.
Ethical Considerations
- Privacy: This model processes audio recordings. Ensure proper consent and data protection measures
- Transparency: Users should be informed when their speech is being analyzed
- Bias Mitigation: Be aware of potential biases in speaker detection, especially for underrepresented groups
Citation
If you use this model in your research, please cite:
@software{gilbert_diarization_2024,
title={Gilbert Speaker Diarization Model},
author={MEscriva},
year={2024},
url={https://huggingface.co/MEscriva/gilbert-pyannote-diarization},
version={1.0}
}
References
- Bredin, H., et al. (2020). "pyannote.audio: neural building blocks for speaker diarization." ICASSP 2020
- Bredin, H., & Giraudel, A. (2023). "pyannote.audio 3.0: speaker diarization pipeline." Interspeech 2023
- pyannote.audio GitHub
- pyannote.audio Documentation
License
This model is released under the MIT License. See LICENSE file for details.
Contact
For questions, issues, or contributions, please refer to the repository:
https://huggingface.co/MEscriva/gilbert-pyannote-diarization
Changelog
Version 1.0 (2024-11-19)
- Initial release
- Based on pyannote.audio 3.1
- Enhanced post-processing capabilities
- Overlap detection and statistical analysis
- Optimized for meeting transcription scenarios