Gilbert Speaker Diarization Model

Model Card

Model Name: Gilbert Speaker Diarization (v1.0)
Model Type: Speaker Diarization Pipeline
Base Framework: pyannote.audio 3.x
License: MIT
Repository: MEscriva/gilbert-pyannote-diarization

Abstract

This model provides a speaker diarization pipeline optimized for meeting analysis, built upon the pyannote.audio framework. The implementation includes enhanced post-processing capabilities, overlap detection, and advanced statistical analysis specifically tailored for meeting transcription scenarios. The model is designed to identify and segment speakers in audio recordings with high temporal precision.

Model Details

Architecture

The model leverages pre-trained pyannote.audio pipelines, specifically:

Primary Model: pyannote/speaker-diarization-3.1 (default)
Alternative Models: pyannote/speaker-diarization-community-1, pyannote/speaker-diarization-precision-2

Key Features

Speaker Segmentation: Identifies speaker boundaries with sub-second precision
Overlap Detection: Detects and quantifies simultaneous speech segments
Post-Processing: Optional intelligent segment merging and filtering (disabled by default to preserve accuracy)
Statistical Analysis: Comprehensive metrics per speaker (duration, segment count, overlap ratios)

Technical Specifications

Input Format: Audio files (WAV, MP3, M4A, FLAC, OGG)
Sample Rate: 16 kHz (automatic conversion)
Output Format: RTTM (Rich Transcription Time Marked) and JSON
Temporal Resolution: 0.01 seconds (100ms)
Speaker ID Format: SPEAKER_00, SPEAKER_01, etc.

Intended Use

Primary Use Cases

Meeting Transcription: Speaker identification in business meetings
Interview Analysis: Segmentation of multi-speaker interviews
Conference Recording: Diarization of conference presentations and Q&A sessions
Podcast Processing: Speaker separation in multi-host podcasts

Out-of-Scope Use Cases

Real-time streaming diarization (designed for batch processing)
Music or non-speech audio analysis
Languages not supported by the base pyannote models

Performance Metrics

Evaluation Methodology

The model performance is evaluated using standard diarization metrics:

DER (Diarization Error Rate): Primary metric combining false alarm, missed detection, and speaker confusion
JER (Jaccard Error Rate): Average Jaccard error across speakers
Segmentation Accuracy: Temporal precision of speaker boundaries

Expected Performance

Based on pyannote.audio benchmarks and internal testing:

Metric	Performance
DER (optimal settings)	< 10% on clean meeting audio
Temporal Precision	± 0.1 seconds
Speaker Detection	95%+ accuracy (known speaker count)

Note: Performance varies significantly based on audio quality, number of speakers, and overlap frequency.

Usage

Installation

pip install pyannote.audio pyannote.core torch librosa soundfile

Basic Usage

from diarization_pyannote_gilbert import run_gilbert_diarization

results = run_gilbert_diarization(
    audio_path="meeting.wav",
    model_name="pyannote/speaker-diarization-3.1"
)

# Access results
segments = results["segments"]  # Post-processed segments
segments_raw = results["segments_raw"]  # Raw pyannote output
overlaps = results["overlaps"]  # Detected overlaps
stats = results["stats"]  # Per-speaker statistics

Command Line Interface

# Standard usage (optimal accuracy)
python diarization_pyannote_gilbert.py audio.wav

# With post-processing (improved readability, potential accuracy trade-off)
python diarization_pyannote_gilbert.py audio.wav \
    --min-segment 0.5 \
    --merge-gaps 0.3

# With known speaker count (improves accuracy)
python diarization_pyannote_gilbert.py audio.wav \
    --num_speakers 4

Parameters

Parameter	Type	Default	Description
`model_name`	str	`pyannote/speaker-diarization-3.1`	Base pyannote model
`num_speakers`	int	None	Exact number of speakers (if known)
`min_speakers`	int	None	Minimum number of speakers
`max_speakers`	int	None	Maximum number of speakers
`min_segment`	float	0.0	Minimum segment duration (s). 0 = disabled
`merge_gaps`	float	0.0	Gap threshold for merging (s). 0 = disabled
`use_exclusive`	bool	False	Use exclusive speaker diarization

Output Format

RTTM Format

SPEAKER <file> 1 <start> <duration> <NA> <NA> <speaker_id> <NA> <NA>

JSON Format

[
  {
    "speaker": "SPEAKER_00",
    "start": 0.0,
    "end": 3.25
  },
  ...
]

Statistics Format

{
  "version": "Gilbert-v1.0",
  "model": "pyannote/speaker-diarization-3.1",
  "num_speakers": 4,
  "duration": 3600.0,
  "num_segments": 150,
  "num_overlaps": 12,
  "speaker_stats": {
    "SPEAKER_00": {
      "total_duration": 900.0,
      "num_segments": 45,
      "avg_segment_duration": 20.0,
      "overlap_duration": 45.2
    },
    ...
  }
}

Limitations and Bias

Known Limitations

Audio Quality: Performance degrades significantly with low-quality audio, background noise, or poor recording conditions
Speaker Similarity: May confuse speakers with similar voices or accents
Overlap Handling: High overlap scenarios (>30% of total duration) may reduce accuracy
Language Dependency: Performance varies by language (best for languages well-represented in training data)
Computational Requirements: Processing time scales with audio duration (approximately 1x real-time on CPU)

Potential Biases

May perform better on male voices due to training data distribution
Accuracy may vary by accent and dialect
Performance optimized for meeting scenarios may not generalize to other contexts

Training Data

This model is built upon pre-trained pyannote.audio models. The base models were trained on:

Training Corpora: VoxConverse, DIHARD, AMI, Ego4D
Languages: Primarily English, with multilingual support
Audio Conditions: Various recording environments (studio, meeting rooms, telephone)

Note: This implementation does not include model training; it utilizes pre-trained weights from pyannote.audio.

Evaluation

Benchmark Results

Evaluation on internal meeting dataset (Gilbert v1 benchmark):

Dataset	DER (%)	JER (%)	Speakers	Duration (min)
Meetings (clean)	8.5	12.3	2-4	5-60
Meetings (noisy)	15.2	18.7	2-4	5-60

Results may vary based on specific audio characteristics.

Ethical Considerations

Privacy: This model processes audio recordings. Ensure proper consent and data protection measures
Transparency: Users should be informed when their speech is being analyzed
Bias Mitigation: Be aware of potential biases in speaker detection, especially for underrepresented groups

Citation

If you use this model in your research, please cite:

@software{gilbert_diarization_2024,
  title={Gilbert Speaker Diarization Model},
  author={MEscriva},
  year={2024},
  url={https://huggingface.co/MEscriva/gilbert-pyannote-diarization},
  version={1.0}
}

References

Bredin, H., et al. (2020). "pyannote.audio: neural building blocks for speaker diarization." ICASSP 2020
Bredin, H., & Giraudel, A. (2023). "pyannote.audio 3.0: speaker diarization pipeline." Interspeech 2023
pyannote.audio GitHub
pyannote.audio Documentation

License

This model is released under the MIT License. See LICENSE file for details.

Contact

For questions, issues, or contributions, please refer to the repository:
https://huggingface.co/MEscriva/gilbert-pyannote-diarization

Changelog

Version 1.0 (2024-11-19)

Initial release
Based on pyannote.audio 3.1
Enhanced post-processing capabilities
Overlap detection and statistical analysis
Optimized for meeting transcription scenarios

Downloads last month: -; Downloads are not tracked for this model. How to track