gilbert-fr-source / README.md
MEscriva's picture
Copy from MEscriva/gilbert-fr-source - Baseline model for Gilbert research
6f5550e verified
---
license: mit
tags:
- automatic-speech-recognition
- asr
- whisper
- french
- speech-recognition
- stt
- multilingual
- research
- baseline
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: openai/whisper-large-v3
---
# Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition
## Overview
**Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem.
**Important Notice on Intellectual Property:**
- This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use.
- **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
- While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.
---
## Research Context
The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for:
- **Professional meeting transcription** (hybrid and remote meetings)
- **Long-form multi-speaker discourse** (30-120 minute sessions)
- **Institutional environments** (education, public sector, healthcare)
- **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio)
- **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents)
This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions.
---
## Model Details
### Architecture
- **Base Model:** OpenAI Whisper Large V3
- **Fine-tuning:** Optimized for French language performance
- **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX
- **Model Size:** ~3.2 GB (full precision)
### Key Characteristics
- **Language:** French (primary), with multilingual capabilities
- **Context Length:** Long-form audio support (up to 30 minutes per segment)
- **Output:** Text transcription with word-level timestamps
- **Performance:** Optimized for French speech recognition accuracy
---
## Intended Use
### Research and Development
This model is intended for:
1. **Research Baseline:** Use as a reference point for ASR research and experimentation
2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies
3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms)
4. **Educational Purposes:** Learning and understanding ASR model behavior
### Production Use
While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models.
---
## Performance Benchmarks
### Reference Results
The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development:
| Dataset | WER | Notes |
|---------|-----|-------|
| MLS (FR) | 3.98% | Multilingual LibriSpeech French |
| Common Voice FR (v13.0) | 7.28% | Diverse French speech |
| VoxPopuli (FR) | 8.91% | European Parliament speeches |
| Fleurs (FR) | 4.84% | FLORES evaluation |
| African Accented French | 4.20% | Regional accent evaluation |
**Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement.
---
## Usage
### Installation
```bash
pip install transformers torch torchaudio librosa soundfile
```
### Basic Usage with Transformers
```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
model_id = "MEscriva/gilbert-fr-source"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True
)
model.to(device)
# Process audio
audio_path = "your_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(
inputs["input_features"],
language="fr",
task="transcribe"
)
transcription = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
```
### Usage with OpenAI Whisper
```python
import whisper
# Load the model
model = whisper.load_model("large-v3")
# Transcribe French audio
result = model.transcribe(
"audio.wav",
language="fr",
task="transcribe"
)
print(result["text"])
```
---
## Research Methodology
### Baseline Purpose
This model serves as:
1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons
2. **Reproducibility Anchor:** All experiments reference this exact checkpoint
3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability
### Evaluation Standards
- **WER Calculation:** Standard normalization (lowercasing, punctuation removal)
- **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score
- **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research)
### Versioning
- **Current Version:** 0.1 (Research Baseline)
- **Future Versions:** All Gilbert model variants will reference this baseline version
---
## Limitations
This baseline model inherits known limitations from Whisper and the underlying training data:
1. **Overlapping Speech:** Sensitivity to simultaneous speakers
2. **Long-form Decoding:** Occasional hallucinations in very long audio segments
3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning
4. **Accent Distribution:** Potential biases related to accent representation in training data
5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation
**Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.**
---
## Future Research Directions
The following specialized models will be developed as independent checkpoints from this baseline:
### Planned Gilbert Models
1. **Gilbert-FR-Longform-v1**
- Optimized for long meetings (30-120 minutes)
- Multi-speaker interaction handling
- Discourse-level context stability
2. **Gilbert-FR-Accents-v1**
- Robustness to regional and international French accents
- African, Canadian, Belgian accent optimization
3. **Gilbert-FR-Telephone-v1**
- Optimized for 8 kHz VoIP/call-center speech
- Narrowband audio adaptation
4. **Gilbert-Multilingual-v1**
- Extended cross-lingual performance
- Optimized French anchors with multilingual support
**All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards.
---
## Intellectual Property and Licensing
### License for This Baseline
This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing:
- ✅ Commercial use
- ✅ Modification
- ✅ Distribution
- ✅ Private use
- ✅ Patent use
See the `LICENSE` file for full terms.
### Intellectual Property Notice
**Important:** While this baseline model is available under MIT License:
- **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
- Use of this baseline for Gilbert project development implies acceptance of these IP terms.
- Commercial use of Gilbert project derivatives requires separate licensing agreements.
For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr**
---
## Citation
If you use this baseline model in your research, please cite:
```bibtex
@software{gilbert_fr_source_2024,
title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition},
author={MEscriva and Lexia France},
year={2024},
url={https://huggingface.co/MEscriva/gilbert-fr-source},
version={0.1},
note={Research baseline for the Gilbert project}
}
```
---
## Acknowledgments
This baseline model is based on:
- **OpenAI Whisper Large V3** (MIT License)
- **bofenghuang/whisper-large-v3-french** (French fine-tuning)
We acknowledge the contributions of the open-source community and the original Whisper research team.
---
## Contact
For research collaboration, evaluation access, or technical inquiries:
- **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr)
- **Email:** mathis@lexiapro.fr
- **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source)
---
## Changelog
### Version 0.1 (2024-12-19)
- Initial research baseline release
- Based on Whisper Large V3 with French optimization
- Established as frozen reference point for Gilbert project
- Documentation of baseline performance metrics
---
**© 2024 Lexia France. All rights reserved for Gilbert project derivatives.**