|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- asr |
|
|
- whisper |
|
|
- french |
|
|
- speech-recognition |
|
|
- stt |
|
|
- multilingual |
|
|
- research |
|
|
- baseline |
|
|
library_name: transformers |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
base_model: openai/whisper-large-v3 |
|
|
--- |
|
|
|
|
|
# Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition |
|
|
|
|
|
## Overview |
|
|
|
|
|
**Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem. |
|
|
|
|
|
**Important Notice on Intellectual Property:** |
|
|
- This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use. |
|
|
- **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.** |
|
|
- While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms. |
|
|
|
|
|
--- |
|
|
|
|
|
## Research Context |
|
|
|
|
|
The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for: |
|
|
|
|
|
- **Professional meeting transcription** (hybrid and remote meetings) |
|
|
- **Long-form multi-speaker discourse** (30-120 minute sessions) |
|
|
- **Institutional environments** (education, public sector, healthcare) |
|
|
- **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio) |
|
|
- **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents) |
|
|
|
|
|
This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Base Model:** OpenAI Whisper Large V3 |
|
|
- **Fine-tuning:** Optimized for French language performance |
|
|
- **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX |
|
|
- **Model Size:** ~3.2 GB (full precision) |
|
|
|
|
|
### Key Characteristics |
|
|
|
|
|
- **Language:** French (primary), with multilingual capabilities |
|
|
- **Context Length:** Long-form audio support (up to 30 minutes per segment) |
|
|
- **Output:** Text transcription with word-level timestamps |
|
|
- **Performance:** Optimized for French speech recognition accuracy |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Research and Development |
|
|
|
|
|
This model is intended for: |
|
|
|
|
|
1. **Research Baseline:** Use as a reference point for ASR research and experimentation |
|
|
2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies |
|
|
3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms) |
|
|
4. **Educational Purposes:** Learning and understanding ASR model behavior |
|
|
|
|
|
### Production Use |
|
|
|
|
|
While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models. |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
### Reference Results |
|
|
|
|
|
The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development: |
|
|
|
|
|
| Dataset | WER | Notes | |
|
|
|---------|-----|-------| |
|
|
| MLS (FR) | 3.98% | Multilingual LibriSpeech French | |
|
|
| Common Voice FR (v13.0) | 7.28% | Diverse French speech | |
|
|
| VoxPopuli (FR) | 8.91% | European Parliament speeches | |
|
|
| Fleurs (FR) | 4.84% | FLORES evaluation | |
|
|
| African Accented French | 4.20% | Regional accent evaluation | |
|
|
|
|
|
**Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchaudio librosa soundfile |
|
|
``` |
|
|
|
|
|
### Basic Usage with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
|
|
import torch |
|
|
|
|
|
model_id = "MEscriva/gilbert-fr-source" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
torch_dtype = torch.float16 if device == "cuda" else torch.float32 |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch_dtype, |
|
|
low_cpu_mem_usage=True |
|
|
) |
|
|
model.to(device) |
|
|
|
|
|
# Process audio |
|
|
audio_path = "your_audio.wav" |
|
|
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000) |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
generated_ids = model.generate( |
|
|
inputs["input_features"], |
|
|
language="fr", |
|
|
task="transcribe" |
|
|
) |
|
|
|
|
|
transcription = processor.batch_decode( |
|
|
generated_ids, |
|
|
skip_special_tokens=True |
|
|
)[0] |
|
|
``` |
|
|
|
|
|
### Usage with OpenAI Whisper |
|
|
|
|
|
```python |
|
|
import whisper |
|
|
|
|
|
# Load the model |
|
|
model = whisper.load_model("large-v3") |
|
|
|
|
|
# Transcribe French audio |
|
|
result = model.transcribe( |
|
|
"audio.wav", |
|
|
language="fr", |
|
|
task="transcribe" |
|
|
) |
|
|
|
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Research Methodology |
|
|
|
|
|
### Baseline Purpose |
|
|
|
|
|
This model serves as: |
|
|
|
|
|
1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons |
|
|
2. **Reproducibility Anchor:** All experiments reference this exact checkpoint |
|
|
3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability |
|
|
|
|
|
### Evaluation Standards |
|
|
|
|
|
- **WER Calculation:** Standard normalization (lowercasing, punctuation removal) |
|
|
- **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score |
|
|
- **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research) |
|
|
|
|
|
### Versioning |
|
|
|
|
|
- **Current Version:** 0.1 (Research Baseline) |
|
|
- **Future Versions:** All Gilbert model variants will reference this baseline version |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
This baseline model inherits known limitations from Whisper and the underlying training data: |
|
|
|
|
|
1. **Overlapping Speech:** Sensitivity to simultaneous speakers |
|
|
2. **Long-form Decoding:** Occasional hallucinations in very long audio segments |
|
|
3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning |
|
|
4. **Accent Distribution:** Potential biases related to accent representation in training data |
|
|
5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation |
|
|
|
|
|
**Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.** |
|
|
|
|
|
--- |
|
|
|
|
|
## Future Research Directions |
|
|
|
|
|
The following specialized models will be developed as independent checkpoints from this baseline: |
|
|
|
|
|
### Planned Gilbert Models |
|
|
|
|
|
1. **Gilbert-FR-Longform-v1** |
|
|
- Optimized for long meetings (30-120 minutes) |
|
|
- Multi-speaker interaction handling |
|
|
- Discourse-level context stability |
|
|
|
|
|
2. **Gilbert-FR-Accents-v1** |
|
|
- Robustness to regional and international French accents |
|
|
- African, Canadian, Belgian accent optimization |
|
|
|
|
|
3. **Gilbert-FR-Telephone-v1** |
|
|
- Optimized for 8 kHz VoIP/call-center speech |
|
|
- Narrowband audio adaptation |
|
|
|
|
|
4. **Gilbert-Multilingual-v1** |
|
|
- Extended cross-lingual performance |
|
|
- Optimized French anchors with multilingual support |
|
|
|
|
|
**All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intellectual Property and Licensing |
|
|
|
|
|
### License for This Baseline |
|
|
|
|
|
This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing: |
|
|
|
|
|
- ✅ Commercial use |
|
|
- ✅ Modification |
|
|
- ✅ Distribution |
|
|
- ✅ Private use |
|
|
- ✅ Patent use |
|
|
|
|
|
See the `LICENSE` file for full terms. |
|
|
|
|
|
### Intellectual Property Notice |
|
|
|
|
|
**Important:** While this baseline model is available under MIT License: |
|
|
|
|
|
- **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.** |
|
|
- Use of this baseline for Gilbert project development implies acceptance of these IP terms. |
|
|
- Commercial use of Gilbert project derivatives requires separate licensing agreements. |
|
|
|
|
|
For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr** |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this baseline model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{gilbert_fr_source_2024, |
|
|
title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition}, |
|
|
author={MEscriva and Lexia France}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/MEscriva/gilbert-fr-source}, |
|
|
version={0.1}, |
|
|
note={Research baseline for the Gilbert project} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This baseline model is based on: |
|
|
- **OpenAI Whisper Large V3** (MIT License) |
|
|
- **bofenghuang/whisper-large-v3-french** (French fine-tuning) |
|
|
|
|
|
We acknowledge the contributions of the open-source community and the original Whisper research team. |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
For research collaboration, evaluation access, or technical inquiries: |
|
|
|
|
|
- **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr) |
|
|
- **Email:** mathis@lexiapro.fr |
|
|
- **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source) |
|
|
|
|
|
--- |
|
|
|
|
|
## Changelog |
|
|
|
|
|
### Version 0.1 (2024-12-19) |
|
|
- Initial research baseline release |
|
|
- Based on Whisper Large V3 with French optimization |
|
|
- Established as frozen reference point for Gilbert project |
|
|
- Documentation of baseline performance metrics |
|
|
|
|
|
--- |
|
|
|
|
|
**© 2024 Lexia France. All rights reserved for Gilbert project derivatives.** |
|
|
|
|
|
|