---
license: mit
tags:
- automatic-speech-recognition
- asr
- whisper
- french
- speech-recognition
- stt
- multilingual
- research
- baseline
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: openai/whisper-large-v3
---

# Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition

## Overview

**Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem.

**Important Notice on Intellectual Property:**
- This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use.
- **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
- While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.

---

## Research Context

The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for:

- **Professional meeting transcription** (hybrid and remote meetings)
- **Long-form multi-speaker discourse** (30-120 minute sessions)
- **Institutional environments** (education, public sector, healthcare)
- **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio)
- **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents)

This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions.

---

## Model Details

### Architecture

- **Base Model:** OpenAI Whisper Large V3
- **Fine-tuning:** Optimized for French language performance
- **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX
- **Model Size:** ~3.2 GB (full precision)

### Key Characteristics

- **Language:** French (primary), with multilingual capabilities
- **Context Length:** Long-form audio support (up to 30 minutes per segment)
- **Output:** Text transcription with word-level timestamps
- **Performance:** Optimized for French speech recognition accuracy

---

## Intended Use

### Research and Development

This model is intended for:

1. **Research Baseline:** Use as a reference point for ASR research and experimentation
2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies
3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms)
4. **Educational Purposes:** Learning and understanding ASR model behavior

### Production Use

While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models.

---

## Performance Benchmarks

### Reference Results

The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development:

| Dataset | WER | Notes |
|---------|-----|-------|
| MLS (FR) | 3.98% | Multilingual LibriSpeech French |
| Common Voice FR (v13.0) | 7.28% | Diverse French speech |
| VoxPopuli (FR) | 8.91% | European Parliament speeches |
| Fleurs (FR) | 4.84% | FLORES evaluation |
| African Accented French | 4.20% | Regional accent evaluation |

**Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement.

---

## Usage

### Installation

```bash
pip install transformers torch torchaudio librosa soundfile
```

### Basic Usage with Transformers

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "MEscriva/gilbert-fr-source"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True
)
model.to(device)

# Process audio
audio_path = "your_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        language="fr",
        task="transcribe"
    )

transcription = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]
```

### Usage with OpenAI Whisper

```python
import whisper

# Load the model
model = whisper.load_model("large-v3")

# Transcribe French audio
result = model.transcribe(
    "audio.wav",
    language="fr",
    task="transcribe"
)

print(result["text"])
```

---

## Research Methodology

### Baseline Purpose

This model serves as:

1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons
2. **Reproducibility Anchor:** All experiments reference this exact checkpoint
3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability

### Evaluation Standards

- **WER Calculation:** Standard normalization (lowercasing, punctuation removal)
- **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score
- **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research)

### Versioning

- **Current Version:** 0.1 (Research Baseline)
- **Future Versions:** All Gilbert model variants will reference this baseline version

---

## Limitations

This baseline model inherits known limitations from Whisper and the underlying training data:

1. **Overlapping Speech:** Sensitivity to simultaneous speakers
2. **Long-form Decoding:** Occasional hallucinations in very long audio segments
3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning
4. **Accent Distribution:** Potential biases related to accent representation in training data
5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation

**Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.**

---

## Future Research Directions

The following specialized models will be developed as independent checkpoints from this baseline:

### Planned Gilbert Models

1. **Gilbert-FR-Longform-v1**
   - Optimized for long meetings (30-120 minutes)
   - Multi-speaker interaction handling
   - Discourse-level context stability

2. **Gilbert-FR-Accents-v1**
   - Robustness to regional and international French accents
   - African, Canadian, Belgian accent optimization

3. **Gilbert-FR-Telephone-v1**
   - Optimized for 8 kHz VoIP/call-center speech
   - Narrowband audio adaptation

4. **Gilbert-Multilingual-v1**
   - Extended cross-lingual performance
   - Optimized French anchors with multilingual support

**All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards.

---

## Intellectual Property and Licensing

### License for This Baseline

This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing:

- ✅ Commercial use
- ✅ Modification
- ✅ Distribution
- ✅ Private use
- ✅ Patent use

See the `LICENSE` file for full terms.

### Intellectual Property Notice

**Important:** While this baseline model is available under MIT License:

- **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
- Use of this baseline for Gilbert project development implies acceptance of these IP terms.
- Commercial use of Gilbert project derivatives requires separate licensing agreements.

For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr**

---

## Citation

If you use this baseline model in your research, please cite:

```bibtex
@software{gilbert_fr_source_2024,
  title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition},
  author={MEscriva and Lexia France},
  year={2024},
  url={https://huggingface.co/MEscriva/gilbert-fr-source},
  version={0.1},
  note={Research baseline for the Gilbert project}
}
```

---

## Acknowledgments

This baseline model is based on:
- **OpenAI Whisper Large V3** (MIT License)
- **bofenghuang/whisper-large-v3-french** (French fine-tuning)

We acknowledge the contributions of the open-source community and the original Whisper research team.

---

## Contact

For research collaboration, evaluation access, or technical inquiries:

- **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr)
- **Email:** mathis@lexiapro.fr
- **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source)

---

## Changelog

### Version 0.1 (2024-12-19)
- Initial research baseline release
- Based on Whisper Large V3 with French optimization
- Established as frozen reference point for Gilbert project
- Documentation of baseline performance metrics

---

**© 2024 Lexia France. All rights reserved for Gilbert project derivatives.**