| | --- |
| | license: mit |
| | tags: |
| | - automatic-speech-recognition |
| | - asr |
| | - whisper |
| | - french |
| | - speech-recognition |
| | - stt |
| | - multilingual |
| | - research |
| | - baseline |
| | library_name: transformers |
| | pipeline_tag: automatic-speech-recognition |
| | base_model: openai/whisper-large-v3 |
| | --- |
| | |
| | # Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition |
| |
|
| | ## Overview |
| |
|
| | **Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem. |
| |
|
| | **Important Notice on Intellectual Property:** |
| | - This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use. |
| | - **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.** |
| | - While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms. |
| |
|
| | --- |
| |
|
| | ## Research Context |
| |
|
| | The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for: |
| |
|
| | - **Professional meeting transcription** (hybrid and remote meetings) |
| | - **Long-form multi-speaker discourse** (30-120 minute sessions) |
| | - **Institutional environments** (education, public sector, healthcare) |
| | - **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio) |
| | - **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents) |
| |
|
| | This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | ### Architecture |
| |
|
| | - **Base Model:** OpenAI Whisper Large V3 |
| | - **Fine-tuning:** Optimized for French language performance |
| | - **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX |
| | - **Model Size:** ~3.2 GB (full precision) |
| |
|
| | ### Key Characteristics |
| |
|
| | - **Language:** French (primary), with multilingual capabilities |
| | - **Context Length:** Long-form audio support (up to 30 minutes per segment) |
| | - **Output:** Text transcription with word-level timestamps |
| | - **Performance:** Optimized for French speech recognition accuracy |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| |
|
| | ### Research and Development |
| |
|
| | This model is intended for: |
| |
|
| | 1. **Research Baseline:** Use as a reference point for ASR research and experimentation |
| | 2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies |
| | 3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms) |
| | 4. **Educational Purposes:** Learning and understanding ASR model behavior |
| |
|
| | ### Production Use |
| |
|
| | While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models. |
| |
|
| | --- |
| |
|
| | ## Performance Benchmarks |
| |
|
| | ### Reference Results |
| |
|
| | The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development: |
| |
|
| | | Dataset | WER | Notes | |
| | |---------|-----|-------| |
| | | MLS (FR) | 3.98% | Multilingual LibriSpeech French | |
| | | Common Voice FR (v13.0) | 7.28% | Diverse French speech | |
| | | VoxPopuli (FR) | 8.91% | European Parliament speeches | |
| | | Fleurs (FR) | 4.84% | FLORES evaluation | |
| | | African Accented French | 4.20% | Regional accent evaluation | |
| |
|
| | **Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement. |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers torch torchaudio librosa soundfile |
| | ``` |
| |
|
| | ### Basic Usage with Transformers |
| |
|
| | ```python |
| | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| | import torch |
| | |
| | model_id = "MEscriva/gilbert-fr-source" |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | torch_dtype = torch.float16 if device == "cuda" else torch.float32 |
| | |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | model = AutoModelForSpeechSeq2Seq.from_pretrained( |
| | model_id, |
| | torch_dtype=torch_dtype, |
| | low_cpu_mem_usage=True |
| | ) |
| | model.to(device) |
| | |
| | # Process audio |
| | audio_path = "your_audio.wav" |
| | inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000) |
| | inputs = {k: v.to(device) for k, v in inputs.items()} |
| | |
| | with torch.no_grad(): |
| | generated_ids = model.generate( |
| | inputs["input_features"], |
| | language="fr", |
| | task="transcribe" |
| | ) |
| | |
| | transcription = processor.batch_decode( |
| | generated_ids, |
| | skip_special_tokens=True |
| | )[0] |
| | ``` |
| |
|
| | ### Usage with OpenAI Whisper |
| |
|
| | ```python |
| | import whisper |
| | |
| | # Load the model |
| | model = whisper.load_model("large-v3") |
| | |
| | # Transcribe French audio |
| | result = model.transcribe( |
| | "audio.wav", |
| | language="fr", |
| | task="transcribe" |
| | ) |
| | |
| | print(result["text"]) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Research Methodology |
| |
|
| | ### Baseline Purpose |
| |
|
| | This model serves as: |
| |
|
| | 1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons |
| | 2. **Reproducibility Anchor:** All experiments reference this exact checkpoint |
| | 3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability |
| |
|
| | ### Evaluation Standards |
| |
|
| | - **WER Calculation:** Standard normalization (lowercasing, punctuation removal) |
| | - **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score |
| | - **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research) |
| |
|
| | ### Versioning |
| |
|
| | - **Current Version:** 0.1 (Research Baseline) |
| | - **Future Versions:** All Gilbert model variants will reference this baseline version |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | This baseline model inherits known limitations from Whisper and the underlying training data: |
| |
|
| | 1. **Overlapping Speech:** Sensitivity to simultaneous speakers |
| | 2. **Long-form Decoding:** Occasional hallucinations in very long audio segments |
| | 3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning |
| | 4. **Accent Distribution:** Potential biases related to accent representation in training data |
| | 5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation |
| |
|
| | **Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.** |
| |
|
| | --- |
| |
|
| | ## Future Research Directions |
| |
|
| | The following specialized models will be developed as independent checkpoints from this baseline: |
| |
|
| | ### Planned Gilbert Models |
| |
|
| | 1. **Gilbert-FR-Longform-v1** |
| | - Optimized for long meetings (30-120 minutes) |
| | - Multi-speaker interaction handling |
| | - Discourse-level context stability |
| |
|
| | 2. **Gilbert-FR-Accents-v1** |
| | - Robustness to regional and international French accents |
| | - African, Canadian, Belgian accent optimization |
| |
|
| | 3. **Gilbert-FR-Telephone-v1** |
| | - Optimized for 8 kHz VoIP/call-center speech |
| | - Narrowband audio adaptation |
| |
|
| | 4. **Gilbert-Multilingual-v1** |
| | - Extended cross-lingual performance |
| | - Optimized French anchors with multilingual support |
| |
|
| | **All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards. |
| |
|
| | --- |
| |
|
| | ## Intellectual Property and Licensing |
| |
|
| | ### License for This Baseline |
| |
|
| | This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing: |
| |
|
| | - ✅ Commercial use |
| | - ✅ Modification |
| | - ✅ Distribution |
| | - ✅ Private use |
| | - ✅ Patent use |
| |
|
| | See the `LICENSE` file for full terms. |
| |
|
| | ### Intellectual Property Notice |
| |
|
| | **Important:** While this baseline model is available under MIT License: |
| |
|
| | - **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.** |
| | - Use of this baseline for Gilbert project development implies acceptance of these IP terms. |
| | - Commercial use of Gilbert project derivatives requires separate licensing agreements. |
| |
|
| | For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr** |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use this baseline model in your research, please cite: |
| |
|
| | ```bibtex |
| | @software{gilbert_fr_source_2024, |
| | title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition}, |
| | author={MEscriva and Lexia France}, |
| | year={2024}, |
| | url={https://huggingface.co/MEscriva/gilbert-fr-source}, |
| | version={0.1}, |
| | note={Research baseline for the Gilbert project} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Acknowledgments |
| |
|
| | This baseline model is based on: |
| | - **OpenAI Whisper Large V3** (MIT License) |
| | - **bofenghuang/whisper-large-v3-french** (French fine-tuning) |
| |
|
| | We acknowledge the contributions of the open-source community and the original Whisper research team. |
| |
|
| | --- |
| |
|
| | ## Contact |
| |
|
| | For research collaboration, evaluation access, or technical inquiries: |
| |
|
| | - **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr) |
| | - **Email:** mathis@lexiapro.fr |
| | - **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source) |
| |
|
| | --- |
| |
|
| | ## Changelog |
| |
|
| | ### Version 0.1 (2024-12-19) |
| | - Initial research baseline release |
| | - Based on Whisper Large V3 with French optimization |
| | - Established as frozen reference point for Gilbert project |
| | - Documentation of baseline performance metrics |
| |
|
| | --- |
| |
|
| | **© 2024 Lexia France. All rights reserved for Gilbert project derivatives.** |
| |
|
| |
|