--- license: mit tags: - automatic-speech-recognition - asr - whisper - french - speech-recognition - stt - multilingual - research - baseline library_name: transformers pipeline_tag: automatic-speech-recognition base_model: openai/whisper-large-v3 --- # Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition ## Overview **Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem. **Important Notice on Intellectual Property:** - This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use. - **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.** - While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms. --- ## Research Context The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for: - **Professional meeting transcription** (hybrid and remote meetings) - **Long-form multi-speaker discourse** (30-120 minute sessions) - **Institutional environments** (education, public sector, healthcare) - **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio) - **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents) This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions. --- ## Model Details ### Architecture - **Base Model:** OpenAI Whisper Large V3 - **Fine-tuning:** Optimized for French language performance - **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX - **Model Size:** ~3.2 GB (full precision) ### Key Characteristics - **Language:** French (primary), with multilingual capabilities - **Context Length:** Long-form audio support (up to 30 minutes per segment) - **Output:** Text transcription with word-level timestamps - **Performance:** Optimized for French speech recognition accuracy --- ## Intended Use ### Research and Development This model is intended for: 1. **Research Baseline:** Use as a reference point for ASR research and experimentation 2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies 3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms) 4. **Educational Purposes:** Learning and understanding ASR model behavior ### Production Use While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models. --- ## Performance Benchmarks ### Reference Results The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development: | Dataset | WER | Notes | |---------|-----|-------| | MLS (FR) | 3.98% | Multilingual LibriSpeech French | | Common Voice FR (v13.0) | 7.28% | Diverse French speech | | VoxPopuli (FR) | 8.91% | European Parliament speeches | | Fleurs (FR) | 4.84% | FLORES evaluation | | African Accented French | 4.20% | Regional accent evaluation | **Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement. --- ## Usage ### Installation ```bash pip install transformers torch torchaudio librosa soundfile ``` ### Basic Usage with Transformers ```python from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor import torch model_id = "MEscriva/gilbert-fr-source" device = "cuda" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if device == "cuda" else torch.float32 processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True ) model.to(device) # Process audio audio_path = "your_audio.wav" inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000) inputs = {k: v.to(device) for k, v in inputs.items()} with torch.no_grad(): generated_ids = model.generate( inputs["input_features"], language="fr", task="transcribe" ) transcription = processor.batch_decode( generated_ids, skip_special_tokens=True )[0] ``` ### Usage with OpenAI Whisper ```python import whisper # Load the model model = whisper.load_model("large-v3") # Transcribe French audio result = model.transcribe( "audio.wav", language="fr", task="transcribe" ) print(result["text"]) ``` --- ## Research Methodology ### Baseline Purpose This model serves as: 1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons 2. **Reproducibility Anchor:** All experiments reference this exact checkpoint 3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability ### Evaluation Standards - **WER Calculation:** Standard normalization (lowercasing, punctuation removal) - **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score - **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research) ### Versioning - **Current Version:** 0.1 (Research Baseline) - **Future Versions:** All Gilbert model variants will reference this baseline version --- ## Limitations This baseline model inherits known limitations from Whisper and the underlying training data: 1. **Overlapping Speech:** Sensitivity to simultaneous speakers 2. **Long-form Decoding:** Occasional hallucinations in very long audio segments 3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning 4. **Accent Distribution:** Potential biases related to accent representation in training data 5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation **Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.** --- ## Future Research Directions The following specialized models will be developed as independent checkpoints from this baseline: ### Planned Gilbert Models 1. **Gilbert-FR-Longform-v1** - Optimized for long meetings (30-120 minutes) - Multi-speaker interaction handling - Discourse-level context stability 2. **Gilbert-FR-Accents-v1** - Robustness to regional and international French accents - African, Canadian, Belgian accent optimization 3. **Gilbert-FR-Telephone-v1** - Optimized for 8 kHz VoIP/call-center speech - Narrowband audio adaptation 4. **Gilbert-Multilingual-v1** - Extended cross-lingual performance - Optimized French anchors with multilingual support **All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards. --- ## Intellectual Property and Licensing ### License for This Baseline This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing: - ✅ Commercial use - ✅ Modification - ✅ Distribution - ✅ Private use - ✅ Patent use See the `LICENSE` file for full terms. ### Intellectual Property Notice **Important:** While this baseline model is available under MIT License: - **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.** - Use of this baseline for Gilbert project development implies acceptance of these IP terms. - Commercial use of Gilbert project derivatives requires separate licensing agreements. For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr** --- ## Citation If you use this baseline model in your research, please cite: ```bibtex @software{gilbert_fr_source_2024, title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition}, author={MEscriva and Lexia France}, year={2024}, url={https://huggingface.co/MEscriva/gilbert-fr-source}, version={0.1}, note={Research baseline for the Gilbert project} } ``` --- ## Acknowledgments This baseline model is based on: - **OpenAI Whisper Large V3** (MIT License) - **bofenghuang/whisper-large-v3-french** (French fine-tuning) We acknowledge the contributions of the open-source community and the original Whisper research team. --- ## Contact For research collaboration, evaluation access, or technical inquiries: - **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr) - **Email:** mathis@lexiapro.fr - **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source) --- ## Changelog ### Version 0.1 (2024-12-19) - Initial research baseline release - Based on Whisper Large V3 with French optimization - Established as frozen reference point for Gilbert project - Documentation of baseline performance metrics --- **© 2024 Lexia France. All rights reserved for Gilbert project derivatives.**