MEscriva commited on
Commit
6f5550e
·
verified ·
1 Parent(s): 52eebe2

Copy from MEscriva/gilbert-fr-source - Baseline model for Gilbert research

Browse files
Files changed (1) hide show
  1. README.md +222 -117
README.md CHANGED
@@ -1,195 +1,300 @@
1
  ---
2
  license: mit
3
- datasets:
4
- - google/fleurs
5
- - facebook/voxpopuli
6
- - facebook/multilingual_librispeech
7
- - mozilla-foundation/common_voice_13_0
8
- - mozilla-foundation/common_voice_17_0
9
- language:
10
- - fr
11
- - en
12
- metrics:
13
- - wer
14
- base_model:
15
- - openai/whisper-large-v3
16
- pipeline_tag: automatic-speech-recognition
17
- library_name: transformers
18
  tags:
19
- - speech-recognition
 
20
  - whisper
21
  - french
 
22
  - stt
23
  - multilingual
24
  - research
25
- - gilbert
 
 
 
26
  ---
27
 
28
  # Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition
29
 
30
- `Gilbert-FR-Source` is a French automatic speech recognition (ASR) model used as the **research foundation** for the Gilbert project.
31
- It is designed as an internal scientific baseline enabling controlled experimentation, reproducible evaluation, and rigorous comparison across ASR architectures, datasets, and adaptation methods.
32
 
33
- This model is not a fine-tuned derivative, but a **curated research anchor** used to support systematic studies in:
34
 
35
- - domain adaptation,
36
- - robustness to spontaneous and long-form speech,
37
- - accented and low-resource linguistic profiles,
38
- - telephony and bandwidth-constrained speech,
39
- - multi-speaker and meeting transcription.
40
 
41
  ---
42
 
43
- ## 1. Research Motivation
44
 
45
- The Gilbert project aims to build highly specialized ASR systems optimized for:
46
 
47
- - professional meeting transcription (hybrid/remote),
48
- - long-form multi-speaker discourse,
49
- - institutional environments (education, public sector),
50
- - constrained audio conditions (telephony, VoIP, low SNR),
51
- - sociolinguistic diversity (African, Canadian, Belgian and other French accents).
52
 
53
- While Whisper Large V3 provides strong baseline performance, its behavior under domain shifts (spontaneous interactions, overlapping speech, degraded microphones) requires systematic study.
54
- `Gilbert-FR-Source` provides the **frozen starting point** for this line of research, ensuring controlled comparisons between experiments.
55
 
56
  ---
57
 
58
- ## 2. Scientific Goals and Research Questions
59
 
60
- This model is used to answer a series of research questions:
61
 
62
- ### **Q1. Long-form modeling**
63
- How does Whisper-L3 behave on meetings lasting 30–120 minutes, with natural topic shifts, interruptions, and pragmatic markers?
 
 
64
 
65
- ### **Q2. Accent robustness**
66
- Which classes of French accents induce the strongest WER degradation?
67
- How does robustness vary across FLEURS, African French, and Common Voice subsets?
68
 
69
- ### **Q3. Telephony adaptation**
70
- What is the degradation curve when downsampling to 16 kHz / 8 kHz / μ-law compressed audio?
 
 
71
 
72
- ### **Q4. Domain adaptation efficiency**
73
- What is the marginal gain of targeted fine-tuning on professional meeting datasets (education, administration, healthcare)?
 
 
 
74
 
75
- ### **Q5. Multilingual side-effects**
76
- To what extent does French fine-tuning affect cross-lingual generalization?
77
 
78
- These research axes structure the development of future specialized Gilbert models.
 
 
 
 
 
 
 
79
 
80
  ---
81
 
82
- ## 3. Benchmark Reference Results
83
 
84
- The following WER scores originate from established open benchmarks and serve as a *reference baseline* for future experiments:
85
 
86
- | Dataset | WER |
87
- |--------|-----|
88
- | MLS (FR) | 3.98 % |
89
- | Common Voice FR (v13.0) | 7.28 % |
90
- | VoxPopuli (FR) | 8.91 % |
91
- | Fleurs (FR) | 4.84 % |
92
- | African Accented French | 4.20 % |
93
 
94
- These results provide **upper bounds** before targeted fine-tuning.
95
- Future Gilbert variants will be evaluated using:
 
 
 
 
 
96
 
97
- - internal meeting datasets,
98
- - domain-specific corpora (administration, higher education, healthcare),
99
- - accented speech corpora,
100
- - telephony datasets,
101
- - long-form evaluation methods (> 1 hour audio).
102
 
103
  ---
104
 
105
- ## 4. Architecture
 
 
106
 
107
- The model is based on the **Whisper Large V3** encoder–decoder architecture, offering:
 
 
108
 
109
- - large multilingual pretraining,
110
- - long-context modeling capacity,
111
- - robust cross-lingual alignment,
112
- - stable decoding for long outputs,
113
- - strong zero-shot performance on French.
114
 
115
- It is compatible with:
 
 
116
 
117
- - Hugging Face Transformers,
118
- - CTranslate2,
119
- - ONNX Runtime,
120
- - MLX (Apple Silicon),
121
- - quantization-based acceleration pipelines.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
  ---
124
 
125
- ## 5. Methodology and Reproducibility
 
 
 
 
 
 
 
 
126
 
127
- `Gilbert-FR-Source` is used in strict research settings emphasizing:
128
 
129
- ### **Reproducible training protocols**
130
- - frozen weights for baseline comparison,
131
- - controlled hyperparameter schedules,
132
- - consistent evaluation datasets,
133
- - deterministic decoding configurations.
134
 
135
- ### **Evaluation methodology**
136
- WER is computed with standard normalization (lowercasing, punctuation removal).
137
- More advanced metrics (diarization error rate, long-context drift) are included in internal research pipelines.
138
 
139
- ### **Versioning policy**
140
- This repository represents version `0.1` of the research baseline.
141
- All future fine-tuned models will explicitly reference this version for traceability.
142
 
143
  ---
144
 
145
- ## 6. Limitations
146
 
147
- This baseline inherits the known limitations of Whisper and of the underlying datasets:
148
 
149
- - sensitivity to overlapping speech,
150
- - occasional hallucinations in long-form decoding,
151
- - domain shift on spontaneous dialogue,
152
- - potential biases related to accent distribution in training data,
153
- - suboptimal performance in telephony bandwidth.
154
 
155
- Understanding and quantifying these limitations is one of the core objectives of the Gilbert research roadmap.
156
 
157
  ---
158
 
159
- ## 7. Future Work (Planned Research Directions)
160
 
161
- The following models will be developed as independent checkpoints:
162
 
163
- - **Gilbert-FR-Longform-v1**
164
- Long meetings, multi-speaker interaction, discourse-level context stability.
165
 
166
- - **Gilbert-FR-Accents-v1**
167
- Robustness to regional and international French accents.
 
 
168
 
169
- - **Gilbert-FR-Telephone-v1**
170
- Optimized for 8 kHz VoIP/call-center speech.
 
171
 
172
- - **Gilbert-Multilingual-v1**
173
- Extended cross-lingual performance with optimized French anchors.
 
174
 
175
- Each model will include detailed evaluation reports and will adhere to research reproducibility standards.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
 
177
  ---
178
 
179
- ## 8. License
180
 
181
- This repository includes files distributed under the MIT License.
182
 
183
- > A copy of the MIT License is included.
184
- > Some files were originally released under MIT.
 
 
 
 
 
 
 
 
185
 
186
- All future Gilbert models built on top of this baseline are the exclusive property of Lexia France.
 
 
 
 
 
 
 
 
187
 
188
  ---
189
 
190
- ## 9. Contact
191
 
192
  For research collaboration, evaluation access, or technical inquiries:
193
 
194
- - Website: https://gilbert-assistant.fr
195
- - Email: mathis@lexiapro.fr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
+ - automatic-speech-recognition
5
+ - asr
6
  - whisper
7
  - french
8
+ - speech-recognition
9
  - stt
10
  - multilingual
11
  - research
12
+ - baseline
13
+ library_name: transformers
14
+ pipeline_tag: automatic-speech-recognition
15
+ base_model: openai/whisper-large-v3
16
  ---
17
 
18
  # Gilbert-FR-Source — Research Baseline for French Automatic Speech Recognition
19
 
20
+ ## Overview
 
21
 
22
+ **Gilbert-FR-Source** is the foundational baseline model for the **Gilbert research project**, a comprehensive initiative focused on developing state-of-the-art automatic speech recognition (ASR) systems optimized for French language applications. This model serves as the **frozen reference point** for all subsequent research, fine-tuning, and development work within the Gilbert ecosystem.
23
 
24
+ **Important Notice on Intellectual Property:**
25
+ - This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the MIT License, allowing research and commercial use.
26
+ - **All derivative models, fine-tuned variants, and specialized models developed from this baseline as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
27
+ - While this baseline can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.
 
28
 
29
  ---
30
 
31
+ ## Research Context
32
 
33
+ The Gilbert project is a systematic research and development effort aimed at creating highly specialized ASR systems for:
34
 
35
+ - **Professional meeting transcription** (hybrid and remote meetings)
36
+ - **Long-form multi-speaker discourse** (30-120 minute sessions)
37
+ - **Institutional environments** (education, public sector, healthcare)
38
+ - **Constrained audio conditions** (telephony, VoIP, low signal-to-noise ratio)
39
+ - **Sociolinguistic diversity** (African, Canadian, Belgian, and other French accents)
40
 
41
+ This baseline model provides the **controlled starting point** for all experimental work, ensuring reproducibility and enabling fair comparison across different research directions.
 
42
 
43
  ---
44
 
45
+ ## Model Details
46
 
47
+ ### Architecture
48
 
49
+ - **Base Model:** OpenAI Whisper Large V3
50
+ - **Fine-tuning:** Optimized for French language performance
51
+ - **Framework:** Compatible with Hugging Face Transformers, OpenAI Whisper, CTranslate2, ONNX Runtime, and MLX
52
+ - **Model Size:** ~3.2 GB (full precision)
53
 
54
+ ### Key Characteristics
 
 
55
 
56
+ - **Language:** French (primary), with multilingual capabilities
57
+ - **Context Length:** Long-form audio support (up to 30 minutes per segment)
58
+ - **Output:** Text transcription with word-level timestamps
59
+ - **Performance:** Optimized for French speech recognition accuracy
60
 
61
+ ---
62
+
63
+ ## Intended Use
64
+
65
+ ### Research and Development
66
 
67
+ This model is intended for:
 
68
 
69
+ 1. **Research Baseline:** Use as a reference point for ASR research and experimentation
70
+ 2. **Comparative Studies:** Benchmark against this baseline when evaluating new architectures or training strategies
71
+ 3. **Fine-tuning Foundation:** Use as a starting point for domain-specific fine-tuning (subject to Gilbert project IP terms)
72
+ 4. **Educational Purposes:** Learning and understanding ASR model behavior
73
+
74
+ ### Production Use
75
+
76
+ While this baseline model can be used directly, **production deployments should use specialized Gilbert models** that are optimized for specific use cases and domains. Contact the Gilbert team for production-grade models.
77
 
78
  ---
79
 
80
+ ## Performance Benchmarks
81
 
82
+ ### Reference Results
83
 
84
+ The following WER (Word Error Rate) scores serve as **baseline reference** for future Gilbert model development:
 
 
 
 
 
 
85
 
86
+ | Dataset | WER | Notes |
87
+ |---------|-----|-------|
88
+ | MLS (FR) | 3.98% | Multilingual LibriSpeech French |
89
+ | Common Voice FR (v13.0) | 7.28% | Diverse French speech |
90
+ | VoxPopuli (FR) | 8.91% | European Parliament speeches |
91
+ | Fleurs (FR) | 4.84% | FLORES evaluation |
92
+ | African Accented French | 4.20% | Regional accent evaluation |
93
 
94
+ **Note:** These results represent the **upper bound** before targeted fine-tuning. Future Gilbert variants will be evaluated against these baselines to measure improvement.
 
 
 
 
95
 
96
  ---
97
 
98
+ ## Usage
99
+
100
+ ### Installation
101
 
102
+ ```bash
103
+ pip install transformers torch torchaudio librosa soundfile
104
+ ```
105
 
106
+ ### Basic Usage with Transformers
 
 
 
 
107
 
108
+ ```python
109
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
110
+ import torch
111
 
112
+ model_id = "MEscriva/gilbert-fr-source"
113
+ device = "cuda" if torch.cuda.is_available() else "cpu"
114
+ torch_dtype = torch.float16 if device == "cuda" else torch.float32
115
+
116
+ processor = AutoProcessor.from_pretrained(model_id)
117
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
118
+ model_id,
119
+ torch_dtype=torch_dtype,
120
+ low_cpu_mem_usage=True
121
+ )
122
+ model.to(device)
123
+
124
+ # Process audio
125
+ audio_path = "your_audio.wav"
126
+ inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
127
+ inputs = {k: v.to(device) for k, v in inputs.items()}
128
+
129
+ with torch.no_grad():
130
+ generated_ids = model.generate(
131
+ inputs["input_features"],
132
+ language="fr",
133
+ task="transcribe"
134
+ )
135
+
136
+ transcription = processor.batch_decode(
137
+ generated_ids,
138
+ skip_special_tokens=True
139
+ )[0]
140
+ ```
141
+
142
+ ### Usage with OpenAI Whisper
143
+
144
+ ```python
145
+ import whisper
146
+
147
+ # Load the model
148
+ model = whisper.load_model("large-v3")
149
+
150
+ # Transcribe French audio
151
+ result = model.transcribe(
152
+ "audio.wav",
153
+ language="fr",
154
+ task="transcribe"
155
+ )
156
+
157
+ print(result["text"])
158
+ ```
159
 
160
  ---
161
 
162
+ ## Research Methodology
163
+
164
+ ### Baseline Purpose
165
+
166
+ This model serves as:
167
+
168
+ 1. **Frozen Reference:** Weights remain unchanged to ensure consistent baseline comparisons
169
+ 2. **Reproducibility Anchor:** All experiments reference this exact checkpoint
170
+ 3. **Version Control:** Future Gilbert models explicitly reference this baseline version for traceability
171
 
172
+ ### Evaluation Standards
173
 
174
+ - **WER Calculation:** Standard normalization (lowercasing, punctuation removal)
175
+ - **Metrics:** Word Error Rate (WER), Character Error Rate (CER), BLEU score
176
+ - **Advanced Metrics:** Speaker-attributed WER (SA-WER), long-context stability (internal research)
 
 
177
 
178
+ ### Versioning
 
 
179
 
180
+ - **Current Version:** 0.1 (Research Baseline)
181
+ - **Future Versions:** All Gilbert model variants will reference this baseline version
 
182
 
183
  ---
184
 
185
+ ## Limitations
186
 
187
+ This baseline model inherits known limitations from Whisper and the underlying training data:
188
 
189
+ 1. **Overlapping Speech:** Sensitivity to simultaneous speakers
190
+ 2. **Long-form Decoding:** Occasional hallucinations in very long audio segments
191
+ 3. **Domain Shift:** Suboptimal performance on spontaneous dialogue without fine-tuning
192
+ 4. **Accent Distribution:** Potential biases related to accent representation in training data
193
+ 5. **Telephony Bandwidth:** Suboptimal performance on narrowband (8 kHz) audio without adaptation
194
 
195
+ **Understanding and quantifying these limitations is a core objective of the Gilbert research roadmap.**
196
 
197
  ---
198
 
199
+ ## Future Research Directions
200
 
201
+ The following specialized models will be developed as independent checkpoints from this baseline:
202
 
203
+ ### Planned Gilbert Models
 
204
 
205
+ 1. **Gilbert-FR-Longform-v1**
206
+ - Optimized for long meetings (30-120 minutes)
207
+ - Multi-speaker interaction handling
208
+ - Discourse-level context stability
209
 
210
+ 2. **Gilbert-FR-Accents-v1**
211
+ - Robustness to regional and international French accents
212
+ - African, Canadian, Belgian accent optimization
213
 
214
+ 3. **Gilbert-FR-Telephone-v1**
215
+ - Optimized for 8 kHz VoIP/call-center speech
216
+ - Narrowband audio adaptation
217
 
218
+ 4. **Gilbert-Multilingual-v1**
219
+ - Extended cross-lingual performance
220
+ - Optimized French anchors with multilingual support
221
+
222
+ **All future Gilbert models are the exclusive intellectual property of Lexia France** and will include detailed evaluation reports adhering to research reproducibility standards.
223
+
224
+ ---
225
+
226
+ ## Intellectual Property and Licensing
227
+
228
+ ### License for This Baseline
229
+
230
+ This baseline model (`MEscriva/gilbert-fr-source`) is distributed under the **MIT License**, allowing:
231
+
232
+ - ✅ Commercial use
233
+ - ✅ Modification
234
+ - ✅ Distribution
235
+ - ✅ Private use
236
+ - ✅ Patent use
237
+
238
+ See the `LICENSE` file for full terms.
239
+
240
+ ### Intellectual Property Notice
241
+
242
+ **Important:** While this baseline model is available under MIT License:
243
+
244
+ - **All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.**
245
+ - Use of this baseline for Gilbert project development implies acceptance of these IP terms.
246
+ - Commercial use of Gilbert project derivatives requires separate licensing agreements.
247
+
248
+ For licensing inquiries regarding Gilbert project models, contact: **mathis@lexiapro.fr**
249
 
250
  ---
251
 
252
+ ## Citation
253
 
254
+ If you use this baseline model in your research, please cite:
255
 
256
+ ```bibtex
257
+ @software{gilbert_fr_source_2024,
258
+ title={Gilbert-FR-Source: Research Baseline for French Automatic Speech Recognition},
259
+ author={MEscriva and Lexia France},
260
+ year={2024},
261
+ url={https://huggingface.co/MEscriva/gilbert-fr-source},
262
+ version={0.1},
263
+ note={Research baseline for the Gilbert project}
264
+ }
265
+ ```
266
 
267
+ ---
268
+
269
+ ## Acknowledgments
270
+
271
+ This baseline model is based on:
272
+ - **OpenAI Whisper Large V3** (MIT License)
273
+ - **bofenghuang/whisper-large-v3-french** (French fine-tuning)
274
+
275
+ We acknowledge the contributions of the open-source community and the original Whisper research team.
276
 
277
  ---
278
 
279
+ ## Contact
280
 
281
  For research collaboration, evaluation access, or technical inquiries:
282
 
283
+ - **Website:** [https://gilbert-assistant.fr](https://gilbert-assistant.fr)
284
+ - **Email:** mathis@lexiapro.fr
285
+ - **Repository:** [https://huggingface.co/MEscriva/gilbert-fr-source](https://huggingface.co/MEscriva/gilbert-fr-source)
286
+
287
+ ---
288
+
289
+ ## Changelog
290
+
291
+ ### Version 0.1 (2024-12-19)
292
+ - Initial research baseline release
293
+ - Based on Whisper Large V3 with French optimization
294
+ - Established as frozen reference point for Gilbert project
295
+ - Documentation of baseline performance metrics
296
+
297
+ ---
298
+
299
+ **© 2024 Lexia France. All rights reserved for Gilbert project derivatives.**
300
+