Update README.md

24f5094 verified about 1 month ago

4.77 kB

	---
	license: mit
	base_model:
	- facebook/nllb-200-distilled-600M
	tags:
	- Ghomala
	- Français
	- Bandjoun
	- Cameroun
	- Cameroon
	datasets:
	- stfotso/french-ghomala-bandjoun
	language:
	- fr
	metrics:
	- bleu
	- chrf
	---


	# Model Overview

	Turaco-mt-fr-gh is a specialized neural machine translation model fine-tuned for high-quality translation from French to Ghomálá.

	This model is part of the Turaco family, an initiative focused on advancing translation capabilities for low-resource and underrepresented African languages. While large-scale multilingual models provide strong general foundations, they often lack depth and fluency when applied to specific low-resource languages. This project addresses that gap through targeted fine-tuning on curated parallel data.

	Built on top of
	NLLB-200,
	Turaco-mt-fr-gh leverages multilingual transfer learning to produce more accurate, fluent, and context-aware translations into Ghomálá.


	## Model Details

	* Developed by: fotiecodes
	* Model type: Sequence-to-Sequence Transformer (Multilingual NMT)
	* License: Apache-2.0
	* Base model: facebook/nllb-200
	* Task: Machine Translation (French → Ghomálá)
	* Language(s): French (`fr`), Ghomálá (`gh`)

	## Intended Use

	This model is designed for:

	* Translating French text into Ghomálá
	* Supporting localization for Cameroonian and regional applications
	* Experimentation with low-resource language translation
	* Research on multilingual transfer learning and adaptation

	<!-- ### Example

	Input:

	```
	Jésus Christ est le même, hier, et aujourd’hui, et éternellement.
	```

	Output:

	```
	Yeso Kristo tə ŋkwyəpnyə pɑ. E bɑ pɑ yʉ yə e lə bɑ a, yə e bɑ cwəlɔ a, biŋ bɑ gɔ pɑ pɑ yʉ́ guŋ mcwə awɛ.
	```

	--- -->

	## Training Data

	The model was fine-tuned on a parallel dataset of French–Ghomálá sentence pairs.

	Key characteristics:

	* High-quality aligned sentence pairs
	* Focus on conversational and general-purpose language
	* Cleaned and normalized text to reduce noise
	* Balanced examples to improve consistency in output

	Given the low-resource nature of Ghomálá, dataset quality and consistency were prioritized over sheer size.

	## Training Procedure

	The model was fine-tuned using supervised learning on parallel translation data.

	Key aspects:

	* Initialized from NLLB-200
	* Standard sequence-to-sequence training with source-target pairs
	* Tokenization handled using the pretrained NLLB tokenizer
	* Optimization focused on adapting the model to:

	* Ghomálá vocabulary and structure
	* French → Ghomálá alignment
	* Improved fluency and coherence

	The training process leverages NLLB’s multilingual representations, allowing the model to generalize better despite limited data.

	## Evaluation

	Evaluation was primarily qualitative, focusing on:

	* Fluency in Ghomálá
	* Semantic correctness of translations
	* Consistency in maintaining the target language

	Preliminary results indicate:

	📊 Results and evals:
	- French → Ghomala: BLEU=4.9 \| chrF2=19.1
	- Ghomala → French: BLEU=10.8 \| chrF2=30.5

	## Note:
	The model shows stronger performance when translating from Ghomala to French than the reverse direction. However, overall scores (BLEU and chrF2) indicate that translation quality is still limited, especially for French → Ghomala. These results suggest the model is better at understanding Ghomala than generating it, and further training data or fine-tuning would be needed for production-level performance.

	## Limitations

	* Performance depends heavily on dataset size and diversity
	* May struggle with:

	* Technical or domain-specific vocabulary
	* Rare linguistic constructions
	* Not optimized for reverse translation (Ghomálá → French)
	* As with most neural MT systems, outputs may occasionally:

	* Be inconsistent
	* Contain minor hallucinations or approximations

	## Future Work

	* Expand the French–Ghomálá dataset with more diverse domains
	* Explore parameter-efficient fine-tuning (LoRA, adapters)
	* Benchmark against other multilingual MT systems
	* Incorporate human evaluation from native speakers

	## Ethical Considerations

	This model contributes to improving representation of under-resourced African languages in AI.

	Care should be taken to:

	* Respect linguistic and cultural nuances of Ghomálá
	* Validate outputs in sensitive or critical contexts
	* Involve native speakers in evaluation and feedback loops
	* Avoid over-reliance in high-stakes applications without verification

	## Citation

	If you use this model, please cite:

	```bibtex
	@model{turaco_mt_fr_gh,
	author = {fotiecodes},
	title = {Turaco-mt-fr-gh},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/fotiecodes/Turaco-mt-fr-gh}
	}
	```