|
|
--- |
|
|
tags: |
|
|
- translation |
|
|
- low-resource-language |
|
|
- marian-mt |
|
|
- fulfulde |
|
|
- fula |
|
|
datasets: |
|
|
- custom-en-ff-parallel |
|
|
license: cc-by-4.0 |
|
|
--- |
|
|
|
|
|
# MarianMT-en-to-ff (English to Fula) |
|
|
|
|
|
## 📝 Overview |
|
|
|
|
|
**MarianMT-en-to-ff** is a fine-tuned machine translation model specializing in translating text from **English to Fula** (also known as Fulfulde or Pulaar). This model is based on the powerful [MarianMT framework by Helsinki-NLP](https://huggingface.co/Helsinki-NLP) and was trained on a meticulously curated, but small, parallel corpus, aiming to serve the low-resource language community. |
|
|
|
|
|
The model provides a baseline for effective machine translation in a language pair where high-quality resources are scarce. |
|
|
|
|
|
## 🧠 Model Architecture |
|
|
|
|
|
* **Base Model:** Initialized from a related language pair (e.g., `opus-mt-en-fr`) and fine-tuned. |
|
|
* **Architecture:** Sequence-to-Sequence Transformer (Encoder-Decoder) model. |
|
|
* **Tokenizer:** A custom SentencePiece tokenizer trained on the combined English and Fula corpus. |
|
|
* **Parameters:** Standard MarianMT configuration with 6 encoder and 6 decoder layers. |
|
|
* **Translation Direction:** English $\rightarrow$ Fula (en $\rightarrow$ ff). |
|
|
|
|
|
## 🚀 Intended Use |
|
|
|
|
|
* **Digital Inclusion:** Facilitating access to English-language content for Fula speakers. |
|
|
* **Academic Research:** A foundational model for further research in low-resource NMT. |
|
|
* **Basic Communication:** Providing draft translations for non-critical text. |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
* **Low-Resource Quality:** Due to the limited size of the parallel corpus, the translation quality may be inconsistent, especially for domain-specific, complex, or highly idiomatic English phrases. |
|
|
* **Dialect Variation:** Fula has several regional dialects. The training data primarily reflects a West African dialect, and translation quality may degrade for texts in other dialects. |
|
|
* **Domain Specificity:** The model is trained on general and news domain text. Technical or highly specific vocabulary may not be handled correctly. |
|
|
|
|
|
## 💻 Example Code |
|
|
|
|
|
```python |
|
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "Your-HF-Username/MarianMT-en-to-ff" |
|
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
|
|
# Sample English text |
|
|
english_text = ["The community needs clean water for health and agriculture.", |
|
|
"We are going to visit the capital city next week."] |
|
|
|
|
|
# Tokenize and generate translation |
|
|
encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True) |
|
|
translated_tokens = model.generate(**encoded_input) |
|
|
|
|
|
# Decode and print |
|
|
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True) |
|
|
|
|
|
print("--- English to Fula Translation ---") |
|
|
for en, ff in zip(english_text, translated_text): |
|
|
print(f"EN: {en}") |
|
|
print(f"FF: {ff}\n") |
|
|
# Note: Fula translations will vary based on training data. |
|
|
# Expected FF example: "Yimɓe ɓee ɗaɓɓi ndiyam laaɓɗam ngam cellal e ndema." |