File size: 8,995 Bytes
fa9683c 18db31a fa9683c 5ca41f1 fa9683c d7d48c1 fa9683c 60c69b9 18db31a fa9683c 18db31a d7d48c1 fa9683c 8b5746a 18db31a fa9683c 18db31a 5ca41f1 18db31a fa9683c 60c69b9 fa9683c d7d48c1 fa9683c 17805fd fa9683c d7d48c1 fa9683c 17805fd fa9683c d7d48c1 fa9683c d7d48c1 e8348e7 17805fd fa9683c d7d48c1 fa9683c d7d48c1 17805fd fa9683c 7c2697a fa9683c 60c69b9 8b5746a 60c69b9 fa9683c 5ca41f1 fa9683c 18db31a fa9683c d7d48c1 fa9683c d7d48c1 fa9683c 4417feb d7d48c1 fa9683c 7e41cc9 fa9683c 7e41cc9 fa9683c d7d48c1 fa9683c d7d48c1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | ---
library_name: transformers
license: cc-by-4.0
datasets:
- nwu-ctext/autshumato
- Helsinki-NLP/opus-100
- WMT22
- gsarti/flores_101
language:
- en
- zu
- xh
metrics:
- bleu
- f1
- G-Eva
base_model:
- mistralai/Mistral-7B-v0.1
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
The model is a result of fine-tuning Mistral-7B-v0.1 on a down stream task, in low resourced setting. It is able to translate English sentences to Zulu and Xhosa sentences.
## Model Details
- **Authors**: Pitso Walter Khoboko, Vukosi Marivate, Joseph Sefara
- **Affiliation**: University of Pretoria, Data Science for Social Impact
- **License**: CC-BY-4.0
- **Base model**: [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Languages**: English → Zulu, English → Xhosa
- **Model type**: Causal LLM with prompt-based translation fine-tuning
### Model Description
<!-- Provide a longer summary of what this model is. -->
dsfsi/OMT-LR-Mistral7b, model, was fine-tuned for 31 GPU days from base model mistralai/Mistral-7B-v0.1. The model was fine-tuned in efforts to improve translation task
for large language model in regard to low resourced morphologically rich African languages using custom prompt.
- **Developed by:** Pitso Walter Khoboko, Vukosi Marivate and Joseph Sefara
- **Funded by [optional]:** University of Pretoria and Data Science For Social Impact
- **Shared by [optional]:** Pitso Walter Khoboko
- **Model type:** Sequence-to-sequence model
- **Language(s) (NLP):** English to Zulu and Xhosa
- **License:** cc-by-4.0
- **Finetuned from model [optional]:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/PKhoboko/MSc-Thesis
- **Paper [optional]:** https://www.sciencedirect.com/science/article/pii/S2666827025000325
- **Demo [optional]:** [More Information Needed]
## Uses
The model can be used to translate Engslih to Zulu and Xhosa. With further improvement it can be used to translate domain specific infromation from English to Zulu and Xhosa,
thus it can be used to translate research information that was written in English, in the agriculture industry, to small scale farmers that speak Zulu and Xhosa. Further, it can
be used in the Education industry to teach core subjects in native South African langauges thus can improve pupils' performance in these subjects.
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
You can download the model, dsfsi/OMT-LR-Mistral7b, and prompt it to translate English sentences to Zulu and Xhosa sentences.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
- Translating full documents or complex legal/medical content.
- Any politically sensitive, sexually biased, or harmful content generation.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- Training data includes English intrusions in target languages (Zulu/Xhosa).
- May hallucinate or degrade performance on domain-specific or long-form content.
- Not tested extensively for dialectal variations or colloquial expressions.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
- For critical use (e.g., government or education), fine-tune further with clean, domain-specific parallel corpora.
- Avoid deploying this model in zero-review production pipelines.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "dsfsi/OMT-LR-Mistral7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
translator = pipeline("text-generation", model=model, tokenizer=tokenizer)
translator("Translate to Zulu: The cow is eating grass.")
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- [nwu-ctext/autshumato](https://huggingface.co/datasets/nwu-ctext/autshumato)
- [Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
- [WMT22](https://huggingface.co/datasets/wmt22)
- [gsarti/flores_101](https://huggingface.co/datasets/gsarti/flores_101)
**Note**: The above datasets were collected individually and used to create a multilingual dataset having English to Zulu and Xhosa sentences.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing [optional]
Please check out the repo above to get the dataset cleanup and preparation code.
#### Training Hyperparameters
- **Training regime:**
```python
- peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']
)
- TrainingArguments(
optim="paged_adamw_8bit",
per_device_train_batch_size=32,
gradient_accumulation_steps=4,
log_level="debug",
save_steps=400,
logging_steps=10,
learning_rate=4e-4,
num_train_epochs=2,
warmup_steps=100,
lr_scheduler_type="linear",
)
```
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
- [nwu-ctext/autshumato](https://huggingface.co/datasets/nwu-ctext/autshumato)
- [Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
- [WMT22](https://huggingface.co/datasets/wmt22)
- [gsarti/flores_101](https://huggingface.co/datasets/gsarti/flores_101)
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
- bleu: Used to check if the model is translating Zulu and Xhosa words proprely when compared to the fround truth.
- f1:evaluates larger linguistic units such as grammatical chunks and syntactic frames, making it more suitable for languages with complex syntactic structures.
- G-Eva: uses embeddings to capture the contextual and semantic similarity between hypothesis and reference translations
### Results
| Language Pair | BLEU | F1 | G-Eva |
|--------------------|------|----|--------|
| English → Zulu | 20 | 42 | 92% |
| English → Xhosa | 14 | 38 | 91% |
#### Summary
**OMT-LR-Mistral7b** fine-tunes **Mistral-7B-v0.1** using custom prompt engineering for **low-resource African languages**, specifically **English to Zulu and Xhosa** translation. It was trained for 31 GPU days using a multilingual dataset to improve translation accuracy for morphologically rich languages.
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware**: 2 A100 48GB GPUs
- **Cloud Provider**: Google Cloud Platform
- **Compute Region**: europe-west4-a
- **Training Time**: ~31 GPU days
- **Carbon Emissions**: [More Information Needed]
## Technical Specifications [optional]
### Model Architecture and Objective
[More Information Needed]
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
@article{khoboko2025optimizing,
title={Optimizing translation for low-resource languages: Efficient fine-tuning with custom prompt engineering in large language models},
author={Khoboko, Pitso Walter and Marivate, Vukosi and Sefara, Joseph},
journal={Machine Learning with Applications},
volume={20},
pages={100649},
year={2025},
publisher={Elsevier}
}
**APA:**
Khoboko, P. W., Marivate, V., & Sefara, J. (2025). Optimizing translation for low-resource languages: Efficient fine-tuning with custom prompt engineering in large language models. Machine Learning with Applications, 20, 100649.
## Model Card Authors [optional]
- Pitso Walter Khoboko
## Model Card Contact
- Pitso Walter Khoboko: u21824772@tuks.co.za
- Vukosi Marivate: vukosi.marivate@cs.up.ac.za
- Joseph Sefara: tsefara@csir.co.za |