File size: 8,995 Bytes

fa9683c
 
18db31a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa9683c
 
 
 
 
5ca41f1
fa9683c
 
 
d7d48c1
 
 
 
 
 
fa9683c
 
 
60c69b9
18db31a
fa9683c
 
18db31a
 
 
 
 
 
d7d48c1
fa9683c
 
 
 
 
8b5746a
18db31a
fa9683c
 
 
 
18db31a
5ca41f1
 
18db31a
fa9683c
 
 
 
60c69b9
fa9683c
 
 
 
 
d7d48c1
 
fa9683c
17805fd
fa9683c
 
 
 
d7d48c1
 
 
fa9683c
17805fd
fa9683c
 
 
 
d7d48c1
 
fa9683c
 
 
d7d48c1
 
 
 
 
 
 
 
 
e8348e7
17805fd
fa9683c
 
 
 
 
 
d7d48c1
 
 
 
fa9683c
d7d48c1
17805fd
fa9683c
 
 
 
 
 
 
7c2697a
fa9683c
 
 
 
60c69b9
 
8b5746a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60c69b9
fa9683c
 
 
 
 
 
 
 
 
5ca41f1
 
 
 
fa9683c
 
 
 
 
18db31a
 
 
fa9683c
 
 
 
d7d48c1
 
 
 
fa9683c
 
 
d7d48c1
fa9683c
 
 
 
 
 
 
4417feb
 
 
d7d48c1
 
fa9683c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e41cc9
 
 
 
 
 
 
 
 
fa9683c
 
 
7e41cc9
fa9683c
 
 
 
d7d48c1
fa9683c
 
 
d7d48c1

---
library_name: transformers
license: cc-by-4.0
datasets:
- nwu-ctext/autshumato
- Helsinki-NLP/opus-100
- WMT22
- gsarti/flores_101
language:
- en
- zu
- xh
metrics:
- bleu
- f1
- G-Eva
base_model:
- mistralai/Mistral-7B-v0.1
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
The model is a result of fine-tuning Mistral-7B-v0.1 on a down stream task, in low resourced setting. It is able to translate English sentences to Zulu and Xhosa sentences.


## Model Details
- **Authors**: Pitso Walter Khoboko, Vukosi Marivate, Joseph Sefara  
- **Affiliation**: University of Pretoria, Data Science for Social Impact  
- **License**: CC-BY-4.0  
- **Base model**: [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)  
- **Languages**: English → Zulu, English → Xhosa  
- **Model type**: Causal LLM with prompt-based translation fine-tuning
### Model Description

<!-- Provide a longer summary of what this model is. -->
dsfsi/OMT-LR-Mistral7b, model, was fine-tuned for 31 GPU days from base model mistralai/Mistral-7B-v0.1. The model was fine-tuned in efforts to improve translation task
for large language model in regard to low resourced morphologically rich African languages using custom prompt.


- **Developed by:** Pitso Walter Khoboko, Vukosi Marivate and Joseph Sefara
- **Funded by [optional]:** University of Pretoria and Data Science For Social Impact
- **Shared by [optional]:** Pitso Walter Khoboko
- **Model type:** Sequence-to-sequence model
- **Language(s) (NLP):** English to Zulu and Xhosa
- **License:** cc-by-4.0
- **Finetuned from model [optional]:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) 

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/PKhoboko/MSc-Thesis
- **Paper [optional]:** https://www.sciencedirect.com/science/article/pii/S2666827025000325
- **Demo [optional]:** [More Information Needed]

## Uses

The model can be used to translate Engslih to Zulu and Xhosa. With further improvement it can be used to translate domain specific infromation from English to Zulu and Xhosa,
thus it can be used to translate research information that was written in English, in the agriculture industry, to small scale farmers that speak Zulu and Xhosa. Further, it can
be used in the Education industry to teach core subjects in native South African langauges thus can improve pupils' performance in these subjects.


### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
You can download the model, dsfsi/OMT-LR-Mistral7b, and prompt it to translate English sentences to Zulu and Xhosa sentences.


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
- Translating full documents or complex legal/medical content.
- Any politically sensitive, sexually biased, or harmful content generation.



## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- Training data includes English intrusions in target languages (Zulu/Xhosa).
- May hallucinate or degrade performance on domain-specific or long-form content.
- Not tested extensively for dialectal variations or colloquial expressions.



### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
- For critical use (e.g., government or education), fine-tune further with clean, domain-specific parallel corpora.
- Avoid deploying this model in zero-review production pipelines.

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "dsfsi/OMT-LR-Mistral7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

translator = pipeline("text-generation", model=model, tokenizer=tokenizer)
translator("Translate to Zulu: The cow is eating grass.")
```


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- [nwu-ctext/autshumato](https://huggingface.co/datasets/nwu-ctext/autshumato)
- [Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
- [WMT22](https://huggingface.co/datasets/wmt22)
- [gsarti/flores_101](https://huggingface.co/datasets/gsarti/flores_101)

 **Note**: The above datasets were collected individually and used to create a multilingual dataset having English to Zulu and Xhosa sentences.


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

Please check out the repo above to get the dataset cleanup and preparation code.


#### Training Hyperparameters

- **Training regime:**
```python
- peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']
)

 - TrainingArguments(
        optim="paged_adamw_8bit",
        per_device_train_batch_size=32,
        gradient_accumulation_steps=4,
        log_level="debug",
        save_steps=400,
        logging_steps=10,
        learning_rate=4e-4,
        num_train_epochs=2,
        warmup_steps=100,
        lr_scheduler_type="linear",
)
```

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

- [nwu-ctext/autshumato](https://huggingface.co/datasets/nwu-ctext/autshumato)
- [Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
- [WMT22](https://huggingface.co/datasets/wmt22)
- [gsarti/flores_101](https://huggingface.co/datasets/gsarti/flores_101)


#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->
- bleu: Used to check if the model is translating Zulu and Xhosa words proprely when compared to the fround truth.
- f1:evaluates larger linguistic units such as grammatical chunks and syntactic frames, making it more suitable for languages with complex syntactic structures.
- G-Eva: uses embeddings to capture the contextual and semantic similarity between hypothesis and reference translations


### Results

| Language Pair      | BLEU | F1 | G-Eva |
|--------------------|------|----|--------|
| English → Zulu     | 20   | 42 | 92%    |
| English → Xhosa    | 14   | 38 | 91%    |

#### Summary

**OMT-LR-Mistral7b** fine-tunes **Mistral-7B-v0.1** using custom prompt engineering for **low-resource African languages**, specifically **English to Zulu and Xhosa** translation. It was trained for 31 GPU days using a multilingual dataset to improve translation accuracy for morphologically rich languages.


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware**: 2 A100 48GB GPUs  
- **Cloud Provider**: Google Cloud Platform
- **Compute Region**: europe-west4-a  
- **Training Time**: ~31 GPU days  
- **Carbon Emissions**: [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

@article{khoboko2025optimizing,
  title={Optimizing translation for low-resource languages: Efficient fine-tuning with custom prompt engineering in large language models},
  author={Khoboko, Pitso Walter and Marivate, Vukosi and Sefara, Joseph},
  journal={Machine Learning with Applications},
  volume={20},
  pages={100649},
  year={2025},
  publisher={Elsevier}
}

**APA:**

Khoboko, P. W., Marivate, V., & Sefara, J. (2025). Optimizing translation for low-resource languages: Efficient fine-tuning with custom prompt engineering in large language models. Machine Learning with Applications, 20, 100649.


## Model Card Authors [optional]

- Pitso Walter Khoboko

## Model Card Contact

- Pitso Walter Khoboko: u21824772@tuks.co.za

- Vukosi Marivate: vukosi.marivate@cs.up.ac.za

- Joseph Sefara: tsefara@csir.co.za