|
|
--- |
|
|
license: gemma |
|
|
language: |
|
|
- sl |
|
|
- en |
|
|
- hr |
|
|
- sr |
|
|
- bs |
|
|
base_model: |
|
|
- cjvt/GaMS-9B |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Model Card for GaMS-DPO-Translator |
|
|
|
|
|
GaMS-DPO-Translator is a fine-tuned version of GaMS-9B-Instruct. Direct Preference Optimization (DPO) was performed on the original model. The learning dataset was synthetially generated by using GaMS-9B-Instruct and EuroLLM-9B-Instruct. |
|
|
|
|
|
 |
|
|
|
|
|
## Basic information |
|
|
|
|
|
- **Developed by:** team of researchers at the University of Ljubljana, Faculty for Computer and Information Science. Team members: Dario Vajda, Domen Vreš and Marko Robnik-Šikonja. |
|
|
- **Languages:** Slovene, English (primary), Croatian, Bosnian and Serbian (secondary). The model might also work for other languages supported by Gemma 2, even though it was not continually pretrained on them. |
|
|
- **Base model:** [cjvt/GaMS-9B](https://huggingface.co/cjvt/GaMS-9B-Instruct) |
|
|
- **License:** [Gemma](https://ai.google.dev/gemma/terms) |
|
|
|
|
|
## Usage |
|
|
|
|
|
The model can be run through `pipeline` API using the following code: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
model_id = "DarioVajda/GaMS-DPO-Translator" |
|
|
|
|
|
pline = pipeline( |
|
|
"text-generation", |
|
|
model=model_id, |
|
|
device_map="cuda" # replace with "mps" to run on a Mac device |
|
|
) |
|
|
|
|
|
# Example of response generation |
|
|
message = [{"role": "user", "content": "Prevedi naslednje angleško besedilo v slovenščino.\nToday is a nice day."}] |
|
|
response = pline(message, max_new_tokens=512) |
|
|
print("Translation:", response[0]["generated_text"][-1]["content"]) |
|
|
``` |
|
|
|
|
|
For multi GPU inference set the `device_map` to `auto`: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
model_id = "DarioVajda/GaMS-DPO-Translator" |
|
|
|
|
|
pline = pipeline( |
|
|
"text-generation", |
|
|
model=model_id, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Example of response generation |
|
|
message = [{"role": "user", "content": "Prevedi naslednje angleško besedilo v slovenščino.\nToday is a nice day."}] |
|
|
response = pline(message, max_new_tokens=512) |
|
|
print("Model's response:", response[0]["generated_text"][-1]["content"]) |
|
|
|
|
|
# Example of conversation chain |
|
|
new_message = response[0]["generated_text"] |
|
|
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"}) |
|
|
response = pline(new_message, max_new_tokens=1024) |
|
|
print("Model's response:", response[0]["generated_text"][-1]["content"]) |
|
|
``` |
|
|
|
|
|
## Data |
|
|
|
|
|
Data for fine-tuning the original model was acquired by translating a large corpora of wikipedia articles by two models (GaMS-9B-Instruct and EuroLLM-9B-Instruct) which were then ranked by some automatic metrics for translation quality and reliability. |
|
|
|
|
|
## Training |
|
|
|
|
|
The model was trained on the [Vega HPC](https://izum.si/vega_slv/) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model was evaluated by Slobench and we expanded the evaluation to measure some other qualities of the model we care about. |
|
|
|
|
|
### Slobench evaluation: |
|
|
|
|
|
|
|
|
| Model | BERT score | BLEU (avg) | METEOR (avg) | CHRF (avg) | BLEU (corpus) | CHRF (corpus) | |
|
|
|--------------------------------|-----------:|-----------:|-------------:|-----------:|--------------:|--------------:| |
|
|
| EuroLLM-9B-Instruct | 0.8741 | 0.2927 | 0.5792 | 0.6055 | 0.3273 | 0.6055 | |
|
|
| GaMS-27B-Instruct | 0.8734 | 0.2866 | 0.5688 | 0.5986 | 0.3246 | 0.5986 | |
|
|
| **GaMS-9B-DPO-Translator** | **0.8726** | **0.2810** | **0.5663** | **0.5967** | **0.3252** | **0.5967** | |
|
|
| GaMS-9B-Instruct | 0.8713 | 0.2773 | 0.5616 | 0.5928 | 0.3209 | 0.5928 | |
|
|
| GPT 4o-mini | 0.8690 | 0.2619 | 0.5456 | 0.5839 | 0.3021 | 0.5839 | |
|
|
|
|
|
### Wikipedia evaluation: |
|
|
This evaluation was performed on data which was not seen during training. We checked how often the model would make some fatal error and later compared the COMET scores. |
|
|
|
|
|
Error rates: |
|
|
|
|
|
| Model | Language Error | Truncation Error | Combined | |
|
|
|-----------------|---------------:|-----------------:|---------:| |
|
|
| EuroLLM | 1% | 0.4% | 1.4% | |
|
|
| GaMS | 9.5% | 3.5% | 13% | |
|
|
| **GaMS-DPO** | **0.6%** | **0.2%** | **0.8%** | |
|
|
|
|
|
COMET scoring results: |
|
|
|
|
|
| Model | Average COMET score | |
|
|
|-------------------------------|--------------------:| |
|
|
| EuroLLM-9B-Instruct | 0.755 | |
|
|
| GaMS-9B-Instruct | 0.736 | |
|
|
| **GaMS-9B-DPO-Translator** | 0.771 | |
|
|
|
|
|
|
|
|
|
|
|
|