File size: 5,202 Bytes
2b5850c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8518ad4
 
 
3a8a5c2
 
 
 
8518ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2022118
8518ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2022118
 
 
 
 
 
 
8518ad4
 
 
 
 
2022118
8518ad4
 
2022118
8518ad4
 
 
 
 
 
 
 
 
2022118
 
 
 
 
 
8518ad4
2022118
8518ad4
 
2022118
 
8518ad4
2022118
 
8518ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2022118
 
 
 
 
 
 
8518ad4
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
  - en
  - code
license: mit
tags:
  - translation
  - yanomami
  - indigenous-languages
  - language-preservation
  - offline-translation
datasets:
  - custom
metrics:
  - perplexity
  - loss
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: yanomami-english-translation
    results:
      - task:
          type: translation
          name: Yanomami-English Translation
        metrics:
          - type: perplexity
            value: 2.87
          - type: loss
            value: 1.0554
---

# Yanomami-English Translation Model

This model is a fine-tuned GPT-2 Small (124M parameters) for bidirectional translation between Yanomami and English languages. It was developed to provide offline translation capabilities for the Yanomami language, an indigenous language spoken in northern Brazil and southern Venezuela.
As GPT2-Small is not suited for this, I tried NLLB but missed the conversational style, now I'm training a model in Llama 3.1 8B-int8: https://github.com/renantrendt/yanomami_llama

In the meantime while we don't finish the fine tuning of llama3, we deployed a chatgpt like that RAG the yanomami dictionary: https://yanomami.bernardoserrano.com/


## Model Description

- **Model Type:** GPT-2 Small (124M parameters)
- **Language(s):** Yanomami ↔ English
- **License:** MIT
- **Developed by:** Renan Serrano

## Training Data

The model was trained on a diverse dataset consisting of:
- translations.jsonl (17,009 examples)
- yanomami-to-english.jsonl (1,822 examples)
- phrases.jsonl (2,322 examples)
- grammar.jsonl (200 examples)
- comparison.jsonl (2,072 examples)
- how-to.jsonl (5,586 examples)

## Training Metrics
- Final training loss: 1.0554 (Epoch 3)
- Final validation loss: 1.0557
- Overall average training loss: 1.2102
- Perplexity: 2.87

## Usage

### Direct Translation

```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("renanserrano/yanomami-finetuning")
model = GPT2LMHeadModel.from_pretrained("renanserrano/yanomami-finetuning")

# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else 
                     "mps" if torch.backends.mps.is_available() else 
                     "cpu")
model.to(device)

# Function for translation
def translate(text, direction="english_to_yanomami"):
    # Add appropriate prefix based on translation direction
    if direction == "english_to_yanomami":
        prompt = f"English: {text} => Yanomami:"
    else:
        prompt = f"Yanomami: {text} => English:"
    
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate translation
    outputs = model.generate(
        **inputs,
        max_length=100,
        num_return_sequences=1,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        num_beams=4,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode translation
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract the actual translation part (after the prompt)
    if "=>" in translation:
        translation = translation.split("=>")[1].strip()
    
    return translation

# Examples
# English to Yanomami
print(translate("What does 'aheprariyo' mean in Yanomami?", "english_to_yanomami"))

# Yanomami to English
print(translate("ahetoimi", "yanomami_to_english"))
```

### Using with RAG (Retrieval-Augmented Generation)

For more advanced use cases, this model can be integrated with a RAG system to provide context-enhanced translations and comprehensive linguistic information.

## Limitations

- The model shows promising results for translating Yanomami words to English definitions but has limitations with more complex translations and conversational phrases.
- Performance varies based on the complexity of the input and its similarity to the training data.
- The model may not capture all cultural nuances and context-specific meanings.

## Ethical Considerations

This model is intended to support language preservation and cross-cultural communication. When using this model, please be respectful of the Yanomami culture and language.

## Offline Usage

This model was designed to function completely offline, ensuring accessibility in remote areas without internet connectivity. All components can be downloaded and used locally.

## Related Resources

### Repositories & Datasets
- **GitHub Repository**: [renantrendt/yanomami-finetuning](https://github.com/renantrendt/yanomami-finetuning)
- **Dataset (Hugging Face)**: [renanserrano/yanomami](https://huggingface.co/datasets/renanserrano/yanomami)
- **Dataset Generator (NPM)**: [ai-dataset-generator](https://www.npmjs.com/package/ai-dataset-generator)

## Citation

If you use this model in your research or applications, please cite:

```
@misc{yanomami-english-translator,
  author = {Renan Serrano},
  title = {Yanomami-English Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/renanserrano/yanomami-finetuning}}
}
```