---
language:
  - en
  - gal
license: mit
tags:
  - translation
  - transformer
  - nmt
  - low-resource
  - galo
  - english
  - bible
  - pytorch
pipeline_tag: translation
metrics:
  - bleu
  - chrf
  - ter
model-index:
  - name: GaloNMT
    results:
      - task:
          type: translation
          name: Machine Translation
        dataset:
          type: custom
          name: Galo Bible Parallel Corpus
        metrics:
          - type: bleu
            value: 16.61
          - type: chrf
            value: 15.26
          - type: ter
            value: 150.04
---

# GaloNMT — English → Galo Neural Machine Translation

**GaloNMT** is a vanilla Transformer-based neural machine translation model that translates **English** text into **Galo**, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language.

## Model Details

| Property | Value |
|---|---|
| **Architecture** | Vanilla Transformer (from scratch) |
| **Translation Direction** | English → Galo |
| **Framework** | PyTorch |
| **Model Size** | ~34.7 MB (`model.pt`) |
| **Tokenizer** | Byte-Pair Encoding (BPE) via HuggingFace `tokenizers` |
| **Source Vocab Size** | 5,000 |
| **Target Vocab Size** | 5,000 |

### Architecture Hyperparameters

| Hyperparameter | Value |
|---|---|
| `d_model` | 128 |
| `n_heads` | 4 |
| `n_layers` | 2 |
| `d_ff` | 256 |
| `dropout` | 0.3 |
| `max_seq_length` | 64 |

### Training Configuration

| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Learning Rate | 1e-4 |
| Batch Size | 16 |
| Epochs | 30 |
| Loss Function | CrossEntropyLoss (ignoring PAD) |
| Hardware | Apple M4 Silicon (MPS) |

## Training Data

The model was trained on the **Galo Bible Parallel Corpus**, a sentence-aligned English–Galo parallel corpus derived from Bible translations.

| Split | Sentences |
|---|---|
| Train | 6,144 |
| Validation | 768 |
| Test | 768 |
| **Total** | **7,680** |

The dataset was split using an **80 : 10 : 10** ratio (train / validation / test) with a fixed random seed of 42 for reproducibility.

## Evaluation Results

Evaluation was performed on **100 randomly sampled sentences** from the held-out test set using [SacreBLEU](https://github.com/mjpost/sacrebleu).

| Metric | Score |
|---|---|
| **BLEU** | 16.61 |
| **chrF** | 15.26 |
| **TER** | 150.04 |

## Sample Translations

| English Input | Galo Output |
|---|---|
| The elder to Gaius the beloved, | Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ? |
| Beloved, I personally am praying for you, | Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu , |
| Do not love the world, nor the things that are in the world. | Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë . |

> **Note:** The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See [Limitations](#limitations) for details.

## How to Use

### Requirements

```bash
pip install torch tokenizers
```

### Inference

```python
import torch
import json
from tokenizers import Tokenizer

with open("GaloNMT/config.json", "r") as f:
    config = json.load(f)

en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json")
galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json")

PAD_IDX = en_tokenizer.token_to_id("[PAD]")
SOS_IDX = en_tokenizer.token_to_id("[SOS]")
EOS_IDX = en_tokenizer.token_to_id("[EOS]")

def translate(sentence, model, max_len=64):
    model.eval()
    tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX]
    src = torch.tensor(tokens).unsqueeze(0).to(device)
    src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2)

    trg_indexes = [SOS_IDX]
    for _ in range(max_len):
        trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = torch.tril(
            torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device)
        ).bool()
        with torch.no_grad():
            output = model(src, trg_tensor, src_mask, trg_mask)
        pred_token = output.argmax(2)[:, -1].item()
        trg_indexes.append(pred_token)
        if pred_token == EOS_IDX:
            break

    return galo_tokenizer.decode(trg_indexes)
```

## Intended Use

- **Primary use:** Research and experimentation in low-resource neural machine translation for the Galo language.
- **Secondary use:** Supporting language documentation and digital preservation efforts for the Galo community.
- **Not intended for:** Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical.

## Limitations

- **Small training corpus:** The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains.
- **Repetitive outputs:** Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT.
- **Single domain:** Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics.
- **No beam search:** The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality.
- **No back-translation or data augmentation:** The model was trained on parallel data only, without synthetic data augmentation techniques.

## Ethical Considerations

- The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts.
- Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights.
- This model should not be used to generate content that misrepresents the Galo language or culture.

## Training Loss Curve

The model trained for 30 epochs with the following loss progression:

| Epoch | Loss | Epoch | Loss | Epoch | Loss |
|---|---|---|---|---|---|
| 1 | 7.0211 | 11 | 5.3566 | 21 | 4.8699 |
| 2 | 6.3616 | 12 | 5.2930 | 22 | 4.8339 |
| 3 | 6.1726 | 13 | 5.2337 | 23 | 4.7986 |
| 4 | 6.0124 | 14 | 5.1815 | 24 | 4.7632 |
| 5 | 5.8844 | 15 | 5.1299 | 25 | 4.7345 |
| 6 | 5.7708 | 16 | 5.0777 | 26 | 4.7034 |
| 7 | 5.6739 | 17 | 5.0343 | 27 | 4.6699 |
| 8 | 5.5823 | 18 | 4.9872 | 28 | 4.6412 |
| 9 | 5.5018 | 19 | 4.9482 | 29 | 4.6122 |
| 10 | 5.4271 | 20 | 4.9081 | 30 | 4.5867 |

## Model Files

```
GaloNMT/
├── config.json            # Model architecture configuration
├── model.pt               # Trained model weights (~34.7 MB)
├── en_tokenizer.json      # English BPE tokenizer
├── galo_tokenizer.json    # Galo BPE tokenizer
└── README.md              # This model card
```

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{galonmt2026,
  title        = {GaloNMT: Neural Machine Translation for Galo to English},
  author       = {Jurist Dupit},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/GaloNMT}},
  note         = {Vanilla Transformer trained on the Galo Bible Parallel Corpus},
  institute = {Rajiv Gandhi University Rono Hills Doimukh}
}
```

## Acknowledgements

This work contributes to the digital preservation and computational linguistic support for the **Galo ** language. We thank the Galo-speaking community for the linguistic resources that made this project possible.