| --- |
| language: |
| - en |
| - gal |
| license: mit |
| tags: |
| - translation |
| - transformer |
| - nmt |
| - low-resource |
| - galo |
| - english |
| - bible |
| - pytorch |
| pipeline_tag: translation |
| metrics: |
| - bleu |
| - chrf |
| - ter |
| model-index: |
| - name: GaloNMT |
| results: |
| - task: |
| type: translation |
| name: Machine Translation |
| dataset: |
| type: custom |
| name: Galo Bible Parallel Corpus |
| metrics: |
| - type: bleu |
| value: 16.61 |
| - type: chrf |
| value: 15.26 |
| - type: ter |
| value: 150.04 |
| --- |
| |
| # GaloNMT — English → Galo Neural Machine Translation |
|
|
| **GaloNMT** is a vanilla Transformer-based neural machine translation model that translates **English** text into **Galo**, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Architecture** | Vanilla Transformer (from scratch) | |
| | **Translation Direction** | English → Galo | |
| | **Framework** | PyTorch | |
| | **Model Size** | ~34.7 MB (`model.pt`) | |
| | **Tokenizer** | Byte-Pair Encoding (BPE) via HuggingFace `tokenizers` | |
| | **Source Vocab Size** | 5,000 | |
| | **Target Vocab Size** | 5,000 | |
|
|
| ### Architecture Hyperparameters |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | `d_model` | 128 | |
| | `n_heads` | 4 | |
| | `n_layers` | 2 | |
| | `d_ff` | 256 | |
| | `dropout` | 0.3 | |
| | `max_seq_length` | 64 | |
|
|
| ### Training Configuration |
|
|
| | Parameter | Value | |
| |---|---| |
| | Optimizer | Adam | |
| | Learning Rate | 1e-4 | |
| | Batch Size | 16 | |
| | Epochs | 30 | |
| | Loss Function | CrossEntropyLoss (ignoring PAD) | |
| | Hardware | Apple M4 Silicon (MPS) | |
|
|
| ## Training Data |
|
|
| The model was trained on the **Galo Bible Parallel Corpus**, a sentence-aligned English–Galo parallel corpus derived from Bible translations. |
|
|
| | Split | Sentences | |
| |---|---| |
| | Train | 6,144 | |
| | Validation | 768 | |
| | Test | 768 | |
| | **Total** | **7,680** | |
|
|
| The dataset was split using an **80 : 10 : 10** ratio (train / validation / test) with a fixed random seed of 42 for reproducibility. |
|
|
| ## Evaluation Results |
|
|
| Evaluation was performed on **100 randomly sampled sentences** from the held-out test set using [SacreBLEU](https://github.com/mjpost/sacrebleu). |
|
|
| | Metric | Score | |
| |---|---| |
| | **BLEU** | 16.61 | |
| | **chrF** | 15.26 | |
| | **TER** | 150.04 | |
|
|
| ## Sample Translations |
|
|
| | English Input | Galo Output | |
| |---|---| |
| | The elder to Gaius the beloved, | Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ? | |
| | Beloved, I personally am praying for you, | Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu , | |
| | Do not love the world, nor the things that are in the world. | Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë . | |
|
|
| > **Note:** The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See [Limitations](#limitations) for details. |
|
|
| ## How to Use |
|
|
| ### Requirements |
|
|
| ```bash |
| pip install torch tokenizers |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import torch |
| import json |
| from tokenizers import Tokenizer |
| |
| with open("GaloNMT/config.json", "r") as f: |
| config = json.load(f) |
| |
| en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json") |
| galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json") |
| |
| PAD_IDX = en_tokenizer.token_to_id("[PAD]") |
| SOS_IDX = en_tokenizer.token_to_id("[SOS]") |
| EOS_IDX = en_tokenizer.token_to_id("[EOS]") |
| |
| def translate(sentence, model, max_len=64): |
| model.eval() |
| tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX] |
| src = torch.tensor(tokens).unsqueeze(0).to(device) |
| src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2) |
| |
| trg_indexes = [SOS_IDX] |
| for _ in range(max_len): |
| trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device) |
| trg_mask = torch.tril( |
| torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device) |
| ).bool() |
| with torch.no_grad(): |
| output = model(src, trg_tensor, src_mask, trg_mask) |
| pred_token = output.argmax(2)[:, -1].item() |
| trg_indexes.append(pred_token) |
| if pred_token == EOS_IDX: |
| break |
| |
| return galo_tokenizer.decode(trg_indexes) |
| ``` |
|
|
| ## Intended Use |
|
|
| - **Primary use:** Research and experimentation in low-resource neural machine translation for the Galo language. |
| - **Secondary use:** Supporting language documentation and digital preservation efforts for the Galo community. |
| - **Not intended for:** Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical. |
|
|
| ## Limitations |
|
|
| - **Small training corpus:** The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains. |
| - **Repetitive outputs:** Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT. |
| - **Single domain:** Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics. |
| - **No beam search:** The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality. |
| - **No back-translation or data augmentation:** The model was trained on parallel data only, without synthetic data augmentation techniques. |
|
|
| ## Ethical Considerations |
|
|
| - The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts. |
| - Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights. |
| - This model should not be used to generate content that misrepresents the Galo language or culture. |
|
|
| ## Training Loss Curve |
|
|
| The model trained for 30 epochs with the following loss progression: |
|
|
| | Epoch | Loss | Epoch | Loss | Epoch | Loss | |
| |---|---|---|---|---|---| |
| | 1 | 7.0211 | 11 | 5.3566 | 21 | 4.8699 | |
| | 2 | 6.3616 | 12 | 5.2930 | 22 | 4.8339 | |
| | 3 | 6.1726 | 13 | 5.2337 | 23 | 4.7986 | |
| | 4 | 6.0124 | 14 | 5.1815 | 24 | 4.7632 | |
| | 5 | 5.8844 | 15 | 5.1299 | 25 | 4.7345 | |
| | 6 | 5.7708 | 16 | 5.0777 | 26 | 4.7034 | |
| | 7 | 5.6739 | 17 | 5.0343 | 27 | 4.6699 | |
| | 8 | 5.5823 | 18 | 4.9872 | 28 | 4.6412 | |
| | 9 | 5.5018 | 19 | 4.9482 | 29 | 4.6122 | |
| | 10 | 5.4271 | 20 | 4.9081 | 30 | 4.5867 | |
|
|
| ## Model Files |
|
|
| ``` |
| GaloNMT/ |
| ├── config.json # Model architecture configuration |
| ├── model.pt # Trained model weights (~34.7 MB) |
| ├── en_tokenizer.json # English BPE tokenizer |
| ├── galo_tokenizer.json # Galo BPE tokenizer |
| └── README.md # This model card |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @misc{galonmt2026, |
| title = {GaloNMT: Neural Machine Translation for Galo to English}, |
| author = {Jurist Dupit}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/GaloNMT}}, |
| note = {Vanilla Transformer trained on the Galo Bible Parallel Corpus}, |
| institute = {Rajiv Gandhi University Rono Hills Doimukh} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| This work contributes to the digital preservation and computational linguistic support for the **Galo ** language. We thank the Galo-speaking community for the linguistic resources that made this project possible. |
|
|