--- language: - en - gal license: mit tags: - translation - transformer - nmt - low-resource - galo - english - bible - pytorch pipeline_tag: translation metrics: - bleu - chrf - ter model-index: - name: GaloNMT results: - task: type: translation name: Machine Translation dataset: type: custom name: Galo Bible Parallel Corpus metrics: - type: bleu value: 16.61 - type: chrf value: 15.26 - type: ter value: 150.04 --- # GaloNMT — English → Galo Neural Machine Translation **GaloNMT** is a vanilla Transformer-based neural machine translation model that translates **English** text into **Galo**, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language. ## Model Details | Property | Value | |---|---| | **Architecture** | Vanilla Transformer (from scratch) | | **Translation Direction** | English → Galo | | **Framework** | PyTorch | | **Model Size** | ~34.7 MB (`model.pt`) | | **Tokenizer** | Byte-Pair Encoding (BPE) via HuggingFace `tokenizers` | | **Source Vocab Size** | 5,000 | | **Target Vocab Size** | 5,000 | ### Architecture Hyperparameters | Hyperparameter | Value | |---|---| | `d_model` | 128 | | `n_heads` | 4 | | `n_layers` | 2 | | `d_ff` | 256 | | `dropout` | 0.3 | | `max_seq_length` | 64 | ### Training Configuration | Parameter | Value | |---|---| | Optimizer | Adam | | Learning Rate | 1e-4 | | Batch Size | 16 | | Epochs | 30 | | Loss Function | CrossEntropyLoss (ignoring PAD) | | Hardware | Apple M4 Silicon (MPS) | ## Training Data The model was trained on the **Galo Bible Parallel Corpus**, a sentence-aligned English–Galo parallel corpus derived from Bible translations. | Split | Sentences | |---|---| | Train | 6,144 | | Validation | 768 | | Test | 768 | | **Total** | **7,680** | The dataset was split using an **80 : 10 : 10** ratio (train / validation / test) with a fixed random seed of 42 for reproducibility. ## Evaluation Results Evaluation was performed on **100 randomly sampled sentences** from the held-out test set using [SacreBLEU](https://github.com/mjpost/sacrebleu). | Metric | Score | |---|---| | **BLEU** | 16.61 | | **chrF** | 15.26 | | **TER** | 150.04 | ## Sample Translations | English Input | Galo Output | |---|---| | The elder to Gaius the beloved, | Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ? | | Beloved, I personally am praying for you, | Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu , | | Do not love the world, nor the things that are in the world. | Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë . | > **Note:** The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See [Limitations](#limitations) for details. ## How to Use ### Requirements ```bash pip install torch tokenizers ``` ### Inference ```python import torch import json from tokenizers import Tokenizer with open("GaloNMT/config.json", "r") as f: config = json.load(f) en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json") galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json") PAD_IDX = en_tokenizer.token_to_id("[PAD]") SOS_IDX = en_tokenizer.token_to_id("[SOS]") EOS_IDX = en_tokenizer.token_to_id("[EOS]") def translate(sentence, model, max_len=64): model.eval() tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX] src = torch.tensor(tokens).unsqueeze(0).to(device) src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2) trg_indexes = [SOS_IDX] for _ in range(max_len): trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device) trg_mask = torch.tril( torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device) ).bool() with torch.no_grad(): output = model(src, trg_tensor, src_mask, trg_mask) pred_token = output.argmax(2)[:, -1].item() trg_indexes.append(pred_token) if pred_token == EOS_IDX: break return galo_tokenizer.decode(trg_indexes) ``` ## Intended Use - **Primary use:** Research and experimentation in low-resource neural machine translation for the Galo language. - **Secondary use:** Supporting language documentation and digital preservation efforts for the Galo community. - **Not intended for:** Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical. ## Limitations - **Small training corpus:** The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains. - **Repetitive outputs:** Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT. - **Single domain:** Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics. - **No beam search:** The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality. - **No back-translation or data augmentation:** The model was trained on parallel data only, without synthetic data augmentation techniques. ## Ethical Considerations - The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts. - Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights. - This model should not be used to generate content that misrepresents the Galo language or culture. ## Training Loss Curve The model trained for 30 epochs with the following loss progression: | Epoch | Loss | Epoch | Loss | Epoch | Loss | |---|---|---|---|---|---| | 1 | 7.0211 | 11 | 5.3566 | 21 | 4.8699 | | 2 | 6.3616 | 12 | 5.2930 | 22 | 4.8339 | | 3 | 6.1726 | 13 | 5.2337 | 23 | 4.7986 | | 4 | 6.0124 | 14 | 5.1815 | 24 | 4.7632 | | 5 | 5.8844 | 15 | 5.1299 | 25 | 4.7345 | | 6 | 5.7708 | 16 | 5.0777 | 26 | 4.7034 | | 7 | 5.6739 | 17 | 5.0343 | 27 | 4.6699 | | 8 | 5.5823 | 18 | 4.9872 | 28 | 4.6412 | | 9 | 5.5018 | 19 | 4.9482 | 29 | 4.6122 | | 10 | 5.4271 | 20 | 4.9081 | 30 | 4.5867 | ## Model Files ``` GaloNMT/ ├── config.json # Model architecture configuration ├── model.pt # Trained model weights (~34.7 MB) ├── en_tokenizer.json # English BPE tokenizer ├── galo_tokenizer.json # Galo BPE tokenizer └── README.md # This model card ``` ## Citation If you use this model in your research, please cite: ```bibtex @misc{galonmt2026, title = {GaloNMT: Neural Machine Translation for Galo to English}, author = {Jurist Dupit}, year = {2026}, howpublished = {\url{https://huggingface.co/GaloNMT}}, note = {Vanilla Transformer trained on the Galo Bible Parallel Corpus}, institute = {Rajiv Gandhi University Rono Hills Doimukh} } ``` ## Acknowledgements This work contributes to the digital preservation and computational linguistic support for the **Galo ** language. We thank the Galo-speaking community for the linguistic resources that made this project possible.