Upload 6 files

4be430a verified 20 days ago

7.72 kB

	---
	language:
	- en
	- gal
	license: mit
	tags:
	- translation
	- transformer
	- nmt
	- low-resource
	- galo
	- english
	- bible
	- pytorch
	pipeline_tag: translation
	metrics:
	- bleu
	- chrf
	- ter
	model-index:
	- name: GaloNMT
	results:
	- task:
	type: translation
	name: Machine Translation
	dataset:
	type: custom
	name: Galo Bible Parallel Corpus
	metrics:
	- type: bleu
	value: 16.61
	- type: chrf
	value: 15.26
	- type: ter
	value: 150.04
	---

	# GaloNMT — English → Galo Neural Machine Translation

	GaloNMT is a vanilla Transformer-based neural machine translation model that translates English text into Galo, a Tibeto-Burman language spoken by the Galo community in Arunachal Pradesh, India. Galo is classified as a low-resource language with very limited digital representation, making this one of the first dedicated NMT systems for the language.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| Vanilla Transformer (from scratch) \|
	\| Translation Direction \| English → Galo \|
	\| Framework \| PyTorch \|
	\| Model Size \| ~34.7 MB (`model.pt`) \|
	\| Tokenizer \| Byte-Pair Encoding (BPE) via HuggingFace `tokenizers` \|
	\| Source Vocab Size \| 5,000 \|
	\| Target Vocab Size \| 5,000 \|

	### Architecture Hyperparameters

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| `d_model` \| 128 \|
	\| `n_heads` \| 4 \|
	\| `n_layers` \| 2 \|
	\| `d_ff` \| 256 \|
	\| `dropout` \| 0.3 \|
	\| `max_seq_length` \| 64 \|

	### Training Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Optimizer \| Adam \|
	\| Learning Rate \| 1e-4 \|
	\| Batch Size \| 16 \|
	\| Epochs \| 30 \|
	\| Loss Function \| CrossEntropyLoss (ignoring PAD) \|
	\| Hardware \| Apple M4 Silicon (MPS) \|

	## Training Data

	The model was trained on the Galo Bible Parallel Corpus, a sentence-aligned English–Galo parallel corpus derived from Bible translations.

	\| Split \| Sentences \|
	\|---\|---\|
	\| Train \| 6,144 \|
	\| Validation \| 768 \|
	\| Test \| 768 \|
	\| Total \| 7,680 \|

	The dataset was split using an 80 : 10 : 10 ratio (train / validation / test) with a fixed random seed of 42 for reproducibility.

	## Evaluation Results

	Evaluation was performed on 100 randomly sampled sentences from the held-out test set using [SacreBLEU](https://github.com/mjpost/sacrebleu).

	\| Metric \| Score \|
	\|---\|---\|
	\| BLEU \| 16.61 \|
	\| chrF \| 15.26 \|
	\| TER \| 150.04 \|

	## Sample Translations

	\| English Input \| Galo Output \|
	\|---\|---\|
	\| The elder to Gaius the beloved, \| Yo lëga ëmrëm nyi gaddë nyi gaddë , yo go mendudü ëgum nyi gaddë yo go mendudü dü ? \|
	\| Beloved, I personally am praying for you, \| Ngo nonnuëm mendu , ngo nonnuëm mendu , ngo nonnuëm mendu , \|
	\| Do not love the world, nor the things that are in the world. \| Ëmbë rünamë , tani mooko sokë tani mooko sokë tani mooko sokë nyi ë , okkë tani mooko sokë nyi ë tani mooko sokë aken ë . \|

	> Note: The model shows signs of repetition in some outputs, a common phenomenon in low-resource NMT settings. See [Limitations](#limitations) for details.

	## How to Use

	### Requirements

	```bash
	pip install torch tokenizers
	```

	### Inference

	```python
	import torch
	import json
	from tokenizers import Tokenizer

	with open("GaloNMT/config.json", "r") as f:
	config = json.load(f)

	en_tokenizer = Tokenizer.from_file("GaloNMT/en_tokenizer.json")
	galo_tokenizer = Tokenizer.from_file("GaloNMT/galo_tokenizer.json")

	PAD_IDX = en_tokenizer.token_to_id("[PAD]")
	SOS_IDX = en_tokenizer.token_to_id("[SOS]")
	EOS_IDX = en_tokenizer.token_to_id("[EOS]")

	def translate(sentence, model, max_len=64):
	model.eval()
	tokens = [SOS_IDX] + en_tokenizer.encode(sentence).ids + [EOS_IDX]
	src = torch.tensor(tokens).unsqueeze(0).to(device)
	src_mask = (src != PAD_IDX).unsqueeze(1).unsqueeze(2)

	trg_indexes = [SOS_IDX]
	for _ in range(max_len):
	trg_tensor = torch.tensor(trg_indexes).unsqueeze(0).to(device)
	trg_mask = torch.tril(
	torch.ones((1, 1, len(trg_indexes), len(trg_indexes)), device=device)
	).bool()
	with torch.no_grad():
	output = model(src, trg_tensor, src_mask, trg_mask)
	pred_token = output.argmax(2)[:, -1].item()
	trg_indexes.append(pred_token)
	if pred_token == EOS_IDX:
	break

	return galo_tokenizer.decode(trg_indexes)
	```

	## Intended Use

	- Primary use: Research and experimentation in low-resource neural machine translation for the Galo language.
	- Secondary use: Supporting language documentation and digital preservation efforts for the Galo community.
	- Not intended for: Production-grade translation systems, legal or medical translation, or any high-stakes application where translation accuracy is critical.

	## Limitations

	- Small training corpus: The model is trained on only ~7,700 sentence pairs from a single domain (Bible text), which limits its vocabulary coverage and generalization to other domains.
	- Repetitive outputs: Due to the low-resource setting and small model size, the decoder occasionally produces repetitive n-grams — a well-known issue in autoregressive NMT.
	- Single domain: Performance on out-of-domain text (news, conversational, technical) is expected to be significantly lower than the reported metrics.
	- No beam search: The current inference uses greedy decoding. Beam search or sampling strategies may improve output quality.
	- No back-translation or data augmentation: The model was trained on parallel data only, without synthetic data augmentation techniques.

	## Ethical Considerations

	- The training data is derived from publicly available Bible translations. Care should be taken when using the model in culturally sensitive contexts.
	- Galo is a language spoken by an indigenous community. Any deployment or public-facing use of this model should involve community consultation and respect for indigenous language rights.
	- This model should not be used to generate content that misrepresents the Galo language or culture.

	## Training Loss Curve

	The model trained for 30 epochs with the following loss progression:

	\| Epoch \| Loss \| Epoch \| Loss \| Epoch \| Loss \|
	\|---\|---\|---\|---\|---\|---\|
	\| 1 \| 7.0211 \| 11 \| 5.3566 \| 21 \| 4.8699 \|
	\| 2 \| 6.3616 \| 12 \| 5.2930 \| 22 \| 4.8339 \|
	\| 3 \| 6.1726 \| 13 \| 5.2337 \| 23 \| 4.7986 \|
	\| 4 \| 6.0124 \| 14 \| 5.1815 \| 24 \| 4.7632 \|
	\| 5 \| 5.8844 \| 15 \| 5.1299 \| 25 \| 4.7345 \|
	\| 6 \| 5.7708 \| 16 \| 5.0777 \| 26 \| 4.7034 \|
	\| 7 \| 5.6739 \| 17 \| 5.0343 \| 27 \| 4.6699 \|
	\| 8 \| 5.5823 \| 18 \| 4.9872 \| 28 \| 4.6412 \|
	\| 9 \| 5.5018 \| 19 \| 4.9482 \| 29 \| 4.6122 \|
	\| 10 \| 5.4271 \| 20 \| 4.9081 \| 30 \| 4.5867 \|

	## Model Files

	```
	GaloNMT/
	├── config.json # Model architecture configuration
	├── model.pt # Trained model weights (~34.7 MB)
	├── en_tokenizer.json # English BPE tokenizer
	├── galo_tokenizer.json # Galo BPE tokenizer
	└── README.md # This model card
	```

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{galonmt2026,
	title = {GaloNMT: Neural Machine Translation for Galo to English},
	author = {Jurist Dupit},
	year = {2026},
	howpublished = {\url{https://huggingface.co/GaloNMT}},
	note = {Vanilla Transformer trained on the Galo Bible Parallel Corpus},
	institute = {Rajiv Gandhi University Rono Hills Doimukh}
	}
	```

	## Acknowledgements

	This work contributes to the digital preservation and computational linguistic support for the Galo language. We thank the Galo-speaking community for the linguistic resources that made this project possible.