Update README.md

f4d082c verified 5 months ago

6.5 kB

license: cc-by-nc-4.0
language:
  - en
  - km
base_model:
  - facebook/nllb-200-distilled-600M
tags:
  - legal
  - translation
  - nllb
  - refugees
  - humanitarian
  - denoising

license: cc-by-nc-4.0 language: - en - km base_model: facebook/nllb-200-distilled-600M pipeline_tag: translation tags: - legal - khmer - translation - nllb - refugee - humanitarian - denoising

Khmer Legal Bridge - NLLB Fine-tuned for Legal Translation

English-Khmer Bidirectional Translation Model for legal and humanitarian Documents

Model Description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M optimized for legal document translation between English and Khmer. It was developed to support Cambodian refugees, asylum seekers, and legal professionals who need accurate translations of legal materials.

Intended Use

Translation of legal documents (court documents, asylum applications, legal handbooks)
Refugee and immigration documentation
Juvenile justice materials
Human rights reports and policy documents

Languages

English (eng_Latn)
Khmer (khm_Khmr)

Evaluation Results

Direction	chrF	BLEU
EN to KM	53.28	29.38
KM to EN	59.68	34.78

Comparison with Base Model

Direction	Metric	Base NLLB	Fine-tuned	Change
KM to EN	chrF	55.98	59.68	+3.70
KM to EN	BLEU	28.35	34.78	+6.43
EN to KM	chrF	54.48	53.28	-1.20

Key Results

Balanced bidirectional performance: Both directions now perform well
KM to EN significantly improved: +3.7 chrF, +6.4 BLEU
EN to KM chrF slightly lower: Small trade-off for better overall balance

Training Pipeline

Phase 1.5: Denoising Pre-training

Before translation fine-tuning, we strengthened the model's Khmer understanding using a denoising autoencoder task on 88,000+ monolingual Khmer examples:

Dataset	Size	Source	Content
khPOS	~12,000	Khmer POS Corpus	News, politics, economics - professionally segmented with POS tags
Khmer Dictionary 44K	~44,700	Royal Academy of Cambodia (2022)	Curated definitions, formal register

Denoising Task:

Input: Corrupted Khmer text (15% noise: masking, deletion, token shuffling)
Output: Clean original text
Both encoder AND decoder trained

Phase 1: Bidirectional Translation Fine-tuning

Using the denoising-pretrained model, we fine-tuned on ~389,000 parallel examples (bidirectional):

Dataset	Pairs	Bidirectional Examples
ALT Corpus	18,088	36,176
OPUS-100	~112,000	~224,000
ParaCrawl	~65,000	~130,000

Training Results:

Phase	Metric	Start	End	Improvement
1.5 Denoising	Val Loss	~4.9	~2.5	-49%
1 Translation	Val Loss	3.804	2.586	-32%

Training Configuration:

Epochs: 3
Batch size: 32 effective
Learning rate: 2e-5
Training time: ~13.5 hours on Google Colab (A100)

Usage

from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
import torch

# Load model and tokenizer
model_id = "ClaudBarbara/Open_Access_Khmer"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = NllbTokenizerFast.from_pretrained(model_id)

def translate(text, src_lang, tgt_lang):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
            max_length=512,
            num_beams=4
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# English to Khmer
result = translate("The court finds the defendant guilty.", "eng_Latn", "khm_Khmr")

# Khmer to English  
result = translate("<your_khmer_text>", "khm_Khmr", "eng_Latn")

Known Limitations

Temporal bias: The model tends to add past-tense markers in Khmer when translating present-tense English. This is being addressed in future training phases.
Domain specificity: Best results on legal and formal documents.
Length: Optimized for sentence-level translation (max 512 tokens).

Ethical Considerations

This model is intended for humanitarian purposes. It should NOT replace certified human translators in official legal proceedings.

Privacy: Stateless processing - no input text is stored or logged.

Roadmap

Phase 1.5: Denoising pre-training on 88K Khmer examples
Phase 1: Bidirectional translation fine-tuning on 389K examples
Phase 2: LoRA fine-tuning on legal glossary (~450 terms)
Phase 3: Tense augmentation to address temporal bias

Technical References

Datasets:

khPOS: Ye Kyaw Thu et al., "Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus" (ONA 2017)
Khmer Dictionary 44K: Royal Academy of Cambodia, 2022
ALT Corpus: Asian Language Treebank

Model:

NLLB-200: Costa-jussa et al., "No Language Left Behind" (2022)

Methodology:

Denoising: Lewis et al., "BART: Denoising Sequence-to-Sequence Pre-training" (2019)

Citation

@misc{khmer-legal-bridge-2024,
  title={Khmer Legal Bridge: Fine-tuned NLLB for Legal Translation},
  author={ClaudBarbara},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/ClaudBarbara/Open_Access_Khmer}
}

Acknowledgments

Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801 Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30 Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf

License

CC-BY-NC-4.0

Khmer Legal Bridge - Open Source Legal Translation for Refugee Communities