Open_Access_Khmer / README.md
ClaudBarbara's picture
Update README.md
ec9da28 verified
---
license: cc-by-nc-4.0
language:
- en
- km
base_model:
- facebook/nllb-200-distilled-600M
tags:
- legal
- translation
- nllb
- refugees
- humanitarian
- denoising
---
---
license: cc-by-nc-4.0
language:
- en
- km
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
tags:
- legal
- khmer
- translation
- nllb
- refugee
- humanitarian
- denoising
---
# Khmer Legal Bridge - NLLB Fine-tuned for Legal Translation
English-Khmer Bidirectional Translation Model for legal and humanitarian Documents
## Model Description
This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) optimized for legal document translation between English and Khmer. It was developed to support Cambodian refugees, asylum seekers, and legal professionals who need accurate translations of legal materials.
### Intended Use
- Translation of legal documents (court documents, asylum applications, legal handbooks)
- Refugee and immigration documentation
- Juvenile justice materials
- Human rights reports and policy documents
### Languages
- English (eng_Latn)
- Khmer (khm_Khmr)
## Evaluation Results
| Direction | chrF | BLEU |
|-----------|------|------|
| EN to KM | 53.28 | 29.38 |
| KM to EN | 59.68 | 34.78 |
### Comparison with Base Model
| Direction | Metric | Base NLLB | Fine-tuned | Change |
|-----------|--------|-----------|------------|--------|
| KM to EN | chrF | 55.98 | 59.68 | **+3.70** |
| KM to EN | BLEU | 28.35 | 34.78 | **+6.43** |
| EN to KM | chrF | 54.48 | 53.28 | -1.20 |
### Key Results
- **Balanced bidirectional performance**: Both directions now perform well
- **KM to EN significantly improved**: +3.7 chrF, +6.4 BLEU
- **EN to KM chrF slightly lower**: Small trade-off for better overall balance
## Training Pipeline
### Phase 1.5: Denoising Pre-training
Before translation fine-tuning, we strengthened the model's Khmer understanding using a **denoising autoencoder task** on **88,000+ monolingual Khmer examples**:
| Dataset | Size | Source | Content |
|---------|------|--------|---------|
| khPOS | ~12,000 | Khmer POS Corpus | News, politics, economics - professionally segmented with POS tags |
| Khmer Dictionary 44K | ~44,700 | Royal Academy of Cambodia (2022) | Curated definitions, formal register |
**Denoising Task:**
- Input: Corrupted Khmer text (15% noise: masking, deletion, token shuffling)
- Output: Clean original text
- Both encoder AND decoder trained
### Phase 1: Bidirectional Translation Fine-tuning
Using the denoising-pretrained model, we fine-tuned on **~389,000 parallel examples** (bidirectional):
| Dataset | Pairs | Bidirectional Examples |
|---------|-------|------------------------|
| ALT Corpus | 18,088 | 36,176 |
| OPUS-100 | ~112,000 | ~224,000 |
| ParaCrawl | ~65,000 | ~130,000 |
**Training Results:**
| Phase | Metric | Start | End | Improvement |
|-------|--------|-------|-----|-------------|
| 1.5 Denoising | Val Loss | ~4.9 | ~2.5 | -49% |
| 1 Translation | Val Loss | 3.804 | 2.586 | -32% |
**Training Configuration:**
- Epochs: 3
- Batch size: 32 effective
- Learning rate: 2e-5
- Training time: ~13.5 hours on Google Colab (A100)
## Usage
```python
from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
import torch
# Load model and tokenizer
model_id = "ClaudBarbara/Open_Access_Khmer"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = NllbTokenizerFast.from_pretrained(model_id)
def translate(text, src_lang, tgt_lang):
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_length=512,
num_beams=4
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# English to Khmer
result = translate("The court finds the defendant guilty.", "eng_Latn", "khm_Khmr")
# Khmer to English
result = translate("<your_khmer_text>", "khm_Khmr", "eng_Latn")
```
# TRANSLATION EXAMPLES
### ENGLISH β†’ KHMER TESTS
**πŸ‡¬πŸ‡§ EN**: The court finds the defendant guilty.<br>
**πŸ‡°πŸ‡­ KM**: αžαž»αž›αžΆαž€αžΆαžš αž”αžΆαž“ αžšαž€ αžƒαžΎαž‰ ថអ αž‡αž“ αž‡αžΆαž”αŸ‹ αž…αŸ„αž‘ αž˜αžΆαž“ αž€αŸ†αž αž»αžŸ αŸ”<br>
**Backwards**: The court found the accused guilty.<br><br>
**πŸ‡¬πŸ‡§ EN**: In addition, authorities, especially provincial authorities, are not aware of the law and apply outdated provisions to restrict NGO meetings and peaceful community demonstrations.<br>
**πŸ‡°πŸ‡­ KM**: αž›αžΎαžŸ αž–αžΈ αž“αŸαŸ‡ αž‘αŸ… αž‘αŸ€αž αž’αžΆαž‡αŸ’αž‰αžΆαž’αžš αž‡αžΆ αž–αž·αžŸαŸαžŸ αž’αžΆαž‡αŸ’αž‰αžΆαž’αžš αžαŸαžαŸ’αž αž˜αž·αž“ αž”αžΆαž“ αžŠαžΉαž„ αž–αžΈ αž…αŸ’αž”αžΆαž”αŸ‹ αž“αŸαŸ‡ αž‘αŸ αž“αž·αž„ αž’αž“αž»αžœαžαŸ’αž αžŸαŸαž…αž€αŸ’αžαžΈ αž–αŸ’αžšαžΆαž„ αž…αžΆαžŸαŸ‹ αŸ— αžŠαžΎαž˜αŸ’αž”αžΈ αžŠαžΆαž€αŸ‹ αž€αž˜αŸ’αžšαž·αž αž€αž·αž…αŸ’αž… αž”αŸ’αžšαž‡αž»αŸ† αžšαž”αžŸαŸ‹ αž’αž„αŸ’αž‚ αž€αžΆαžš αž˜αž·αž“αž˜αŸ‚αž“ αžšαžŠαŸ’αž‹αžΆαž—αž·αž”αžΆαž› αž“αž·αž„ αž€αžΆαžš αž’αŸ’αžœαžΎ αž”αžΆαžαž»αž€αž˜αŸ’αž˜ αžŠαŸ„αž™ αžŸαž“αŸ’αžαž·αžœαž·αž’αžΈ αžšαž”αžŸαŸ‹ αžŸαž αž‚αž˜αž“αŸ αŸ”<br>
**Backwards**: In addition, authorities, especially provincial authorities, are unaware of the law and apply the old regulations to restrict meetings of non-governmental organizations and peaceful demonstrations of communities.<br><br>
**πŸ‡¬πŸ‡§ EN**: Human rights are protected by law.<br>
**πŸ‡°πŸ‡­ KM**: αžŸαž·αž‘αŸ’αž’αž· αž˜αž“αž»αžŸαŸ’αžŸ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžšαž–αžΆαžš αžŠαŸ„αž™ αž…αŸ’αž”αžΆαž”αŸ‹ αŸ”<br>
**Backwards**: Human rights are protected by law.<br><br>
**πŸ‡¬πŸ‡§ EN**: The refugee seeks asylum in Australia.<br>
**πŸ‡°πŸ‡­ KM**: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αž“αŸαŸ‡αž”αžΆαž“αžŸαŸ’αžœαŸ‚αž„αžšαž€αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αž“αŸ…αž”αŸ’αžšαž‘αŸαžŸαž’αžΌαžŸαŸ’αžαŸ’αžšαžΆαž›αžΈαŸ”<br>
**Backwards**: The refugee sought asylum in Australia.<br><br>
**πŸ‡¬πŸ‡§ EN**: Lesson 8: Ensuring that children receive information and guidance, hygiene and sanitation, nutrition, care for children, providing food and drinks, receiving vaccines, maintaining personal hygiene, learning from experiences, fostering social interactions, observing surroundings, participating in community activities, protecting children from danger and harm, and handling various situations.<br>
**πŸ‡°πŸ‡­ KM**: Lesson 8: αž€αžΆαžšαž’αžΆαž“αžΆαžαžΆαž€αž»αž˜αžΆαžšαž‘αž‘αž½αž›αž”αžΆαž“αž–αŸαžαŸŒαž˜αžΆαž“αž“αž·αž„αž€αžΆαžšαžŽαŸ‚αž“αžΆαŸ†, αž’αž“αžΆαž˜αŸαž™αž“αž·αž„αž’αž“αžΆαž˜αŸαž™, αž’αžΆαž αžΆαžšαžΌαž”αžαŸ’αžαž˜αŸ’αž—, αž€αžΆαžšαžαŸ‚αž‘αžΆαŸ†αž€αž»αž˜αžΆαžš, αž€αžΆαžšαž•αŸ’αžαž›αŸ‹αž’αžΆαž αžΆαžšαž“αž·αž„αž—αŸαžŸαž‡αŸ’αž‡αŸˆ, αž€αžΆαžšαž‘αž‘αž½αž›αžœαŸ‰αžΆαž€αŸ‹αžŸαžΆαŸ†αž„, αž€αžΆαžšαžαŸ‚αž‘αžΆαŸ†αž’αž“αžΆαž˜αŸαž™αž•αŸ’αž‘αžΆαž›αŸ‹αžαŸ’αž›αž½αž“, αž€αžΆαžšαžšαŸ€αž“αž–αžΈαž”αž‘αž–αž·αžŸαŸ„αž’αž“αŸ, αž€αžΆαžšαž›αžΎαž€αž€αž˜αŸ’αž–αžŸαŸ‹αž‘αŸ†αž“αžΆαž€αŸ‹αž‘αŸ†αž“αž„αžŸαž„αŸ’αž‚αž˜, αž€αžΆαžšαžŸαž„αŸ’αž€αŸαžαž˜αžΎαž›αž”αžšαž·αžŸαŸ’αžαžΆαž“, αž€αžΆαžšαž…αžΌαž›αžšαž½αž˜αž“αŸ…αž€αŸ’αž“αž»αž„αžŸαž€αž˜αŸ’αž˜αž—αžΆαž–αžŸαž αž‚αž˜αž“αŸ, αž€αžΆαžšαž€αžΆαžšαž–αžΆαžšαž€αž»αž˜αžΆαžšαž–αžΈαž‚αŸ’αžšαŸ„αŸ‡αžαŸ’αž“αžΆαž€αŸ‹αž“αž·αž„αž€αžΆαžšαž”αŸ‰αŸ‡αž–αžΆαž›αŸ‹, αž“αž·αž„αž€αžΆαžšαžŠαŸ„αŸ‡αžŸαŸ’αžšαžΆαž™αžŸαŸ’αžαžΆαž“αž—αžΆαž–αž•αŸ’αžŸαŸαž„αŸ—<br>
**Backwards**: Lesson 8: Ensuring children receive information and guidance, sanitation and hygiene, nutrition, childcare, food and beverage provision, vaccination, personal hygiene, learning from experience, social communication promotion, environmental observation, participation in community activities, protecting children from harm and exposure, and addressing various situations<br><br>
### KHMER β†’ ENGLISH TESTS
**πŸ‡°πŸ‡­ KM**: αž’αž“αžΈαžαž·αž‡αž“αžŠαŸ‚αž›αž”αŸ’αžšαž–αŸ’αžšαžΉαžαŸ’αžαž”αž‘αž›αŸ’αž˜αžΎαžŸαžαŸ’αžšαžΌαžœαž‘αž‘αž½αž›αž”αžΆαž“αž€αžΆαžšαž€αžΆαžšαž–αžΆαžšαž–αžΈαž˜αŸαž’αžΆαžœαžΈ αž“αž·αž„αžαŸ’αžšαžΌαžœαž”αžΆαž“αž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αž€αŸ’αž“αž»αž„αžαž»αž›αžΆαž€αžΆαžšαž’αž“αžΈαžαž·αž‡αž“αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: A minor who commits a crime is protected by a lawyer and is tried in a juvenile court.<br>
**Backwards**: αž™αž»αžœαž‡αž“ αžŠαŸ‚αž› αž”αŸ’αžšαž–αŸ’αžšαžΉαžαŸ’αž αž”αž‘ αž§αž€αŸ’αžšαž·αžŠαŸ’αž‹ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžšαž–αžΆαžš αžŠαŸ„αž™ αž˜αŸαž’αžΆαžœαžΈ αž αžΎαž™ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžαŸ‹ αž‘αŸ„αžŸ αž“αŸ… αžαž»αž›αžΆαž€αžΆαžš αž™αž»αžœαž‡αž“ αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αžŠαŸ‚αž›αž˜αžΆαž“αž€αžΆαžšαž—αŸαž™αžαŸ’αž›αžΆαž…αžŠαŸ‚αž›αž˜αžΆαž“αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αžαŸ’αžšαžΉαž˜αžαŸ’αžšαžΌαžœαž’αŸ†αž–αžΈαž€αžΆαžšαž’αŸ’αžœαžΎαž‘αž»αž€αŸ’αžαž”αž»αž€αž˜αŸ’αž“αŸαž‰αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αžŸαŸ’αž“αžΎαžŸαž»αŸ†αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: Refugee with a well-founded fear of torture is eligible for asylum.<br>
**Backwardsk**: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αžŠαŸ‚αž›αž˜αžΆαž“αž€αžΆαžšαž—αŸαž™αžαŸ’αž›αžΆαž…αž“αŸƒαž€αžΆαžšαž’αŸ’αžœαžΎαž‘αžΆαžšαž»αžŽαž€αž˜αŸ’αž˜αžŠαŸ‚αž›αž˜αžΆαž“αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αž›αŸ’αž’αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž‘αž‘αž½αž›αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž‡αž“αž‡αžΆαž”αŸ‹αž…αŸ„αž‘αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž˜αž·αž“αž‘αž‘αž½αž›αžŸαŸ’αž‚αžΆαž›αŸ‹αž€αŸ†αž αž»αžŸ αž“αž·αž„αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž‘αž‘αž½αž›αž”αžΆαž“αž€αžΆαžšαž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒαŸ”<br>
**πŸ‡¬πŸ‡§ EN**: Defendants have the right not to plead guilty and the right to a fair trial.<br>
**Backwards**: αž‡αž“ αž‡αžΆαž”αŸ‹ αž…αŸ„αž‘ αž˜αžΆαž“ αžŸαž·αž‘αŸ’αž’αž· αž˜αž·αž“ αž‘αž‘αž½αž› ខុស αžαŸ’αžšαžΌαžœ αž“αž·αž„ αžŸαž·αž‘αŸ’αž’αž· αž‘αž‘αž½αž› αž”αžΆαž“ αž€αžΆαžš αž€αžΆαžαŸ‹ αž‘αŸ„αžŸ αžŠαŸ„αž™ αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒ αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž‚αŸ„αž›αž€αžΆαžšαžŽαŸαž˜αž·αž“αž”αž‰αŸ’αž‡αžΌαž“αžαŸ’αžšαž‘αž”αŸ‹αž‘αŸ…αžœαž·αž‰αž αžΆαž˜αžƒαžΆαžαŸ‹αžšαžŠαŸ’αž‹αž˜αž·αž“αž±αŸ’αž™αž”αžŽαŸ’αžαŸαž‰αž”αž»αž‚αŸ’αž‚αž›αž‘αŸ…αž”αŸ’αžšαž‘αŸαžŸαžŠαŸ‚αž›αž–αž½αž€αž‚αŸαž’αžΆαž…αž”αŸ’αžšαžˆαž˜αž“αžΉαž„αž€αžΆαžšαž’αŸ’αžœαžΎαž‘αžΆαžšαž»αžŽαž€αž˜αŸ’αž˜αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: The non-refoulement policy prohibits states from deporting individuals to countries where they may face torture.<br>
**Backwards**: αž‚αŸ„αž› αž“αž™αŸ„αž”αžΆαž™ αž˜αž·αž“ αžαŸ’αžšαž‘αž”αŸ‹ αž˜αž€ αžœαž·αž‰ αž“αŸαŸ‡ ហអម αžƒαžΆαžαŸ‹ αžšαžŠαŸ’αž‹ αž˜αž·αž“ αž²αŸ’αž™ αž”αžŽαŸ’αžαŸαž‰ αž”αž»αž‚αŸ’αž‚αž› αž‘αŸ… αž€αžΆαž“αŸ‹ αž”αŸ’αžšαž‘αŸαžŸ αžŠαŸ‚αž› αž–αž½αž€ αž‚αŸ αž’αžΆαž… αž”αŸ’αžšαžˆαž˜ មុខ αž“αžΉαž„ αž€αžΆαžš αž’αŸ’αžœαžΎ αž‘αžΆαžšαž»αžŽ αž€αž˜αŸ’αž˜ αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž‚αŸ’αžšαž”αŸ‹αžŸαŸαž…αž€αŸ’αžαžΈαžŸαž˜αŸ’αžšαŸαž…αžŠαŸ‚αž›αž”αŸ‰αŸ‡αž–αžΆαž›αŸ‹αžŠαž›αŸ‹αž€αž»αž˜αžΆαžšαžαŸ’αžšαžΌαžœαž‚αž·αžαž‚αžΌαžšαž–αžΈαž’αžαŸ’αžαž”αŸ’αžšαž™αŸ„αž‡αž“αŸαž€αž»αž˜αžΆαžšαž›αŸ’αž’αž”αŸ†αž•αž»αžαž‡αžΆαž…αž˜αŸ’αž”αž„αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: Any decision that affects children must be based on the best interests of the child.<br>
**Backwards**: αž€αžΆαžš αžŸαž˜αŸ’αžšαŸαž… αž…αž·αžαŸ’αž ណអ αž˜αž½αž™ αžŠαŸ‚αž› αž”αŸ‰αŸ‡ αž–αžΆαž›αŸ‹ αžŠαž›αŸ‹ αž€αž»αž˜αžΆαžš αžαŸ’αžšαžΌαžœ αžαŸ‚ αž•αŸ’αž’αŸ‚αž€ αž›αžΎ αž•αž› αž”αŸ’αžšαž™αŸ„αž‡αž“αŸ αž›αŸ’αž’ αž”αŸ†αž•αž»αž αžšαž”αžŸαŸ‹ αž€αž»αž˜αžΆαžš αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž€αžΆαžšαžƒαž»αŸ†αžαŸ’αž›αž½αž“αž˜αž»αž“αž€αžΆαžšαž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αžαŸ’αžšαžΌαžœαž”αŸ’αžšαžΎαž‡αžΆαžœαž·αž’αžΆαž“αž€αžΆαžšαž…αž»αž„αž€αŸ’αžšαŸ„αž™αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž’αž“αžΈαžαž·αž‡αž“αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: Pre-trial detention shall be used as a last resort for minors.<br>
**Backwards**: αž€αžΆαžš αžƒαž»αŸ† αžαŸ’αž›αž½αž“ αž˜αž»αž“ αž–αŸαž› αž€αžΆαžαŸ‹ αž€αŸ’αžαžΈ αž“αžΉαž„ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹ αž‡αžΆ αžŠαŸ†αžŽαŸ„αŸ‡ αžŸαŸ’αžšαžΆαž™ αž…αž»αž„ αž€αŸ’αžšαŸ„αž™ αžŸαž˜αŸ’αžšαžΆαž”αŸ‹ αž’αž“αžΈαžαž·αž‡αž“ αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αžαž»αž›αžΆαž€αžΆαžšαžαŸ’αžšαžΌαžœαž–αž·αž…αžΆαžšαžŽαžΆαž—αžŸαŸ’αžαž»αžαžΆαž„αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αž˜αž»αž“αž–αŸαž›αžŸαž˜αŸ’αžšαŸαž…αž…αž·αžαŸ’αž αž αžΎαž™αžαŸ’αžšαžΌαžœαž•αŸ’αžαž›αŸ‹αž αŸαžαž»αž•αž›αž…αŸ’αž”αžΆαžŸαŸ‹αž›αžΆαžŸαŸ‹αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αžŸαžΆαž›αž€αŸ’αžšαž˜αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: The court must consider all the evidence before making a decision and must give a clear reason for the verdict.<br>
**Backwards**: αžαž»αž›αžΆαž€αžΆαžš αžαŸ’αžšαžΌαžœ αžαŸ‚ αž–αž·αž…αžΆαžšαžŽαžΆ αž›αžΎ αž—αžŸαŸ’αžαž»αžαžΆαž„ αž‘αžΆαŸ†αž„ αž’αžŸαŸ‹ αž˜αž»αž“ αž–αŸαž› αž’αŸ’αžœαžΎ αž€αžΆαžš αžŸαž˜αŸ’αžšαŸαž… αž…αž·αžαŸ’αž αž“αž·αž„ αžαŸ’αžšαžΌαžœ αžαŸ‚ αž•αŸ’αžαž›αŸ‹ ហេតុ αž•αž› αž…αŸ’αž”αžΆαžŸαŸ‹αž›αžΆαžŸαŸ‹ αžŸαŸ†αžšαžΆαž”αŸ‹ αž€αžΆαžš αž€αžΆαžαŸ‹ αž€αŸ’αžαžΈ αŸ”<br><br>
**πŸ‡°πŸ‡­ KM**: αž€αŸ’αžšαžŸαž½αž„αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒαžαŸ’αžšαžΌαžœαž’αžΆαž“αžΆαžαžΆαž’αŸ’αž“αž€αž‡αžΆαž”αŸ‹αžƒαž»αŸ†αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αž˜αžΆαž“αž›αž‘αŸ’αž’αž—αžΆαž–αž‘αž‘αž½αž›αž”αžΆαž“αž‡αŸ†αž“αž½αž™αž•αŸ’αž›αžΌαžœαž…αŸ’αž”αžΆαž”αŸ‹αŸ”<br>
**πŸ‡¬πŸ‡§ EN**: The Justice Department must ensure that all detainees have access to legal aid.<br>
**Backwards**: αž€αŸ’αžšαžŸαž½αž„ αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒ αžαŸ’αžšαžΌαžœ αžαŸ‚ αž’αžΆαž“αžΆ ថអ αž’αŸ’αž“αž€ αž‡αžΆαž”αŸ‹ αžƒαž»αŸ† αž‘αžΆαŸ†αž„ αž’αžŸαŸ‹ αž˜αžΆαž“ αžŸαž·αž‘αŸ’αž’αž· αž‘αž‘αž½αž› αž”αžΆαž“ αž‡αŸ†αž“αž½αž™ αž•αŸ’αž›αžΌαžœ αž…αŸ’αž”αžΆαž”αŸ‹ αŸ”<br>
```
## Known Limitations
1. **Temporal bias**: The model tends to add past-tense markers in Khmer when translating present-tense English. This is being addressed in future training phases.
2. **Domain specificity**: Best results on legal and formal documents.
3. **Length**: Optimized for sentence-level translation (max 512 tokens).
## Ethical Considerations
This model is intended for humanitarian purposes. It should NOT replace certified human translators in official legal proceedings.
**Privacy**: Stateless processing - no input text is stored or logged.
## Roadmap
- [x] Phase 1.5: Denoising pre-training on 88K Khmer examples
- [x] Phase 1: Bidirectional translation fine-tuning on 389K examples
- [ ] Phase 2: LoRA fine-tuning on legal glossary (~5000 pairs and 450 terms that could work as hard constraints)
- [ ] Phase 3: Tense augmentation to address temporal bias
## Technical References
**Datasets:**
- khPOS: Ye Kyaw Thu et al., "Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus" (ONA 2017)
- Khmer Dictionary 44K: Royal Academy of Cambodia, 2022
- ALT Corpus: Asian Language Treebank
**Model:**
- NLLB-200: Costa-jussa et al., "No Language Left Behind" (2022)
**Methodology:**
- Denoising: Lewis et al., "BART: Denoising Sequence-to-Sequence Pre-training" (2019)
## Citation
```bibtex
@misc{khmer-legal-bridge-2024,
title={Khmer Legal Bridge: Fine-tuned NLLB for Legal Translation},
author={ClaudBarbara},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/ClaudBarbara/Open_Access_Khmer}
}
```
## Acknowledgments
Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801
Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30
Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf
## License
CC-BY-NC-4.0
---
*Khmer Legal Bridge - Open Source Legal Translation for Refugee Communities*