Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,203 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- km
|
| 6 |
+
base_model:
|
| 7 |
+
- facebook/nllb-200-distilled-600M
|
| 8 |
+
tags:
|
| 9 |
+
- legal
|
| 10 |
+
- translation
|
| 11 |
+
- nllb
|
| 12 |
+
- refugees
|
| 13 |
+
- humanitarian
|
| 14 |
+
- denoising
|
| 15 |
+
---
|
| 16 |
+
---
|
| 17 |
+
license: cc-by-nc-4.0
|
| 18 |
+
language:
|
| 19 |
+
- en
|
| 20 |
+
- km
|
| 21 |
+
base_model: facebook/nllb-200-distilled-600M
|
| 22 |
+
pipeline_tag: translation
|
| 23 |
+
tags:
|
| 24 |
+
- legal
|
| 25 |
+
- khmer
|
| 26 |
+
- translation
|
| 27 |
+
- nllb
|
| 28 |
+
- refugee
|
| 29 |
+
- humanitarian
|
| 30 |
+
- denoising
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
# Khmer Legal Bridge - NLLB Fine-tuned for Legal Translation
|
| 34 |
+
|
| 35 |
+
English-Khmer Bidirectional Translation Model for legal and humanitarian Documents
|
| 36 |
+
|
| 37 |
+
## Model Description
|
| 38 |
+
|
| 39 |
+
This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) optimized for legal document translation between English and Khmer. It was developed to support Cambodian refugees, asylum seekers, and legal professionals who need accurate translations of legal materials.
|
| 40 |
+
|
| 41 |
+
### Intended Use
|
| 42 |
+
|
| 43 |
+
- Translation of legal documents (court documents, asylum applications, legal handbooks)
|
| 44 |
+
- Refugee and immigration documentation
|
| 45 |
+
- Juvenile justice materials
|
| 46 |
+
- Human rights reports and policy documents
|
| 47 |
+
|
| 48 |
+
### Languages
|
| 49 |
+
|
| 50 |
+
- English (eng_Latn)
|
| 51 |
+
- Khmer (khm_Khmr)
|
| 52 |
+
|
| 53 |
+
## Evaluation Results
|
| 54 |
+
|
| 55 |
+
| Direction | chrF | BLEU |
|
| 56 |
+
|-----------|------|------|
|
| 57 |
+
| EN to KM | 53.28 | 29.38 |
|
| 58 |
+
| KM to EN | 59.68 | 34.78 |
|
| 59 |
+
|
| 60 |
+
### Comparison with Base Model
|
| 61 |
+
|
| 62 |
+
| Direction | Metric | Base NLLB | Fine-tuned | Change |
|
| 63 |
+
|-----------|--------|-----------|------------|--------|
|
| 64 |
+
| KM to EN | chrF | 55.98 | 59.68 | **+3.70** |
|
| 65 |
+
| KM to EN | BLEU | 28.35 | 34.78 | **+6.43** |
|
| 66 |
+
| EN to KM | chrF | 54.48 | 53.28 | -1.20 |
|
| 67 |
+
|
| 68 |
+
### Key Results
|
| 69 |
+
|
| 70 |
+
- **Balanced bidirectional performance**: Both directions now perform well
|
| 71 |
+
- **KM to EN significantly improved**: +3.7 chrF, +6.4 BLEU
|
| 72 |
+
- **EN to KM chrF slightly lower**: Small trade-off for better overall balance
|
| 73 |
+
|
| 74 |
+
## Training Pipeline
|
| 75 |
+
|
| 76 |
+
### Phase 1.5: Denoising Pre-training
|
| 77 |
+
|
| 78 |
+
Before translation fine-tuning, we strengthened the model's Khmer understanding using a **denoising autoencoder task** on **88,000+ monolingual Khmer examples**:
|
| 79 |
+
|
| 80 |
+
| Dataset | Size | Source | Content |
|
| 81 |
+
|---------|------|--------|---------|
|
| 82 |
+
| khPOS | ~12,000 | Khmer POS Corpus | News, politics, economics - professionally segmented with POS tags |
|
| 83 |
+
| Khmer Dictionary 44K | ~44,700 | Royal Academy of Cambodia (2022) | Curated definitions, formal register |
|
| 84 |
+
|
| 85 |
+
**Denoising Task:**
|
| 86 |
+
- Input: Corrupted Khmer text (15% noise: masking, deletion, token shuffling)
|
| 87 |
+
- Output: Clean original text
|
| 88 |
+
- Both encoder AND decoder trained
|
| 89 |
+
|
| 90 |
+
### Phase 1: Bidirectional Translation Fine-tuning
|
| 91 |
+
|
| 92 |
+
Using the denoising-pretrained model, we fine-tuned on **~389,000 parallel examples** (bidirectional):
|
| 93 |
+
|
| 94 |
+
| Dataset | Pairs | Bidirectional Examples |
|
| 95 |
+
|---------|-------|------------------------|
|
| 96 |
+
| ALT Corpus | 18,088 | 36,176 |
|
| 97 |
+
| OPUS-100 | ~112,000 | ~224,000 |
|
| 98 |
+
| ParaCrawl | ~65,000 | ~130,000 |
|
| 99 |
+
|
| 100 |
+
**Training Results:**
|
| 101 |
+
|
| 102 |
+
| Phase | Metric | Start | End | Improvement |
|
| 103 |
+
|-------|--------|-------|-----|-------------|
|
| 104 |
+
| 1.5 Denoising | Val Loss | ~4.9 | ~2.5 | -49% |
|
| 105 |
+
| 1 Translation | Val Loss | 3.804 | 2.586 | -32% |
|
| 106 |
+
|
| 107 |
+
**Training Configuration:**
|
| 108 |
+
- Epochs: 3
|
| 109 |
+
- Batch size: 32 effective
|
| 110 |
+
- Learning rate: 2e-5
|
| 111 |
+
- Training time: ~13.5 hours on Google Colab (A100)
|
| 112 |
+
|
| 113 |
+
## Usage
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
|
| 117 |
+
import torch
|
| 118 |
+
|
| 119 |
+
# Load model and tokenizer
|
| 120 |
+
model_id = "ClaudBarbara/Open_Access_Khmer"
|
| 121 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
| 122 |
+
tokenizer = NllbTokenizerFast.from_pretrained(model_id)
|
| 123 |
+
|
| 124 |
+
def translate(text, src_lang, tgt_lang):
|
| 125 |
+
tokenizer.src_lang = src_lang
|
| 126 |
+
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
|
| 127 |
+
|
| 128 |
+
with torch.no_grad():
|
| 129 |
+
outputs = model.generate(
|
| 130 |
+
**inputs,
|
| 131 |
+
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
|
| 132 |
+
max_length=512,
|
| 133 |
+
num_beams=4
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 137 |
+
|
| 138 |
+
# English to Khmer
|
| 139 |
+
result = translate("The court finds the defendant guilty.", "eng_Latn", "khm_Khmr")
|
| 140 |
+
|
| 141 |
+
# Khmer to English
|
| 142 |
+
result = translate("<your_khmer_text>", "khm_Khmr", "eng_Latn")
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
## Known Limitations
|
| 146 |
+
|
| 147 |
+
1. **Temporal bias**: The model tends to add past-tense markers in Khmer when translating present-tense English. This is being addressed in future training phases.
|
| 148 |
+
|
| 149 |
+
2. **Domain specificity**: Best results on legal and formal documents.
|
| 150 |
+
|
| 151 |
+
3. **Length**: Optimized for sentence-level translation (max 512 tokens).
|
| 152 |
+
|
| 153 |
+
## Ethical Considerations
|
| 154 |
+
|
| 155 |
+
This model is intended for humanitarian purposes. It should NOT replace certified human translators in official legal proceedings.
|
| 156 |
+
|
| 157 |
+
**Privacy**: Stateless processing - no input text is stored or logged.
|
| 158 |
+
|
| 159 |
+
## Roadmap
|
| 160 |
+
|
| 161 |
+
- [x] Phase 1.5: Denoising pre-training on 88K Khmer examples
|
| 162 |
+
- [x] Phase 1: Bidirectional translation fine-tuning on 389K examples
|
| 163 |
+
- [ ] Phase 2: LoRA fine-tuning on legal glossary (~450 terms)
|
| 164 |
+
- [ ] Phase 3: Tense augmentation to address temporal bias
|
| 165 |
+
|
| 166 |
+
## Technical References
|
| 167 |
+
|
| 168 |
+
**Datasets:**
|
| 169 |
+
- khPOS: Ye Kyaw Thu et al., "Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus" (ONA 2017)
|
| 170 |
+
- Khmer Dictionary 44K: Royal Academy of Cambodia, 2022
|
| 171 |
+
- ALT Corpus: Asian Language Treebank
|
| 172 |
+
|
| 173 |
+
**Model:**
|
| 174 |
+
- NLLB-200: Costa-jussa et al., "No Language Left Behind" (2022)
|
| 175 |
+
|
| 176 |
+
**Methodology:**
|
| 177 |
+
- Denoising: Lewis et al., "BART: Denoising Sequence-to-Sequence Pre-training" (2019)
|
| 178 |
+
|
| 179 |
+
## Citation
|
| 180 |
+
|
| 181 |
+
```bibtex
|
| 182 |
+
@misc{khmer-legal-bridge-2024,
|
| 183 |
+
title={Khmer Legal Bridge: Fine-tuned NLLB for Legal Translation},
|
| 184 |
+
author={ClaudBarbara},
|
| 185 |
+
year={2024},
|
| 186 |
+
publisher={HuggingFace},
|
| 187 |
+
url={https://huggingface.co/ClaudBarbara/Open_Access_Khmer}
|
| 188 |
+
}
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## Acknowledgments
|
| 192 |
+
|
| 193 |
+
Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801
|
| 194 |
+
Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30
|
| 195 |
+
Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf
|
| 196 |
+
|
| 197 |
+
## License
|
| 198 |
+
|
| 199 |
+
CC-BY-NC-4.0
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
*Khmer Legal Bridge - Open Source Legal Translation for Refugee Communities*
|