GitHub | Model Download | Dataset Download | Inference Space |
A Compact Bidirectional Encoder-Decoder Transformer-Based Model for English-Khmer Translation
Inference Speed Benchmark on CPU Using Greedy, and Beam Search Decoding Strategy
1. Abstract
This repository present Netra-NMT a 90M-parameter encoder-decoder transformer-based model trained on 220 million tokens of English-Khmer parallel text (4.2M bidirectional examples). The encoder uses bidirectional self-attention, much like BERT, to capture global contextual representation. The decoder perform autoregressive generation through causal self-attention and encoder-decoder cross attention.
Unlike traditional transformer block, Netra-NMT incorporates several architectural improvements, including Pre-Layer Normalization (Pre-LN) for stable optimization, SwiGLU feed-forward networks for enhanced representational capacity, and weight tying between the decoder embedding layer and output projection head to reduce parameter redundancy.
2. Dataset
Netra-NMT was trained on 220 million tokens drawn from approximately 2.4 million unique English-Khmer sentence pairs (4.2 million examples after bidirectional augmentation). The corpus combines LLM-generated synthetic data with web-crawled parallel text, spanning legal, literary, medical, technical, and conversational domains.
2.1 Sources
| Dataset | Type | Pairs | Domains |
|---|---|---|---|
| Darayut/khmer-english-pairs-raw | Synthetic | 200K | Legal, Literary, Governmental |
| lyfeyvutha/nllb-en-km-316K | Synthetic | 316K | General |
| KrorngAI/ParaCrawl-English-Khmer-v2 | Web crawl (ParaCrawl) | 1.5M | Web / general |
| SeyhaLite/Translate-English-Khmer-All | --- | 366K | General |
| Total | 2.4M |
2.2 Preprocessing
Raw data was cleaned through the following pipeline:
- Deduplication: exact duplicate pairs removed across all sources.
- Length filtering: pairs with extreme source/target length mismatches were discarded.
- Empty/null removal: pairs where either side was empty or below a minimum token count were dropped.
After cleaning, each surviving pair is duplicated in both directions (ENβKM and KMβEN) with a direction prefix token (<2km> / <2en>), yielding ~4.2 million training examples.
3. Model Architecture
Figure 1: Overview of the Netra-NMT encoder-decoder architecture. The encoder (left) processes the source sentence with bidirectional self-attention; the decoder (right) generates the target sentence autoregressively via causal self-attention and cross-attention over the encoder output. Both sides share a 32K SentencePiece tokenizer.
Netra-NMT follows a standard encoder-decoder transformer architecture with several modifications for training stability and parameter efficiency.
Encoder takes the source sentence tokenized by the shared 32K SentencePiece tokenizer, adds learned positional embeddings, and passes the sequence through 6 transformer layers with bidirectional self-attention (every token attends to every other token, similar to BERT). A final Pre-LN layer norm is applied to the encoder output before it is passed to the decoder via cross-attention.
Decoder takes the (partially generated) target sentence through the same tokenizer, adds positional embeddings, and passes it through 6 transformer layers. Each decoder layer applies three sub-layers in order: (1) causal (masked) self-attention over previously generated tokens, (2) cross-attention over the full encoder output, and (3) a feed-forward block. A final Pre-LN layer norm feeds into the tied linear projection head to produce output token probabilities.
Architectural improvements over the vanilla transformer:
| Feature | Detail |
|---|---|
| Pre-Layer Normalization | Layer norm applied before each sub-layer (Pre-LN) rather than after, improving gradient flow and training stability |
| SwiGLU FFN | Feed-forward blocks use the SwiGLU activation instead of ReLU, providing richer representational capacity at no parameter cost |
| Weight tying | The decoder input embedding matrix is shared with the output linear projection head, reducing redundant parameters |
Hyperparameters:
| d_model | 512 |
| Encoder / Decoder layers | 6 / 6 |
| Attention heads | 8 |
| FFN hidden size | 2048 |
| Vocabulary | 32K (SentencePiece unigram, shared) |
| Total parameters | ~89.7M |
4. Evaluation Results
Install
pip install netra-nmt # core (Python API + CLI)
pip install "netra-nmt[web]" # + FastAPI web app & REST API
Or from source:
git clone https://github.com/NDarayut/netra-nmt
cd netra-nmt
pip install -e ".[web]"
The first translation downloads the weights (180 MB fp16) from the Hugging Face Hub and caches them
under `/.cache/huggingface`.
Usage
1. Python API
from netra_nmt import NetraTranslator
t = NetraTranslator() # auto-detect GPU/CPU; downloads weights once
t.translate("Hello, how are you?", direction="en2km") # β "αα½ααααΈ αα»ααααααΆαα’αα?"
t.translate("αααα»ααααα‘αΆαααααααααααααααα»αα", direction="km2en")
# Batch + decoding options
t.translate_batch(["Good morning.", "See you tomorrow."], direction="en2km")
t.translate("Good morning, my friend.", direction="en2km", mode="beam", beam_size=5)
One-shot helper (caches a default translator):
from netra_nmt import translate
translate("Hello", direction="en2km")
direction is "en2km" (EnglishβKhmer) or "km2en" (KhmerβEnglish).
mode is "greedy" (default), "beam", or "sample".
2. CLI
# Single sentence (default direction en2km):
netra-translate --text "Hello, how are you?"
# Khmer β English with beam search:
netra-translate --text "αα½ααααΈ, ααΎα’ααααα»ααααααΆααα?" --direction km2en --mode beam
# Translate a file (one sentence per line):
netra-translate --file input.txt --output output.txt --direction en2km
# Interactive REPL (omit --text / --file):
netra-translate
3. Web app + REST API (FastAPI)
netra-web # serves the web UI + API at http://127.0.0.1:8000
netra-web --port 8080 --device cpu
netra-web --local-dir export # load weights from a local export dir
A two-pane translation site (source left, output right, ENβKM swap button) plus a JSON API:
curl -X POST http://127.0.0.1:8000/api/translate \
-H 'Content-Type: application/json' \
-d '{"text": "Hello, how are you?", "direction": "en2km"}'
# {"translation": "...", "direction": "en2km"}
Requires the web extra (pip install "netra-nmt[web]").
- Downloads last month
- 122