|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ne |
|
|
- en |
|
|
tags: |
|
|
- translation |
|
|
- nepali |
|
|
- english |
|
|
- multilingual |
|
|
- code-mixed |
|
|
- romanized |
|
|
- devanagari |
|
|
- onnx |
|
|
pipeline_tag: translation |
|
|
widget: |
|
|
- text: "mero name ramesh ho" |
|
|
example_title: "Romanized Nepali" |
|
|
- text: "सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।" |
|
|
example_title: "Devanagari Nepali" |
|
|
- text: "what is your nam" |
|
|
example_title: "Informal English" |
|
|
model-index: |
|
|
- name: SETU |
|
|
results: |
|
|
- task: |
|
|
type: translation |
|
|
name: Translation |
|
|
dataset: |
|
|
type: custom |
|
|
name: Nepali-English Mixed Dataset |
|
|
metrics: |
|
|
- type: bleu |
|
|
value: 49.5 |
|
|
name: BLEU |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# SETU - Script-agnostic English Translation Unifier |
|
|
|
|
|
SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The SETU model can handle: |
|
|
- Romanized Nepali to English translation |
|
|
- Devanagari Nepali to English translation |
|
|
- Code-mixed text to English translation |
|
|
- Informal/slang to formal English translation |
|
|
|
|
|
## Try It Out |
|
|
|
|
|
🚀 **Interactive Demo**: Try SETU in Google Colab: [https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing](https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing) |
|
|
|
|
|
## Installation |
|
|
|
|
|
Ensure that you have transformers and onnx installed: |
|
|
|
|
|
```bash |
|
|
pip install transformers onnxruntime |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel |
|
|
|
|
|
# Load the model |
|
|
model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True) |
|
|
|
|
|
# Translate text |
|
|
result = model("mero name ramesh ho") |
|
|
print("Translation:", result) |
|
|
# Output: "My name is Ramesh." |
|
|
|
|
|
# Works with Devanagari script too |
|
|
result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।") |
|
|
print("Translation:", result) |
|
|
# Output: "Social media and reality are different." |
|
|
|
|
|
# Handles informal text |
|
|
result = model("what is your nam") |
|
|
print("Translation:", result) |
|
|
# Output: "what's your name" |
|
|
|
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Neural Machine Translation |
|
|
- **Architecture**: Transformer |
|
|
- **Vocabulary Size**: 40,253 tokens |
|
|
- **Languages Supported**: Nepali (Romanized & Devanagari), English, Code-mixed text |
|
|
- **Model Format**: ONNX for efficient inference |
|
|
|
|
|
## Technical Implementation |
|
|
|
|
|
The model uses: |
|
|
- ONNX Runtime for efficient inference |
|
|
- SentencePiece for tokenization |
|
|
- Beam search decoding with configurable beam size |
|
|
- Separate encoder and decoder ONNX models |
|
|
|
|
|
## Files Included |
|
|
|
|
|
- `encoder.onnx`: ONNX encoder model |
|
|
- `decoder.onnx`: ONNX decoder model |
|
|
- `spm.model`: SentencePiece tokenizer model |
|
|
- `spm.vocab`: SentencePiece vocabulary |
|
|
- `config.json`: Model configuration |
|
|
- `modeling_setu_translation.py`: Model implementation |
|
|
- `configuration_setu_translation.py`: Configuration class |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
``` |
|
|
@misc{setu2025, |
|
|
title={SETU: Script-agnostic English Translation Unifier}, |
|
|
author={Santosh Dahal}, |
|
|
year={2025} |
|
|
} |
|
|
``` |