setu / README.md
santoshdahal's picture
Upload folder using huggingface_hub
6ba2860 verified
---
license: apache-2.0
language:
- ne
- en
tags:
- translation
- nepali
- english
- multilingual
- code-mixed
- romanized
- devanagari
- onnx
pipeline_tag: translation
widget:
- text: "mero name ramesh ho"
example_title: "Romanized Nepali"
- text: "सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।"
example_title: "Devanagari Nepali"
- text: "what is your nam"
example_title: "Informal English"
model-index:
- name: SETU
results:
- task:
type: translation
name: Translation
dataset:
type: custom
name: Nepali-English Mixed Dataset
metrics:
- type: bleu
value: 49.5
name: BLEU
library_name: transformers
---
# SETU - Script-agnostic English Translation Unifier
SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English.
## Model Description
The SETU model can handle:
- Romanized Nepali to English translation
- Devanagari Nepali to English translation
- Code-mixed text to English translation
- Informal/slang to formal English translation
## Try It Out
🚀 **Interactive Demo**: Try SETU in Google Colab: [https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing](https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing)
## Installation
Ensure that you have transformers and onnx installed:
```bash
pip install transformers onnxruntime
```
## Usage
```python
from transformers import AutoModel
# Load the model
model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True)
# Translate text
result = model("mero name ramesh ho")
print("Translation:", result)
# Output: "My name is Ramesh."
# Works with Devanagari script too
result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।")
print("Translation:", result)
# Output: "Social media and reality are different."
# Handles informal text
result = model("what is your nam")
print("Translation:", result)
# Output: "what's your name"
```
## Model Details
- **Model Type**: Neural Machine Translation
- **Architecture**: Transformer
- **Vocabulary Size**: 40,253 tokens
- **Languages Supported**: Nepali (Romanized & Devanagari), English, Code-mixed text
- **Model Format**: ONNX for efficient inference
## Technical Implementation
The model uses:
- ONNX Runtime for efficient inference
- SentencePiece for tokenization
- Beam search decoding with configurable beam size
- Separate encoder and decoder ONNX models
## Files Included
- `encoder.onnx`: ONNX encoder model
- `decoder.onnx`: ONNX decoder model
- `spm.model`: SentencePiece tokenizer model
- `spm.vocab`: SentencePiece vocabulary
- `config.json`: Model configuration
- `modeling_setu_translation.py`: Model implementation
- `configuration_setu_translation.py`: Configuration class
## Citation
If you use this model, please cite:
```
@misc{setu2025,
title={SETU: Script-agnostic English Translation Unifier},
author={Santosh Dahal},
year={2025}
}
```