README.md · santoshdahal/setu at main

File size: 3,144 Bytes

---
license: apache-2.0
language:
- ne
- en
tags:
- translation
- nepali
- english
- multilingual
- code-mixed
- romanized
- devanagari
- onnx
pipeline_tag: translation
widget:
- text: "mero name ramesh  ho"
  example_title: "Romanized Nepali"
- text: "सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।"
  example_title: "Devanagari Nepali"
- text: "what is your nam"
  example_title: "Informal English"
model-index:
- name: SETU
  results:
  - task:
      type: translation
      name: Translation
    dataset:
      type: custom
      name: Nepali-English Mixed Dataset
    metrics:
    - type: bleu
      value: 49.5
      name: BLEU
library_name: transformers
---

# SETU - Script-agnostic English Translation Unifier

SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English.

## Model Description

The SETU model can handle:
- Romanized Nepali to English translation
- Devanagari Nepali to English translation  
- Code-mixed text to English translation
- Informal/slang to formal English translation

## Try It Out

🚀 **Interactive Demo**: Try SETU in Google Colab: [https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing](https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing)

## Installation

Ensure that you have transformers and onnx installed:

```bash
pip install transformers  onnxruntime 
```

## Usage

```python
from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True)

# Translate text
result = model("mero name ramesh  ho")
print("Translation:", result)
# Output: "My name is Ramesh."

# Works with Devanagari script too
result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।")
print("Translation:", result) 
# Output: "Social media and reality are different."

# Handles informal text
result = model("what is your nam")
print("Translation:", result)
# Output: "what's your name"

```

## Model Details

- **Model Type**: Neural Machine Translation
- **Architecture**: Transformer 
- **Vocabulary Size**: 40,253 tokens
- **Languages Supported**: Nepali (Romanized & Devanagari), English, Code-mixed text
- **Model Format**: ONNX for efficient inference

## Technical Implementation

The model uses:
- ONNX Runtime for efficient inference
- SentencePiece for tokenization
- Beam search decoding with configurable beam size
- Separate encoder and decoder ONNX models

## Files Included

- `encoder.onnx`: ONNX encoder model
- `decoder.onnx`: ONNX decoder model  
- `spm.model`: SentencePiece tokenizer model
- `spm.vocab`: SentencePiece vocabulary
- `config.json`: Model configuration
- `modeling_setu_translation.py`: Model implementation
- `configuration_setu_translation.py`: Configuration class

## Citation

If you use this model, please cite:

```
@misc{setu2025,
  title={SETU: Script-agnostic English Translation Unifier},
  author={Santosh Dahal},
  year={2025}
}
```