setu / README.md

santoshdahal

Upload folder using huggingface_hub

6ba2860 verified 3 months ago

preview code

raw

history blame contribute delete

3.14 kB

metadata

license: apache-2.0
language:
  - ne
  - en
tags:
  - translation
  - nepali
  - english
  - multilingual
  - code-mixed
  - romanized
  - devanagari
  - onnx
pipeline_tag: translation
widget:
  - text: mero name ramesh  ho
    example_title: Romanized Nepali
  - text: सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।
    example_title: Devanagari Nepali
  - text: what is your nam
    example_title: Informal English
model-index:
  - name: SETU
    results:
      - task:
          type: translation
          name: Translation
        dataset:
          type: custom
          name: Nepali-English Mixed Dataset
        metrics:
          - type: bleu
            value: 49.5
            name: BLEU
library_name: transformers

SETU - Script-agnostic English Translation Unifier

SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English.

Model Description

The SETU model can handle:

Romanized Nepali to English translation
Devanagari Nepali to English translation
Code-mixed text to English translation
Informal/slang to formal English translation

Try It Out

🚀 Interactive Demo: Try SETU in Google Colab: https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing

Installation

Ensure that you have transformers and onnx installed:

pip install transformers  onnxruntime

Usage

from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True)

# Translate text
result = model("mero name ramesh  ho")
print("Translation:", result)
# Output: "My name is Ramesh."

# Works with Devanagari script too
result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।")
print("Translation:", result) 
# Output: "Social media and reality are different."

# Handles informal text
result = model("what is your nam")
print("Translation:", result)
# Output: "what's your name"

Model Details

Model Type: Neural Machine Translation
Architecture: Transformer
Vocabulary Size: 40,253 tokens
Languages Supported: Nepali (Romanized & Devanagari), English, Code-mixed text
Model Format: ONNX for efficient inference

Technical Implementation

The model uses:

ONNX Runtime for efficient inference
SentencePiece for tokenization
Beam search decoding with configurable beam size
Separate encoder and decoder ONNX models

Files Included

encoder.onnx: ONNX encoder model
decoder.onnx: ONNX decoder model
spm.model: SentencePiece tokenizer model
spm.vocab: SentencePiece vocabulary
config.json: Model configuration
modeling_setu_translation.py: Model implementation
configuration_setu_translation.py: Configuration class

Citation

If you use this model, please cite:

@misc{setu2025,
  title={SETU: Script-agnostic English Translation Unifier},
  author={Santosh Dahal},
  year={2025}
}