--- license: apache-2.0 language: - ne - en tags: - translation - nepali - english - multilingual - code-mixed - romanized - devanagari - onnx pipeline_tag: translation widget: - text: "mero name ramesh ho" example_title: "Romanized Nepali" - text: "सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।" example_title: "Devanagari Nepali" - text: "what is your nam" example_title: "Informal English" model-index: - name: SETU results: - task: type: translation name: Translation dataset: type: custom name: Nepali-English Mixed Dataset metrics: - type: bleu value: 49.5 name: BLEU library_name: transformers --- # SETU - Script-agnostic English Translation Unifier SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English. ## Model Description The SETU model can handle: - Romanized Nepali to English translation - Devanagari Nepali to English translation - Code-mixed text to English translation - Informal/slang to formal English translation ## Try It Out 🚀 **Interactive Demo**: Try SETU in Google Colab: [https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing](https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing) ## Installation Ensure that you have transformers and onnx installed: ```bash pip install transformers onnxruntime ``` ## Usage ```python from transformers import AutoModel # Load the model model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True) # Translate text result = model("mero name ramesh ho") print("Translation:", result) # Output: "My name is Ramesh." # Works with Devanagari script too result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।") print("Translation:", result) # Output: "Social media and reality are different." # Handles informal text result = model("what is your nam") print("Translation:", result) # Output: "what's your name" ``` ## Model Details - **Model Type**: Neural Machine Translation - **Architecture**: Transformer - **Vocabulary Size**: 40,253 tokens - **Languages Supported**: Nepali (Romanized & Devanagari), English, Code-mixed text - **Model Format**: ONNX for efficient inference ## Technical Implementation The model uses: - ONNX Runtime for efficient inference - SentencePiece for tokenization - Beam search decoding with configurable beam size - Separate encoder and decoder ONNX models ## Files Included - `encoder.onnx`: ONNX encoder model - `decoder.onnx`: ONNX decoder model - `spm.model`: SentencePiece tokenizer model - `spm.vocab`: SentencePiece vocabulary - `config.json`: Model configuration - `modeling_setu_translation.py`: Model implementation - `configuration_setu_translation.py`: Configuration class ## Citation If you use this model, please cite: ``` @misc{setu2025, title={SETU: Script-agnostic English Translation Unifier}, author={Santosh Dahal}, year={2025} } ```