README.md · santoshdahal/setu at main

setu / README.md

santoshdahal

Upload folder using huggingface_hub

6ba2860 verified 3 months ago

preview code

raw

history blame contribute delete

3.14 kB

	---
	license: apache-2.0
	language:
	- ne
	- en
	tags:
	- translation
	- nepali
	- english
	- multilingual
	- code-mixed
	- romanized
	- devanagari
	- onnx
	pipeline_tag: translation
	widget:
	- text: "mero name ramesh ho"
	example_title: "Romanized Nepali"
	- text: "सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।"
	example_title: "Devanagari Nepali"
	- text: "what is your nam"
	example_title: "Informal English"
	model-index:
	- name: SETU
	results:
	- task:
	type: translation
	name: Translation
	dataset:
	type: custom
	name: Nepali-English Mixed Dataset
	metrics:
	- type: bleu
	value: 49.5
	name: BLEU
	library_name: transformers
	---

	# SETU - Script-agnostic English Translation Unifier

	SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English.

	## Model Description

	The SETU model can handle:
	- Romanized Nepali to English translation
	- Devanagari Nepali to English translation
	- Code-mixed text to English translation
	- Informal/slang to formal English translation

	## Try It Out

	🚀 Interactive Demo: Try SETU in Google Colab: [https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing](https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing)

	## Installation

	Ensure that you have transformers and onnx installed:

	```bash
	pip install transformers onnxruntime
	```

	## Usage

	```python
	from transformers import AutoModel

	# Load the model
	model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True)

	# Translate text
	result = model("mero name ramesh ho")
	print("Translation:", result)
	# Output: "My name is Ramesh."

	# Works with Devanagari script too
	result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।")
	print("Translation:", result)
	# Output: "Social media and reality are different."

	# Handles informal text
	result = model("what is your nam")
	print("Translation:", result)
	# Output: "what's your name"

	```

	## Model Details

	- Model Type: Neural Machine Translation
	- Architecture: Transformer
	- Vocabulary Size: 40,253 tokens
	- Languages Supported: Nepali (Romanized & Devanagari), English, Code-mixed text
	- Model Format: ONNX for efficient inference

	## Technical Implementation

	The model uses:
	- ONNX Runtime for efficient inference
	- SentencePiece for tokenization
	- Beam search decoding with configurable beam size
	- Separate encoder and decoder ONNX models

	## Files Included

	- `encoder.onnx`: ONNX encoder model
	- `decoder.onnx`: ONNX decoder model
	- `spm.model`: SentencePiece tokenizer model
	- `spm.vocab`: SentencePiece vocabulary
	- `config.json`: Model configuration
	- `modeling_setu_translation.py`: Model implementation
	- `configuration_setu_translation.py`: Configuration class

	## Citation

	If you use this model, please cite:

	```
	@misc{setu2025,
	title={SETU: Script-agnostic English Translation Unifier},
	author={Santosh Dahal},
	year={2025}
	}
	```