English β†’ Telugu Translation (mBART)

This repository contains an English to Telugu neural machine translation model based on mBART-50, fine-tuned and deployed using the Hugging Face ecosystem.

The model translates English text (en) into Telugu (te) and is suitable for research, learning, and demo purposes.


πŸ”Ž Live Demo

You can test the model live using the Hugging Face Space below:

πŸ‘‰ https://huggingface.co/spaces/Yaser77/mbart-en-te-demo


🧠 Model Details

  • Model name: mbart-en-te
  • Base model: facebook/mbart-large-50-many-to-many-mmt
  • Task: Machine Translation (English β†’ Telugu)
  • Framework: πŸ€— Transformers
  • Model size: ~600M parameters
  • Precision: FP32
  • Tokenizer: SentencePiece (mBART tokenizer)

πŸ“Œ How It Works

  • Source language: English (en_XX)
  • Target language: Telugu (te_IN)
  • The model uses forced BOS token decoding, which is required for mBART-based translation models.

πŸš€ Intended Uses

βœ… Direct Use

  • Translating English text into Telugu
  • Educational demos and learning projects
  • NLP experimentation with Indic languages

πŸ”„ Downstream Use

  • Can be fine-tuned further on domain-specific parallel corpora
  • Can be integrated into web apps or APIs for translation services

❌ Out-of-Scope Use

  • Not intended for real-time production systems without optimization
  • Not suitable for legal, medical, or safety-critical translations

⚠️ Limitations & Risks

  • Translation quality depends heavily on sentence complexity
  • May struggle with idioms, slang, or highly technical language
  • Biases present in the original training data may be reflected
  • Slower inference on CPU due to model size

πŸ§ͺ How to Use the Model (Code)

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model_id = "Yaser77/mbart-en-te"

tokenizer = MBart50TokenizerFast.from_pretrained(model_id)
model = MBartForConditionalGeneration.from_pretrained(model_id)

tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.lang_code_to_id["te_IN"]

text = "Hello, how are you?"

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=128
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

πŸ“Š Training Details

Training Data

  • Based on multilingual translation data used by mBART
  • No custom dataset is claimed beyond fine-tuning and experimentation

Training Method

  • Sequence-to-sequence Transformer
  • Teacher forcing during training
  • Cross-entropy loss

🧾 Evaluation

  • Qualitative evaluation using common English sentences
  • No official BLEU or automatic metrics reported
  • Intended mainly as a demo and learning project

🌍 Environmental Impact

  • Hardware: CPU (Hugging Face Spaces – Basic tier)
  • Compute Region: Managed by Hugging Face
  • Precision: FP32
  • Carbon Emissions: Not explicitly measured

🧩 Technical Specifications

  • Architecture: Transformer encoder–decoder (mBART)

  • Libraries:

    • transformers
    • torch
    • sentencepiece
    • gradio

πŸ“– References


Acknowledgements

This project was developed as part of a hands-on learning workshop conducted by Vijender P at Alumnx AI Labs during the "GPU Hours" session on fine-tuning and deploying large AI models.

The deployment, debugging, Hugging Face integration, and demo Space were independently implemented by the author as a learning exercise.


πŸ‘€ Author

T Mohamed Yaser Computer Science Engineering Student Interested in NLP, ML deployment, and real-world AI applications


πŸ“¬ Contact

For questions, feedback, or collaboration:

Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Yaser77/mbart-en-te

Finetuned
(212)
this model

Dataset used to train Yaser77/mbart-en-te

Space using Yaser77/mbart-en-te 1