TagoreX / README.md
SwastikGuhaRoy's picture
Update README.md
d13f4b4 verified

🕊️ TagoreX – A Bengali Text Generator Inspired by Tagore

Model name: SwastikGuhaRoy/TagoreX Base model: GPT-2 with LoRA adapters (based on AddaGPT2.0) Language: Bengali Author: Swastik Guha Roy (@SwastikGuhaRoy) License: MIT Model size: ~124M parameters Trained on: Curated (but imperfect) corpus of Rabindranath Tagore’s writings Intended use: Poetic and philosophical Bengali text generation Demo app: TagoreX + Gemini Streamlit App


📘 Model Description

TagoreX is a fine-tuned version of AddaGPT2.0 — a small GPT-2 model adapted for Bengali using LoRA (Low-Rank Adaptation). This model was trained on literary works of Rabindranath Tagore as a tribute.

The model continues a given Bengali prompt in a Tagore-like poetic tone. It generates ~256 tokens, which are then optionally refined by Gemini AI in a downstream application.


🔧 Technical Details

  • Architecture: GPT-2 (117M parameters)
  • Training strategy: Full fine-tuning
  • Epochs: 22 (symbolically referencing “২২শে শ্রাবণ”)
  • Max sequence length: 256 tokens
  • Tokenizer: AutoTokenizer from the base model
  • Framework: PyTorch + Transformers

📂 Training Data

The dataset includes poems, prose and other works from Rabindranath Tagore which is publicly available. The dataset can be accessed in a consolidated .txt format from here

⚠️ Note: The data may and DOES contain:

  • Typos, formatting errors
  • OCR issues
  • Incomplete or duplicated lines

This model is not a scholarly curation, but an experimental artistic rendering.


🎯 Intended Use

You can use this model to:

  • Experiment with Bengali poetic text generation
  • Create creative writing prompts in Bengali
  • Explore Indic LLM capabilities in low-resource settings

This model is not suitable for:

  • Any commercial or sensitive deployment
  • Factual or linguistic accuracy tasks
  • Scholarly representation of Tagore’s works

💬 How to Prompt

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/TagoreX")
model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/TagoreX")

prompt = "তুমি রবে নীরবে"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🚫 Limitations & Disclaimer

  • Not aligned, filtered, or safety-trained.
  • Most outputs may be incoherent, repetitive, or nonsensical.
  • This is not meant to reproduce or replace Tagore's literary work.
  • The generation reflects training data and randomness — not any human author.

🌏 Why It Matters

TagoreX demonstrates how even small-scale, open models can express poetic and cultural essence in Indic languages — using limited compute and a lot of curiosity.

It aims to inspire communities to build Indic LLMs, especially in low-resource and rural settings.

"AI doesn’t have to be massive. It can be local, soulful, and deeply human."



📫 Contact

📧 Email: swastikguharoy@googlemail.com 💬 Feedback, bugs, or nice generations? I'd love to hear from you!