TagoreX / README.md

Update README.md

d13f4b4 verified 7 months ago

3.56 kB


	## 🕊️ TagoreX – A Bengali Text Generator Inspired by Tagore

	Model name: `SwastikGuhaRoy/TagoreX`
	Base model: `GPT-2` with LoRA adapters [(based on `AddaGPT2.0`)](https://huggingface.co/SwastikGuhaRoy/AddaGPT2.0)
	Language: Bengali
	Author: Swastik Guha Roy (`@SwastikGuhaRoy`)
	License: MIT
	Model size: \~124M parameters
	Trained on: Curated (but imperfect) corpus of Rabindranath Tagore’s writings
	Intended use: Poetic and philosophical Bengali text generation
	Demo app: [TagoreX + Gemini Streamlit App](https://tagorexgemini.streamlit.app)

	---

	### 📘 Model Description

	TagoreX is a fine-tuned version of `AddaGPT2.0` — a small GPT-2 model adapted for Bengali using LoRA (Low-Rank Adaptation).
	This model was trained on literary works of Rabindranath Tagore as a tribute.

	The model continues a given Bengali prompt in a Tagore-like poetic tone. It generates \~256 tokens, which are then optionally refined by Gemini AI in a downstream application.

	---

	### 🔧 Technical Details

	* Architecture: GPT-2 (117M parameters)
	* Training strategy: Full fine-tuning
	* Epochs: 22 (symbolically referencing “২২শে শ্রাবণ”)
	* Max sequence length: 256 tokens
	* Tokenizer: AutoTokenizer from the base model
	* Framework: PyTorch + Transformers

	---

	### 📂 Training Data

	The dataset includes poems, prose and other works from Rabindranath Tagore which is [publicly available](https://archive.org/details/RABINDRARACHANABALI/). [The dataset can be accessed in a consolidated .txt format from here](https://huggingface.co/datasets/SwastikGuhaRoy/WorksofTagore)

	⚠️ Note: The data may and DOES contain:

	* Typos, formatting errors
	* OCR issues
	* Incomplete or duplicated lines

	This model is not a scholarly curation, but an experimental artistic rendering.

	---

	### 🎯 Intended Use

	You can use this model to:

	* Experiment with Bengali poetic text generation
	* Create creative writing prompts in Bengali
	* Explore Indic LLM capabilities in low-resource settings

	This model is not suitable for:

	* Any commercial or sensitive deployment
	* Factual or linguistic accuracy tasks
	* Scholarly representation of Tagore’s works

	---

	### 💬 How to Prompt

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/TagoreX")
	model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/TagoreX")

	prompt = "তুমি রবে নীরবে"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	### 🚫 Limitations & Disclaimer

	* Not aligned, filtered, or safety-trained.
	* Most outputs may be incoherent, repetitive, or nonsensical.
	* This is not meant to reproduce or replace Tagore's literary work.
	* The generation reflects training data and randomness — not any human author.

	---

	### 🌏 Why It Matters

	TagoreX demonstrates how even small-scale, open models can express poetic and cultural essence in Indic languages — using limited compute and a lot of curiosity.

	It aims to inspire communities to build Indic LLMs, especially in low-resource and rural settings.

	> "AI doesn’t have to be massive. It can be local, soulful, and deeply human."

	---

	---

	### 📫 Contact

	📧 Email: `swastikguharoy@googlemail.com`
	💬 Feedback, bugs, or nice generations? I'd love to hear from you!

	---