SwastikGuhaRoy commited on
Commit
2ddb225
·
verified ·
1 Parent(s): d8eca4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -1
README.md CHANGED
@@ -3,4 +3,115 @@ license: apache-2.0
3
  language:
4
  - bn
5
  pipeline_tag: text-generation
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - bn
5
  pipeline_tag: text-generation
6
+ ---
7
+
8
+ -------
9
+
10
+ ## 🕊️ TagoreX – A Bengali Text Generator Inspired by Tagore
11
+
12
+ **Model name:** `SwastikGuhaRoy/TagoreX`
13
+ **Base model:** `GPT-2` with LoRA adapters [(based on `AddaGPT2.0`)](https://huggingface.co/SwastikGuhaRoy/AddaGPT2.0)
14
+ **Language:** Bengali
15
+ **Author:** Swastik Guha Roy (`@SwastikGuhaRoy`)
16
+ **License:** MIT
17
+ **Model size:** \~124M parameters
18
+ **Trained on:** Curated (but imperfect) corpus of Rabindranath Tagore’s writings
19
+ **Intended use:** Poetic and philosophical Bengali text generation
20
+ **Demo app:** [TagoreX + Gemini Streamlit App](https://tagorexgemini.streamlit.app)
21
+
22
+ ---
23
+
24
+ ### 📘 Model Description
25
+
26
+ **TagoreX** is a fine-tuned version of `AddaGPT2.0` — a small GPT-2 model adapted for Bengali using LoRA (Low-Rank Adaptation).
27
+ This model was trained on literary works of Rabindranath Tagore as a tribute.
28
+
29
+ The model continues a given Bengali prompt in a Tagore-like poetic tone. It generates \~256 tokens, which are then optionally refined by Gemini AI in a downstream application.
30
+
31
+ ---
32
+
33
+ ### 🔧 Technical Details
34
+
35
+ * **Architecture**: GPT-2 (117M parameters)
36
+ * **Training strategy**: Full fine-tuning
37
+ * **Epochs**: 22 (symbolically referencing “২২শে শ্রাবণ”)
38
+ * **Max sequence length**: 256 tokens
39
+ * **Tokenizer**: AutoTokenizer from the base model
40
+ * **Framework**: PyTorch + Transformers
41
+
42
+ ---
43
+
44
+ ### 📂 Training Data
45
+
46
+ The dataset includes poems, prose and other works from Rabindranath Tagore which is [publicly available](https://archive.org/details/RABINDRARACHANABALI/). [The dataset can be accessed in a consolidated .txt format from here :](https://huggingface.co/datasets/SwastikGuhaRoy/WorksofTagore)
47
+
48
+ ⚠️ **Note**: The data may contain:
49
+
50
+ * Typos, formatting errors
51
+ * OCR issues
52
+ * Incomplete or duplicated lines
53
+
54
+ This model is not a scholarly curation, but an experimental artistic rendering.
55
+
56
+ ---
57
+
58
+ ### 🎯 Intended Use
59
+
60
+ **You can use this model to:**
61
+
62
+ * Experiment with Bengali poetic text generation
63
+ * Create creative writing prompts in Bengali
64
+ * Explore Indic LLM capabilities in low-resource settings
65
+
66
+ This model is **not suitable** for:
67
+
68
+ * Any commercial or sensitive deployment
69
+ * Factual or linguistic accuracy tasks
70
+ * Scholarly representation of Tagore’s works
71
+
72
+ ---
73
+
74
+ ### 💬 How to Prompt
75
+
76
+ ```python
77
+ from transformers import AutoTokenizer, AutoModelForCausalLM
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/TagoreX")
80
+ model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/TagoreX")
81
+
82
+ prompt = "তুমি রবে নীরবে"
83
+ inputs = tokenizer(prompt, return_tensors="pt")
84
+ outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
85
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
86
+ ```
87
+
88
+ ---
89
+
90
+ ### 🚫 Limitations & Disclaimer
91
+
92
+ * Not aligned, filtered, or safety-trained.
93
+ * Most outputs may be incoherent, repetitive, or nonsensical.
94
+ * This is **not** meant to reproduce or replace Tagore's literary work.
95
+ * The generation reflects training data and randomness — not any human author.
96
+
97
+ ---
98
+
99
+ ### 🌏 Why It Matters
100
+
101
+ TagoreX demonstrates how even small-scale, open models can express poetic and cultural essence in Indic languages — using limited compute and a lot of curiosity.
102
+
103
+ It aims to inspire communities to build **Indic LLMs**, especially in low-resource and rural settings.
104
+
105
+ > *"AI doesn’t have to be massive. It can be local, soulful, and deeply human."*
106
+
107
+ ---
108
+
109
+ ---
110
+
111
+ ### 📫 Contact
112
+
113
+ 📧 Email: `swastikguharoy@googlemail.com`
114
+ 💬 Feedback, bugs, or nice generations? I'd love to hear from you!
115
+
116
+ ---
117
+