File size: 6,171 Bytes
5a81ec8
 
731e643
 
 
 
 
dbae1dc
 
5a81ec8
 
 
 
dbae1dc
2232b79
5a81ec8
dbae1dc
 
 
 
 
 
 
 
 
 
 
 
 
5a81ec8
 
 
 
 
dbae1dc
2232b79
5a81ec8
 
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
 
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
 
2232b79
5a81ec8
dbae1dc
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
 
 
2232b79
5a81ec8
2232b79
 
 
 
5a81ec8
2232b79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a81ec8
 
 
 
2232b79
5a81ec8
2232b79
c931298
 
 
 
 
5a81ec8
2232b79
5a81ec8
2232b79
5a81ec8
2232b79
 
081e947
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
library_name: transformers
license: apache-2.0
datasets:
- ai4bharat/naamapadam
language:
- bn
base_model:
- openai-community/gpt2
---

# Model Card for Model ID

AddaGPT 2.0 is a Bengali language model based on GPT-2, fine-tuned using LoRA adapters for academic and low-resource applications. While GPT-2 was originally trained only on English data, this model has been adapted to Bengali using the AI4Bharat NaamaPadam dataset — a corpus focused on Named Entity Recognition (NER).
This project is intended as a proof of concept to explore how small, pretrained models like GPT-2 can be extended to Indic languages using low-rank adaptation (LoRA) techniques, even under limited compute settings (e.g., free Kaggle GPUs). It lays the foundation for future work in adapting language models for low-bandwidth, regional, and offline-first use cases — to support local communities.

## Model Details
| **Attribute**                      | **Description**                                                                                                        |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Base Model**                     | GPT-2 (117M parameters)                                                                                                |
| **Fine-tuned Using**               | [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685)                                                         |
| **Language**                       | Bengali (`bn`)                                                                                                         |
| **Training Dataset**               | [`ai4bharat/naamapadam`](https://huggingface.co/datasets/ai4bharat/naamapadam) – Bengali NER corpus (train split only) |
| **Sentences Seen During Training** | \~9.6 million Bengali sentences                                                                                        |
| **Training Platform**              | Kaggle (Free T4 GPUs)                                                                            |
| **Frameworks**                     | 🤗 Transformers + PEFT (Parameter-Efficient Fine-Tuning) + Safetensors                                                 |
| **Trainable Parameters**           | 294,912                                                                                                                |
| **Total Parameters**               | 124,734,720                                                                                                            |
| **Percentage Fine-Tuned**          | 0.2364%                                                                                                                |



### Model Description

- **Developed by:** Swastik Guha Roy
- **Funded by   :** Self Funded


### Uses

AddaGPT 2.0 is an academic proof-of-concept project designed to explore how low-resource, low-compute setups (like Kaggle T4 GPUs) can be used to adapt large language models like GPT-2 for Indic languages, specifically Bengali.

### Intended Use Cases:
Academic research on low-rank adaptation (LoRA) for regional languages

Language modeling experimentation in Bengali

Demonstration of fine-tuning techniques in resource-constrained environments

Baseline comparison for future Bengali language model development

Educational purposes for students and ML enthusiasts working on low-resource NLP

### Intended Users:

ML/NLP researchers exploring parameter-efficient tuning

Students building regional language models

Developers prototyping Bengali language tools (with limitations)

Community contributors interested in advancing open-source Bengali AI


## Limitations

This model is not capable of generating grammatically or syntactically correct Bengali sentences. Instead, it outputs individual Bengali words or word-like tokens that are often meaningful on their own — a direct result of training on a NER-style dataset rather than full natural language text.

->This version does not produce grammatically coherent Bengali sentences

->It's trained on a NER dataset, so it mostly outputs individual Bengali words

->It is not suitable for downstream tasks like summarization, translation, or question-answering — yet



### How to Get Started with the Model

# Load Nessecary Libraries
  ```python
     from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
  ```

# Load the model and tokenizer
```python
  model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/AddaGPT2.0")
  tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/AddaGPT2.0_tokenizer")
```
# Initialize generation pipeline
  ```python
  text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
 ```
# Run inference
``` python
  prompt = "রবীন্দ্রনাথ ঠাকুর একজন"
  output = text_generator(
      prompt,
      max_new_tokens=30,
      temperature=0.7,
      top_p=0.95,
      do_sample=True
  )
  
  print(output[0]["generated_text"])
```

## Evaluation
### Results

The model was evaluated on the validation split of the ai4bharat/naamapadam dataset to measure how well it models Bengali text.

## Metric: Perplexity (Lower is Better)
| Model                   | Validation Perplexity |
| ----------------------- | --------------------- |
| **AddaGPT 2.0**         | **25.61**             |
| Vanilla GPT-2 (English) | 144.53                |


## AddaGPT 2.0 shows a significantly lower perplexity, indicating a better fit to Bengali text.

## GPT-2 struggles with Bengali due to the lack of Bengali data during pretraining.

## Summary
Despite lower perplexity, the model still generates mostly isolated Bengali words, not grammatically complete sentences (due to the nature of the training dataset — a NER corpus).


### Citation 
If you use this model, please cite:

```bibtex
 @misc{addagpt2.0,
  author = {Swastik Guha Roy},
  title = {AddaGPT 2.0: Bengali Finetuned GPT-2 with LoRA},
  year = 2025,
  howpublished = {\url{https://huggingface.co/SwastikGuhaRoy/AddaGPT2.0}},
 }