AddaGPT2.0 / README.md
SwastikGuhaRoy's picture
Update README.md
081e947 verified
---
library_name: transformers
license: apache-2.0
datasets:
- ai4bharat/naamapadam
language:
- bn
base_model:
- openai-community/gpt2
---
# Model Card for Model ID
AddaGPT 2.0 is a Bengali language model based on GPT-2, fine-tuned using LoRA adapters for academic and low-resource applications. While GPT-2 was originally trained only on English data, this model has been adapted to Bengali using the AI4Bharat NaamaPadam dataset — a corpus focused on Named Entity Recognition (NER).
This project is intended as a proof of concept to explore how small, pretrained models like GPT-2 can be extended to Indic languages using low-rank adaptation (LoRA) techniques, even under limited compute settings (e.g., free Kaggle GPUs). It lays the foundation for future work in adapting language models for low-bandwidth, regional, and offline-first use cases — to support local communities.
## Model Details
| **Attribute** | **Description** |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Base Model** | GPT-2 (117M parameters) |
| **Fine-tuned Using** | [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) |
| **Language** | Bengali (`bn`) |
| **Training Dataset** | [`ai4bharat/naamapadam`](https://huggingface.co/datasets/ai4bharat/naamapadam) – Bengali NER corpus (train split only) |
| **Sentences Seen During Training** | \~9.6 million Bengali sentences |
| **Training Platform** | Kaggle (Free T4 GPUs) |
| **Frameworks** | 🤗 Transformers + PEFT (Parameter-Efficient Fine-Tuning) + Safetensors |
| **Trainable Parameters** | 294,912 |
| **Total Parameters** | 124,734,720 |
| **Percentage Fine-Tuned** | 0.2364% |
### Model Description
- **Developed by:** Swastik Guha Roy
- **Funded by :** Self Funded
### Uses
AddaGPT 2.0 is an academic proof-of-concept project designed to explore how low-resource, low-compute setups (like Kaggle T4 GPUs) can be used to adapt large language models like GPT-2 for Indic languages, specifically Bengali.
### Intended Use Cases:
Academic research on low-rank adaptation (LoRA) for regional languages
Language modeling experimentation in Bengali
Demonstration of fine-tuning techniques in resource-constrained environments
Baseline comparison for future Bengali language model development
Educational purposes for students and ML enthusiasts working on low-resource NLP
### Intended Users:
ML/NLP researchers exploring parameter-efficient tuning
Students building regional language models
Developers prototyping Bengali language tools (with limitations)
Community contributors interested in advancing open-source Bengali AI
## Limitations
This model is not capable of generating grammatically or syntactically correct Bengali sentences. Instead, it outputs individual Bengali words or word-like tokens that are often meaningful on their own — a direct result of training on a NER-style dataset rather than full natural language text.
->This version does not produce grammatically coherent Bengali sentences
->It's trained on a NER dataset, so it mostly outputs individual Bengali words
->It is not suitable for downstream tasks like summarization, translation, or question-answering — yet
### How to Get Started with the Model
# Load Nessecary Libraries
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
```
# Load the model and tokenizer
```python
model = AutoModelForCausalLM.from_pretrained("SwastikGuhaRoy/AddaGPT2.0")
tokenizer = AutoTokenizer.from_pretrained("SwastikGuhaRoy/AddaGPT2.0_tokenizer")
```
# Initialize generation pipeline
```python
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
```
# Run inference
``` python
prompt = "রবীন্দ্রনাথ ঠাকুর একজন"
output = text_generator(
prompt,
max_new_tokens=30,
temperature=0.7,
top_p=0.95,
do_sample=True
)
print(output[0]["generated_text"])
```
## Evaluation
### Results
The model was evaluated on the validation split of the ai4bharat/naamapadam dataset to measure how well it models Bengali text.
## Metric: Perplexity (Lower is Better)
| Model | Validation Perplexity |
| ----------------------- | --------------------- |
| **AddaGPT 2.0** | **25.61** |
| Vanilla GPT-2 (English) | 144.53 |
## AddaGPT 2.0 shows a significantly lower perplexity, indicating a better fit to Bengali text.
## GPT-2 struggles with Bengali due to the lack of Bengali data during pretraining.
## Summary
Despite lower perplexity, the model still generates mostly isolated Bengali words, not grammatically complete sentences (due to the nature of the training dataset — a NER corpus).
### Citation
If you use this model, please cite:
```bibtex
@misc{addagpt2.0,
author = {Swastik Guha Roy},
title = {AddaGPT 2.0: Bengali Finetuned GPT-2 with LoRA},
year = 2025,
howpublished = {\url{https://huggingface.co/SwastikGuhaRoy/AddaGPT2.0}},
}