Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- id
|
| 5 |
+
base_model:
|
| 6 |
+
- cahya/gpt2-small-indonesian-522M
|
| 7 |
+
tags:
|
| 8 |
+
- instruct-tuned
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# GPT2-Small Indonesian Chat Instruct-Tuned Model
|
| 12 |
+
|
| 13 |
+
An Indonesian conversational AI model fine-tuned from `GPT2-Small(124M Parameters)` using instruction-following techniques to enable chat-like interactions.
|
| 14 |
+
|
| 15 |
+
## π Model Overview
|
| 16 |
+
|
| 17 |
+
This model transforms a base Indonesian GPT-2 text generator into a conversational chatbot capable of following instructions and engaging in question-answering dialogues in Bahasa Indonesia.
|
| 18 |
+
|
| 19 |
+
- **Base Model**: `GPT2-Small`
|
| 20 |
+
- **Fine-tuning Method**: SFT-LoRA (merged adapter)
|
| 21 |
+
- **Dataset**: `indonesian-nlp/wikipedia-id`, `FreedomIntelligence/evol-instruct-indonesian`, `FreedomIntelligence/sharegpt-indonesian`
|
| 22 |
+
- **Language**: Indonesian (Bahasa Indonesia)
|
| 23 |
+
- **Task**: Conversational AI / Chat Completion
|
| 24 |
+
|
| 25 |
+
## π§ͺ Project Background
|
| 26 |
+
|
| 27 |
+
This model was fine-tuned as part of my personal learning journey in AI and LLMs. The training was done entirely on Google Colab (free tier, T4 GPU), as an exercise in building Indonesian conversational AI with limited resources.
|
| 28 |
+
|
| 29 |
+
## π Quick Start
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 33 |
+
import torch
|
| 34 |
+
|
| 35 |
+
# Setup device
|
| 36 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 37 |
+
print(f"Using device: {device}")
|
| 38 |
+
|
| 39 |
+
# Load model and tokenizer
|
| 40 |
+
model_path = "IzzulGod/GPT2-Small-Indonesian"
|
| 41 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 42 |
+
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
|
| 43 |
+
|
| 44 |
+
# Generate response
|
| 45 |
+
prompt = "User: Siapa presiden pertama Indonesia?\nAI:"
|
| 46 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(device)
|
| 47 |
+
|
| 48 |
+
with torch.no_grad():
|
| 49 |
+
outputs = model.generate(
|
| 50 |
+
**inputs,
|
| 51 |
+
max_new_tokens=128,
|
| 52 |
+
do_sample=True,
|
| 53 |
+
temperature=0.7,
|
| 54 |
+
top_p=0.95,
|
| 55 |
+
repetition_penalty=1.2,
|
| 56 |
+
pad_token_id=tokenizer.eos_token_id
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 60 |
+
print(response)
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Example Output
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
User: Siapa presiden pertama Indonesia?
|
| 67 |
+
AI: Presiden pertama Indonesia adalah Soekarno. Sukarno dikenal sebagai seorang pemimpin yang sangat dihormati dan dicintai oleh rakyatnya, terutama di kalangan rakyat Indonesia karena perananya dalam membentuk persatuan bangsa Indonesia. Dia juga dianggap sebagai sosok kunci bagi seluruh masyarakat Indonesia untuk mempertahankan kemerdekaan negara tersebut dari penjajahan Belanda.
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## π― Model Capabilities
|
| 71 |
+
|
| 72 |
+
- **Question Answering**: Responds to factual questions in Indonesian
|
| 73 |
+
- **Instruction Following**: Capable of following various instructions and tasks
|
| 74 |
+
- **Conversational Context**: Maintains context in chat-like interactions
|
| 75 |
+
- **Code Generation**: Can generate simple code snippets (R, Python, etc.) with Indonesian explanations
|
| 76 |
+
|
| 77 |
+
## π Training Details
|
| 78 |
+
|
| 79 |
+
### Dataset
|
| 80 |
+
|
| 81 |
+
This model was trained on a dataset containing conversation data in the following format:
|
| 82 |
+
|
| 83 |
+
```json
|
| 84 |
+
[
|
| 85 |
+
{
|
| 86 |
+
"from": "human",
|
| 87 |
+
"value": "Question or instruction in Indonesian"
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"from": "gpt",
|
| 91 |
+
"value": "Detailed response in Indonesian"
|
| 92 |
+
}
|
| 93 |
+
]
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### Training Configuration
|
| 97 |
+
|
| 98 |
+
The model was fine-tuned using LoRA (Low-Rank Adaptation) with aggressive parameter injection across key GPT-2 layers:
|
| 99 |
+
|
| 100 |
+
**LoRA Configuration:**
|
| 101 |
+
- `r`: 64 (rank)
|
| 102 |
+
- `lora_alpha`: 128
|
| 103 |
+
- `target_modules`: ["c_attn", "c_proj", "mlp.c_fc", "mlp.c_proj"]
|
| 104 |
+
- `lora_dropout`: 0.05
|
| 105 |
+
- `bias`: "none"
|
| 106 |
+
|
| 107 |
+
**Training Arguments:**
|
| 108 |
+
- `epochs`: 3
|
| 109 |
+
- `batch_size`: 16 per device
|
| 110 |
+
- `gradient_accumulation_steps`: 2
|
| 111 |
+
- `learning_rate`: 2e-4
|
| 112 |
+
- `scheduler`: cosine
|
| 113 |
+
- `weight_decay`: 0.01
|
| 114 |
+
- `fp16`: enabled
|
| 115 |
+
|
| 116 |
+
### Training Results
|
| 117 |
+
|
| 118 |
+
```
|
| 119 |
+
Final Training Loss: 2.692
|
| 120 |
+
Total Steps: 2,766
|
| 121 |
+
Training Time: ~1h 45m
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
The model showed consistent improvement with loss decreasing from 3.44 to 2.51 over the training period.
|
| 125 |
+
|
| 126 |
+
## π§ Advanced Usage
|
| 127 |
+
|
| 128 |
+
### Custom Generation Parameters
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
# For more creative responses
|
| 132 |
+
outputs = model.generate(
|
| 133 |
+
**inputs,
|
| 134 |
+
max_new_tokens=100,
|
| 135 |
+
do_sample=True,
|
| 136 |
+
temperature=0.8,
|
| 137 |
+
top_p=0.9,
|
| 138 |
+
repetition_penalty=1.3
|
| 139 |
+
)
|
| 140 |
+
|
| 141 |
+
# For more focused responses
|
| 142 |
+
outputs = model.generate(
|
| 143 |
+
**inputs,
|
| 144 |
+
max_new_tokens=50,
|
| 145 |
+
do_sample=True,
|
| 146 |
+
temperature=0.4,
|
| 147 |
+
top_p=0.95,
|
| 148 |
+
repetition_penalty=1.1
|
| 149 |
+
)
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
### Prompt Format
|
| 153 |
+
|
| 154 |
+
The model expects prompts in the following format:
|
| 155 |
+
```
|
| 156 |
+
User: [Your question or instruction in Indonesian]
|
| 157 |
+
AI:
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## β οΈ Limitations
|
| 161 |
+
|
| 162 |
+
- **Knowledge Base**: The base model was trained primarily on Wikipedia data: `indonesian-nlp/wikipedia-id` by [Cahya](https://huggingface.co/cahya), providing general factual knowledge but limited real-world conversational patterns
|
| 163 |
+
- **Training Data Scope**: Current fine-tuning focuses on general instruction-following and Q&A rather than natural daily conversations
|
| 164 |
+
- **Conversational Style**: Responses may feel formal or academic due to the Wikipedia-based foundation and instruction-tuned nature
|
| 165 |
+
- **Model Size**: Relatively small (124M Parameters), which may limit complex reasoning capabilities
|
| 166 |
+
- **Factual Accuracy**: Responses are generated based on training data and may not always be factually accurate or up-to-date
|
| 167 |
+
- **Language Optimization**: Best performance is achieved with Indonesian language inputs
|
| 168 |
+
- **Response Consistency**: May occasionally generate repetitive or inconsistent responses
|
| 169 |
+
|
| 170 |
+
## π Future Improvements
|
| 171 |
+
|
| 172 |
+
For enhanced conversational naturalness, consider:
|
| 173 |
+
- **Conversational Dataset Training**: Fine-tuning with Indonesian daily conversation datasets
|
| 174 |
+
- **Lighter LoRA Configuration**: Using more efficient LoRA parameters for conversation-specific training
|
| 175 |
+
- **Multi-turn Dialogue**: Training on multi-turn conversation data for better context handling
|
| 176 |
+
- **Informal Language Patterns**: Incorporating colloquial Indonesian expressions and casual speech patterns
|
| 177 |
+
|
| 178 |
+
## π License
|
| 179 |
+
|
| 180 |
+
This model is released under the MIT License. See the LICENSE file for details.
|
| 181 |
+
|
| 182 |
+
## π Citation
|
| 183 |
+
|
| 184 |
+
If you use this model in your research or applications, please cite:
|
| 185 |
+
|
| 186 |
+
```bibtex
|
| 187 |
+
@misc{izzulgod2025gpt2indochat,
|
| 188 |
+
title = {GPT2-Small Indonesian Chat Instruct-Tuned Model},
|
| 189 |
+
author = {IzzulGod},
|
| 190 |
+
year = {2025},
|
| 191 |
+
howpublished = {\url{https://huggingface.co/IzzulGod/GPT2-Small-Indonesian}},
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
*Disclaimer: This model was developed as an experimental project for learning purposes. While it performs well on basic tasks, it may have limitations in reasoning and real-world usage.*
|