File size: 8,253 Bytes
c58da85 111d09c 163a366 c58da85 72d10ab c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 f470f80 61affee c58da85 61affee c58da85 61affee 1891f0b c58da85 61affee c58da85 61affee c58da85 61affee 72d10ab 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee 72d10ab 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee c58da85 61affee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
---
license: mit
language:
- id
base_model:
- cahya/gpt2-small-indonesian-522M
tags:
- instruct-tuned
datasets:
- indonesian-nlp/wikipedia-id
- FreedomIntelligence/evol-instruct-indonesian
- FreedomIntelligence/sharegpt-indonesian
---
# GPT2-Small Indonesian Instruct-Tuned Model
An Indonesian conversational AI model fine-tuned from `GPT2-Small (124M Parameters)` using instruction-following techniques to enable natural chat-like interactions in Bahasa Indonesia.
## π Model Overview
This model transforms a base Indonesian GPT-2 text generator into a conversational chatbot capable of following instructions and engaging in question-answering dialogues. The model has been specifically optimized for Indonesian language understanding and generation.
- **Base Model**: `GPT2-Small (124M Parameters)`
- **Fine-tuning Method**: SFT-LoRA (Supervised Fine-Tuning with Low-Rank Adaptation)
- **Training Datasets**:
- `indonesian-nlp/wikipedia-id` (knowledge base)
- `FreedomIntelligence/evol-instruct-indonesian` (instruction following)
- `FreedomIntelligence/sharegpt-indonesian` (conversational patterns)
- **Primary Language**: Indonesian (Bahasa Indonesia)
- **Task**: Conversational AI / Chat Completion
- **License**: MIT
## π§ͺ Project Background
This model was developed as part of a personal learning journey in AI and Large Language Models (LLMs). The entire training process was conducted on **Google Colab's** free tier using **T4 GPU**, demonstrating how to build effective Indonesian conversational AI with limited computational resources.
The project focuses on creating an accessible Indonesian language model that can understand context, follow instructions, and provide helpful responses in natural Bahasa Indonesia.
## π Quick Start
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Load model and tokenizer
model_path = "izzulgod/gpt2-indo-instruct-tuned"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
# Create prompt
prompt = "User: Siapa presiden pertama Indonesia?\nAI:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate response
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>")
)
# Decode response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## π― Model Capabilities
The model demonstrates strong performance across several Indonesian language tasks:
- **Question Answering**: Provides accurate responses to factual questions in Indonesian
- **Instruction Following**: Capable of understanding and executing various instructions and tasks
- **Conversational Context**: Maintains coherent context throughout chat-like interactions
- **Code Generation**: Can generate simple code snippets (Python, R, etc.) with clear Indonesian explanations
- **Educational Content**: Explains complex concepts in accessible Indonesian language
- **Cultural Awareness**: Understands Indonesian cultural context and references
## π Training Details
### Dataset Composition
The model was trained on a carefully curated combination of datasets to balance knowledge, instruction-following, and conversational abilities:
**Training Data Format:**
```json
[
{
"from": "human",
"value": "Question or instruction in Indonesian"
},
{
"from": "gpt",
"value": "Detailed and helpful response in Indonesian"
}
]
```
### Training Configuration
The model was fine-tuned using LoRA (Low-Rank Adaptation) technique, which allows efficient training while preserving the base model's capabilities:
**LoRA Configuration:**
- **Rank (r)**: 64 - Higher rank for better adaptation capacity
- **Alpha**: 128 - Scaling factor for LoRA weights
- **Target Modules**: `["c_attn", "c_proj", "mlp.c_fc", "mlp.c_proj"]` - Key transformer components
- **Dropout**: 0.05 - Regularization to prevent overfitting
- **Bias**: "none" - Focus adaptation on weight matrices
**Training Hyperparameters:**
- **Epochs**: 3 - Sufficient for convergence without overfitting
- **Batch Size**: 16 per device - Optimized for T4 GPU memory
- **Gradient Accumulation**: 2 steps - Effective batch size of 32
- **Learning Rate**: 2e-4 - Conservative rate for stable training
- **Scheduler**: Cosine annealing - Smooth learning rate decay
- **Weight Decay**: 0.01 - L2 regularization
- **Mixed Precision**: FP16 enabled - Memory and speed optimization
### Training Progress
The model showed consistent improvement throughout training:
```
Training Progress (5535 total steps over 3 epochs):
Step Training Loss
200 3.533500 # Initial high loss
400 2.964200 # Rapid initial improvement
...
4000 2.416200 # Stable convergence
...
5400 2.397500 # Final optimized loss
Final Metrics:
- Training Loss: 2.573
- Training Time: 3.5 hours
- Samples per Second: 14.049
- Total Training Samples: ~177k
```
The steady decrease from 3.53 to 2.39 demonstrates effective learning and adaptation to the Indonesian instruction-following task.
## π§ Advanced Usage
### Generation Parameter Tuning
**For Creative Responses:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=256, # Longer responses
temperature=0.8, # More randomness
top_p=0.9, # Diverse vocabulary
repetition_penalty=1.2 # Avoid repetition
)
```
**For Focused/Factual Responses:**
```python
outputs = model.generate(
**inputs,
max_new_tokens=128, # Concise responses
temperature=0.6, # More deterministic
top_p=0.95, # High-quality tokens
repetition_penalty=1.1 # Mild repetition control
)
```
### Prompt Engineering
**Recommended Format:**
```
User: [Your question or instruction in Indonesian]
AI: [Expected response starts here]
```
## β οΈ Limitations and Considerations
**Knowledge Limitations:**
- **Training Data Cutoff**: Knowledge is limited to the training datasets, primarily Wikipedia-based information
- **Factual Accuracy**: Generated responses may not always be factually accurate or up-to-date
- **Real-time Information**: Cannot access current events or real-time data
**Technical Limitations:**
- **Model Size**: With 124M parameters, complex reasoning capabilities are limited compared to larger models
- **Context Length**: Limited context window may affect very long conversations
- **Language Specialization**: Optimized primarily for Indonesian; other languages may produce suboptimal results
**Response Characteristics:**
- **Formality**: Responses may occasionally sound formal due to Wikipedia-based training data
- **Consistency**: May generate repetitive patterns or inconsistent information across sessions
- **Cultural Nuances**: While trained on Indonesian data, may miss subtle cultural references or regional variations
## π Future Development Roadmap
**Short-term Improvements:**
- Fine-tuning with more diverse conversational datasets
- Integration of current Indonesian news and cultural content
- Specialized domain adaptations (education, healthcare, business)
## π License
This model is released under the MIT License, Please see the LICENSE file for complete terms.
## π Acknowledgments
- **Base Model**: Thanks to [Cahya](https://huggingface.co/cahya) for the Indonesian GPT-2 base model
- **Datasets**: FreedomIntelligence team for Indonesian instruction and conversation datasets
- **Infrastructure**: Google Colab for providing accessible GPU resources for training
---
**Disclaimer**: This model was developed as an experimental project for educational and research purposes. While it demonstrates good performance on various tasks, users should validate outputs for critical applications and be aware of the limitations outlined above. |