Safetensors
Indonesian
gpt2
instruct-tuned
izzulgod commited on
Commit
61affee
Β·
verified Β·
1 Parent(s): 72d10ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -108
README.md CHANGED
@@ -14,24 +14,38 @@ datasets:
14
 
15
  # GPT2-Small Indonesian Instruct-Tuned Model
16
 
17
- An Indonesian conversational AI model fine-tuned from `GPT2-Small(124M Parameters)` using instruction-following techniques to enable chat-like interactions.
18
 
19
  ## πŸ“‹ Model Overview
20
 
21
- This model transforms a base Indonesian GPT-2 text generator into a conversational chatbot capable of following instructions and engaging in question-answering dialogues in Bahasa Indonesia.
22
 
23
- - **Base Model**: `GPT2-Small`
24
- - **Fine-tuning Method**: SFT-LoRA (merged adapter)
25
- - **Dataset**: indonesian-nlp/wikipedia-id, FreedomIntelligence/evol-instruct-indonesian, FreedomIntelligence/sharegpt-indonesian
26
- - **Language**: Indonesian (Bahasa Indonesia)
 
 
 
27
  - **Task**: Conversational AI / Chat Completion
 
28
 
29
  ## πŸ§ͺ Project Background
30
 
31
- This model was fine-tuned as part of my personal learning journey in AI and LLMs. The training was done entirely on Google Colab (free tier, T4 GPU), as an exercise in building Indonesian conversational AI with limited resources.
 
 
32
 
33
  ## πŸš€ Quick Start
34
 
 
 
 
 
 
 
 
 
35
  ```python
36
  from transformers import AutoTokenizer, AutoModelForCausalLM
37
  import torch
@@ -40,16 +54,16 @@ import torch
40
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
41
  print(f"Using device: {device}")
42
 
43
- # Load model dan tokenizer
44
  model_path = "IzzulGod/GPT2-Indo-Instruct-Tuned"
45
  tokenizer = AutoTokenizer.from_pretrained(model_path)
46
  model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
47
 
48
- # Prompt
49
  prompt = "User: Siapa presiden pertama Indonesia?\nAI:"
50
  inputs = tokenizer(prompt, return_tensors="pt").to(device)
51
 
52
- # Generate output
53
  with torch.no_grad():
54
  outputs = model.generate(
55
  **inputs,
@@ -57,162 +71,163 @@ with torch.no_grad():
57
  do_sample=True,
58
  temperature=0.7,
59
  top_p=0.95,
60
- repetition_penalty=1.2,
61
  pad_token_id=tokenizer.eos_token_id,
62
- eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>") # <== ini penting
63
  )
64
 
65
- # Decode respons
66
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
67
  print(response)
68
-
69
  ```
70
 
71
  ### Example Output
72
 
73
  ```
74
  User: Siapa presiden pertama Indonesia?
75
- AI: Presiden pertama Indonesia adalah Soekarno. Sukarno dikenal sebagai seorang pemimpin yang sangat dihormati dan dicintai oleh rakyatnya, terutama di kalangan rakyat Indonesia karena perananya dalam membentuk persatuan bangsa Indonesia. Dia juga dianggap sebagai sosok kunci bagi seluruh masyarakat Indonesia untuk mempertahankan kemerdekaan negara tersebut dari penjajahan Belanda.
76
  ```
77
 
78
  ## 🎯 Model Capabilities
79
 
80
- - **Question Answering**: Responds to factual questions in Indonesian
81
- - **Instruction Following**: Capable of following various instructions and tasks
82
- - **Conversational Context**: Maintains context in chat-like interactions
83
- - **Code Generation**: Can generate simple code snippets (R, Python, etc.) with Indonesian explanations
 
 
 
 
84
 
85
  ## πŸ“Š Training Details
86
 
87
- ### Dataset
88
 
89
- This model was trained on a Evol-Instruct and ShareGPT dataset containing conversation data in the following format:
90
 
 
91
  ```json
92
  [
93
  {
94
- "from": "human",
95
  "value": "Question or instruction in Indonesian"
96
  },
97
  {
98
- "from": "gpt",
99
- "value": "Detailed response in Indonesian"
100
  }
101
  ]
102
  ```
103
 
104
  ### Training Configuration
105
 
106
- The model was fine-tuned using LoRA (Low-Rank Adaptation) with aggressive parameter injection across key GPT-2 layers:
107
 
108
  **LoRA Configuration:**
109
- - `r`: 64 (rank)
110
- - `lora_alpha`: 128
111
- - `target_modules`: ["c_attn", "c_proj", "mlp.c_fc", "mlp.c_proj"]
112
- - `lora_dropout`: 0.05
113
- - `bias`: "none"
114
-
115
- **Training Arguments:**
116
- - `epochs`: 3
117
- - `batch_size`: 16 per device
118
- - `gradient_accumulation_steps`: 2
119
- - `learning_rate`: 2e-4
120
- - `scheduler`: cosine
121
- - `weight_decay`: 0.01
122
- - `fp16`: enabled
123
-
124
- ### Training Results
 
 
125
 
126
  ```
127
- [5535/5535 3:29:59, Epoch 3/3]
128
  Step Training Loss
129
- 200 3.533500
130
- 400 2.964200
131
- 600 2.847200
132
- 800 2.772600
133
- 1000 2.717300
134
- 1200 2.671700
135
- 1400 2.651500
136
- 1600 2.623400
137
- 1800 2.586100
138
- 2000 2.551900
139
- 2200 2.533900
140
- 2400 2.523000
141
- 2600 2.510900
142
- 2800 2.490900
143
- 3000 2.482600
144
- 3200 2.476900
145
- 3400 2.471900
146
- 3600 2.455300
147
- 3800 2.444100
148
- 4000 2.416200
149
- 4200 2.407400
150
- 4400 2.412600
151
- 4600 2.416100
152
- 4800 2.419000
153
- 5000 2.408800
154
- 5200 2.406000
155
- 5400 2.397500
156
- TrainOutput(global_step=5535, training_loss=2.5733587828431994, metrics={'train_runtime': 12603.3708, 'train_samples_per_second': 14.049, 'train_steps_per_second': 0.439, 'total_flos': 5.139926052293837e+16, 'train_loss': 2.5733587828431994, 'epoch': 3.0})
157
  ```
158
 
159
- The model showed consistent improvement with loss decreasing from 3.53 to 2.39 over the training period.
160
 
161
  ## πŸ”§ Advanced Usage
162
 
163
- ### Custom Generation Parameters
164
 
 
165
  ```python
166
- # For more creative responses
167
  outputs = model.generate(
168
  **inputs,
169
- max_new_tokens=256,
170
- do_sample=True,
171
- temperature=0.8,
172
- top_p=0.9,
173
- repetition_penalty=1.2
174
  )
 
175
 
176
- # For more focused responses
 
177
  outputs = model.generate(
178
  **inputs,
179
- max_new_tokens=128,
180
- do_sample=True,
181
- temperature=0.6,
182
- top_p=0.95,
183
- repetition_penalty=1.1
184
  )
185
  ```
186
 
187
- ### Prompt Format
188
 
189
- The model expects prompts in the following format:
190
  ```
191
- User: Pertanyaan dari user
192
- AI: Jawaban dari AI <|endoftext|>
193
  ```
194
 
195
- ## ⚠️ Limitations
 
 
 
 
 
 
 
 
 
 
196
 
197
- - **Knowledge Base**: The base model was trained primarily on Wikipedia data: `indonesian-nlp/wikipedia-id` by [Cahya](https://huggingface.co/cahya), providing general factual knowledge but limited real-world conversational patterns
198
- - **Training Data Scope**: Current fine-tuning focuses on general instruction-following and Q&A rather than natural daily conversations
199
- - **Conversational Style**: Responses may feel formal or academic due to the Wikipedia-based foundation and instruction-tuned nature
200
- - **Model Size**: Relatively small (124M Parameters), which may limit complex reasoning capabilities
201
- - **Factual Accuracy**: Responses are generated based on training data and may not always be factually accurate or up-to-date
202
- - **Language Optimization**: Best performance is achieved with Indonesian language inputs
203
- - **Response Consistency**: May occasionally generate repetitive or inconsistent responses
204
 
205
- ## πŸš€ Future Improvements
 
 
 
206
 
207
- For enhanced conversational naturalness, consider:
208
- - **Conversational Dataset Training**: Fine-tuning with Indonesian daily conversation datasets
209
- - **Lighter LoRA Configuration**: Using more efficient LoRA parameters for conversation-specific training
210
- - **Multi-turn Dialogue**: Training on multi-turn conversation data for better context handling
211
- - **Informal Language Patterns**: Incorporating colloquial Indonesian expressions and casual speech patterns
 
212
 
213
  ## πŸ“ License
214
 
215
- This model is released under the MIT License. See the LICENSE file for details.
216
 
217
  ## πŸ“š Citation
218
 
@@ -220,12 +235,20 @@ If you use this model in your research or applications, please cite:
220
 
221
  ```bibtex
222
  @misc{izzulgod2025gpt2indochat,
223
- title = {GPT2-Small Indonesian Instruct-Tuned Model},
224
- author = {IzzulGod},
225
- year = {2025},
226
- howpublished = {\url{https://huggingface.co/IzzulGod/GPT2-Indo-Instruct-Tuned}},
 
227
  }
228
  ```
 
 
 
 
 
 
 
229
  ---
230
 
231
- *Disclaimer: This model was developed as an experimental project for learning purposes. While it performs well on basic tasks, it may have limitations in reasoning and real-world usage.*
 
14
 
15
  # GPT2-Small Indonesian Instruct-Tuned Model
16
 
17
+ An Indonesian conversational AI model fine-tuned from `GPT2-Small (124M Parameters)` using instruction-following techniques to enable natural chat-like interactions in Bahasa Indonesia.
18
 
19
  ## πŸ“‹ Model Overview
20
 
21
+ This model transforms a base Indonesian GPT-2 text generator into a conversational chatbot capable of following instructions and engaging in question-answering dialogues. The model has been specifically optimized for Indonesian language understanding and generation.
22
 
23
+ - **Base Model**: `GPT2-Small (124M Parameters)`
24
+ - **Fine-tuning Method**: SFT-LoRA (Supervised Fine-Tuning with Low-Rank Adaptation)
25
+ - **Training Datasets**:
26
+ - `indonesian-nlp/wikipedia-id` (knowledge base)
27
+ - `FreedomIntelligence/evol-instruct-indonesian` (instruction following)
28
+ - `FreedomIntelligence/sharegpt-indonesian` (conversational patterns)
29
+ - **Primary Language**: Indonesian (Bahasa Indonesia)
30
  - **Task**: Conversational AI / Chat Completion
31
+ - **License**: MIT
32
 
33
  ## πŸ§ͺ Project Background
34
 
35
+ This model was developed as part of a personal learning journey in AI and Large Language Models (LLMs). The entire training process was conducted on **Google Colab's** free tier using **T4 GPU,** demonstrating how to build effective Indonesian conversational AI with limited computational resources.
36
+
37
+ The project focuses on creating an accessible Indonesian language model that can understand context, follow instructions, and provide helpful responses in natural Bahasa Indonesia.
38
 
39
  ## πŸš€ Quick Start
40
 
41
+ ### Installation
42
+
43
+ ```bash
44
+ pip install transformers torch
45
+ ```
46
+
47
+ ### Basic Usage
48
+
49
  ```python
50
  from transformers import AutoTokenizer, AutoModelForCausalLM
51
  import torch
 
54
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
55
  print(f"Using device: {device}")
56
 
57
+ # Load model and tokenizer
58
  model_path = "IzzulGod/GPT2-Indo-Instruct-Tuned"
59
  tokenizer = AutoTokenizer.from_pretrained(model_path)
60
  model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
61
 
62
+ # Create prompt
63
  prompt = "User: Siapa presiden pertama Indonesia?\nAI:"
64
  inputs = tokenizer(prompt, return_tensors="pt").to(device)
65
 
66
+ # Generate response
67
  with torch.no_grad():
68
  outputs = model.generate(
69
  **inputs,
 
71
  do_sample=True,
72
  temperature=0.7,
73
  top_p=0.95,
74
+ repetition_penalty=1.1,
75
  pad_token_id=tokenizer.eos_token_id,
76
+ eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>")
77
  )
78
 
79
+ # Decode response
80
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
81
  print(response)
 
82
  ```
83
 
84
  ### Example Output
85
 
86
  ```
87
  User: Siapa presiden pertama Indonesia?
88
+ AI: Presiden pertama Indonesia adalah Soekarno. Soekarno dikenal sebagai seorang pemimpin yang sangat dihormati dan dicintai oleh rakyatnya, terutama karena perannya dalam memproklamirkan kemerdekaan Indonesia. Beliau juga dianggap sebagai sosok kunci dalam mempertahankan persatuan bangsa dan kemerdekaan negara dari penjajahan kolonial.
89
  ```
90
 
91
  ## 🎯 Model Capabilities
92
 
93
+ The model demonstrates strong performance across several Indonesian language tasks:
94
+
95
+ - **Question Answering**: Provides accurate responses to factual questions in Indonesian
96
+ - **Instruction Following**: Capable of understanding and executing various instructions and tasks
97
+ - **Conversational Context**: Maintains coherent context throughout chat-like interactions
98
+ - **Code Generation**: Can generate simple code snippets (Python, R, etc.) with clear Indonesian explanations
99
+ - **Educational Content**: Explains complex concepts in accessible Indonesian language
100
+ - **Cultural Awareness**: Understands Indonesian cultural context and references
101
 
102
  ## πŸ“Š Training Details
103
 
104
+ ### Dataset Composition
105
 
106
+ The model was trained on a carefully curated combination of datasets to balance knowledge, instruction-following, and conversational abilities:
107
 
108
+ **Training Data Format:**
109
  ```json
110
  [
111
  {
112
+ "from": "human",
113
  "value": "Question or instruction in Indonesian"
114
  },
115
  {
116
+ "from": "gpt",
117
+ "value": "Detailed and helpful response in Indonesian"
118
  }
119
  ]
120
  ```
121
 
122
  ### Training Configuration
123
 
124
+ The model was fine-tuned using LoRA (Low-Rank Adaptation) technique, which allows efficient training while preserving the base model's capabilities:
125
 
126
  **LoRA Configuration:**
127
+ - **Rank (r)**: 64 - Higher rank for better adaptation capacity
128
+ - **Alpha**: 128 - Scaling factor for LoRA weights
129
+ - **Target Modules**: `["c_attn", "c_proj", "mlp.c_fc", "mlp.c_proj"]` - Key transformer components
130
+ - **Dropout**: 0.05 - Regularization to prevent overfitting
131
+ - **Bias**: "none" - Focus adaptation on weight matrices
132
+
133
+ **Training Hyperparameters:**
134
+ - **Epochs**: 3 - Sufficient for convergence without overfitting
135
+ - **Batch Size**: 16 per device - Optimized for T4 GPU memory
136
+ - **Gradient Accumulation**: 2 steps - Effective batch size of 32
137
+ - **Learning Rate**: 2e-4 - Conservative rate for stable training
138
+ - **Scheduler**: Cosine annealing - Smooth learning rate decay
139
+ - **Weight Decay**: 0.01 - L2 regularization
140
+ - **Mixed Precision**: FP16 enabled - Memory and speed optimization
141
+
142
+ ### Training Progress
143
+
144
+ The model showed consistent improvement throughout training:
145
 
146
  ```
147
+ Training Progress (5535 total steps over 3 epochs):
148
  Step Training Loss
149
+ 200 3.533500 # Initial high loss
150
+ 400 2.964200 # Rapid initial improvement
151
+ ...
152
+ 4000 2.416200 # Stable convergence
153
+ ...
154
+ 5400 2.397500 # Final optimized loss
155
+
156
+ Final Metrics:
157
+ - Training Loss: 2.573
158
+ - Training Time: 3.5 hours
159
+ - Samples per Second: 14.049
160
+ - Total Training Samples: ~177k
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ```
162
 
163
+ The steady decrease from 3.53 to 2.39 demonstrates effective learning and adaptation to the Indonesian instruction-following task.
164
 
165
  ## πŸ”§ Advanced Usage
166
 
167
+ ### Generation Parameter Tuning
168
 
169
+ **For Creative Responses:**
170
  ```python
 
171
  outputs = model.generate(
172
  **inputs,
173
+ max_new_tokens=256, # Longer responses
174
+ temperature=0.8, # More randomness
175
+ top_p=0.9, # Diverse vocabulary
176
+ repetition_penalty=1.2 # Avoid repetition
 
177
  )
178
+ ```
179
 
180
+ **For Focused/Factual Responses:**
181
+ ```python
182
  outputs = model.generate(
183
  **inputs,
184
+ max_new_tokens=128, # Concise responses
185
+ temperature=0.6, # More deterministic
186
+ top_p=0.95, # High-quality tokens
187
+ repetition_penalty=1.1 # Mild repetition control
 
188
  )
189
  ```
190
 
191
+ ### Prompt Engineering
192
 
193
+ **Recommended Format:**
194
  ```
195
+ User: [Your question or instruction in Indonesian]
196
+ AI: [Expected response starts here]
197
  ```
198
 
199
+ **Examples of Effective Prompts:**
200
+ - `User: Jelaskan cara kerja fotosintesis dengan bahasa sederhana\nAI:`
201
+ - `User: Buatkan kode Python untuk menghitung luas lingkaran\nAI:`
202
+ - `User: Apa perbedaan antara demokrasi dan republik?\nAI:`
203
+
204
+ ## ⚠️ Limitations and Considerations
205
+
206
+ **Knowledge Limitations:**
207
+ - **Training Data Cutoff**: Knowledge is limited to the training datasets, primarily Wikipedia-based information
208
+ - **Factual Accuracy**: Generated responses may not always be factually accurate or up-to-date
209
+ - **Real-time Information**: Cannot access current events or real-time data
210
 
211
+ **Technical Limitations:**
212
+ - **Model Size**: With 124M parameters, complex reasoning capabilities are limited compared to larger models
213
+ - **Context Length**: Limited context window may affect very long conversations
214
+ - **Language Specialization**: Optimized primarily for Indonesian; other languages may produce suboptimal results
 
 
 
215
 
216
+ **Response Characteristics:**
217
+ - **Formality**: Responses may occasionally sound formal due to Wikipedia-based training data
218
+ - **Consistency**: May generate repetitive patterns or inconsistent information across sessions
219
+ - **Cultural Nuances**: While trained on Indonesian data, may miss subtle cultural references or regional variations
220
 
221
+ ## πŸš€ Future Development Roadmap
222
+
223
+ **Short-term Improvements:**
224
+ - Fine-tuning with more diverse conversational datasets
225
+ - Integration of current Indonesian news and cultural content
226
+ - Specialized domain adaptations (education, healthcare, business)
227
 
228
  ## πŸ“ License
229
 
230
+ This model is released under the MIT License, Please see the LICENSE file for complete terms.
231
 
232
  ## πŸ“š Citation
233
 
 
235
 
236
  ```bibtex
237
  @misc{izzulgod2025gpt2indochat,
238
+ title={GPT2-Small Indonesian Instruct-Tuned Model},
239
+ author={IzzulGod},
240
+ year={2025},
241
+ howpublished={\url{https://huggingface.co/IzzulGod/GPT2-Indo-Instruct-Tuned}},
242
+ note={Indonesian conversational AI model fine-tuned for instruction following}
243
  }
244
  ```
245
+
246
+ ## πŸ™ Acknowledgments
247
+
248
+ - **Base Model**: Thanks to [Cahya](https://huggingface.co/cahya) for the Indonesian GPT-2 base model
249
+ - **Datasets**: FreedomIntelligence team for Indonesian instruction and conversation datasets
250
+ - **Infrastructure**: Google Colab for providing accessible GPU resources for training
251
+
252
  ---
253
 
254
+ **Disclaimer**: This model was developed as an experimental project for educational and research purposes. While it demonstrates good performance on various tasks, users should validate outputs for critical applications and be aware of the limitations outlined above.