amkyawdev commited on
Commit
bf64cbe
ยท
verified ยท
1 Parent(s): fa051e3

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +28 -13
  2. requirements.txt +3 -1
  3. train.py +106 -32
README.md CHANGED
@@ -1,12 +1,12 @@
1
  # ๐Ÿง  Myanmar LLM Training
2
 
3
- Training script for Myanmar language model using Qwen2.5-0.5B-Instruct.
4
 
5
  ## ๐Ÿ“‹ Requirements
6
 
7
  - Python 3.8+
8
- - GPU with 8GB+ VRAM (recommended)
9
- - HuggingFace Account
10
 
11
  ## ๐Ÿš€ Quick Start
12
 
@@ -21,6 +21,8 @@ huggingface-cli login
21
  # Enter your token
22
  ```
23
 
 
 
24
  ### 3. Run training
25
  ```bash
26
  python train.py
@@ -30,10 +32,18 @@ python train.py
30
 
31
  | Parameter | Default | Description |
32
  |-----------|---------|-------------|
33
- | MODEL_NAME | Qwen/Qwen2.5-0.5B-Instruct | Base model |
34
  | num_train_epochs | 3 | Training iterations |
35
- | per_device_train_batch_size | 4 | Batch size |
36
- | learning_rate | 2e-5 | Learning rate |
 
 
 
 
 
 
 
 
37
 
38
  ## ๐Ÿ“Š Training Data
39
 
@@ -42,33 +52,38 @@ Dataset: [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/
42
  | Split | Samples |
43
  |-------|---------|
44
  | Train | 1000 |
45
- | Test | 1000 |
46
  | Validation | 1000 |
 
47
 
48
  ## ๐Ÿ’พ Output
49
 
50
- Trained model saved to `./myanmar-llm-output/`
51
 
52
  ## ๐Ÿ“ค Upload to HuggingFace
53
 
54
  ```bash
55
- cd myanmar-llm-output
56
- huggingface-cli upload amkyawdev/my-myanmar-llm-v1 . --repo-type model
57
  ```
58
 
59
- ## ๐Ÿ–ฅ๏ธ Run on Google Colab
60
 
61
  ```python
62
  # Install
63
- !pip install transformers datasets torch
64
 
65
  # Login
66
  from huggingface_hub import login
67
  login("YOUR_TOKEN")
68
 
69
- # Run training script
70
  %run train.py
71
  ```
72
 
 
 
 
 
 
73
  ---
74
  Built by amkyawdev
 
1
  # ๐Ÿง  Myanmar LLM Training
2
 
3
+ Fine-tune **Llama-3.1-8B-Instruct** with Myanmar language dataset.
4
 
5
  ## ๐Ÿ“‹ Requirements
6
 
7
  - Python 3.8+
8
+ - GPU with 16GB+ VRAM (recommended)
9
+ - HuggingFace Account with Llama access
10
 
11
  ## ๐Ÿš€ Quick Start
12
 
 
21
  # Enter your token
22
  ```
23
 
24
+ **Note:** Llama requires accepting the license at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
25
+
26
  ### 3. Run training
27
  ```bash
28
  python train.py
 
32
 
33
  | Parameter | Default | Description |
34
  |-----------|---------|-------------|
35
+ | MODEL_NAME | meta-llama/Llama-3.1-8B-Instruct | Base model |
36
  | num_train_epochs | 3 | Training iterations |
37
+ | per_device_train_batch_size | 2 | Batch size (4-bit) |
38
+ | gradient_accumulation_steps | 8 | Effective batch |
39
+ | learning_rate | 1e-5 | Learning rate |
40
+
41
+ ## ๐Ÿ“Š Features
42
+
43
+ - โœ… 4-bit quantization (NF4) - แ€กแ€”แ€Šแ€บแ€ธแ€†แ€ฏแ€ถแ€ธ VRAM แ€”แ€ฒแ€ท run แ€œแ€ฏแ€•แ€บแ€”แ€ญแ€ฏแ€„แ€บแ€•แ€ซแ€žแ€Šแ€บแ‹
44
+ - โœ… Gradient checkpointing - Memory แ€แ€ปแ€ฝแ€ฑแ€แ€ฌแ€•แ€ซแ€žแ€Šแ€บแ‹
45
+ - โœ… Test/Validation evaluation - แ€”แ€พแ€…แ€บแ€แ€ฏแ€œแ€ฏแ€ถแ€ธแ€กแ€แ€ฝแ€€แ€บ แ€…แ€™แ€บแ€ธแ€žแ€•แ€บแ€•แ€ซแ€žแ€Šแ€บแ‹
46
+ - โœ… BF16 mixed precision - แ€•แ€ญแ€ฏแ€™แ€ญแ€ฏแ€แ€ญแ€€แ€ปแ€แ€ฒแ€ท trainingแ‹
47
 
48
  ## ๐Ÿ“Š Training Data
49
 
 
52
  | Split | Samples |
53
  |-------|---------|
54
  | Train | 1000 |
 
55
  | Validation | 1000 |
56
+ | Test | 1000 |
57
 
58
  ## ๐Ÿ’พ Output
59
 
60
+ Trained model saved to `./myanmar-llama-output/`
61
 
62
  ## ๐Ÿ“ค Upload to HuggingFace
63
 
64
  ```bash
65
+ cd myanmar-llama-output
66
+ huggingface-cli upload amkyawdev/my-myanmar-llama . --repo-type model
67
  ```
68
 
69
+ ## ๐Ÿ–ฅ๏ธ Google Colab
70
 
71
  ```python
72
  # Install
73
+ !pip install transformers datasets torch bitsandbytes accelerate
74
 
75
  # Login
76
  from huggingface_hub import login
77
  login("YOUR_TOKEN")
78
 
79
+ # Run
80
  %run train.py
81
  ```
82
 
83
+ ## โš ๏ธ Important
84
+
85
+ 1. Llama license แ€œแ€ญแ€ฏแ€•แ€ซแ€žแ€Šแ€บแ‹ https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct แ€™แ€พแ€ฌ Accept แ€œแ€ฏแ€•แ€บแ€•แ€ซแ€žแ€Šแ€บแ‹
86
+ 2. Token แ€™แ€พแ€ฌLlama access แ€›แ€พแ€ญแ€›แ€•แ€ซแ€žแ€Šแ€บแ‹
87
+
88
  ---
89
  Built by amkyawdev
requirements.txt CHANGED
@@ -4,4 +4,6 @@ transformers>=4.36.0
4
  datasets>=2.14.0
5
  torch>=2.0.0
6
  accelerate>=0.20.0
7
- tensorboard>=2.12.0
 
 
 
4
  datasets>=2.14.0
5
  torch>=2.0.0
6
  accelerate>=0.20.0
7
+ tensorboard>=2.12.0
8
+ bitsandbytes>=0.41.0
9
+ scikit-learn>=1.0.0
train.py CHANGED
@@ -1,6 +1,6 @@
1
  """
2
  Myanmar LLM Training Script
3
- Fine-tune Qwen2.5-0.5B with Myanmar dataset
4
  """
5
 
6
  import json
@@ -11,28 +11,75 @@ from transformers import (
11
  AutoTokenizer,
12
  TrainingArguments,
13
  Trainer,
14
- DataCollatorForLanguageModeling
 
15
  )
 
16
  import torch
 
17
 
18
  # Config
19
- MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
20
- OUTPUT_DIR = "./myanmar-llm-output"
21
  DATASET_PATH = "amkyawdev/myanmar-llm-data"
22
 
 
 
 
 
 
 
 
 
23
  def format_conversation(example):
24
- """Format conversation for training"""
25
  messages = example["messages"]
26
  text = ""
27
  for msg in messages:
28
- if msg["role"] == "system":
29
- text += f"<|im_start|>system\n{msg['content']}<|im_end|>\n"
30
- elif msg["role"] == "user":
31
- text += f"<|im_start|>user\n{msg['content']}<|im_end|>\n"
32
- elif msg["role"] == "assistant":
33
- text += f"<|im_start|>assistant\n{msg['content']}<|im_end|>\n"
 
 
 
 
34
  return {"text": text}
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  def load_data():
37
  """Load and prepare Myanmar dataset"""
38
  print("๐Ÿ“‚ Loading dataset...")
@@ -46,14 +93,16 @@ def load_data():
46
  return dataset
47
 
48
  def main():
49
- print("=" * 50)
50
- print("๐Ÿง  Myanmar LLM Training")
51
- print("=" * 50)
52
 
53
  # Check GPU
54
  if torch.cuda.is_available():
55
- print(f"โœ… GPU: {torch.cuda.get_device_name(0)}")
56
- print(f" VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
 
 
57
  else:
58
  print("โš ๏ธ No GPU - will use CPU (very slow)")
59
 
@@ -61,55 +110,74 @@ def main():
61
  print(f"\n๐Ÿ“ฅ Loading model: {MODEL_NAME}")
62
  tokenizer = AutoTokenizer.from_pretrained(
63
  MODEL_NAME,
64
- trust_remote_code=True
 
65
  )
66
 
67
  # Set pad token
68
- if tokenizer.pad_token is None:
69
- tokenizer.pad_token = tokenizer.eos_token
70
 
71
- # Load model
 
72
  model = AutoModelForCausalLM.from_pretrained(
73
  MODEL_NAME,
 
74
  trust_remote_code=True,
75
- torch_dtype=torch.float16,
76
- device_map="auto"
77
  )
78
 
 
 
 
79
  # Load dataset
80
  dataset = load_data()
81
 
82
- # Split for validation
 
 
 
 
 
 
 
 
83
  train_dataset = dataset["train"]
84
  eval_dataset = dataset["validation"]
 
85
 
86
  print(f"\n๐Ÿ“Š Dataset:")
87
  print(f" Train: {len(train_dataset)} samples")
88
- print(f" Eval: {len(eval_dataset)} samples")
 
89
 
90
  # Training args
91
  training_args = TrainingArguments(
92
  output_dir=OUTPUT_DIR,
93
  num_train_epochs=3,
94
- per_device_train_batch_size=4,
95
- per_device_eval_batch_size=4,
96
- gradient_accumulation_steps=4,
97
- learning_rate=2e-5,
98
  warmup_ratio=0.1,
99
  logging_steps=10,
100
  save_steps=100,
101
  eval_steps=100,
102
  save_total_limit=2,
 
103
  bf16=True,
104
  remove_unused_columns=False,
105
  optim="adamw_torch",
106
  report_to="none",
 
 
 
107
  )
108
 
109
  # Data collator
110
  data_collator = DataCollatorForLanguageModeling(
111
  tokenizer=tokenizer,
112
  mlm=False,
 
113
  )
114
 
115
  # Trainer
@@ -119,22 +187,28 @@ def main():
119
  train_dataset=train_dataset,
120
  eval_dataset=eval_dataset,
121
  data_collator=data_collator,
 
122
  )
123
 
124
  # Train
125
  print("\n๐Ÿš€ Starting training...")
126
  trainer.train()
127
 
 
 
 
 
 
128
  # Save model
129
  print("\n๐Ÿ’พ Saving model...")
130
- model.save_pretrained(OUTPUT_DIR)
131
  tokenizer.save_pretrained(OUTPUT_DIR)
132
 
133
- print(f"\nโœ… Training complete! Model saved to: {OUTPUT_DIR}")
 
134
  print(f"\n๐Ÿ“ค Upload to HuggingFace:")
135
- print(f" huggingface-cli login")
136
  print(f" cd {OUTPUT_DIR}")
137
- print(f" hf_upload amkyawdev/my-myanmar-llm . --repo-type model")
138
 
139
  if __name__ == "__main__":
140
  main()
 
1
  """
2
  Myanmar LLM Training Script
3
+ Fine-tune Llama-3.1-8B-Instruct with Myanmar dataset
4
  """
5
 
6
  import json
 
11
  AutoTokenizer,
12
  TrainingArguments,
13
  Trainer,
14
+ DataCollatorForLanguageModeling,
15
+ EvalPrediction,
16
  )
17
+ from transformers import BitsAndBytesConfig
18
  import torch
19
+ from sklearn.metrics import accuracy_score
20
 
21
  # Config
22
+ MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
23
+ OUTPUT_DIR = "./myanmar-llama-output"
24
  DATASET_PATH = "amkyawdev/myanmar-llm-data"
25
 
26
+ # Quantization config for low VRAM
27
+ bnb_config = BitsAndBytesConfig(
28
+ load_in_4bit=True,
29
+ bnb_4bit_quant_type="nf4",
30
+ bnb_4bit_compute_dtype="float16",
31
+ bnb_4bit_use_double_quant=True,
32
+ )
33
+
34
  def format_conversation(example):
35
+ """Format conversation for Llama chat template"""
36
  messages = example["messages"]
37
  text = ""
38
  for msg in messages:
39
+ role = msg["role"]
40
+ content = msg["content"]
41
+ if role == "system":
42
+ text += f"<|start_header_id|>system<|end_header_id|>\n\n{content}<|eot_id|>"
43
+ elif role == "user":
44
+ text += f"<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>"
45
+ elif role == "assistant":
46
+ text += f"<|start_header_id|>assistant<|end_header_id|>\n\n{content}<|eot_id|>"
47
+ # Add separator
48
+ text += "<|start_header_id|>assistant<|end_header_id|>\n\n"
49
  return {"text": text}
50
 
51
+ def preprocess_function(examples, tokenizer, max_length=2048):
52
+ """Tokenize the text"""
53
+ # Add prompt suffix for assistant response
54
+ texts = [text + "<|start_header_id|>assistant<|end_header_id|>\n\n" for text in examples["text"]]
55
+
56
+ tokenized = tokenizer(
57
+ texts,
58
+ truncation=True,
59
+ max_length=max_length,
60
+ padding="max_length",
61
+ return_tensors=None,
62
+ )
63
+
64
+ # Labels same as input_ids (causal LM)
65
+ tokenized["labels"] = tokenized["input_ids"].copy()
66
+ return tokenized
67
+
68
+ def compute_metrics(eval_pred):
69
+ """Compute perplexity as evaluation metric"""
70
+ logits, labels = eval_pred
71
+ # Shift for causal LM
72
+ logits = logits[:-1]
73
+ labels = labels[1:]
74
+
75
+ # Calculate perplexity
76
+ loss = torch.nn.functional.cross_entropy(
77
+ torch.tensor(logits),
78
+ torch.tensor(labels),
79
+ ignore_index=-100
80
+ )
81
+ return {"perplexity": torch.exp(loss).item()}
82
+
83
  def load_data():
84
  """Load and prepare Myanmar dataset"""
85
  print("๐Ÿ“‚ Loading dataset...")
 
93
  return dataset
94
 
95
  def main():
96
+ print("=" * 60)
97
+ print("๐Ÿง  Myanmar LLM Training - Llama 3.1 8B")
98
+ print("=" * 60)
99
 
100
  # Check GPU
101
  if torch.cuda.is_available():
102
+ gpu_name = torch.cuda.get_device_name(0)
103
+ vram = torch.cuda.get_device_properties(0).total_memory / 1e9
104
+ print(f"โœ… GPU: {gpu_name}")
105
+ print(f" VRAM: {vram:.2f} GB")
106
  else:
107
  print("โš ๏ธ No GPU - will use CPU (very slow)")
108
 
 
110
  print(f"\n๐Ÿ“ฅ Loading model: {MODEL_NAME}")
111
  tokenizer = AutoTokenizer.from_pretrained(
112
  MODEL_NAME,
113
+ trust_remote_code=True,
114
+ padding_side="right",
115
  )
116
 
117
  # Set pad token
118
+ tokenizer.pad_token = tokenizer.eos_token
 
119
 
120
+ # Load model with 4-bit quantization
121
+ print("๐Ÿ”„ Loading model with 4-bit quantization...")
122
  model = AutoModelForCausalLM.from_pretrained(
123
  MODEL_NAME,
124
+ quantization_config=bnb_config,
125
  trust_remote_code=True,
126
+ device_map="auto",
 
127
  )
128
 
129
+ # Disable gradient checkpointing for stability
130
+ model.gradient_checkpointing_enable()
131
+
132
  # Load dataset
133
  dataset = load_data()
134
 
135
+ # Preprocess
136
+ print("๐Ÿ”ง Tokenizing...")
137
+ for split in dataset:
138
+ dataset[split] = dataset[split].map(
139
+ lambda x: preprocess_function(x, tokenizer),
140
+ batched=True,
141
+ remove_columns=dataset[split].column_names,
142
+ )
143
+
144
  train_dataset = dataset["train"]
145
  eval_dataset = dataset["validation"]
146
+ test_dataset = dataset["test"]
147
 
148
  print(f"\n๐Ÿ“Š Dataset:")
149
  print(f" Train: {len(train_dataset)} samples")
150
+ print(f" Validation: {len(eval_dataset)} samples")
151
+ print(f" Test: {len(test_dataset)} samples")
152
 
153
  # Training args
154
  training_args = TrainingArguments(
155
  output_dir=OUTPUT_DIR,
156
  num_train_epochs=3,
157
+ per_device_train_batch_size=2,
158
+ per_device_eval_batch_size=2,
159
+ gradient_accumulation_steps=8,
160
+ learning_rate=1e-5,
161
  warmup_ratio=0.1,
162
  logging_steps=10,
163
  save_steps=100,
164
  eval_steps=100,
165
  save_total_limit=2,
166
+ fp16=False,
167
  bf16=True,
168
  remove_unused_columns=False,
169
  optim="adamw_torch",
170
  report_to="none",
171
+ load_best_model_at_end=True,
172
+ eval_strategy="steps",
173
+ save_strategy="steps",
174
  )
175
 
176
  # Data collator
177
  data_collator = DataCollatorForLanguageModeling(
178
  tokenizer=tokenizer,
179
  mlm=False,
180
+ pad_to_multiple_of=8,
181
  )
182
 
183
  # Trainer
 
187
  train_dataset=train_dataset,
188
  eval_dataset=eval_dataset,
189
  data_collator=data_collator,
190
+ compute_metrics=compute_metrics,
191
  )
192
 
193
  # Train
194
  print("\n๐Ÿš€ Starting training...")
195
  trainer.train()
196
 
197
+ # Evaluate on test set
198
+ print("\n๐Ÿ“ Evaluating on test set...")
199
+ test_results = trainer.evaluate(test_dataset)
200
+ print(f"Test Results: {test_results}")
201
+
202
  # Save model
203
  print("\n๐Ÿ’พ Saving model...")
204
+ trainer.save_model(OUTPUT_DIR)
205
  tokenizer.save_pretrained(OUTPUT_DIR)
206
 
207
+ print(f"\nโœ… Training complete!")
208
+ print(f" Model: {OUTPUT_DIR}")
209
  print(f"\n๐Ÿ“ค Upload to HuggingFace:")
 
210
  print(f" cd {OUTPUT_DIR}")
211
+ print(f" hf upload amkyawdev/my-myanmar-llama . --repo-type model")
212
 
213
  if __name__ == "__main__":
214
  main()