File size: 6,962 Bytes
4e71b21
 
 
 
 
 
 
 
 
9ff0e8b
 
 
 
5c7eb13
9ff0e8b
 
 
 
 
d2dc2ac
4e71b21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f82f66
4e71b21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9eb8c8a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
license: apache-2.0
library_name: transformers
---

# ThaiLLM-8B info

This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens. 

**Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases. 

**For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B:

- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview

- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa

- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/

- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0

## Data

The training corpus consists of the following datasets:

| Dataset | Tokens |
|---------|--------|
| Fineweb2-ENG | 24,000,000,000 |
| Fineweb2-TH | 31,525,674,209 |
| CuratedData | 8,054,246,789 |

### CuratedData Breakdown

| Category | Token Count |
|----------|-------------|
| Business & Finance | 736,071,807 |
| News | 1,700,662,378 |
| Education | 576,489,778 |
| Social | 211,000,000 |
| Government | 40,492,117 |
| Medical | 42,987,587 |
| Conversation | 80,919,390 |
| Code | 620,218 |
| Research Articles | 4,185,649,758 |
| Law | 467,994,847 |
| Travel | 6,948,290 |
| Others | 4,410,619 |

*Token counts calculated using Qwen3 Tokenizer

## Requirements

The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`.

With `transformers<4.51.0`, you will encounter the following error:
```
KeyError: 'qwen3'
```

## Usage Training

**Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements.

### Recommended Training Setup

We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques.

#### Quick Start with LLaMA-Factory

```bash
# Clone the repository
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# Install dependencies
pip install -e .

# Example training command for LoRA
llamafactory-cli train \
    --model_name_or_path ThaiLLM/ThaiLLM-8B \
    --stage sft \
    --do_train \
    --finetuning_type lora \
    --dataset your_dataset \
    --template qwen3 \
    --cutoff_len 8192 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --output_dir saves/ThaiLLM-8B-lora \
    --bf16
```

## Usage Inference

Below are code snippets to get quickly started with running the model. First, install the necessary libraries.

```bash
pip install -U transformers torch accelerate
```

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, 
import torch

model_id = "ThaiLLM/ThaiLLM-8B"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# Example prompt
prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate response
with torch.inference_mode(): 
    generate_ids = model.generate( 
        inputs.input_ids, 
        max_new_tokens=500, 
        repetition_penalty=1.2, 
        num_beams=1, 
        do_sample=True, 
        top_k=40, 
        top_p=0.75, 
        temperature=0.4, 
        pad_token_id=tokenizer.eos_token_id, 
    )

response = tokenizer.batch_decode(
    generate_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True
)[0]

print(response)
```

## Benchmarks

We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English.  
Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction.

### 1. Natural Language Understanding (NLU)

| Task | Qwen3-8B-Base | ThaiLLM-8B | Δ |
|------|--------------:|-----------:|---:|
| **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 |
| **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 |
| **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 |
| ├── ONET | 0.4074 | **0.5864** | +0.1790 |
| ├── IC | 0.5157 | **0.7052** | +0.1895 |
| ├── TGAT | 0.3384 | **0.6307** | +0.2923 |
| ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 |
| └── A-Level | 0.1653 | **0.5275** | +0.3622 |
| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 |
| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 |
| ├── Thai | **0.4833** | 0.5023 | +0.0190 |
| ├── Math | 0.4090 | **0.2727** | -0.1363 |
| ├── Social | 0.5844 | **0.7088** | +0.1244 |
| ├── Science | 0.4603 | **0.5238** | +0.0635 |
| └── English | 0.7552 | **0.7864** | +0.0312 |
| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 |
| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 |
| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 |

---

### 2. Average Performance

| Model | Average Score |
|-------|--------------:|
| Qwen3-8B-Base | 0.5987 |
| ThaiLLM-8B | **0.6891** |

> **Highlights**:  
> - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**.  
> - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29).  
> - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam.


## Limitations

- This is a base model and requires instruction fine-tuning for optimal performance
- Performance on specialized domains may require domain-specific fine-tuning
- As with all language models, outputs should be verified for accuracy in critical applications

## Citation

```bibtex
@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}
```

## Dataset Contributor

![](dataset_contributors.png)