File size: 6,962 Bytes
4e71b21 9ff0e8b 5c7eb13 9ff0e8b d2dc2ac 4e71b21 6f82f66 4e71b21 9eb8c8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | ---
license: apache-2.0
library_name: transformers
---
# ThaiLLM-8B info
This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens.
**Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases.
**For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B:
- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview
- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa
- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/
- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0
## Data
The training corpus consists of the following datasets:
| Dataset | Tokens |
|---------|--------|
| Fineweb2-ENG | 24,000,000,000 |
| Fineweb2-TH | 31,525,674,209 |
| CuratedData | 8,054,246,789 |
### CuratedData Breakdown
| Category | Token Count |
|----------|-------------|
| Business & Finance | 736,071,807 |
| News | 1,700,662,378 |
| Education | 576,489,778 |
| Social | 211,000,000 |
| Government | 40,492,117 |
| Medical | 42,987,587 |
| Conversation | 80,919,390 |
| Code | 620,218 |
| Research Articles | 4,185,649,758 |
| Law | 467,994,847 |
| Travel | 6,948,290 |
| Others | 4,410,619 |
*Token counts calculated using Qwen3 Tokenizer
## Requirements
The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`.
With `transformers<4.51.0`, you will encounter the following error:
```
KeyError: 'qwen3'
```
## Usage Training
**Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements.
### Recommended Training Setup
We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques.
#### Quick Start with LLaMA-Factory
```bash
# Clone the repository
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
# Install dependencies
pip install -e .
# Example training command for LoRA
llamafactory-cli train \
--model_name_or_path ThaiLLM/ThaiLLM-8B \
--stage sft \
--do_train \
--finetuning_type lora \
--dataset your_dataset \
--template qwen3 \
--cutoff_len 8192 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--output_dir saves/ThaiLLM-8B-lora \
--bf16
```
## Usage Inference
Below are code snippets to get quickly started with running the model. First, install the necessary libraries.
```bash
pip install -U transformers torch accelerate
```
```python
from transformers import AutoTokenizer, AutoModelForCausalLM,
import torch
model_id = "ThaiLLM/ThaiLLM-8B"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Example prompt
prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate response
with torch.inference_mode():
generate_ids = model.generate(
inputs.input_ids,
max_new_tokens=500,
repetition_penalty=1.2,
num_beams=1,
do_sample=True,
top_k=40,
top_p=0.75,
temperature=0.4,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)[0]
print(response)
```
## Benchmarks
We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English.
Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction.
### 1. Natural Language Understanding (NLU)
| Task | Qwen3-8B-Base | ThaiLLM-8B | Δ |
|------|--------------:|-----------:|---:|
| **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 |
| **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 |
| **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 |
| ├── ONET | 0.4074 | **0.5864** | +0.1790 |
| ├── IC | 0.5157 | **0.7052** | +0.1895 |
| ├── TGAT | 0.3384 | **0.6307** | +0.2923 |
| ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 |
| └── A-Level | 0.1653 | **0.5275** | +0.3622 |
| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 |
| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 |
| ├── Thai | **0.4833** | 0.5023 | +0.0190 |
| ├── Math | 0.4090 | **0.2727** | -0.1363 |
| ├── Social | 0.5844 | **0.7088** | +0.1244 |
| ├── Science | 0.4603 | **0.5238** | +0.0635 |
| └── English | 0.7552 | **0.7864** | +0.0312 |
| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 |
| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 |
| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 |
---
### 2. Average Performance
| Model | Average Score |
|-------|--------------:|
| Qwen3-8B-Base | 0.5987 |
| ThaiLLM-8B | **0.6891** |
> **Highlights**:
> - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**.
> - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29).
> - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam.
## Limitations
- This is a base model and requires instruction fine-tuning for optimal performance
- Performance on specialized domains may require domain-specific fine-tuning
- As with all language models, outputs should be verified for accuracy in critical applications
## Citation
```bibtex
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
```
## Dataset Contributor
 |