ThaiLLM-8B / README.md
jackJessada's picture
Update README.md
d2dc2ac verified
---
license: apache-2.0
library_name: transformers
---
# ThaiLLM-8B info
This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens.
**Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases.
**For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B:
- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview
- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa
- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/
- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0
## Data
The training corpus consists of the following datasets:
| Dataset | Tokens |
|---------|--------|
| Fineweb2-ENG | 24,000,000,000 |
| Fineweb2-TH | 31,525,674,209 |
| CuratedData | 8,054,246,789 |
### CuratedData Breakdown
| Category | Token Count |
|----------|-------------|
| Business & Finance | 736,071,807 |
| News | 1,700,662,378 |
| Education | 576,489,778 |
| Social | 211,000,000 |
| Government | 40,492,117 |
| Medical | 42,987,587 |
| Conversation | 80,919,390 |
| Code | 620,218 |
| Research Articles | 4,185,649,758 |
| Law | 467,994,847 |
| Travel | 6,948,290 |
| Others | 4,410,619 |
*Token counts calculated using Qwen3 Tokenizer
## Requirements
The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`.
With `transformers<4.51.0`, you will encounter the following error:
```
KeyError: 'qwen3'
```
## Usage Training
**Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements.
### Recommended Training Setup
We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques.
#### Quick Start with LLaMA-Factory
```bash
# Clone the repository
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
# Install dependencies
pip install -e .
# Example training command for LoRA
llamafactory-cli train \
--model_name_or_path ThaiLLM/ThaiLLM-8B \
--stage sft \
--do_train \
--finetuning_type lora \
--dataset your_dataset \
--template qwen3 \
--cutoff_len 8192 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--output_dir saves/ThaiLLM-8B-lora \
--bf16
```
## Usage Inference
Below are code snippets to get quickly started with running the model. First, install the necessary libraries.
```bash
pip install -U transformers torch accelerate
```
```python
from transformers import AutoTokenizer, AutoModelForCausalLM,
import torch
model_id = "ThaiLLM/ThaiLLM-8B"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Example prompt
prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate response
with torch.inference_mode():
generate_ids = model.generate(
inputs.input_ids,
max_new_tokens=500,
repetition_penalty=1.2,
num_beams=1,
do_sample=True,
top_k=40,
top_p=0.75,
temperature=0.4,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)[0]
print(response)
```
## Benchmarks
We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English.
Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction.
### 1. Natural Language Understanding (NLU)
| Task | Qwen3-8B-Base | ThaiLLM-8B | Δ |
|------|--------------:|-----------:|---:|
| **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 |
| **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 |
| **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 |
| ├── ONET | 0.4074 | **0.5864** | +0.1790 |
| ├── IC | 0.5157 | **0.7052** | +0.1895 |
| ├── TGAT | 0.3384 | **0.6307** | +0.2923 |
| ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 |
| └── A-Level | 0.1653 | **0.5275** | +0.3622 |
| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 |
| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 |
| ├── Thai | **0.4833** | 0.5023 | +0.0190 |
| ├── Math | 0.4090 | **0.2727** | -0.1363 |
| ├── Social | 0.5844 | **0.7088** | +0.1244 |
| ├── Science | 0.4603 | **0.5238** | +0.0635 |
| └── English | 0.7552 | **0.7864** | +0.0312 |
| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 |
| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 |
| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 |
---
### 2. Average Performance
| Model | Average Score |
|-------|--------------:|
| Qwen3-8B-Base | 0.5987 |
| ThaiLLM-8B | **0.6891** |
> **Highlights**:
> - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**.
> - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29).
> - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam.
## Limitations
- This is a base model and requires instruction fine-tuning for optimal performance
- Performance on specialized domains may require domain-specific fine-tuning
- As with all language models, outputs should be verified for accuracy in critical applications
## Citation
```bibtex
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
```