File size: 6,962 Bytes

---
license: apache-2.0
library_name: transformers
---

# ThaiLLM-8B info

This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens. 

**Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases. 

**For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B:

- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview

- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa

- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/

- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0

## Data

The training corpus consists of the following datasets:

| Dataset | Tokens |
|---------|--------|
| Fineweb2-ENG | 24,000,000,000 |
| Fineweb2-TH | 31,525,674,209 |
| CuratedData | 8,054,246,789 |

### CuratedData Breakdown

| Category | Token Count |
|----------|-------------|
| Business & Finance | 736,071,807 |
| News | 1,700,662,378 |
| Education | 576,489,778 |
| Social | 211,000,000 |
| Government | 40,492,117 |
| Medical | 42,987,587 |
| Conversation | 80,919,390 |
| Code | 620,218 |
| Research Articles | 4,185,649,758 |
| Law | 467,994,847 |
| Travel | 6,948,290 |
| Others | 4,410,619 |

*Token counts calculated using Qwen3 Tokenizer

## Requirements

The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`.

With `transformers<4.51.0`, you will encounter the following error:
```
KeyError: 'qwen3'
```

## Usage Training

**Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements.

### Recommended Training Setup

We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques.

#### Quick Start with LLaMA-Factory

```bash
# Clone the repository
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# Install dependencies
pip install -e .

# Example training command for LoRA
llamafactory-cli train \
    --model_name_or_path ThaiLLM/ThaiLLM-8B \
    --stage sft \
    --do_train \
    --finetuning_type lora \
    --dataset your_dataset \
    --template qwen3 \
    --cutoff_len 8192 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --output_dir saves/ThaiLLM-8B-lora \
    --bf16
```

## Usage Inference

Below are code snippets to get quickly started with running the model. First, install the necessary libraries.

```bash
pip install -U transformers torch accelerate
```

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, 
import torch

model_id = "ThaiLLM/ThaiLLM-8B"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype=torch.bfloat16
)

# Example prompt
prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate response
with torch.inference_mode(): 
    generate_ids = model.generate( 
        inputs.input_ids, 
        max_new_tokens=500, 
        repetition_penalty=1.2, 
        num_beams=1, 
        do_sample=True, 
        top_k=40, 
        top_p=0.75, 
        temperature=0.4, 
        pad_token_id=tokenizer.eos_token_id, 
    )

response = tokenizer.batch_decode(
    generate_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True
)[0]

print(response)
```

## Benchmarks

We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English.  
Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction.

### 1. Natural Language Understanding (NLU)

| Task | Qwen3-8B-Base | ThaiLLM-8B | Δ |
|------|--------------:|-----------:|---:|
| **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 |
| **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 |
| **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 |
| ├── ONET | 0.4074 | **0.5864** | +0.1790 |
| ├── IC | 0.5157 | **0.7052** | +0.1895 |
| ├── TGAT | 0.3384 | **0.6307** | +0.2923 |
| ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 |
| └── A-Level | 0.1653 | **0.5275** | +0.3622 |
| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 |
| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 |
| ├── Thai | **0.4833** | 0.5023 | +0.0190 |
| ├── Math | 0.4090 | **0.2727** | -0.1363 |
| ├── Social | 0.5844 | **0.7088** | +0.1244 |
| ├── Science | 0.4603 | **0.5238** | +0.0635 |
| └── English | 0.7552 | **0.7864** | +0.0312 |
| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 |
| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 |
| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 |

---

### 2. Average Performance

| Model | Average Score |
|-------|--------------:|
| Qwen3-8B-Base | 0.5987 |
| ThaiLLM-8B | **0.6891** |

> **Highlights**:  
> - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**.  
> - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29).  
> - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam.


## Limitations

- This is a base model and requires instruction fine-tuning for optimal performance
- Performance on specialized domains may require domain-specific fine-tuning
- As with all language models, outputs should be verified for accuracy in critical applications

## Citation

```bibtex
@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}
```

## Dataset Contributor

![](dataset_contributors.png)