|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ThaiLLM-8B info |
|
|
|
|
|
This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens. |
|
|
|
|
|
**Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases. |
|
|
|
|
|
**For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B: |
|
|
|
|
|
- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview |
|
|
|
|
|
- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa |
|
|
|
|
|
- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/ |
|
|
|
|
|
- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0 |
|
|
|
|
|
## Data |
|
|
|
|
|
The training corpus consists of the following datasets: |
|
|
|
|
|
| Dataset | Tokens | |
|
|
|---------|--------| |
|
|
| Fineweb2-ENG | 24,000,000,000 | |
|
|
| Fineweb2-TH | 31,525,674,209 | |
|
|
| CuratedData | 8,054,246,789 | |
|
|
|
|
|
### CuratedData Breakdown |
|
|
|
|
|
| Category | Token Count | |
|
|
|----------|-------------| |
|
|
| Business & Finance | 736,071,807 | |
|
|
| News | 1,700,662,378 | |
|
|
| Education | 576,489,778 | |
|
|
| Social | 211,000,000 | |
|
|
| Government | 40,492,117 | |
|
|
| Medical | 42,987,587 | |
|
|
| Conversation | 80,919,390 | |
|
|
| Code | 620,218 | |
|
|
| Research Articles | 4,185,649,758 | |
|
|
| Law | 467,994,847 | |
|
|
| Travel | 6,948,290 | |
|
|
| Others | 4,410,619 | |
|
|
|
|
|
*Token counts calculated using Qwen3 Tokenizer |
|
|
|
|
|
## Requirements |
|
|
|
|
|
The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`. |
|
|
|
|
|
With `transformers<4.51.0`, you will encounter the following error: |
|
|
``` |
|
|
KeyError: 'qwen3' |
|
|
``` |
|
|
|
|
|
## Usage Training |
|
|
|
|
|
**Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements. |
|
|
|
|
|
### Recommended Training Setup |
|
|
|
|
|
We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques. |
|
|
|
|
|
#### Quick Start with LLaMA-Factory |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/hiyouga/LLaMA-Factory.git |
|
|
cd LLaMA-Factory |
|
|
|
|
|
# Install dependencies |
|
|
pip install -e . |
|
|
|
|
|
# Example training command for LoRA |
|
|
llamafactory-cli train \ |
|
|
--model_name_or_path ThaiLLM/ThaiLLM-8B \ |
|
|
--stage sft \ |
|
|
--do_train \ |
|
|
--finetuning_type lora \ |
|
|
--dataset your_dataset \ |
|
|
--template qwen3 \ |
|
|
--cutoff_len 8192 \ |
|
|
--learning_rate 5e-05 \ |
|
|
--num_train_epochs 3.0 \ |
|
|
--per_device_train_batch_size 2 \ |
|
|
--gradient_accumulation_steps 8 \ |
|
|
--lr_scheduler_type cosine \ |
|
|
--max_grad_norm 1.0 \ |
|
|
--logging_steps 5 \ |
|
|
--save_steps 100 \ |
|
|
--warmup_steps 0 \ |
|
|
--output_dir saves/ThaiLLM-8B-lora \ |
|
|
--bf16 |
|
|
``` |
|
|
|
|
|
## Usage Inference |
|
|
|
|
|
Below are code snippets to get quickly started with running the model. First, install the necessary libraries. |
|
|
|
|
|
```bash |
|
|
pip install -U transformers torch accelerate |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, |
|
|
import torch |
|
|
|
|
|
model_id = "ThaiLLM/ThaiLLM-8B" |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
# Example prompt |
|
|
prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
# Generate response |
|
|
with torch.inference_mode(): |
|
|
generate_ids = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=500, |
|
|
repetition_penalty=1.2, |
|
|
num_beams=1, |
|
|
do_sample=True, |
|
|
top_k=40, |
|
|
top_p=0.75, |
|
|
temperature=0.4, |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
response = tokenizer.batch_decode( |
|
|
generate_ids, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=True |
|
|
)[0] |
|
|
|
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Benchmarks |
|
|
|
|
|
We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English. |
|
|
Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction. |
|
|
|
|
|
### 1. Natural Language Understanding (NLU) |
|
|
|
|
|
| Task | Qwen3-8B-Base | ThaiLLM-8B | Δ | |
|
|
|------|--------------:|-----------:|---:| |
|
|
| **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 | |
|
|
| **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 | |
|
|
| **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 | |
|
|
| ├── ONET | 0.4074 | **0.5864** | +0.1790 | |
|
|
| ├── IC | 0.5157 | **0.7052** | +0.1895 | |
|
|
| ├── TGAT | 0.3384 | **0.6307** | +0.2923 | |
|
|
| ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 | |
|
|
| └── A-Level | 0.1653 | **0.5275** | +0.3622 | |
|
|
| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 | |
|
|
| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 | |
|
|
| ├── Thai | **0.4833** | 0.5023 | +0.0190 | |
|
|
| ├── Math | 0.4090 | **0.2727** | -0.1363 | |
|
|
| ├── Social | 0.5844 | **0.7088** | +0.1244 | |
|
|
| ├── Science | 0.4603 | **0.5238** | +0.0635 | |
|
|
| └── English | 0.7552 | **0.7864** | +0.0312 | |
|
|
| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 | |
|
|
| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 | |
|
|
| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 | |
|
|
|
|
|
--- |
|
|
|
|
|
### 2. Average Performance |
|
|
|
|
|
| Model | Average Score | |
|
|
|-------|--------------:| |
|
|
| Qwen3-8B-Base | 0.5987 | |
|
|
| ThaiLLM-8B | **0.6891** | |
|
|
|
|
|
> **Highlights**: |
|
|
> - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**. |
|
|
> - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29). |
|
|
> - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam. |
|
|
|
|
|
|
|
|
## Limitations |
|
|
|
|
|
- This is a base model and requires instruction fine-tuning for optimal performance |
|
|
- Performance on specialized domains may require domain-specific fine-tuning |
|
|
- As with all language models, outputs should be verified for accuracy in critical applications |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen3technicalreport, |
|
|
title={Qwen3 Technical Report}, |
|
|
author={Qwen Team}, |
|
|
year={2025}, |
|
|
eprint={2505.09388}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2505.09388}, |
|
|
} |
|
|
``` |