ThaiLLM-8B / README.md

Update README.md

d2dc2ac verified 9 days ago

6.91 kB

	---
	license: apache-2.0
	library_name: transformers
	---

	# ThaiLLM-8B info

	This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens.

	Important Note: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases.

	For example, the following models have been instruction fine-tuned based on ThaiLLM-8B:

	- Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview

	- THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa

	- OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/

	- Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0

	## Data

	The training corpus consists of the following datasets:

	\| Dataset \| Tokens \|
	\|---------\|--------\|
	\| Fineweb2-ENG \| 24,000,000,000 \|
	\| Fineweb2-TH \| 31,525,674,209 \|
	\| CuratedData \| 8,054,246,789 \|

	### CuratedData Breakdown

	\| Category \| Token Count \|
	\|----------\|-------------\|
	\| Business & Finance \| 736,071,807 \|
	\| News \| 1,700,662,378 \|
	\| Education \| 576,489,778 \|
	\| Social \| 211,000,000 \|
	\| Government \| 40,492,117 \|
	\| Medical \| 42,987,587 \|
	\| Conversation \| 80,919,390 \|
	\| Code \| 620,218 \|
	\| Research Articles \| 4,185,649,758 \|
	\| Law \| 467,994,847 \|
	\| Travel \| 6,948,290 \|
	\| Others \| 4,410,619 \|

	*Token counts calculated using Qwen3 Tokenizer

	## Requirements

	The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`.

	With `transformers<4.51.0`, you will encounter the following error:
	```
	KeyError: 'qwen3'
	```

	## Usage Training

	Important: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements.

	### Recommended Training Setup

	We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques.

	#### Quick Start with LLaMA-Factory

	```bash
	# Clone the repository
	git clone https://github.com/hiyouga/LLaMA-Factory.git
	cd LLaMA-Factory

	# Install dependencies
	pip install -e .

	# Example training command for LoRA
	llamafactory-cli train \
	--model_name_or_path ThaiLLM/ThaiLLM-8B \
	--stage sft \
	--do_train \
	--finetuning_type lora \
	--dataset your_dataset \
	--template qwen3 \
	--cutoff_len 8192 \
	--learning_rate 5e-05 \
	--num_train_epochs 3.0 \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--lr_scheduler_type cosine \
	--max_grad_norm 1.0 \
	--logging_steps 5 \
	--save_steps 100 \
	--warmup_steps 0 \
	--output_dir saves/ThaiLLM-8B-lora \
	--bf16
	```

	## Usage Inference

	Below are code snippets to get quickly started with running the model. First, install the necessary libraries.

	```bash
	pip install -U transformers torch accelerate
	```

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM,
	import torch

	model_id = "ThaiLLM/ThaiLLM-8B"

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	# Example prompt
	prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด"
	inputs = tokenizer(prompt, return_tensors="pt")

	# Generate response
	with torch.inference_mode():
	generate_ids = model.generate(
	inputs.input_ids,
	max_new_tokens=500,
	repetition_penalty=1.2,
	num_beams=1,
	do_sample=True,
	top_k=40,
	top_p=0.75,
	temperature=0.4,
	pad_token_id=tokenizer.eos_token_id,
	)

	response = tokenizer.batch_decode(
	generate_ids,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=True
	)[0]

	print(response)
	```

	## Benchmarks

	We evaluated ThaiLLM-8B against Qwen3-8B-Base using multiple-choice question datasets in both Thai and English.
	Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction.

	### 1. Natural Language Understanding (NLU)

	\| Task \| Qwen3-8B-Base \| ThaiLLM-8B \| Δ \|
	\|------\|--------------:\|-----------:\|---:\|
	\| [MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot) \| 0.7691 \| 0.7565 \| -0.0126 \|
	\| [MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/) \| 0.6259 \| 0.6459 \| +0.0200 \|
	\| [ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg. (ONET, IC, TGAT, TPAT-1, A-Level) \| 0.31396 \| 0.48292 \| +0.16896 \|
	\| ├── ONET \| 0.4074 \| 0.5864 \| +0.1790 \|
	\| ├── IC \| 0.5157 \| 0.7052 \| +0.1895 \|
	\| ├── TGAT \| 0.3384 \| 0.6307 \| +0.2923 \|
	\| ├── TPAT-1 \| 0.1379 \| 0.3965 \| +0.2586 \|
	\| └── A-Level \| 0.1653 \| 0.5275 \| +0.3622 \|
	\| [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) \| 0.5802 \| 0.6369 \| +0.0567 \|
	\| [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. \| 0.54844 \| 0.55792 \| +0.00948 \|
	\| ├── Thai \| 0.4833 \| 0.5023 \| +0.0190 \|
	\| ├── Math \| 0.4090 \| 0.2727 \| -0.1363 \|
	\| ├── Social \| 0.5844 \| 0.7088 \| +0.1244 \|
	\| ├── Science \| 0.4603 \| 0.5238 \| +0.0635 \|
	\| └── English \| 0.7552 \| 0.7864 \| +0.0312 \|
	\| [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) \| 0.7529 \| 0.6667 \| -0.0862 \|
	\| [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) \| 0.8220 \| 0.8340 \| +0.0120 \|
	\| [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) \| 0.3880 \| 0.8447 \| +0.4567 \|

	---

	### 2. Average Performance

	\| Model \| Average Score \|
	\|-------\|--------------:\|
	\| Qwen3-8B-Base \| 0.5987 \|
	\| ThaiLLM-8B \| 0.6891 \|

	> Highlights:
	> - ThaiLLM-8B shows large improvements in ThaiExam, Belebele-Thai, and MMLU-TH.
	> - Gains are especially strong in A-Level (+0.36) and TGAT (+0.29).
	> - Some slight regressions are seen in MMLU-ENG and Math in M6Exam.


	## Limitations

	- This is a base model and requires instruction fine-tuning for optimal performance
	- Performance on specialized domains may require domain-specific fine-tuning
	- As with all language models, outputs should be verified for accuracy in critical applications

	## Citation

	```bibtex
	@misc{qwen3technicalreport,
	title={Qwen3 Technical Report},
	author={Qwen Team},
	year={2025},
	eprint={2505.09388},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.09388},
	}
	```