--- license: apache-2.0 library_name: transformers --- # ThaiLLM-8B info This model is a continued pre-training from [Qwen3-8-Base](https://huggingface.co/Qwen/Qwen3-8B-Base), which underwent training on a diverse corpus of approximately 63 billion tokens. **Important Note**: This is a base model that requires instruction fine-tuning to align with specific user requirements and use cases. **For example**, the following models have been instruction fine-tuned based on ThaiLLM-8B: - Typhoon by SCB10X: https://huggingface.co/typhoon-ai/typhoon-s-thaillm-8b-instruct-research-preview - THaLLE by KBTG: https://huggingface.co/KBTG-Labs/THaLLE-0.2-ThaiLLM-8B-fa - OpenThaiGPT by AIEAT: https://huggingface.co/openthaigpt/openthaigpt-thaillm-8b-instruct-v0.7.2-research-preview/ - Pathumma by NECTEC: https://huggingface.co/nectec/Pathumma-ThaiLLM-qwen3-8b-it-2.0.0 ## Data The training corpus consists of the following datasets: | Dataset | Tokens | |---------|--------| | Fineweb2-ENG | 24,000,000,000 | | Fineweb2-TH | 31,525,674,209 | | CuratedData | 8,054,246,789 | ### CuratedData Breakdown | Category | Token Count | |----------|-------------| | Business & Finance | 736,071,807 | | News | 1,700,662,378 | | Education | 576,489,778 | | Social | 211,000,000 | | Government | 40,492,117 | | Medical | 42,987,587 | | Conversation | 80,919,390 | | Code | 620,218 | | Research Articles | 4,185,649,758 | | Law | 467,994,847 | | Travel | 6,948,290 | | Others | 4,410,619 | *Token counts calculated using Qwen3 Tokenizer ## Requirements The code of Qwen3 has been integrated into the latest Hugging Face `transformers` library. We strongly recommend using the latest version of `transformers`. With `transformers<4.51.0`, you will encounter the following error: ``` KeyError: 'qwen3' ``` ## Usage Training **Important**: This is a base model and requires instruction fine-tuning before use to ensure optimal performance for your specific tasks and requirements. ### Recommended Training Setup We recommend using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory.git) for instruction fine-tuning. This framework provides an easy-to-use interface for training language models with various optimization techniques. #### Quick Start with LLaMA-Factory ```bash # Clone the repository git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory # Install dependencies pip install -e . # Example training command for LoRA llamafactory-cli train \ --model_name_or_path ThaiLLM/ThaiLLM-8B \ --stage sft \ --do_train \ --finetuning_type lora \ --dataset your_dataset \ --template qwen3 \ --cutoff_len 8192 \ --learning_rate 5e-05 \ --num_train_epochs 3.0 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --output_dir saves/ThaiLLM-8B-lora \ --bf16 ``` ## Usage Inference Below are code snippets to get quickly started with running the model. First, install the necessary libraries. ```bash pip install -U transformers torch accelerate ``` ```python from transformers import AutoTokenizer, AutoModelForCausalLM, import torch model_id = "ThaiLLM/ThaiLLM-8B" # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 ) # Example prompt prompt = "น้ำบริสุทธิ์มีค่า pH เท่าใด" inputs = tokenizer(prompt, return_tensors="pt") # Generate response with torch.inference_mode(): generate_ids = model.generate( inputs.input_ids, max_new_tokens=500, repetition_penalty=1.2, num_beams=1, do_sample=True, top_k=40, top_p=0.75, temperature=0.4, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True )[0] print(response) ``` ## Benchmarks We evaluated **ThaiLLM-8B** against **Qwen3-8B-Base** using multiple-choice question datasets in both Thai and English. Each benchmark measures the probability of selecting the correct choice based on the model’s next-token prediction. ### 1. Natural Language Understanding (NLU) | Task | Qwen3-8B-Base | ThaiLLM-8B | Δ | |------|--------------:|-----------:|---:| | **[MMLU](https://huggingface.co/datasets/cais/mmlu) (ENG, 5-shot)** | **0.7691** | 0.7565 | -0.0126 | | **[MMLU (TH)](https://huggingface.co/datasets/SeaLLMs/SeaExam/)** | 0.6259 | **0.6459** | +0.0200 | | **[ThaiExam](https://huggingface.co/datasets/scb10x/thai_exam) Avg.** (ONET, IC, TGAT, TPAT-1, A-Level) | 0.31396 | **0.48292** | +0.16896 | | ├── ONET | 0.4074 | **0.5864** | +0.1790 | | ├── IC | 0.5157 | **0.7052** | +0.1895 | | ├── TGAT | 0.3384 | **0.6307** | +0.2923 | | ├── TPAT-1 | 0.1379 | **0.3965** | +0.2586 | | └── A-Level | 0.1653 | **0.5275** | +0.3622 | | [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) | 0.5802 | **0.6369** | +0.0567 | | [M6Exam](https://huggingface.co/datasets/openthaigpt/thai-onet-m6-exam) Avg. | 0.54844 | **0.55792** | +0.00948 | | ├── Thai | **0.4833** | 0.5023 | +0.0190 | | ├── Math | 0.4090 | **0.2727** | -0.1363 | | ├── Social | 0.5844 | **0.7088** | +0.1244 | | ├── Science | 0.4603 | **0.5238** | +0.0635 | | └── English | 0.7552 | **0.7864** | +0.0312 | | [XNLI-Thai](https://huggingface.co/datasets/facebook/xnli/viewer/th) | 0.7529 | **0.6667** | -0.0862 | | [XCOPA-Thai](https://github.com/cambridgeltl/xcopa/blob/master/data/th/test.th.jsonl) | 0.8220 | **0.8340** | +0.0120 | | [Belebele-Thai](https://huggingface.co/datasets/facebook/belebele/viewer/tha_Thai) | 0.3880 | **0.8447** | +0.4567 | --- ### 2. Average Performance | Model | Average Score | |-------|--------------:| | Qwen3-8B-Base | 0.5987 | | ThaiLLM-8B | **0.6891** | > **Highlights**: > - **ThaiLLM-8B** shows large improvements in **ThaiExam**, **Belebele-Thai**, and **MMLU-TH**. > - Gains are especially strong in **A-Level** (+0.36) and **TGAT** (+0.29). > - Some slight regressions are seen in **MMLU-ENG** and **Math** in M6Exam. ## Limitations - This is a base model and requires instruction fine-tuning for optimal performance - Performance on specialized domains may require domain-specific fine-tuning - As with all language models, outputs should be verified for accuracy in critical applications ## Citation ```bibtex @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```