Instructions to use ThaiLLM/ThaiLLM-30B-IQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThaiLLM/ThaiLLM-30B-IQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ThaiLLM/ThaiLLM-30B-IQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ThaiLLM/ThaiLLM-30B-IQ")
model = AutoModelForCausalLM.from_pretrained("ThaiLLM/ThaiLLM-30B-IQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ThaiLLM/ThaiLLM-30B-IQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ThaiLLM/ThaiLLM-30B-IQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThaiLLM/ThaiLLM-30B-IQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ThaiLLM/ThaiLLM-30B-IQ

SGLang

How to use ThaiLLM/ThaiLLM-30B-IQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ThaiLLM/ThaiLLM-30B-IQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThaiLLM/ThaiLLM-30B-IQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ThaiLLM/ThaiLLM-30B-IQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThaiLLM/ThaiLLM-30B-IQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ThaiLLM/ThaiLLM-30B-IQ with Docker Model Runner:
```
docker model run hf.co/ThaiLLM/ThaiLLM-30B-IQ
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ThaiLLM-30B-IQ (Medical)

Model Description

ThaiLLM-30B-IQ (Medical) is a Thai-language large language model specialized for medical information query and citation-grounded question answering.

The model is fine-tuned from ThaiLLM-30B using supervised fine-tuning (SFT) with medical-domain data. It is designed for retrieval-augmented generation (RAG) workflows, where answers must be generated only from provided medical contexts and returned with explicit citations.

This model does not include a retriever and must be used with externally supplied medical documents.

Intended Use

Suitable for

Medical RAG systems (Thai)
Medical document question answering
Citation-grounded medical QA
Medical education and evaluation

Not suitable for

Medical diagnosis or treatment decisions
Patient-facing clinical advice
Emergency or critical-care use

Model Details

Base model: ThaiLLM-30B
Domain: Medical
Model type: Decoder-only causal language model
Fine-tuning: Supervised fine-tuning (LoRA)
Language: Thai
Precision: bfloat16

Performance (Medical Information Query Setting)

Model	Response (BLEU)	Citations (Jaccard)
ThaiLLM-30B	0.406	0.1786
ThaiLLM-8B-SFT-IQ (Medical)	0.4331	0.5458

Training

Framework: LLaMA-Factory
Method: LoRA-based supervised fine-tuning

Hyperparameter	Value
Learning rate	2e-4
LoRA rank	16
LoRA alpha	16
Sequence length	4096
Epochs	3
Batch size	4

Training data consists of Thai medical question–answer pairs with context grounding and citation supervision.

Training Script

llamafactory-cli train \
  --stage sft \
  --do_train True \
  --model_name_or_path ${MODEL_PATH} \
  --preprocessing_num_workers 16 \
  --deepspeed ./examples/deepspeed/ds_z2_config.json \
  --finetuning_type lora \
  --template qwen3 \
  --flash_attn auto \
  --dataset thaillm-SFT \
  --cutoff_len 4096 \
  --learning_rate 2e-4 \
  --num_train_epochs 3.0 \
  --max_samples 100000 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 4 \
  --gradient_checkpointing True \
  --gradient_checkpointing_kwargs '{"use_reentrant":false}' \
  --lr_scheduler_type cosine \
  --max_grad_norm 1.0 \
  --logging_steps 5 \
  --save_steps 100 \
  --warmup_steps 0 \
  --packing False \
  --enable_thinking True \
  --report_to tensorboard \
  --quantization_bit 4 \
  --output_dir "./checkpoints/ThaiLLM-30B-SFT-IQ" \
  --overwrite_output_dir True \
  --bf16 True \
  --plot_loss True \
  --trust_remote_code True \
  --include_num_input_tokens_seen True \
  --optim adamw_8bit \
  --lora_rank 16 \
  --lora_alpha 16 \
  --upcast_layernorm True \
  --ddp_timeout 180000000 \
  --lora_dropout 0 \
  --lora_target all \
  --freeze_vision_tower True \
  --freeze_multi_modal_projector True \
  --use_unsloth False \
  --seed 1234

Usage

The model is optimized for strict JSON output and context-only medical answers.

Expected Instruction Format

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ThaiLLM-30B-SFT-IQ"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

system_prompt = """\
Answer in JSON format with citations only.
Use only the provided medical contexts to answer the question.
Include the fact IDs that support your answer in the following format:
{"answer": "<ANSWER_TEXT>", "citations": ["<FACT_ID1>", "<FACT_ID2>"]}

If unknown, respond with:
{"answer": "unknown", "citations": []}
"""

question = "การบาดเจ็บจากอุบัติเหตุที่ข้อต่อขากรรไกรสามารถทำให้อาการปวดร้าวไปที่ใดเมื่อเคี้ยวอาหาร?"

facts = """\
[1] การบาดเจ็บจากอุบัติเหตุที่ข้อต่อขากรรไกรสามารถทำให้ปวดร้าวไปที่หูเมื่อเคี้ยวอาหาร
[2] ข้อต่อขากรรไกรอักเสบ (TMJ) สามารถทำให้ปวดบริเวณใกล้ติ่งหูและร้าวไปที่หูเมื่อเคี้ยวอาหาร
"""

prompt = f"""{system_prompt}

Question:
{question}

Facts:
{facts}
"""

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        inputs,
        max_new_tokens=256,
        do_sample=False,
        temperature=0.0,
    )

generated = outputs[0, inputs.shape[-1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

Downloads last month: 3

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for ThaiLLM/ThaiLLM-30B-IQ

Quantizations

2 models