ThaiLLM-27B-Prescreen
ThaiLLM-27B-Prescreen is a reinforcement learning fine-tuned version of google/medgemma-27b-text-it, trained specifically for patient pre-screening. Given a patient profile and current symptoms, the model predicts the likely disease, recommends the appropriate hospital department, and estimates clinical severity.
Training Details
The model was trained using Prime-Intellect's prime-rl framework
Data
The model was trained using https://huggingface.co/datasets/ThaiLLM/med-prescreen dataset with Prime Intellect's verifier framework.
Training Configuration
This was the prime-rl configuration used to train the model
max_steps = 500
seq_len = 16384
[deployment]
type = "single_node"
num_train_gpus = 2
num_infer_gpus = 6
[inference.parallel]
dp = 6
[trainer.model]
attn = "flash_attention_3"
optimization_dtype = "bfloat16"
reduce_dtype = "bfloat16"
[trainer.model.lora]
rank = 64
alpha = 128
[trainer.model.ac]
[trainer.optim]
lr = 5e-5
[orchestrator]
batch_size = 512
rollouts_per_example = 16
num_train_workers = 2
[orchestrator.wandb.log_extras]
samples = true
interval = 1
[orchestrator.sampling]
max_tokens = 8192
[[orchestrator.env]]
id = "prescreen_classification"
name = "prescreen_classification"
[ckpt]
interval = 50
keep_interval = 50
Reward Functions
The environment was developed following the verifiers framework with the following reward functions with the following weights for each reward [2.0, 1.0, 1.0, 0.3]
async def disease_reward(completion, answer):
response = completion[-1]["content"]
predicted = _extract_tag(response, "disease")
if predicted is None:
return 0.0
predicted = predicted.lower()
true_disease = answer.get("disease", "").lower()
if predicted == true_disease:
return 1.0
return 0.0
async def department_reward(completion, answer):
response = completion[-1]["content"]
predicted = _extract_tag(response, "department").lower()
if predicted is None:
return 0.0
answer = answer.get("department", "").lower()
return 1.0 if predicted == answer else 0.0
async def severity_reward(completion, answer):
response = completion[-1]["content"]
predicted = _extract_tag(response, "severity").lower()
if predicted is None:
return 0.0
answer = answer.get("severity", "").lower()
return 1.0 if predicted == answer else 0.0
async def format_reward(completion, answer) -> float:
response = completion[-1]["content"]
text_without_think = re.sub(r"<unused94>.*?</unused94>", "", response, flags=re.DOTALL | re.IGNORECASE) # medgemma uses the the <unused94> token instead of <think>
tags = ["disease", "department", "severity"]
present = sum(1 for t in tags if f"<{t}>" in text_without_think.lower() and f"</{t}>" in text_without_think.lower())
return present / len(tags)
Performance
We benchmark against four baselines spanning general-purpose reasoning models (Qwen3-30B-A3B-Thinking-2507, Qwen3-8B) and medical-domain models (medgemma-27b-text-it, medgemma1.5-4b-it). ThaiLLM-27B-Prescreen improves disease F1 by +0.448 over its base model (0.287 โ 0.735) and outperforms Qwen3-30B-A3B-Thinking-2507 at 0.515. Department routing also improves meaningfully (+0.048 F1 over the base, +0.077 over Qwen3-30B-A3B-Thinking-2507), with the largest gain appearing in accuracy (0.436 โ 0.677), suggesting the model is substantially better at picking the single correct department rather than hedging across plausible ones. There is however a severity trade-off, severity F1 is slightly below the base MedGemma-27B (0.571 vs 0.601) and noticeably below Qwen3-30B-A3B-Thinking (0.659). However, ThaiLLM-27B-Prescreen achieves the highest severity accuracy of any model tested (0.799), and the per-class breakdown below shows why the two metrics diverge: the model is strong on the two clinically consequential classes (Emergency and Visit Hospital / Clinic) and fails entirely on Observe at Home.
Overall Performance (F1)
| Model | Disease | Department | Severity |
|---|---|---|---|
| Qwen3-30B-A3B-Thinking-2507 | 0.515 | 0.464 | 0.659 |
| Qwen3-8B | 0.157 | 0.449 | 0.574 |
| medgemma1.5-4b-it | 0.095 | 0.424 | 0.525 |
| medgemma-27b-text-it | 0.287 | 0.493 | 0.601 |
| ThaiLLM-27B-Prescreen | 0.735 | 0.541 | 0.571 |
Disease Classification
| Model | F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| Qwen3-30B-A3B-Thinking-2507 | 0.515 | 0.562 | 0.509 | 0.510 |
| Qwen3-8B | 0.157 | 0.215 | 0.148 | 0.149 |
| medgemma1.5-4b-it | 0.095 | 0.131 | 0.082 | 0.076 |
| medgemma-27b-text-it | 0.287 | 0.336 | 0.266 | 0.286 |
| ThaiLLM-27B-Prescreen | 0.735 | 0.776 | 0.730 | 0.729 |
Department Classification
| Model | F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| Qwen3-30B-A3B-Thinking-2507 | 0.464 | 0.466 | 0.677 | 0.420 |
| Qwen3-8B | 0.449 | 0.419 | 0.648 | 0.358 |
| medgemma1.5-4b-it | 0.424 | 0.394 | 0.541 | 0.358 |
| medgemma-27b-text-it | 0.493 | 0.469 | 0.678 | 0.436 |
| ThaiLLM-27B-Prescreen | 0.541 | 0.606 | 0.518 | 0.677 |
Severity Classification
| Model | F1 | Precision | Recall | Accuracy |
|---|---|---|---|---|
| Qwen3-30B-A3B-Thinking-2507 | 0.659 | 0.722 | 0.639 | 0.774 |
| Qwen3-8B | 0.574 | 0.858 | 0.601 | 0.771 |
| medgemma1.5-4b-it | 0.525 | 0.616 | 0.529 | 0.715 |
| medgemma-27b-text-it | 0.601 | 0.835 | 0.609 | 0.755 |
| ThaiLLM-27B-Prescreen | 0.571 | 0.548 | 0.599 | 0.799 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Emergency | 0.878 | 0.857 | 0.867 | 84 |
| Observe At Home | 0.000 | 0.000 | 0.000 | 36 |
| Visit Hospital / Clinic | 0.767 | 0.940 | 0.845 | 168 |
The model never predicts Observe at Home โ those 36 cases are being absorbed into Visit Hospital / Clinic instead. The collapse of the Observe at Home class is a real limitation of the system and should be taken into account when deploying the model.
Usage
The model expects a specific system prompt (specified in system_prompt.py) where the list of possible diseases and department can be retrieved from https://github.com/vistec-AI/thaillm-prescreen-rulesets/blob/main/v1/const/diseases.yaml and https://github.com/vistec-AI/thaillm-prescreen-rulesets/blob/main/v1/const/departments.yaml respectively.
vLLM
uv run --with vllm vllm serve google/medgemma-27b-text-it \
--enable-lora \
--lora-modules prescreen=ThaiLLM/ThaiLLM-27B-Prescreen \
--max-lora-rank 64