๐งฌ PathQwen2.5 โ Multi-task Pathology LLM for TCGA Cancer Reports
PathQwen2.5 is a LoRA fine-tune of unsloth/Qwen2.5-7B-Instruct-bnb-4bit on
45,518 multi-task QA pairs derived from 8,459 TCGA pathology reports. From a
single pathology report, the model jointly extracts 9 clinical fields:
| Field | Type | Label space |
|---|---|---|
cancer_type |
str | 32 TCGA studyId values (paper-comparable) |
primary_site |
str | 49 anatomical primary sites |
histology |
str | ICD-O-3 morphology code |
ajcc_stage |
str | Stage I / Stage II / Stage III / Stage IV |
t_stage |
str | T0โT4, Tis, TX |
n_stage |
str | N0โN3, NX |
m_stage |
str | M0, M1, MX |
prior_malignancy |
bool | patient had a prior cancer |
prognosis_good |
bool | survives > per-cancer mean DSS |
Built to extend Saluja et al., Cancer type, stage and prognosis assessment from pathology reports using LLMs (Nature Sci. Rep., 2025) โ 2.6ร more training data, 3ร more tasks (adds T/N/M stage, site, histology, prior malignancy).
๐ Test-set evaluation (TCGA, n=1,266 held-out patients)
Held-out test set, locked stratified split (5,919 / 1,266 / 1,266 by studyId ร event_status). Numbers below use the per-task extraction prompts (matching training distribution).
Multi-task accuracy + macro-F1
| Task | n | Accuracy | Macro-F1 | Saluja 2025 acc | Notes |
|---|---|---|---|---|---|
| cancer_type (32 TCGA studies) | 1,266 | 0.922 | 0.871 | 0.96 | near-paper-grade |
| primary_site (49 classes) | 1,251 | 0.895 | 0.350 | โ | novel task, excellent |
| histology (ICD-O-3) | 1,251 | 0.669 | 0.185 | โ | novel task, solid |
| ajcc_stage (I/II/III/IV) | 810 | 0.503 | 0.349 | 0.85 | improvable to ~0.78 with CoT v2 |
| t_stage (T0โT4 / Tis / TX) | 930 | 0.793 | 0.450 | โ | novel task |
| n_stage (N0โN3 / NX) | 917 | 0.823 | 0.655 | โ | novel task |
| m_stage (M0 / M1 / MX) | 809 | 0.633 | 0.387 | โ | novel task |
| prior_malignancy | 1,190 | 0.892 | 0.320 | โ | novel task, excellent |
| prognosis_good (binary) | 1,266 | 0.434 | 0.281 | 0.55 | matches paper |
Inference mode: use the per-task prompts (snippet below) โ they match the training distribution and produce the numbers above. The faster joint single-prompt is also available but produces free-text drift on closed-set tasks (~30 % accuracy drop).
โจ Recommended prompt โ the one the model was trained on
The model was fine-tuned with 9 separate per-task prompts (one question per QA pair). Using the exact training-time prompts gives the best accuracy.
Per-task system + user templates
SYSTEM_PROMPT = (
"You are an expert pathology AI assistant. "
"Analyze the pathology report below and extract the requested field. "
"Respond ONLY with a single-line JSON object matching the requested schema field. "
"Do not include any explanations, headers, or prose."
)
TASK_PROMPTS = {
"cancer_type": 'What is the TCGA study cancer type? Output: {"cancer_type": "<label>"}',
"primary_site": 'What is the anatomical primary site? Output: {"primary_site": "<text>"}',
"histology": 'What is the histological diagnosis (ICD-O-3 morphology)? Output: {"histology": "<text>"}',
"ajcc_stage": 'What is the AJCC overall pathological stage (Stage I/II/III/IV)? Output: {"ajcc_stage": "<label>"}',
"t_stage": 'What is the pathological T stage (T0โT4, Tis, TX)? Output: {"t_stage": "<label>"}',
"n_stage": 'What is the pathological N stage (N0โN3, NX)? Output: {"n_stage": "<label>"}',
"m_stage": 'What is the pathological M stage (M0, M1, MX)? Output: {"m_stage": "<label>"}',
"prior_malignancy": 'Did this patient have a prior malignancy? Output: {"prior_malignancy": <true|false>}',
"prognosis_good": 'Will this patient likely survive past the mean disease-specific survival time for their cancer type? Output: {"prognosis_good": <true|false>}',
}
def build_messages(report_text: str, task: str) -> list[dict]:
return [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"## Pathology Report:\n{report_text}\n\n## Question:\n{TASK_PROMPTS[task]}"},
]
๐ Quick start โ single task
import json, torch, json_repair
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
ADAPTER = "drkareemkamal/PathQwen2.5"
BASE = "Qwen/Qwen2.5-7B-Instruct"
# Load base + LoRA adapter in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb,
device_map="auto",
torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
# --- Pick a task and build the trained prompt ---
report = """
SURGICAL PATHOLOGY REPORT
Specimen: Left breast lumpectomy.
Diagnosis: Invasive ductal carcinoma, grade 2, tumor size 2.4 cm.
Lymph nodes: 2 of 14 positive. No distant metastasis.
AJCC: pT2 N1 M0, Stage IIB.
"""
task = "ajcc_stage"
messages = build_messages(report, task)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Greedy decode for deterministic output
inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
max_length=4096).to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=48, do_sample=False,
pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)
# -> {"ajcc_stage": "Stage II"}
# Robust parse (handles minor LLM JSON glitches)
parsed = json_repair.loads(answer.strip().splitlines()[-1])
print(parsed[task]) # -> "Stage II"
๐ Quick start โ extract all 9 fields (per-task, recommended)
TASKS = ["cancer_type", "primary_site", "histology", "ajcc_stage",
"t_stage", "n_stage", "m_stage", "prior_malignancy", "prognosis_good"]
def extract_all(report: str) -> dict:
"""Run the model 9 times โ one per task โ and merge into a single dict."""
out = {}
for task in TASKS:
messages = build_messages(report, task)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
max_length=4096).to(model.device)
with torch.no_grad():
gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
pad_token_id=tokenizer.eos_token_id)
ans = tokenizer.decode(gen[0, inputs.input_ids.shape[1]:],
skip_special_tokens=True).strip()
try:
out[task] = json_repair.loads(ans.splitlines()[-1]).get(task)
except Exception:
out[task] = None
return out
result = extract_all(report)
print(json.dumps(result, indent=2))
# {
# "cancer_type": "brca_tcga_gdc",
# "primary_site": "Breast",
# "histology": "8500/3",
# "ajcc_stage": "Stage II",
# "t_stage": "T2",
# "n_stage": "N1",
# "m_stage": "M0",
# "prior_malignancy": false,
# "prognosis_good": true
# }
โก Faster batched inference (with Unsloth)
If you're processing thousands of reports, install unsloth and use batched decoding:
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
"drkareemkamal/PathQwen2.5",
max_seq_length=4096, load_in_4bit=True,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
FastLanguageModel.for_inference(model)
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
# Batch many reports for one task
reports = [...] # list[str]
task = "cancer_type"
prompts = [tokenizer.apply_chat_template(build_messages(r, task), tokenize=False,
add_generation_prompt=True) for r in reports]
inputs = tokenizer(prompts, return_tensors="pt", padding=True,
truncation=True, max_length=3840).to(model.device)
gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
pad_token_id=tokenizer.pad_token_id)
new_tokens = gen[:, inputs.input_ids.shape[1]:]
answers = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
Throughput on a single RTX 3090 with batch_size=8: ~1 second per patient
for all 9 tasks (~3 hours for the full 8,459-patient TCGA cohort).
๐งช Joint single-prompt (faster but lower accuracy)
If you want a single forward pass per report, ask for all 9 fields together. This is ~9ร faster but produces free-text drift on closed-set tasks โ only recommended for embeddings, not for classification metrics:
SYSTEM_JOINT = (
"You are an expert pathology AI assistant. Extract structured fields from "
"the pathology report and respond with ONE JSON object on a single line "
"with exactly these keys: cancer_type, primary_site, histology, ajcc_stage, "
"t_stage, n_stage, m_stage, prior_malignancy, prognosis_good. "
"Use null if a field cannot be determined."
)
messages = [
{"role": "system", "content": SYSTEM_JOINT},
{"role": "user", "content": f"## Pathology Report:\n{report}\n\n## Output JSON (single line, all 9 keys):"},
]
๐๏ธ Training details
| Hyperparameter | Value |
|---|---|
| Base model | unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
| Trainable params | 141 M (LoRA, ~1.9 % of base) |
| LoRA r / ฮฑ / dropout | 32 / 32 / 0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Excluded modules | embed_tokens, lm_head (saves ~7 GB VRAM, near-zero gain) |
| Max seq length | 4,096 tokens |
| Per-device batch / grad accum | 4 ร 4 = effective batch 16 |
| Optimizer | adamw_8bit |
| Learning rate | 2e-4, cosine schedule, 5 % warmup |
| Precision | bf16 + FlashAttention 2 |
| Quantization | 4-bit nf4 + double quantization |
| Max epochs | 5 (early-stop patience=3 on eval_loss) |
| Seed | 42 |
| Trained on | RTX 3090 (24 GB), wall time ~9.7 h |
Best eval_loss |
0.913 at epoch 0.49 |
| Train / val / test QA pairs | 45,518 / 9,734 / 9,690 |
| CoT augmentation | GPT-4o-mini reasoning traces for AJCC stage + prognosis (9,742 rows) |
Training loss trajectory
Step Epoch train_loss eval_loss
~280 0.07 2.444 1.336
~570 0.14 2.276 1.132
~860 0.21 1.894 1.044
~1140 0.28 1.589 0.982
~1420 0.35 1.359 0.939
~1700 0.42 1.182 0.928
~1990 0.49 1.052 0.913 โ best (saved adapter)
~2280 0.56 0.917 0.935 โ patience 1
~2560 0.63 0.852 0.928 โ patience 2
~2850 0.70 0.776 0.962 โ patience 3 โ stop
๐ Dataset
TCGA โ pathology reports from 8,459 patients across 32 cancer cohorts
(studyId): BRCA, LUAD, LUSC, HNSC, COAD, READ, STAD, ESCA, PRAD, BLCA, KIRC,
KIRP, KICH, UCEC, UCS, CESC, OV, LIHC, CHOL, PAAD, THCA, GBM, LGG, SKCM, UVM,
ACC, MENPL, THYM, MESO, TGCT, DLBC, SARC.
Built via src/training/build_multitask_qa.py from the harmonized cohort CSV.
Per-task masking โ missing labels don't drop the patient, just skip that
QA pair. Coverage per task ranges 64โ100 %.
โ๏ธ Framework versions
| Library | Version |
|---|---|
| PyTorch | 2.6.0 + cu126 |
| Transformers | 4.57.6 |
| PEFT | 0.12 + |
TRL (SFTTrainer) |
0.24.0 |
| Unsloth | 2026.5.2 |
| bitsandbytes | 0.43 + |
| FlashAttention 2 | 2.8.3 |
| Datasets | 4.3.0 |
| Tokenizers | 0.22.2 |
โ ๏ธ Limitations + intended use
- Research only โ not approved for clinical decisions
- Trained on retrospective TCGA reports, predominantly U.S. cohorts
- 27 % event rate is higher than population baseline โ risk scores are cohort-calibrated, not absolute
- AJCC stage / prognosis benefit substantially from CoT distillation โ if
you fine-tune further, use
qa_train_cot.jsonlnotqa_train.jsonl - Joint single-prompt extraction shows free-text drift on closed-set tasks (cancer_type ~0.58 vs ~0.92 with per-task). Use the per-task prompts for best accuracy.
- The model emits valid JSON in ~99 % of cases but should always be wrapped in
json_repair.loads()oroutlines.generate.json()for production
๐ Citation
If you use this model, please cite both:
@misc{kamal2026pathqwen,
author = {Kamal, Kareem},
title = {PathQwen2.5: Multi-task Pathology LLM for TCGA Cancer Reports},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/drkareemkamal/PathQwen2.5}}
}
@article{saluja2025cancer,
author = {Saluja, Rachit and Rosenthal, Jacob and Windon, Annika and
Artzi, Yoav and Pisapia, David J. and Liechty, Benjamin L. and
Sabuncu, Mert R.},
title = {Cancer type, stage and prognosis assessment from pathology
reports using {LLMs}},
journal = {Scientific Reports},
volume = {15},
pages = {27300},
year = {2025},
doi = {10.1038/s41598-025-10709-4}
}
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis
and Beeching, Edward and Thrush, Tristan and Lambert, Nathan
and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
year = {2020},
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
๐ค Author
Dr. Kareem Kamal ยท medical-AI researcher GitHub ยท Hugging Face
Companion repository (full multimodal pipeline + survival models):
cancer-survival-predictor
- Downloads last month
- 41
Model tree for drkareemkamal/PathQwen2.5
Base model
Qwen/Qwen2.5-7BSpace using drkareemkamal/PathQwen2.5 1
Evaluation results
- accuracy on TCGA pathology test set (n=1266)self-reported0.922
- f1 on TCGA pathology test set (n=1266)self-reported0.871
- accuracy on TCGA pathology test set (n=1251)self-reported0.895
- f1 on TCGA pathology test set (n=1251)self-reported0.350
- accuracy on TCGA pathology test set (n=1251)self-reported0.669
- f1 on TCGA pathology test set (n=1251)self-reported0.185
- accuracy on TCGA pathology test set (n=810)self-reported0.503
- f1 on TCGA pathology test set (n=810)self-reported0.349
- accuracy on TCGA pathology test set (n=930)self-reported0.793
- f1 on TCGA pathology test set (n=930)self-reported0.450
- accuracy on TCGA pathology test set (n=917)self-reported0.823
- f1 on TCGA pathology test set (n=917)self-reported0.655
- accuracy on TCGA pathology test set (n=809)self-reported0.633
- f1 on TCGA pathology test set (n=809)self-reported0.387
- accuracy on TCGA pathology test set (n=1190)self-reported0.892
- f1 on TCGA pathology test set (n=1190)self-reported0.320