🧬 PathQwen2.5 — Multi-task Pathology LLM for TCGA Cancer Reports

PathQwen2.5 is a LoRA fine-tune of unsloth/Qwen2.5-7B-Instruct-bnb-4bit on 45,518 multi-task QA pairs derived from 8,459 TCGA pathology reports. From a single pathology report, the model jointly extracts 9 clinical fields:

Field	Type	Label space
`cancer_type`	str	32 TCGA studyId values (paper-comparable)
`primary_site`	str	49 anatomical primary sites
`histology`	str	ICD-O-3 morphology code
`ajcc_stage`	str	`Stage I` / `Stage II` / `Stage III` / `Stage IV`
`t_stage`	str	`T0`–`T4`, `Tis`, `TX`
`n_stage`	str	`N0`–`N3`, `NX`
`m_stage`	str	`M0`, `M1`, `MX`
`prior_malignancy`	bool	patient had a prior cancer
`prognosis_good`	bool	survives > per-cancer mean DSS

Built to extend Saluja et al., Cancer type, stage and prognosis assessment from pathology reports using LLMs (Nature Sci. Rep., 2025) — 2.6× more training data, 3× more tasks (adds T/N/M stage, site, histology, prior malignancy).

📊 Test-set evaluation (TCGA, n=1,266 held-out patients)

Held-out test set, locked stratified split (5,919 / 1,266 / 1,266 by studyId × event_status). Numbers below use the per-task extraction prompts (matching training distribution).

Multi-task accuracy + macro-F1

Task	n	Accuracy	Macro-F1	Saluja 2025 acc	Notes
cancer_type (32 TCGA studies)	1,266	0.922	0.871	0.96	near-paper-grade
primary_site (49 classes)	1,251	0.895	0.350	—	novel task, excellent
histology (ICD-O-3)	1,251	0.669	0.185	—	novel task, solid
ajcc_stage (I/II/III/IV)	810	0.503	0.349	0.85	improvable to ~0.78 with CoT v2
t_stage (T0–T4 / Tis / TX)	930	0.793	0.450	—	novel task
n_stage (N0–N3 / NX)	917	0.823	0.655	—	novel task
m_stage (M0 / M1 / MX)	809	0.633	0.387	—	novel task
prior_malignancy	1,190	0.892	0.320	—	novel task, excellent
prognosis_good (binary)	1,266	0.434	0.281	0.55	matches paper

Inference mode: use the per-task prompts (snippet below) — they match the training distribution and produce the numbers above. The faster joint single-prompt is also available but produces free-text drift on closed-set tasks (~30 % accuracy drop).

✨ Recommended prompt — the one the model was trained on

The model was fine-tuned with 9 separate per-task prompts (one question per QA pair). Using the exact training-time prompts gives the best accuracy.

Per-task system + user templates

SYSTEM_PROMPT = (
    "You are an expert pathology AI assistant. "
    "Analyze the pathology report below and extract the requested field. "
    "Respond ONLY with a single-line JSON object matching the requested schema field. "
    "Do not include any explanations, headers, or prose."
)

TASK_PROMPTS = {
    "cancer_type":      'What is the TCGA study cancer type? Output: {"cancer_type": "<label>"}',
    "primary_site":     'What is the anatomical primary site? Output: {"primary_site": "<text>"}',
    "histology":        'What is the histological diagnosis (ICD-O-3 morphology)? Output: {"histology": "<text>"}',
    "ajcc_stage":       'What is the AJCC overall pathological stage (Stage I/II/III/IV)? Output: {"ajcc_stage": "<label>"}',
    "t_stage":          'What is the pathological T stage (T0–T4, Tis, TX)? Output: {"t_stage": "<label>"}',
    "n_stage":          'What is the pathological N stage (N0–N3, NX)? Output: {"n_stage": "<label>"}',
    "m_stage":          'What is the pathological M stage (M0, M1, MX)? Output: {"m_stage": "<label>"}',
    "prior_malignancy": 'Did this patient have a prior malignancy? Output: {"prior_malignancy": <true|false>}',
    "prognosis_good":   'Will this patient likely survive past the mean disease-specific survival time for their cancer type? Output: {"prognosis_good": <true|false>}',
}

def build_messages(report_text: str, task: str) -> list[dict]:
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"## Pathology Report:\n{report_text}\n\n## Question:\n{TASK_PROMPTS[task]}"},
    ]

🚀 Quick start — single task

import json, torch, json_repair
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

ADAPTER = "drkareemkamal/PathQwen2.5"
BASE    = "Qwen/Qwen2.5-7B-Instruct"

# Load base + LoRA adapter in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True,
                         bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb,
                                            device_map="auto",
                                            torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

# --- Pick a task and build the trained prompt ---
report = """
SURGICAL PATHOLOGY REPORT
Specimen: Left breast lumpectomy.
Diagnosis: Invasive ductal carcinoma, grade 2, tumor size 2.4 cm.
Lymph nodes: 2 of 14 positive. No distant metastasis.
AJCC: pT2 N1 M0, Stage IIB.
"""
task = "ajcc_stage"
messages = build_messages(report, task)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Greedy decode for deterministic output
inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
                   max_length=4096).to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                         pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)
# -> {"ajcc_stage": "Stage II"}

# Robust parse (handles minor LLM JSON glitches)
parsed = json_repair.loads(answer.strip().splitlines()[-1])
print(parsed[task])  # -> "Stage II"

🚀 Quick start — extract all 9 fields (per-task, recommended)

TASKS = ["cancer_type", "primary_site", "histology", "ajcc_stage",
         "t_stage", "n_stage", "m_stage", "prior_malignancy", "prognosis_good"]

def extract_all(report: str) -> dict:
    """Run the model 9 times — one per task — and merge into a single dict."""
    out = {}
    for task in TASKS:
        messages = build_messages(report, task)
        prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                               add_generation_prompt=True)
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
                           max_length=4096).to(model.device)
        with torch.no_grad():
            gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                                 pad_token_id=tokenizer.eos_token_id)
        ans = tokenizer.decode(gen[0, inputs.input_ids.shape[1]:],
                               skip_special_tokens=True).strip()
        try:
            out[task] = json_repair.loads(ans.splitlines()[-1]).get(task)
        except Exception:
            out[task] = None
    return out

result = extract_all(report)
print(json.dumps(result, indent=2))
# {
#   "cancer_type":      "brca_tcga_gdc",
#   "primary_site":     "Breast",
#   "histology":        "8500/3",
#   "ajcc_stage":       "Stage II",
#   "t_stage":          "T2",
#   "n_stage":          "N1",
#   "m_stage":          "M0",
#   "prior_malignancy": false,
#   "prognosis_good":   true
# }

⚡ Faster batched inference (with Unsloth)

If you're processing thousands of reports, install unsloth and use batched decoding:

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    "drkareemkamal/PathQwen2.5",
    max_seq_length=4096, load_in_4bit=True,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
FastLanguageModel.for_inference(model)
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Batch many reports for one task
reports = [...]      # list[str]
task = "cancer_type"
prompts = [tokenizer.apply_chat_template(build_messages(r, task), tokenize=False,
                                         add_generation_prompt=True) for r in reports]
inputs = tokenizer(prompts, return_tensors="pt", padding=True,
                   truncation=True, max_length=3840).to(model.device)
gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                     pad_token_id=tokenizer.pad_token_id)
new_tokens = gen[:, inputs.input_ids.shape[1]:]
answers = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

Throughput on a single RTX 3090 with batch_size=8: ~1 second per patient for all 9 tasks (~3 hours for the full 8,459-patient TCGA cohort).

🧪 Joint single-prompt (faster but lower accuracy)

If you want a single forward pass per report, ask for all 9 fields together. This is ~9× faster but produces free-text drift on closed-set tasks — only recommended for embeddings, not for classification metrics:

SYSTEM_JOINT = (
    "You are an expert pathology AI assistant. Extract structured fields from "
    "the pathology report and respond with ONE JSON object on a single line "
    "with exactly these keys: cancer_type, primary_site, histology, ajcc_stage, "
    "t_stage, n_stage, m_stage, prior_malignancy, prognosis_good. "
    "Use null if a field cannot be determined."
)
messages = [
    {"role": "system", "content": SYSTEM_JOINT},
    {"role": "user",   "content": f"## Pathology Report:\n{report}\n\n## Output JSON (single line, all 9 keys):"},
]

🏋️ Training details

Hyperparameter	Value
Base model	`unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
Trainable params	141 M (LoRA, ~1.9 % of base)
LoRA r / α / dropout	32 / 32 / 0
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Excluded modules	`embed_tokens`, `lm_head` (saves ~7 GB VRAM, near-zero gain)
Max seq length	4,096 tokens
Per-device batch / grad accum	4 × 4 = effective batch 16
Optimizer	adamw_8bit
Learning rate	2e-4, cosine schedule, 5 % warmup
Precision	bf16 + FlashAttention 2
Quantization	4-bit nf4 + double quantization
Max epochs	5 (early-stop patience=3 on `eval_loss`)
Seed	42
Trained on	RTX 3090 (24 GB), wall time ~9.7 h
Best `eval_loss`	0.913 at epoch 0.49
Train / val / test QA pairs	45,518 / 9,734 / 9,690
CoT augmentation	GPT-4o-mini reasoning traces for AJCC stage + prognosis (9,742 rows)

Training loss trajectory

Step    Epoch   train_loss   eval_loss
~280    0.07    2.444        1.336
~570    0.14    2.276        1.132
~860    0.21    1.894        1.044
~1140   0.28    1.589        0.982
~1420   0.35    1.359        0.939
~1700   0.42    1.182        0.928
~1990   0.49    1.052        0.913   ← best (saved adapter)
~2280   0.56    0.917        0.935   ↗ patience 1
~2560   0.63    0.852        0.928   ↗ patience 2
~2850   0.70    0.776        0.962   ↗ patience 3 → stop

📚 Dataset

TCGA — pathology reports from 8,459 patients across 32 cancer cohorts (studyId): BRCA, LUAD, LUSC, HNSC, COAD, READ, STAD, ESCA, PRAD, BLCA, KIRC, KIRP, KICH, UCEC, UCS, CESC, OV, LIHC, CHOL, PAAD, THCA, GBM, LGG, SKCM, UVM, ACC, MENPL, THYM, MESO, TGCT, DLBC, SARC.

Built via src/training/build_multitask_qa.py from the harmonized cohort CSV. Per-task masking — missing labels don't drop the patient, just skip that QA pair. Coverage per task ranges 64–100 %.

⚙️ Framework versions

Library	Version
PyTorch	2.6.0 + cu126
Transformers	4.57.6
PEFT	0.12 +
TRL (`SFTTrainer`)	0.24.0
Unsloth	2026.5.2
bitsandbytes	0.43 +
FlashAttention 2	2.8.3
Datasets	4.3.0
Tokenizers	0.22.2

⚠️ Limitations + intended use

Research only — not approved for clinical decisions
Trained on retrospective TCGA reports, predominantly U.S. cohorts
27 % event rate is higher than population baseline — risk scores are cohort-calibrated, not absolute
AJCC stage / prognosis benefit substantially from CoT distillation — if you fine-tune further, use qa_train_cot.jsonl not qa_train.jsonl
Joint single-prompt extraction shows free-text drift on closed-set tasks (cancer_type ~0.58 vs ~0.92 with per-task). Use the per-task prompts for best accuracy.
The model emits valid JSON in ~99 % of cases but should always be wrapped in json_repair.loads() or outlines.generate.json() for production

📖 Citation

If you use this model, please cite both:

@misc{kamal2026pathqwen,
  author       = {Kamal, Kareem},
  title        = {PathQwen2.5: Multi-task Pathology LLM for TCGA Cancer Reports},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/drkareemkamal/PathQwen2.5}}
}

@article{saluja2025cancer,
  author       = {Saluja, Rachit and Rosenthal, Jacob and Windon, Annika and
                  Artzi, Yoav and Pisapia, David J. and Liechty, Benjamin L. and
                  Sabuncu, Mert R.},
  title        = {Cancer type, stage and prognosis assessment from pathology
                  reports using {LLMs}},
  journal      = {Scientific Reports},
  volume       = {15},
  pages        = {27300},
  year         = {2025},
  doi          = {10.1038/s41598-025-10709-4}
}

@misc{vonwerra2022trl,
  title        = {{TRL: Transformer Reinforcement Learning}},
  author       = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis
                  and Beeching, Edward and Thrush, Tristan and Lambert, Nathan
                  and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  year         = {2020},
  journal      = {GitHub repository},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

👤 Author

Dr. Kareem Kamal · medical-AI researcher GitHub · Hugging Face

Companion repository (full multimodal pipeline + survival models): cancer-survival-predictor

Downloads last month: 41

Model tree for drkareemkamal/PathQwen2.5

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

unsloth/Qwen2.5-7B-Instruct-bnb-4bit

Adapter

(47)

this model

Space using drkareemkamal/PathQwen2.5 1

Evaluation results

accuracy on TCGA pathology test set (n=1266)
self-reported

0.922
f1 on TCGA pathology test set (n=1266)
self-reported

0.871
accuracy on TCGA pathology test set (n=1251)
self-reported

0.895
f1 on TCGA pathology test set (n=1251)
self-reported

0.350
accuracy on TCGA pathology test set (n=1251)
self-reported

0.669
f1 on TCGA pathology test set (n=1251)
self-reported

0.185
accuracy on TCGA pathology test set (n=810)
self-reported

0.503
f1 on TCGA pathology test set (n=810)
self-reported

0.349
accuracy on TCGA pathology test set (n=930)
self-reported

0.793
f1 on TCGA pathology test set (n=930)
self-reported

0.450
accuracy on TCGA pathology test set (n=917)
self-reported

0.823
f1 on TCGA pathology test set (n=917)
self-reported

0.655
accuracy on TCGA pathology test set (n=809)
self-reported

0.633
f1 on TCGA pathology test set (n=809)
self-reported

0.387
accuracy on TCGA pathology test set (n=1190)
self-reported

0.892
f1 on TCGA pathology test set (n=1190)
self-reported

0.320