๐Ÿงฌ PathQwen2.5 โ€” Multi-task Pathology LLM for TCGA Cancer Reports

PathQwen2.5 is a LoRA fine-tune of unsloth/Qwen2.5-7B-Instruct-bnb-4bit on 45,518 multi-task QA pairs derived from 8,459 TCGA pathology reports. From a single pathology report, the model jointly extracts 9 clinical fields:

Field Type Label space
cancer_type str 32 TCGA studyId values (paper-comparable)
primary_site str 49 anatomical primary sites
histology str ICD-O-3 morphology code
ajcc_stage str Stage I / Stage II / Stage III / Stage IV
t_stage str T0โ€“T4, Tis, TX
n_stage str N0โ€“N3, NX
m_stage str M0, M1, MX
prior_malignancy bool patient had a prior cancer
prognosis_good bool survives > per-cancer mean DSS

Built to extend Saluja et al., Cancer type, stage and prognosis assessment from pathology reports using LLMs (Nature Sci. Rep., 2025) โ€” 2.6ร— more training data, 3ร— more tasks (adds T/N/M stage, site, histology, prior malignancy).


๐Ÿ“Š Test-set evaluation (TCGA, n=1,266 held-out patients)

Held-out test set, locked stratified split (5,919 / 1,266 / 1,266 by studyId ร— event_status). Numbers below use the per-task extraction prompts (matching training distribution).

Multi-task accuracy + macro-F1

Task n Accuracy Macro-F1 Saluja 2025 acc Notes
cancer_type (32 TCGA studies) 1,266 0.922 0.871 0.96 near-paper-grade
primary_site (49 classes) 1,251 0.895 0.350 โ€” novel task, excellent
histology (ICD-O-3) 1,251 0.669 0.185 โ€” novel task, solid
ajcc_stage (I/II/III/IV) 810 0.503 0.349 0.85 improvable to ~0.78 with CoT v2
t_stage (T0โ€“T4 / Tis / TX) 930 0.793 0.450 โ€” novel task
n_stage (N0โ€“N3 / NX) 917 0.823 0.655 โ€” novel task
m_stage (M0 / M1 / MX) 809 0.633 0.387 โ€” novel task
prior_malignancy 1,190 0.892 0.320 โ€” novel task, excellent
prognosis_good (binary) 1,266 0.434 0.281 0.55 matches paper

Inference mode: use the per-task prompts (snippet below) โ€” they match the training distribution and produce the numbers above. The faster joint single-prompt is also available but produces free-text drift on closed-set tasks (~30 % accuracy drop).


โœจ Recommended prompt โ€” the one the model was trained on

The model was fine-tuned with 9 separate per-task prompts (one question per QA pair). Using the exact training-time prompts gives the best accuracy.

Per-task system + user templates

SYSTEM_PROMPT = (
    "You are an expert pathology AI assistant. "
    "Analyze the pathology report below and extract the requested field. "
    "Respond ONLY with a single-line JSON object matching the requested schema field. "
    "Do not include any explanations, headers, or prose."
)

TASK_PROMPTS = {
    "cancer_type":      'What is the TCGA study cancer type? Output: {"cancer_type": "<label>"}',
    "primary_site":     'What is the anatomical primary site? Output: {"primary_site": "<text>"}',
    "histology":        'What is the histological diagnosis (ICD-O-3 morphology)? Output: {"histology": "<text>"}',
    "ajcc_stage":       'What is the AJCC overall pathological stage (Stage I/II/III/IV)? Output: {"ajcc_stage": "<label>"}',
    "t_stage":          'What is the pathological T stage (T0โ€“T4, Tis, TX)? Output: {"t_stage": "<label>"}',
    "n_stage":          'What is the pathological N stage (N0โ€“N3, NX)? Output: {"n_stage": "<label>"}',
    "m_stage":          'What is the pathological M stage (M0, M1, MX)? Output: {"m_stage": "<label>"}',
    "prior_malignancy": 'Did this patient have a prior malignancy? Output: {"prior_malignancy": <true|false>}',
    "prognosis_good":   'Will this patient likely survive past the mean disease-specific survival time for their cancer type? Output: {"prognosis_good": <true|false>}',
}

def build_messages(report_text: str, task: str) -> list[dict]:
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"## Pathology Report:\n{report_text}\n\n## Question:\n{TASK_PROMPTS[task]}"},
    ]

๐Ÿš€ Quick start โ€” single task

import json, torch, json_repair
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

ADAPTER = "drkareemkamal/PathQwen2.5"
BASE    = "Qwen/Qwen2.5-7B-Instruct"

# Load base + LoRA adapter in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True,
                         bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb,
                                            device_map="auto",
                                            torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

# --- Pick a task and build the trained prompt ---
report = """
SURGICAL PATHOLOGY REPORT
Specimen: Left breast lumpectomy.
Diagnosis: Invasive ductal carcinoma, grade 2, tumor size 2.4 cm.
Lymph nodes: 2 of 14 positive. No distant metastasis.
AJCC: pT2 N1 M0, Stage IIB.
"""
task = "ajcc_stage"
messages = build_messages(report, task)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Greedy decode for deterministic output
inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
                   max_length=4096).to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                         pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)
# -> {"ajcc_stage": "Stage II"}

# Robust parse (handles minor LLM JSON glitches)
parsed = json_repair.loads(answer.strip().splitlines()[-1])
print(parsed[task])  # -> "Stage II"

๐Ÿš€ Quick start โ€” extract all 9 fields (per-task, recommended)

TASKS = ["cancer_type", "primary_site", "histology", "ajcc_stage",
         "t_stage", "n_stage", "m_stage", "prior_malignancy", "prognosis_good"]

def extract_all(report: str) -> dict:
    """Run the model 9 times โ€” one per task โ€” and merge into a single dict."""
    out = {}
    for task in TASKS:
        messages = build_messages(report, task)
        prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                               add_generation_prompt=True)
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
                           max_length=4096).to(model.device)
        with torch.no_grad():
            gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                                 pad_token_id=tokenizer.eos_token_id)
        ans = tokenizer.decode(gen[0, inputs.input_ids.shape[1]:],
                               skip_special_tokens=True).strip()
        try:
            out[task] = json_repair.loads(ans.splitlines()[-1]).get(task)
        except Exception:
            out[task] = None
    return out

result = extract_all(report)
print(json.dumps(result, indent=2))
# {
#   "cancer_type":      "brca_tcga_gdc",
#   "primary_site":     "Breast",
#   "histology":        "8500/3",
#   "ajcc_stage":       "Stage II",
#   "t_stage":          "T2",
#   "n_stage":          "N1",
#   "m_stage":          "M0",
#   "prior_malignancy": false,
#   "prognosis_good":   true
# }

โšก Faster batched inference (with Unsloth)

If you're processing thousands of reports, install unsloth and use batched decoding:

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    "drkareemkamal/PathQwen2.5",
    max_seq_length=4096, load_in_4bit=True,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
FastLanguageModel.for_inference(model)
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Batch many reports for one task
reports = [...]      # list[str]
task = "cancer_type"
prompts = [tokenizer.apply_chat_template(build_messages(r, task), tokenize=False,
                                         add_generation_prompt=True) for r in reports]
inputs = tokenizer(prompts, return_tensors="pt", padding=True,
                   truncation=True, max_length=3840).to(model.device)
gen = model.generate(**inputs, max_new_tokens=48, do_sample=False,
                     pad_token_id=tokenizer.pad_token_id)
new_tokens = gen[:, inputs.input_ids.shape[1]:]
answers = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

Throughput on a single RTX 3090 with batch_size=8: ~1 second per patient for all 9 tasks (~3 hours for the full 8,459-patient TCGA cohort).


๐Ÿงช Joint single-prompt (faster but lower accuracy)

If you want a single forward pass per report, ask for all 9 fields together. This is ~9ร— faster but produces free-text drift on closed-set tasks โ€” only recommended for embeddings, not for classification metrics:

SYSTEM_JOINT = (
    "You are an expert pathology AI assistant. Extract structured fields from "
    "the pathology report and respond with ONE JSON object on a single line "
    "with exactly these keys: cancer_type, primary_site, histology, ajcc_stage, "
    "t_stage, n_stage, m_stage, prior_malignancy, prognosis_good. "
    "Use null if a field cannot be determined."
)
messages = [
    {"role": "system", "content": SYSTEM_JOINT},
    {"role": "user",   "content": f"## Pathology Report:\n{report}\n\n## Output JSON (single line, all 9 keys):"},
]

๐Ÿ‹๏ธ Training details

Hyperparameter Value
Base model unsloth/Qwen2.5-7B-Instruct-bnb-4bit
Trainable params 141 M (LoRA, ~1.9 % of base)
LoRA r / ฮฑ / dropout 32 / 32 / 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Excluded modules embed_tokens, lm_head (saves ~7 GB VRAM, near-zero gain)
Max seq length 4,096 tokens
Per-device batch / grad accum 4 ร— 4 = effective batch 16
Optimizer adamw_8bit
Learning rate 2e-4, cosine schedule, 5 % warmup
Precision bf16 + FlashAttention 2
Quantization 4-bit nf4 + double quantization
Max epochs 5 (early-stop patience=3 on eval_loss)
Seed 42
Trained on RTX 3090 (24 GB), wall time ~9.7 h
Best eval_loss 0.913 at epoch 0.49
Train / val / test QA pairs 45,518 / 9,734 / 9,690
CoT augmentation GPT-4o-mini reasoning traces for AJCC stage + prognosis (9,742 rows)

Training loss trajectory

Step    Epoch   train_loss   eval_loss
~280    0.07    2.444        1.336
~570    0.14    2.276        1.132
~860    0.21    1.894        1.044
~1140   0.28    1.589        0.982
~1420   0.35    1.359        0.939
~1700   0.42    1.182        0.928
~1990   0.49    1.052        0.913   โ† best (saved adapter)
~2280   0.56    0.917        0.935   โ†— patience 1
~2560   0.63    0.852        0.928   โ†— patience 2
~2850   0.70    0.776        0.962   โ†— patience 3 โ†’ stop

Visualize in Weights & Biases


๐Ÿ“š Dataset

TCGA โ€” pathology reports from 8,459 patients across 32 cancer cohorts (studyId): BRCA, LUAD, LUSC, HNSC, COAD, READ, STAD, ESCA, PRAD, BLCA, KIRC, KIRP, KICH, UCEC, UCS, CESC, OV, LIHC, CHOL, PAAD, THCA, GBM, LGG, SKCM, UVM, ACC, MENPL, THYM, MESO, TGCT, DLBC, SARC.

Built via src/training/build_multitask_qa.py from the harmonized cohort CSV. Per-task masking โ€” missing labels don't drop the patient, just skip that QA pair. Coverage per task ranges 64โ€“100 %.


โš™๏ธ Framework versions

Library Version
PyTorch 2.6.0 + cu126
Transformers 4.57.6
PEFT 0.12 +
TRL (SFTTrainer) 0.24.0
Unsloth 2026.5.2
bitsandbytes 0.43 +
FlashAttention 2 2.8.3
Datasets 4.3.0
Tokenizers 0.22.2

โš ๏ธ Limitations + intended use

  • Research only โ€” not approved for clinical decisions
  • Trained on retrospective TCGA reports, predominantly U.S. cohorts
  • 27 % event rate is higher than population baseline โ€” risk scores are cohort-calibrated, not absolute
  • AJCC stage / prognosis benefit substantially from CoT distillation โ€” if you fine-tune further, use qa_train_cot.jsonl not qa_train.jsonl
  • Joint single-prompt extraction shows free-text drift on closed-set tasks (cancer_type ~0.58 vs ~0.92 with per-task). Use the per-task prompts for best accuracy.
  • The model emits valid JSON in ~99 % of cases but should always be wrapped in json_repair.loads() or outlines.generate.json() for production

๐Ÿ“– Citation

If you use this model, please cite both:

@misc{kamal2026pathqwen,
  author       = {Kamal, Kareem},
  title        = {PathQwen2.5: Multi-task Pathology LLM for TCGA Cancer Reports},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/drkareemkamal/PathQwen2.5}}
}

@article{saluja2025cancer,
  author       = {Saluja, Rachit and Rosenthal, Jacob and Windon, Annika and
                  Artzi, Yoav and Pisapia, David J. and Liechty, Benjamin L. and
                  Sabuncu, Mert R.},
  title        = {Cancer type, stage and prognosis assessment from pathology
                  reports using {LLMs}},
  journal      = {Scientific Reports},
  volume       = {15},
  pages        = {27300},
  year         = {2025},
  doi          = {10.1038/s41598-025-10709-4}
}

@misc{vonwerra2022trl,
  title        = {{TRL: Transformer Reinforcement Learning}},
  author       = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis
                  and Beeching, Edward and Thrush, Tristan and Lambert, Nathan
                  and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
  year         = {2020},
  journal      = {GitHub repository},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

๐Ÿ‘ค Author

Dr. Kareem Kamal ยท medical-AI researcher GitHub ยท Hugging Face

Companion repository (full multimodal pipeline + survival models): cancer-survival-predictor

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Model tree for drkareemkamal/PathQwen2.5

Base model

Qwen/Qwen2.5-7B
Adapter
(47)
this model

Space using drkareemkamal/PathQwen2.5 1

Evaluation results