Glüten · Gemma 4 E4B Marsh classifier (merged fp16)

Gemma 4 E4B fine-tuned via Unsloth QLoRA on the IBDColEpi colon-biopsy patch dataset (Pettersen et al. 2022, CC0). The fine-tuned LoRA adapter is merged into the base in fp16 and re-saved as a single standalone multimodal checkpoint, so deployment doesn't need PEFT or bitsandbytes at runtime.

This is the structural-layer classifier used by the Glüten coeliac disease digital twin: https://gluten--gluten-gemma4.europe-west4.hosted.app

Quick start

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

REPO = "faith-ogun/gluten-gemma4-marsh-merged"

processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
    REPO,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

img = Image.open("patch.tif").convert("RGB").resize((224, 224))
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": img},
        {"type": "text", "text": (
            "You are a histopathology assistant. Classify the Marsh "
            "grade of this HE-stained intestinal biopsy patch. Respond "
            "with exactly one of: Marsh-0, Marsh-1, Marsh-3a, Marsh-3b."
        )},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to("cuda")

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=8, do_sample=False)
print(processor.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# -> "Marsh-3b"

Single-GPU inference fits on a 24 GB card (L4/A10) at fp16. T4 (14.5 GB) will OOM unless you cpu-offload via device_map="auto".

What this model does

Given a 224×224 RGB patch from an HE-stained intestinal mucosa biopsy, the model returns one of four Marsh-equivalent labels:

Label Meaning
Marsh-0 normal mucosa
Marsh-1 infiltrative (raised intraepithelial lymphocytes, intact villi)
Marsh-3a partial villous atrophy
Marsh-3b subtotal villous atrophy

Marsh-2 is omitted (under-represented in the literature and in the training data; the molecular reference dataset GSE164883 also skips it).

Provenance

This checkpoint is the deployment artefact of the v2 training run (see notebooks/gluten-gemma4-marsh-qlora-v2.ipynb in the project repo):

  1. Started from google/gemma-4-E4B-it, loaded in 4-bit via Unsloth FastVisionModel.from_pretrained(load_in_4bit=True).
  2. Vision tower frozen (finetune_vision_layers=False); LoRA applied to the language head only (r=16, lora_alpha=32, dropout=0.05, target attention + MLP modules).
  3. SFT for 250 steps on a 2,000-patch sample from IBDColEpi Trainset with per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, cosine schedule, fp16 (T4 doesn't support bf16).
  4. model.save_pretrained_merged(save_method='merged_16bit') then pushed here.

Training time: ~30 min on a free Kaggle T4. The notebook is public and reproducible end-to-end given a HF_TOKEN with the Gemma 4 license accepted.

Pseudo-label methodology (read this first)

IBDColEpi ships pixel-level epithelium-segmentation masks, NOT Marsh grades. The labels in this fine-tune are derived from epithelium fraction per patch:

epi_frac = (mask > 0).mean()
# quantile-binned over the full dataset into four classes:
#   Q1 (lowest coverage)  → Marsh-3b
#   Q2                    → Marsh-3a
#   Q3                    → Marsh-1
#   Q4 (highest coverage) → Marsh-0

The motivation is published in Pettersen et al. 2022 (DOI 10.18710/TLA01U): epithelium quantification is the same primitive a pathologist counts when staging villous atrophy, and the authors explicitly name coeliac disease as an applicable use case for the same segmentation pipeline.

These are weak-supervision proxy labels, not pathologist-validated Marsh scores. Anyone retraining with pathologist-validated grades should expect substantially higher accuracy on Marsh-1 / Marsh-3a specifically; the labels in this dataset are most accurate at the extremes (no epithelium / mostly epithelium) and noisiest in the middle.

Performance

Aggregate (v3 stratified audit, 400 held-out test patches)

Metric Value
Accuracy (training-time eval, in v2 kernel) 0.70
Accuracy (deployed merged-fp16, v3 audit) 0.645
Macro-F1 (deployed) 0.624

Per-class (deployed model, v3 audit)

Class Precision Recall F1 n
Marsh-0 0.68 0.73 0.70 81
Marsh-1 0.49 0.57 0.53 96
Marsh-3a 0.58 0.35 0.44 99
Marsh-3b 0.78 0.88 0.83 124

Marsh-3b — the most clinically actionable severe-atrophy class — is the strongest. Errors confine to adjacent grades (3a↔3b, 0↔1), mirroring 73–80% inter-pathologist agreement reported in the literature.

Stratified bias audit

The full audit (figure + CSV + per-stratum summary) is in the project repo under results/marsh_stratified*. Key finding:

  • Per-WSI accuracy spans 0.00–1.00 (σ=0.21) across 35 source slides drawn from a single hospital and scanner.

In other words, on the same dataset, same scanner, same centre, accuracy varies by 100 percentage points slide-to-slide. Cross-site generalisation is the load-bearing unknown the field has no answer for, because the SOTA Cambridge benchmark (NEJM AI 2025, 3,383 duodenal WSIs) sits behind a UK NHS data-sharing agreement (IRAS 162057) and has not published per-demographic metrics. Performance for African, South Asian, East Asian, Hispanic, or Middle Eastern patients cannot be quantified for any publicly available CD biopsy model, including this one.

Limitations

  • Pseudo-labels, not pathologist-validated. See "Pseudo-label methodology" above. Re-training on pathologist grades is a 30-minute job on a T4 if those labels become available.
  • Trained on colon, not duodenum. Coeliac disease is duodenal; the IBDColEpi dataset is colon. Pettersen et al. argue the segmentation primitive transfers, but the canonical structural reference would be the Cambridge duodenal dataset which is access-restricted.
  • Single-site, single-scanner training data. No demographic metadata accompanies IBDColEpi — no patient ancestry, sex, age, hospital, or scanner. Generalisation across centres and populations is not measurable from the available data.
  • Marsh-2 absent. Quantile binning produces four classes, not five.
  • Patch-level, not slide-level. The model classifies 224×224 patches individually. Whole-slide inference would require tiling + aggregation logic not provided here.
  • Not a diagnostic tool. This is a research prototype. Marsh grading in practice depends on context the model cannot see (IEL counts, clinical history, serology). The output is one input among many to a pathologist's judgement, not a substitute for it.

Intended use

  • Research and reproducibility of the Glüten coeliac disease digital twin project.
  • Baseline for groups with pathologist-validated Marsh datasets who want a starting point for proper supervised fine-tuning.
  • Educational demonstration of multimodal LoRA fine-tuning of Gemma 4 E4B on histopathology, including the dequantize-and-merge deployment pattern.

Out-of-scope

  • Clinical decision-making
  • Whole-slide analysis without external tiling and aggregation
  • Validation for any non-European patient population (no demographic metadata in training data)
  • Drug, treatment, or therapy decisions of any kind

Citation

If this model is useful in your work, please cite:

@misc{ogundimu2026gluten,
  title  = {Glüten: a coeliac disease digital twin with per-layer
           ancestry-aware confidence},
  author = {Ogundimu, Faith},
  year   = {2026},
  note   = {Gemma 4 Good Hackathon submission, Kaggle/Google DeepMind},
  url    = {https://github.com/faith-ogun/Gluten},
}

And the underlying dataset:

@dataset{pettersen2022ibdcolepi,
  title  = {IBDColEpi: Annotated WSIs of colon epithelium and pixel-level
           U-Net segmentation models},
  author = {Pettersen, Henrik S. and others},
  year   = {2022},
  doi    = {10.18710/TLA01U},
  url    = {https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/TLA01U},
}

And the base model:

@misc{gemma4,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://ai.google.dev/gemma/docs/core/model_card_4},
}

License

This fine-tuned model inherits Gemma's licence (https://ai.google.dev/gemma/terms). The LoRA delta and the merged weights themselves are released under the same terms. The IBDColEpi training data is CC0 1.0 (no attribution required, public domain).

Acknowledgements

  • Google DeepMind for Gemma 4 and the Gemma 4 Good Hackathon
  • Unsloth for the QLoRA training framework
  • Pettersen et al. and NTNU / St. Olavs Hospital, Trondheim for IBDColEpi
  • Modal for hosting the merged-fp16 inference at the live demo URL
Downloads last month
48
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faith-ogun/gluten-gemma4-marsh-merged

Adapter
(105)
this model