How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Laborator/microlens-gemma4-e2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Laborator/microlens-gemma4-e2b", dtype="auto")
Quick Links

MicroLens · Microscopy Vision-Language Model

A small, fine-tuned multimodal model that turns a $150 Android phone + a clip-on microscope into a field-ready assistant for diatom-based water-quality assessment, freshwater zooplankton biodiversity, fungal spore identification, and cyanobacterial bloom monitoring. Runs offline.


Table of Contents

  1. Model Details
  2. Intended Use
  3. Training Data
  4. Training Procedure
  5. Evaluation
  6. Bias, Risks & Limitations
  7. Environmental Impact
  8. Technical Specifications
  9. How to Use
  10. Citation
  11. Model Card Authors
  12. Model Card Contact

1. Model Details

Field Value
Name MicroLens
Version 1.0 · May 2026
Author Serghei Brinza · Vienna, Austria
Model type Vision-Language (image + text → text)
Language(s) English (primary); multilingual output via Gemma 4 base tokenizer
Base model unsloth/gemma-4-E2B-it (Gemma 4 Effective-2B, instruction-tuned)
Parameters 4.65 B total · 59.7 M trainable during fine-tune (1.34 %)
License Apache 2.0
Finetuning method Unsloth FastVisionModel + 4-bit QLoRA (LoRA adapter, r = 32, α = 64, dropout = 0.05)
Framework Unsloth 2026.4.7 · Transformers 5.6.0 · PyTorch 2.10 · CUDA 12.8
Hardware 1 × NVIDIA RTX 3090 Ti (24 GB VRAM)
Training time ~13 h (v3 = 1 epoch on rich format, ~6,200 steps · resumed from v2 checkpoint-18351 · v2 base = 3 epochs, ~37 h)
Distilled from Internal Teacher VLM (Apache 2.0, thinking mode, 3 × vLLM workers)

Distribution artefacts

Artefact Size Purpose Target runtime
LoRA adapter 228 MB Load on top of base Gemma 4 E2B Unsloth / PEFT / Transformers
Merged FP16 9.5 GB Full stand-alone model Transformers / vLLM / SGLang
GGUF Q4_K_M 3.2 GB 4-bit quantised weights Ollama · llama.cpp · LM Studio
BF16 mmproj 942 MB Vision projector for GGUF runtimes Ollama · llama.cpp

All artefacts live on the same HF repo: Laborator/microlens-gemma4-e2b.


2. Intended Use

MicroLens is built to lower the cost of scientific observation in places where expert knowledge or network access is scarce.

Primary intended uses

  • Citizen science. Volunteers contributing to pond-water biodiversity counts, freshwater plankton surveys, or diatom-based water-quality monitoring can capture a smartphone-microscope image and receive a structured natural-language description of the subject and its key visual features.
  • Future of Education. Offline biology / earth-science classes; the model runs on the same Android tablet the students already use, with no cloud call required.
  • Research support. Pre-screening for plankton surveys, diatom-based water-quality monitoring, and fungal spore identification, where the model narrows the candidate set before an expert confirms.
  • Digital equity. The Q4_K_M GGUF build runs on mid-range Android hardware (~$150 phones with 6 GB RAM) via llama.cpp / MLC. No API key, no telemetry, no internet.

Intended users

  • Citizen-science volunteers (amateur botanists, beekeepers, freshwater monitors).
  • Teachers and students in biology / earth-science courses, particularly in low-connectivity regions.
  • Researchers doing preliminary triage of large microscopy datasets.
  • Hackathon / jury members evaluating The Gemma 4 Good Hackathon submission.

Out-of-scope uses

  • Medical diagnosis. MicroLens has not been trained on medical imaging (histology, cytology, pathology, radiology). Do not use it to diagnose disease in humans or animals.
  • Legally or biologically authoritative species identification. The model returns descriptions, not court-defensible or taxonomically rigorous identifications.
  • Materials outside the 8 trained categories. Feeding the model an unrelated image (e.g. a face, a landscape, a screenshot) produces an answer but the answer is not grounded in its training and should be treated as unreliable.
  • Forensics, compliance, or regulated decision-making. Do not chain MicroLens into any pipeline where a confident but wrong output can harm a person or violate regulation.

3. Training Data

Category distribution

All samples are microscopy images from 5 open-licensed source datasets (AquaScope · ZooLake · UDE Diatoms · DiatlAS · TgFC ). Total: 93,014 image-question-answer triples (82,737 train · 10,277 validation · 123 genera across 8 categories).

# Category Train samples Typical subjects Source datasets
1 Diatoms 64,043 (64.6%) Pennate / centric diatoms · genus-level taxonomy UDE Diatoms · DiatlAS
2 Freshwater zooplankton 11,264 (11.4%) Cladocerans · copepods · rotifers AquaScope · ZooLake
3 Fungal spores 4,188 (4.2%) Conidia · ascospores · basidiospores · spore morphology TgFC
4 Cyanobacteria 1,091 (1.1%) Filamentous and unicellular cyanobacteria curated subset
5 Fish 177 (0.2%) Fish or fish part (pseudo-genus, category-level only) TgFC
6 No specimen (service) 1,350 (1.4%) Background / empty fields for OOD detection synthetic negatives
7 Debris (service) 428 (0.4%) Non-biological fragments for OOD handling curated
8 Unknown (service) 196 (0.2%) Unidentified microscopy specimens for fallback curated
Total: 82,737 train · 10,277 val · 123 genera (top-30 hand-curated KB)

Class balancing

The 8 categories naturally vary in genus density. Long-tail genera (~100 of 123 have fewer than 100 samples each) get category-generic morphology rather than genus-specific cues — this is the correct conservative behaviour given training coverage. The 30 most-common genera have hand-curated knowledge-base entries (morphology · habitat · ID cues from AlgaeBase, WoRMS, ITIS, Round 1990, Krammer-Lange-Bertalot 1986–1991).

Description generation pipeline (distillation)

Natural-language descriptions (questionanswer) were generated from the raw images using Internal Teacher VLM (Apache 2.0) running in thinking mode across 3 × vLLM workers in parallel. For every image:

  1. The teacher sees the image alongside a structured prompt asking for subject identification and key visual features.
  2. The teacher produces a detailed chain-of-thought inside <think>…</think> tags, then a concise final answer.
  3. Only the final answer is kept as the training target; the <think> trace is discarded.

This is a teacher-student distillation. The student (MicroLens / Gemma 4 E2B) inherits the teacher's descriptive style while being ~1.7× smaller in parameter count and dramatically smaller after Q4 quantisation.

Licensing of training data

All upstream datasets were checked for license compatibility. Accepted licenses: Apache 2.0, MIT, CC-BY, CC-BY-SA, CC0. Zero samples were used from unlicensed or research-only datasets. The distilled VQA pairs are released under Apache 2.0 alongside the model.

Source datasets & DOI

Dataset License DOI
AquaScope (Eawag) CC-BY 4.0 10.25678/0009YP
ZooLake (Eawag) CC-BY 4.0 10.25678/0004DY
UDE Diatoms CC-BY 4.0 10.1093/gigascience/giae087
DIATLAS CC-BY 4.0 10.5281/zenodo.16260887
TgFC fungal spores CC-BY 4.0 10.6084/m9.figshare.28855910

4. Training Procedure

Hyperparameters

Hyperparameter Value
Fine-tuning method Unsloth FastVisionModel + 4-bit QLoRA (NF4 base + LoRA adapter in bf16)
LoRA rank (r) 32
LoRA α 64
LoRA dropout 0.05
Trainable parameters 59.7 M (1.34 % of 4.65 B)
Target modules All linear projections (vision tower + language tower)
Optimizer AdamW (8-bit)
Learning rate 5 × 10⁻⁵ (4× softer than v2's 2 × 10⁻⁴ — gentle re-learning of rich format)
LR schedule Linear warmup (100 steps) → linear decay
Batch size (per device) 2
Gradient accumulation 8
Effective batch size 16
Max sequence length 2048 tokens
Epochs (v3) 1 (rich format, resumed from v2 checkpoint-18351)
Steps (v3) ~6,200
v2 base config 3 epochs · lr 2 × 10⁻⁴ · 18,351 steps · ~37 h wall-clock
Mixed precision bf16
Gradient checkpointing enabled during fine-tune (Unsloth's optimised path)
Seed 3407

Hardware & runtime

  • 1 × NVIDIA GeForce RTX 3090 Ti (24 GB GDDR6X) — single GPU only (Unsloth currently does not support multi-GPU)
  • AMD Ryzen host · 64 GB system RAM
  • Ubuntu 24.04 · CUDA 12.8 · PyTorch 2.10
  • Wall-clock v3 (1 epoch rich-format resume): ~13 h
  • Wall-clock v2 (3 epochs base training that v3 resumed from): ~37 h
  • Cumulative wall-clock through full v2 + v3 pipeline: ~50 h

Loss curves

Stage Split Final loss
v2 (3 epochs base) Eval (10,277-image holdout) ~0.21
v3 (1 epoch rich-format resume) Eval (220-image stratified holdout) 0.0213

The v3 step preserves v2's category/genus accuracy (no drift, ~45 % top-1 genus accuracy on 123 classes vs random baseline of 0.7 %) while overwriting the response format prior to produce structured rich answers (genus · morphology · habitat · ID cues). The 4× softer learning rate (5e-5 vs 2e-4) was specifically chosen for this gentle re-learning step — without it, 1 epoch on rich format would degrade the genus signal v2 had already learned.

Attention backend

Gemma 4's vision encoder uses a head dimension of 512, which exceeds the 256-head-dim limit of current FlashAttention-2 kernels. Fine-tuning and inference therefore use PyTorch SDPA (scaled-dot-product attention, memory-efficient path). On RTX 3090 Ti this is the correct default; SDPA is the only supported backend for Gemma 4 in Unsloth 2026.4.7 at the time of training. Unsloth's FastVisionModel adds custom 4-bit QLoRA kernels and UnslothVisionDataCollator on top of this backend, which together cut peak VRAM from ~38 GB (vanilla HF Transformers) to ~12 GB and roughly halve the per-step time.


5. Evaluation

Evaluation is qualitative and per-category, reflecting the spirit of the submission (an assistive descriptor, not a classifier). For every category we sampled images from the held-out validation split and compared the MicroLens answer against the original internal teacher answer.

Category Observation
Diatoms Pennate vs. centric distinction is reliable; genus-level naming on common Naviculales / Cymbellaceae / Aulacoseiraceae is consistent. Long-tail diatoms degrade gracefully into morphological description (raphe / striae / valve outline).
Freshwater zooplankton Cladocerans and rotifers are consistently named at family level; common copepod genera (Cyclops, Daphnia, Bosmina) are reliably tagged.
Fungal spores Conidial vs. ascospore vs. basidiospore separation is reliable; common spore morphologies (Neopestalotiopsis, Colletotrichum, Olivea) receive genus-level naming.
Cyanobacteria Identified at category level; specific cyanobacterial genus naming is best-effort due to small training share (~1% of total).
Fish Pseudo-genus class with no species-level annotation in training. Returns category-level templated description rather than species names.

The 3 service classes (debris, no_specimen, unknown) are used for out-of-distribution handling and route the model to conservative, generic responses rather than taxonomic descriptions.

There is no single accuracy number for MicroLens, because the output space is free-form natural language. The correct axis of evaluation is "does the description help a human in the field decide what to do next?". For the trained categories, it does.


6. Bias, Risks & Limitations

Known failure modes

  • Small-model ceiling. MicroLens is built on Gemma 4 E2B (effective 2-billion scale). On edge cases the teacher (Internal Teacher) was stronger; the student inherits the style but not the full capability. Expect the student to be close to, but not equal to, the teacher on hard examples.
  • English-first. Scientific terminology is maximally accurate in English. The Gemma 4 base model is multilingual, so translated output is available, but translations can simplify or partially drop domain terms; always verify critical terms in the English answer.
  • Out-of-distribution images. Photographs that are not microscopy (landscapes, faces, screenshots) will still produce text. That text is not grounded in the training distribution and should not be trusted.

Risks

  • Over-trust by non-experts. A fluent natural-language description can feel more authoritative than it is. Treat MicroLens as a first-pass field note, not as an oracle. Verify before publishing, diagnosing, or acting on any output.
  • Distribution shift. The training data is dominated by lab-quality or curated-quality images. Field images taken through cheap clip-on phone microscopes have more motion blur, chromatic aberration, and inconsistent illumination. Descriptions on those inputs remain helpful but are more generic.

Ethical considerations

  • Distillation is explicitly disclosed. Training data was generated from Internal Teacher VLM (Apache 2.0). The teacher VLM was used under its Apache 2.0 license.
  • Dataset provenance is audited. Only Apache / MIT / CC-BY / CC-BY-SA / CC0 upstream data was used. Zero non-licensed images were included.
  • No faces, no PII. The training pool contains microscopy subjects only: no human faces, no personally identifiable information, no private medical imaging.

Recommended usage pattern

  1. Capture image → 2. MicroLens describes it → 3. Human confirms or rejects → 4. Log both.

The model produces a first draft. Final decisions stay with the user.


7. Environmental Impact

MicroLens v3 was trained on a single workstation GPU for ~13 hours (1-epoch rich-format resume from v2 checkpoint-18351). The v2 base run that v3 resumes from took an additional ~37 hours.

Factor Value
GPU RTX 3090 Ti · ~400 W under sustained fine-tune load
CPU + chassis + cooling overhead ~140 W
Wall-time v3 (1 epoch rich) ~13 h
Wall-time v2 (3 epochs base) ~37 h
Cumulative wall-time ~50 h
Estimated energy (v3 step) ~3 kWh
Estimated energy (cumulative v2 + v3) ~12.5 kWh

At the Austrian 2024 grid carbon intensity (110 g CO₂ / kWh), the v3 training step emits ~0.3 kg CO₂-equivalent, and the full v2 + v3 pipeline emits **1.9 kg CO₂-equivalent**.

Inference cost is negligible: the Q4_K_M GGUF build runs on a mid-range Android phone at a few watts. MicroLens is designed so that the cumulative lifetime inference energy per query can be orders of magnitude smaller than a single cloud-inference call to a frontier model.


8. Technical Specifications

Architecture

  • Backbone: Gemma 4 (E2B), sparse-attention transformer decoder with an integrated vision encoder stack.
  • Vision encoder: Gemma 4 native vision tower (head dim 512).
  • Fusion: multimodal projector that lifts vision tokens into the language model embedding space (mmproj is shipped separately for GGUF runtimes).
  • Positional encoding: inherited from Gemma 4 base.
  • Attention backend: SDPA (scaled dot-product attention) during both fine-tune and inference. FlashAttention-2 is not usable: Gemma 4's vision-tower head dim (512) exceeds the FA-2 kernel limit (256).

Adapter layout

  • LoRA rank: 32
  • LoRA α: 64
  • LoRA dropout: 0.05
  • Target modules: all linear projections across both the language and vision sub-networks (attention Q/K/V/O and MLP gate/up/down), enabling multimodal co-adaptation rather than a language-only adapter.
  • Merged adapter size: 228 MB (bf16).

Quantisations shipped

  • Merged FP16: full-precision full-model snapshot (9.5 GB), Transformers-native.
  • GGUF Q4_K_M: 4-bit quantised weights via llama.cpp convert pipeline (3.2 GB). Pairs with the BF16 mmproj (942 MB) for full multimodal inference.
  • LoRA-only (bf16): for users who want to re-merge against a different Gemma 4 E2B base or stack additional adapters.

Software dependencies at training time

  • unsloth == 2026.4.7
  • transformers == 5.6.0
  • torch == 2.10 (CUDA 12.8)
  • peft, bitsandbytes, trl from Unsloth's pinned resolver.

9. How to Use

Transformers (merged FP16)

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_id = "Laborator/microlens-gemma4-e2b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

image = Image.open("my_microscopy_image.jpg").convert("RGB")
prompt = "Describe what you see in this microscopy image. Identify the subject and key visual features."

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text":  prompt},
]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=220, temperature=0.3, do_sample=True)
print(processor.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Unsloth (LoRA on top of base)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "Laborator/microlens-gemma4-e2b",
    load_in_4bit=True,
    use_gradient_checkpointing=False,
)
FastVisionModel.for_inference(model)
# … same prompting pattern as above.

Ollama / llama.cpp (Q4_K_M)

# download microlens-gemma4-e2b-Q4_K_M.gguf and mmproj-bf16.gguf from the HF repo
ollama create microlens -f Modelfile        # see repo for Modelfile
ollama run microlens "Describe this sample." --image slide_01.jpg

10. Citation

If you use MicroLens in a publication, project, or downstream model, please cite:

@software{brinza_microlens_2026,
  title        = {MicroLens: a microscopy vision-language model fine-tuned from Gemma 4 E2B},
  author       = {Brinza, Serghei},
  year         = {2026},
  month        = may,
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Laborator/microlens-gemma4-e2b},
  note         = {Submission to the Gemma 4 Good Hackathon (Kaggle, May 2026)}
}

Upstream works used by MicroLens:

@misc{gemma4_2026,
  title        = {Gemma 4 Technical Report},
  author       = {Google DeepMind},
  year         = {2026},
  note         = {Base model: unsloth/gemma-4-E2B-it}
}

@misc{unsloth_2026,
  title        = {Unsloth: faster LLM fine-tuning},
  author       = {Daniel Han and Michael Han and Unsloth team},
  year         = {2026},
  url          = {https://github.com/unslothai/unsloth}
}

11. Model Card Authors

  • Serghei Brinza · Vienna, Austria · sole author of the model, the training pipeline, and this card.

12. Model Card Contact


MicroLens · built for the Gemma 4 Good Hackathon.

Downloads last month
4,409
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Laborator/microlens-gemma4-e2b

Adapter
(20)
this model

Space using Laborator/microlens-gemma4-e2b 1