How to use from
Docker Model Runner
docker model run hf.co/Laborator/microlens-gemma4-e2b:BF16
Quick Links

MicroLens · Microscopy Vision-Language Model

A small, fine-tuned multimodal model that turns a $150 Android phone + a clip-on microscope into a field-ready assistant for pollen surveys, pond-water plankton, mineral identification, plant-disease triage, and more. Runs offline.


Table of Contents

  1. Model Details
  2. Intended Use
  3. Training Data
  4. Training Procedure
  5. Evaluation
  6. Bias, Risks & Limitations
  7. Environmental Impact
  8. Technical Specifications
  9. How to Use
  10. Citation
  11. Model Card Authors
  12. Model Card Contact

1. Model Details

Field Value
Name MicroLens
Version 1.0 · May 2026
Author Serghei Brinza · Vienna, Austria
Model type Vision-Language (image + text → text)
Language(s) English (primary); multilingual output via Gemma 4 base tokenizer
Base model unsloth/gemma-4-E2B-it (Gemma 4 Effective-2B, instruction-tuned)
Parameters 4.65 B total · 59.7 M trainable during fine-tune (1.34 %)
License Apache 2.0
Finetuning method Unsloth FastVisionModel + 4-bit QLoRA (LoRA adapter, r = 32, α = 64, dropout = 0.05)
Framework Unsloth 2026.4.7 · Transformers 5.6.0 · PyTorch 2.10 · CUDA 12.8
Hardware 1 × NVIDIA RTX 3090 Ti (24 GB VRAM)
Training time ~13 h (v3 = 1 epoch on rich format, ~6,200 steps · resumed from v2 checkpoint-18351 · v2 base = 3 epochs, ~37 h)
Distilled from Qwen3-VL-8B-AWQ (Apache 2.0, thinking mode, 3 × vLLM workers)

Distribution artefacts

Artefact Size Purpose Target runtime
LoRA adapter 228 MB Load on top of base Gemma 4 E2B Unsloth / PEFT / Transformers
Merged FP16 8.7 GB Full stand-alone model Transformers / vLLM / SGLang
GGUF Q4_K_M 3.2 GB 4-bit quantised weights Ollama · llama.cpp · LM Studio
BF16 mmproj 942 MB Vision projector for GGUF runtimes Ollama · llama.cpp

All artefacts live on the same HF repo: Laborator/microlens-gemma4-e2b.


2. Intended Use

MicroLens is built to lower the cost of scientific observation in places where expert knowledge or network access is scarce.

Primary intended uses

  • Citizen science. Volunteers contributing to pollen surveys, pond-water biodiversity counts, or amateur mineralogy can capture a smartphone-microscope image and receive a structured natural-language description of the subject and its key visual features.
  • Education. Offline biology / earth-science classes; the model runs on the same Android tablet the students already use, with no cloud call required.
  • Research support. Pre-screening for pollen monitoring, zooplankton surveys, and mineral field work, where the model narrows the candidate set before an expert confirms.
  • Digital equity. The Q4_K_M GGUF build runs on mid-range Android hardware (~$150 phones with 6 GB RAM) via llama.cpp / MLC. No API key, no telemetry, no internet.

Intended users

  • Citizen-science volunteers (amateur botanists, beekeepers, freshwater monitors).
  • Teachers and students in biology / earth-science courses, particularly in low-connectivity regions.
  • Researchers doing preliminary triage of large microscopy datasets.
  • Hackathon / jury members evaluating The Gemma 4 Good Hackathon submission.

Out-of-scope uses

  • Medical diagnosis. MicroLens has not been trained on medical imaging (histology, cytology, pathology, radiology). Do not use it to diagnose disease in humans or animals.
  • Legally or biologically authoritative species identification. The model returns descriptions, not court-defensible or taxonomically rigorous identifications.
  • Materials outside the 9 trained categories. Feeding the model an unrelated image (e.g. a face, a landscape, a screenshot) produces an answer but the answer is not grounded in its training and should be treated as unreliable.
  • Forensics, compliance, or regulated decision-making. Do not chain MicroLens into any pipeline where a confident but wrong output can harm a person or violate regulation.

3. Training Data

Category distribution

All samples are microscopy images from 6 open-licensed source datasets (AquaScope · ZooLake · UDE Diatoms · DiatlAS · TgFC · Marine zooplankton dataset). Total: 122,399 image-question-answer triples (99,215 train · 12,331 validation · 12,353 test · 1,500 negative-class · 146 genera across 9 categories).

# Category Typical subjects Source datasets
1 Diatoms Pennate / centric diatoms · genus-level taxonomy UDE Diatoms · DiatlAS
2 Freshwater zooplankton Cladocerans · copepods · rotifers AquaScope · ZooLake
3 Marine zooplankton Copepods · larvae · medusae · marine crustaceans Marine zooplankton dataset
4 Fungal spores Conidia · ascospores · basidiospores · spore morphology curated subset
5 Fish larvae Early-stage fish larvae and pre-larval forms TgFC
6 Pollen Grass · tree · flower pollen grains · aperture morphology curated subset
7 Minerals Thin-section petrographic slides · crystal habit curated subset
8 Plant disease Leaf lesions · phytopathogen morphology · chlorosis / necrosis curated subset
9 Snowflakes Macro / microphotographed snow crystals · dendrite / plate / column curated subset
Total: 99,215 train · 12,331 val · 146 genera (top-30 hand-curated KB)

Class balancing & negative-class

The 9 categories naturally vary in genus density. Long-tail genera (~100 of 146 have fewer than 100 samples each) get category-generic morphology rather than genus-specific cues — this is the correct conservative behaviour given training coverage. The 30 most-common genera have hand-curated knowledge-base entries (morphology · habitat · ID cues from AlgaeBase, WoRMS, ITIS, Round 1990, Krammer-Lange-Bertalot 1986–1991).

A synthetic negative-class of 1,500 non-microscopy images (faces · landscapes · screenshots) was added so the model learns to refuse out-of-distribution inputs at inference time.

Description generation pipeline (distillation)

Natural-language descriptions (questionanswer) were generated from the raw images using Qwen3-VL-8B-AWQ (Apache 2.0) running in thinking mode across 3 × vLLM workers in parallel. For every image:

  1. The teacher sees the image alongside a structured prompt asking for subject identification and key visual features.
  2. The teacher produces a detailed chain-of-thought inside <think>…</think> tags, then a concise final answer.
  3. Only the final answer is kept as the training target; the <think> trace is discarded.

This is a teacher-student distillation. The student (MicroLens / Gemma 4 E2B) inherits the teacher's descriptive style while being ~1.7× smaller in parameter count and dramatically smaller after Q4 quantisation.

Licensing of training data

All upstream datasets were checked for license compatibility. Accepted licenses: Apache 2.0, MIT, CC-BY, CC-BY-SA, CC0. Zero samples were used from unlicensed or research-only datasets. The distilled VQA pairs are released under Apache 2.0 alongside the model.


4. Training Procedure

Hyperparameters

Hyperparameter Value
Fine-tuning method Unsloth FastVisionModel + 4-bit QLoRA (NF4 base + LoRA adapter in bf16)
LoRA rank (r) 32
LoRA α 64
LoRA dropout 0.05
Trainable parameters 59.7 M (1.34 % of 4.65 B)
Target modules All linear projections (vision tower + language tower)
Optimizer AdamW (8-bit)
Learning rate 5 × 10⁻⁵ (4× softer than v2's 2 × 10⁻⁴ — gentle re-learning of rich format)
LR schedule Linear warmup (100 steps) → linear decay
Batch size (per device) 2
Gradient accumulation 8
Effective batch size 16
Max sequence length 2048 tokens
Epochs (v3) 1 (rich format, resumed from v2 checkpoint-18351)
Steps (v3) ~6,200
v2 base config 3 epochs · lr 2 × 10⁻⁴ · 18,351 steps · ~37 h wall-clock
Mixed precision bf16
Gradient checkpointing enabled during fine-tune (Unsloth's optimised path)
Seed 3407

Hardware & runtime

  • 1 × NVIDIA GeForce RTX 3090 Ti (24 GB GDDR6X) — single GPU only (Unsloth currently does not support multi-GPU)
  • AMD Ryzen host · 64 GB system RAM
  • Ubuntu 24.04 · CUDA 12.8 · PyTorch 2.10
  • Wall-clock v3 (1 epoch rich-format resume): ~13 h
  • Wall-clock v2 (3 epochs base training that v3 resumed from): ~37 h
  • Cumulative wall-clock through full v2 + v3 pipeline: ~50 h

Loss curves

Stage Split Final loss
v2 (3 epochs base) Eval (12,331-image holdout) ~0.21
v3 (1 epoch rich-format resume) Eval (220-image stratified holdout) 0.0213

The v3 step preserves v2's category/genus accuracy (no drift, ~45 % top-1 genus accuracy on 146 classes vs random baseline of 0.7 %) while overwriting the response format prior to produce structured rich answers (genus · morphology · habitat · ID cues). The 4× softer learning rate (5e-5 vs 2e-4) was specifically chosen for this gentle re-learning step — without it, 1 epoch on rich format would degrade the genus signal v2 had already learned.

Attention backend

Gemma 4's vision encoder uses a head dimension of 512, which exceeds the 256-head-dim limit of current FlashAttention-2 kernels. Fine-tuning and inference therefore use PyTorch SDPA (scaled-dot-product attention, memory-efficient path). On RTX 3090 Ti this is the correct default; SDPA is the only supported backend for Gemma 4 in Unsloth 2026.4.7 at the time of training. Unsloth's FastVisionModel adds custom 4-bit QLoRA kernels and UnslothVisionDataCollator on top of this backend, which together cut peak VRAM from ~38 GB (vanilla HF Transformers) to ~12 GB and roughly halve the per-step time.


5. Evaluation

Evaluation is qualitative and per-category, reflecting the spirit of the submission (an assistive descriptor, not a classifier). For every category we sampled images from the held-out validation split and compared the MicroLens answer against the original Qwen3-VL-8B teacher answer.

Category Observation
Pollen Consistent identification of pollen vs. non-pollen. Species-level guesses degrade gracefully into morphological descriptions (shape, aperture, surface texture).
Algae Separates filamentous vs. unicellular vs. colonial. Genus-level names are best-effort.
Yeast Reliable identification of budding cells; distinguishes yeast from bacteria.
Minerals Good at gross texture (crystalline, granular, foliated) and colour; specific mineral names can be off when the sample lacks diagnostic features visible in brightfield.
Plant disease Strong on lesion descriptions (chlorosis, necrosis, spotting); pathogen identification is probabilistic.
PCB Identifies trace patterns, solder joints, component silhouettes. Not intended for defect triage; it describes rather than grades.
Snowflakes Dendrite / plate / column classification is reliable; novel crystal habits are described morphologically.
Zooplankton Copepods, rotifers, and common cladocerans are consistently named. Rare subclasses degrade gracefully (see limitations).
Diatoms Pennate vs. centric distinction is reliable; genus-level naming on common Naviculales / Cymbellaceae / Aulacoseiraceae is consistent. Long-tail diatoms degrade gracefully into morphological description (raphe / striae / valve outline).

There is no single accuracy number for MicroLens, because the output space is free-form natural language. The correct axis of evaluation is "does the description help a human in the field decide what to do next?". For the trained categories, it does.


6. Bias, Risks & Limitations

Known failure modes

  • Graceful degradation on rare zooplankton subclasses. Specimens from Branchiopoda, Decapoda, and other sparsely represented orders are typically described as "marine zooplankton" or "crustacean-like organism" rather than named at the correct taxonomic level. This is the correct conservative behaviour given training coverage; the model does not fabricate taxonomy it cannot defend.
  • Small-model ceiling. MicroLens is built on Gemma 4 E2B (effective 2-billion scale). On edge cases the teacher (Qwen3-VL-8B) was stronger; the student inherits the style but not the full capability. Expect the student to be close to, but not equal to, the teacher on hard examples.
  • English-first. Scientific terminology is maximally accurate in English. The Gemma 4 base model is multilingual, so translated output is available, but translations can simplify or partially drop domain terms; always verify critical terms in the English answer.
  • Out-of-distribution images. Photographs that are not microscopy (landscapes, faces, screenshots) will still produce text. That text is not grounded in the training distribution and should not be trusted.

Risks

  • Over-trust by non-experts. A fluent natural-language description can feel more authoritative than it is. Treat MicroLens as a first-pass field note, not as an oracle. Verify before publishing, diagnosing, or acting on any output.
  • Distribution shift. The training data is dominated by lab-quality or curated-quality images. Field images taken through cheap clip-on phone microscopes have more motion blur, chromatic aberration, and inconsistent illumination. Descriptions on those inputs remain helpful but are more generic.

Ethical considerations

  • Distillation is explicitly disclosed. Training data was generated from Qwen3-VL-8B-AWQ (Apache 2.0). Qwen's license permits this; Qwen is credited in the Citation section.
  • Dataset provenance is audited. Only Apache / MIT / CC-BY / CC-BY-SA / CC0 upstream data was used. Zero non-licensed images were included.
  • No faces, no PII. The training pool contains microscopy subjects only: no human faces, no personally identifiable information, no private medical imaging.

Recommended usage pattern

  1. Capture image → 2. MicroLens describes it → 3. Human confirms or rejects → 4. Log both.

The model produces a first draft. Final decisions stay with the user.


7. Environmental Impact

MicroLens v3 was trained on a single workstation GPU for ~13 hours (1-epoch rich-format resume from v2 checkpoint-18351). The v2 base run that v3 resumes from took an additional ~37 hours.

Factor Value
GPU RTX 3090 Ti · ~400 W under sustained fine-tune load
CPU + chassis + cooling overhead ~140 W
Wall-time v3 (1 epoch rich) ~13 h
Wall-time v2 (3 epochs base) ~37 h
Cumulative wall-time ~50 h
Estimated energy (v3 step) ~3 kWh
Estimated energy (cumulative v2 + v3) ~12.5 kWh

At the Austrian 2024 grid carbon intensity (110 g CO₂ / kWh), the v3 training step emits ~0.3 kg CO₂-equivalent, and the full v2 + v3 pipeline emits **1.9 kg CO₂-equivalent**.

Inference cost is negligible: the Q4_K_M GGUF build runs on a mid-range Android phone at a few watts. MicroLens is designed so that the cumulative lifetime inference energy per query can be orders of magnitude smaller than a single cloud-inference call to a frontier model.


8. Technical Specifications

Architecture

  • Backbone: Gemma 4 (E2B), sparse-attention transformer decoder with an integrated vision encoder stack.
  • Vision encoder: Gemma 4 native vision tower (head dim 512).
  • Fusion: multimodal projector that lifts vision tokens into the language model embedding space (mmproj is shipped separately for GGUF runtimes).
  • Positional encoding: inherited from Gemma 4 base.
  • Attention backend: SDPA (scaled dot-product attention) during both fine-tune and inference. FlashAttention-2 is not usable: Gemma 4's vision-tower head dim (512) exceeds the FA-2 kernel limit (256).

Adapter layout

  • LoRA rank: 32
  • LoRA α: 64
  • LoRA dropout: 0.05
  • Target modules: all linear projections across both the language and vision sub-networks (attention Q/K/V/O and MLP gate/up/down), enabling multimodal co-adaptation rather than a language-only adapter.
  • Merged adapter size: 228 MB (bf16).

Quantisations shipped

  • Merged FP16: full-precision full-model snapshot (8.7 GB), Transformers-native.
  • GGUF Q4_K_M: 4-bit quantised weights via llama.cpp convert pipeline (3.2 GB). Pairs with the BF16 mmproj (942 MB) for full multimodal inference.
  • LoRA-only (bf16): for users who want to re-merge against a different Gemma 4 E2B base or stack additional adapters.

Software dependencies at training time

  • unsloth == 2026.4.7
  • transformers == 5.6.0
  • torch == 2.10 (CUDA 12.8)
  • peft, bitsandbytes, trl from Unsloth's pinned resolver.

9. How to Use

Transformers (merged FP16)

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_id = "Laborator/microlens-gemma4-e2b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

image = Image.open("my_microscopy_image.jpg").convert("RGB")
prompt = "Describe what you see in this microscopy image. Identify the subject and key visual features."

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text":  prompt},
]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=220, temperature=0.3, do_sample=True)
print(processor.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Unsloth (LoRA on top of base)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "Laborator/microlens-gemma4-e2b",
    load_in_4bit=True,
    use_gradient_checkpointing=False,
)
FastVisionModel.for_inference(model)
# … same prompting pattern as above.

Ollama / llama.cpp (Q4_K_M)

# download microlens-gemma4-e2b-Q4_K_M.gguf and mmproj-bf16.gguf from the HF repo
ollama create microlens -f Modelfile        # see repo for Modelfile
ollama run microlens "Describe this sample." --image slide_01.jpg

10. Citation

If you use MicroLens in a publication, project, or downstream model, please cite:

@software{brinza_microlens_2026,
  title        = {MicroLens: a microscopy vision-language model fine-tuned from Gemma 4 E2B},
  author       = {Brinza, Serghei},
  year         = {2026},
  month        = may,
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Laborator/microlens-gemma4-e2b},
  note         = {Submission to the Gemma 4 Good Hackathon (Kaggle, May 2026)}
}

Upstream works used by MicroLens:

@misc{gemma4_2026,
  title        = {Gemma 4 Technical Report},
  author       = {Google DeepMind},
  year         = {2026},
  note         = {Base model: unsloth/gemma-4-E2B-it}
}

@misc{unsloth_2026,
  title        = {Unsloth: 2x faster LLM fine-tuning},
  author       = {Daniel Han and Michael Han and Unsloth team},
  year         = {2026},
  url          = {https://github.com/unslothai/unsloth}
}

@misc{qwen3_vl_2025,
  title        = {Qwen3-VL: Vision-Language Models},
  author       = {Alibaba Qwen Team},
  year         = {2025},
  note         = {Teacher model for distillation, Apache 2.0}
}

11. Model Card Authors

  • Serghei Brinza · Vienna, Austria · sole author of the model, the training pipeline, and this card.

12. Model Card Contact


MicroLens · built for the Gemma 4 Good Hackathon.

Downloads last month
2,508
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Laborator/microlens-gemma4-e2b

Adapter
(13)
this model

Space using Laborator/microlens-gemma4-e2b 1