---
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- gemma
- gemma-4
- vision-language
- microscopy
- scientific-imaging
- lora
- qlora
- unsloth
- citizen-science
- education
- edge-deployment
license: apache-2.0
base_model: unsloth/gemma-4-E2B-it
model-index:
- name: MicroLens
results: []
---
# MicroLens · Microscopy Vision-Language Model
A small, fine-tuned multimodal model that turns a **$150 Android phone + a clip-on microscope** into a field-ready assistant for pollen surveys, pond-water plankton, mineral identification, plant-disease triage, and more. Runs offline.
- **Base model:** [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) (4.65 B params)
- **Adapter / Merged / GGUF:** [`Laborator/microlens-gemma4-e2b`](https://huggingface.co/Laborator/microlens-gemma4-e2b)
- **Source code:** [`SergheiBrinza/microlens`](https://github.com/SergheiBrinza/microlens)
- **Submission for:** *The Gemma 4 Good Hackathon*, Kaggle · May 2026
- **License:** Apache 2.0 (weights · code · dataset, see component licenses below)
---
## Table of Contents
1. [Model Details](#1-model-details)
2. [Intended Use](#2-intended-use)
3. [Training Data](#3-training-data)
4. [Training Procedure](#4-training-procedure)
5. [Evaluation](#5-evaluation)
6. [Bias, Risks & Limitations](#6-bias-risks--limitations)
7. [Environmental Impact](#7-environmental-impact)
8. [Technical Specifications](#8-technical-specifications)
9. [How to Use](#9-how-to-use)
10. [Citation](#10-citation)
11. [Model Card Authors](#11-model-card-authors)
12. [Model Card Contact](#12-model-card-contact)
---
## 1. Model Details
| Field | Value |
|---|---|
| **Name** | MicroLens |
| **Version** | 1.0 · May 2026 |
| **Author** | Serghei Brinza · Vienna, Austria |
| **Model type** | Vision-Language (image + text → text) |
| **Language(s)** | English (primary); multilingual output via Gemma 4 base tokenizer |
| **Base model** | `unsloth/gemma-4-E2B-it` (Gemma 4 Effective-2B, instruction-tuned) |
| **Parameters** | 4.65 B total · 59.7 M trainable during fine-tune (1.34 %) |
| **License** | Apache 2.0 |
| **Finetuning method** | Unsloth FastVisionModel + 4-bit QLoRA (LoRA adapter, r = 32, α = 64, dropout = 0.05) |
| **Framework** | Unsloth 2026.4.7 · Transformers 5.6.0 · PyTorch 2.10 · CUDA 12.8 |
| **Hardware** | 1 × NVIDIA RTX 3090 Ti (24 GB VRAM) |
| **Training time** | ~13 h (v3 = 1 epoch on rich format, ~6,200 steps · resumed from v2 checkpoint-18351 · v2 base = 3 epochs, ~37 h) |
| **Distilled from** | Qwen3-VL-8B-AWQ (Apache 2.0, thinking mode, 3 × vLLM workers) |
### Distribution artefacts
| Artefact | Size | Purpose | Target runtime |
|---|---|---|---|
| **LoRA adapter** | 228 MB | Load on top of base Gemma 4 E2B | Unsloth / PEFT / Transformers |
| **Merged FP16** | 8.7 GB | Full stand-alone model | Transformers / vLLM / SGLang |
| **GGUF Q4_K_M** | 3.2 GB | 4-bit quantised weights | Ollama · llama.cpp · LM Studio |
| **BF16 `mmproj`** | 942 MB | Vision projector for GGUF runtimes | Ollama · llama.cpp |
All artefacts live on the same HF repo: [Laborator/microlens-gemma4-e2b](https://huggingface.co/Laborator/microlens-gemma4-e2b).
---
## 2. Intended Use
MicroLens is built to **lower the cost of scientific observation** in places where expert knowledge or network access is scarce.
### Primary intended uses
- **Citizen science.** Volunteers contributing to pollen surveys, pond-water biodiversity counts, or amateur mineralogy can capture a smartphone-microscope image and receive a structured natural-language description of the subject and its key visual features.
- **Education.** Offline biology / earth-science classes; the model runs on the same Android tablet the students already use, with no cloud call required.
- **Research support.** Pre-screening for pollen monitoring, zooplankton surveys, and mineral field work, where the model narrows the candidate set before an expert confirms.
- **Digital equity.** The Q4_K_M GGUF build runs on mid-range Android hardware (~$150 phones with 6 GB RAM) via `llama.cpp` / `MLC`. No API key, no telemetry, no internet.
### Intended users
- Citizen-science volunteers (amateur botanists, beekeepers, freshwater monitors).
- Teachers and students in biology / earth-science courses, particularly in low-connectivity regions.
- Researchers doing preliminary triage of large microscopy datasets.
- Hackathon / jury members evaluating *The Gemma 4 Good Hackathon* submission.
### Out-of-scope uses
- **Medical diagnosis.** MicroLens has **not** been trained on medical imaging (histology, cytology, pathology, radiology). Do not use it to diagnose disease in humans or animals.
- **Legally or biologically authoritative species identification.** The model returns descriptions, not court-defensible or taxonomically rigorous identifications.
- **Materials outside the 9 trained categories.** Feeding the model an unrelated image (e.g. a face, a landscape, a screenshot) produces an answer but the answer is not grounded in its training and should be treated as unreliable.
- **Forensics, compliance, or regulated decision-making.** Do not chain MicroLens into any pipeline where a confident but wrong output can harm a person or violate regulation.
---
## 3. Training Data
### Category distribution
All samples are microscopy images from **6 open-licensed source datasets** (**AquaScope** · **ZooLake** · **UDE Diatoms** · **DiatlAS** · **TgFC** · **Marine zooplankton dataset**). Total: **122,399 image-question-answer triples** (99,215 train · 12,331 validation · 12,353 test · 1,500 negative-class · **146 genera** across **9 categories**).
| # | Category | Typical subjects | Source datasets |
|---|---|---|---|
| 1 | Diatoms | Pennate / centric diatoms · genus-level taxonomy | UDE Diatoms · DiatlAS |
| 2 | Freshwater zooplankton | Cladocerans · copepods · rotifers | AquaScope · ZooLake |
| 3 | Marine zooplankton | Copepods · larvae · medusae · marine crustaceans | Marine zooplankton dataset |
| 4 | Fungal spores | Conidia · ascospores · basidiospores · spore morphology | curated subset |
| 5 | Fish larvae | Early-stage fish larvae and pre-larval forms | TgFC |
| 6 | Pollen | Grass · tree · flower pollen grains · aperture morphology | curated subset |
| 7 | Minerals | Thin-section petrographic slides · crystal habit | curated subset |
| 8 | Plant disease | Leaf lesions · phytopathogen morphology · chlorosis / necrosis | curated subset |
| 9 | Snowflakes | Macro / microphotographed snow crystals · dendrite / plate / column | curated subset |
| | **Total: 99,215 train · 12,331 val · 146 genera (top-30 hand-curated KB)** | | |
### Class balancing & negative-class
The 9 categories naturally vary in genus density. **Long-tail genera** (~100 of 146 have fewer than 100 samples each) get category-generic morphology rather than genus-specific cues — this is the correct conservative behaviour given training coverage. The **30 most-common genera** have hand-curated knowledge-base entries (morphology · habitat · ID cues from AlgaeBase, WoRMS, ITIS, Round 1990, Krammer-Lange-Bertalot 1986–1991).
A **synthetic negative-class of 1,500 non-microscopy images** (faces · landscapes · screenshots) was added so the model learns to refuse out-of-distribution inputs at inference time.
### Description generation pipeline (distillation)
Natural-language descriptions (`question` → `answer`) were generated from the raw images using **Qwen3-VL-8B-AWQ (Apache 2.0)** running in **thinking mode** across **3 × vLLM workers** in parallel. For every image:
1. The teacher sees the image alongside a structured prompt asking for subject identification and key visual features.
2. The teacher produces a detailed chain-of-thought inside `…` tags, then a concise final answer.
3. Only the final answer is kept as the training target; the `` trace is discarded.
This is a **teacher-student distillation**. The student (MicroLens / Gemma 4 E2B) inherits the teacher's descriptive style while being **~1.7× smaller** in parameter count and dramatically smaller after Q4 quantisation.
### Licensing of training data
All upstream datasets were checked for license compatibility. Accepted licenses: **Apache 2.0**, **MIT**, **CC-BY**, **CC-BY-SA**, **CC0**. **Zero** samples were used from unlicensed or research-only datasets. The distilled VQA pairs are released under **Apache 2.0** alongside the model.
---
## 4. Training Procedure
### Hyperparameters
| Hyperparameter | Value |
|---|---|
| Fine-tuning method | **Unsloth FastVisionModel + 4-bit QLoRA** (NF4 base + LoRA adapter in bf16) |
| LoRA rank (r) | **32** |
| LoRA α | **64** |
| LoRA dropout | **0.05** |
| Trainable parameters | **59.7 M** (1.34 % of 4.65 B) |
| Target modules | All linear projections (vision tower + language tower) |
| Optimizer | AdamW (8-bit) |
| Learning rate | **5 × 10⁻⁵** (4× softer than v2's 2 × 10⁻⁴ — gentle re-learning of rich format) |
| LR schedule | Linear warmup (100 steps) → linear decay |
| Batch size (per device) | 2 |
| Gradient accumulation | 8 |
| **Effective batch size** | **16** |
| Max sequence length | 2048 tokens |
| Epochs (v3) | **1** (rich format, resumed from v2 checkpoint-18351) |
| Steps (v3) | **~6,200** |
| v2 base config | 3 epochs · lr 2 × 10⁻⁴ · 18,351 steps · ~37 h wall-clock |
| Mixed precision | bf16 |
| Gradient checkpointing | enabled during fine-tune (Unsloth's optimised path) |
| Seed | 3407 |
### Hardware & runtime
- 1 × NVIDIA GeForce **RTX 3090 Ti** (24 GB GDDR6X) — single GPU only (Unsloth currently does not support multi-GPU)
- AMD Ryzen host · 64 GB system RAM
- Ubuntu 24.04 · CUDA 12.8 · PyTorch 2.10
- Wall-clock **v3** (1 epoch rich-format resume): **~13 h**
- Wall-clock **v2** (3 epochs base training that v3 resumed from): **~37 h**
- Cumulative wall-clock through full v2 + v3 pipeline: **~50 h**
### Loss curves
| Stage | Split | Final loss |
|---|---|---|
| v2 (3 epochs base) | Eval (12,331-image holdout) | **~0.21** |
| v3 (1 epoch rich-format resume) | Eval (220-image stratified holdout) | **0.0213** |
The v3 step preserves v2's category/genus accuracy (no drift, ~45 % top-1 genus accuracy on 146 classes vs random baseline of 0.7 %) while overwriting the response format prior to produce structured rich answers (genus · morphology · habitat · ID cues). The 4× softer learning rate (5e-5 vs 2e-4) was specifically chosen for this gentle re-learning step — without it, 1 epoch on rich format would degrade the genus signal v2 had already learned.
### Attention backend
Gemma 4's vision encoder uses a **head dimension of 512**, which exceeds the 256-head-dim limit of current **FlashAttention-2** kernels. Fine-tuning and inference therefore use **PyTorch SDPA** (scaled-dot-product attention, memory-efficient path). On RTX 3090 Ti this is the correct default; SDPA is the only supported backend for Gemma 4 in Unsloth 2026.4.7 at the time of training. Unsloth's FastVisionModel adds custom 4-bit QLoRA kernels and `UnslothVisionDataCollator` on top of this backend, which together cut peak VRAM from ~38 GB (vanilla HF Transformers) to ~12 GB and roughly halve the per-step time.
---
## 5. Evaluation
Evaluation is **qualitative and per-category**, reflecting the spirit of the submission (an *assistive descriptor*, not a classifier). For every category we sampled images from the held-out validation split and compared the MicroLens answer against the original Qwen3-VL-8B teacher answer.
| Category | Observation |
|---|---|
| Pollen | Consistent identification of pollen vs. non-pollen. Species-level guesses degrade gracefully into morphological descriptions (shape, aperture, surface texture). |
| Algae | Separates filamentous vs. unicellular vs. colonial. Genus-level names are best-effort. |
| Yeast | Reliable identification of budding cells; distinguishes yeast from bacteria. |
| Minerals | Good at gross texture (crystalline, granular, foliated) and colour; specific mineral names can be off when the sample lacks diagnostic features visible in brightfield. |
| Plant disease | Strong on lesion descriptions (chlorosis, necrosis, spotting); pathogen identification is probabilistic. |
| PCB | Identifies trace patterns, solder joints, component silhouettes. Not intended for defect triage; it describes rather than grades. |
| Snowflakes | Dendrite / plate / column classification is reliable; novel crystal habits are described morphologically. |
| Zooplankton | Copepods, rotifers, and common cladocerans are consistently named. Rare subclasses degrade gracefully (see limitations). |
| Diatoms | Pennate vs. centric distinction is reliable; genus-level naming on common Naviculales / Cymbellaceae / Aulacoseiraceae is consistent. Long-tail diatoms degrade gracefully into morphological description (raphe / striae / valve outline). |
There is no single accuracy number for MicroLens, because the output space is free-form natural language. The correct axis of evaluation is *"does the description help a human in the field decide what to do next?"*. For the trained categories, it does.
---
## 6. Bias, Risks & Limitations
### Known failure modes
- **Graceful degradation on rare zooplankton subclasses.** Specimens from Branchiopoda, Decapoda, and other sparsely represented orders are typically described as *"marine zooplankton"* or *"crustacean-like organism"* rather than named at the correct taxonomic level. This is the correct conservative behaviour given training coverage; the model does not fabricate taxonomy it cannot defend.
- **Small-model ceiling.** MicroLens is built on Gemma 4 **E2B** (effective 2-billion scale). On edge cases the teacher (Qwen3-VL-8B) was stronger; the student inherits the style but not the full capability. Expect the student to be **close to, but not equal to**, the teacher on hard examples.
- **English-first.** Scientific terminology is maximally accurate in English. The Gemma 4 base model is multilingual, so translated output is available, but translations can simplify or partially drop domain terms; always verify critical terms in the English answer.
- **Out-of-distribution images.** Photographs that are not microscopy (landscapes, faces, screenshots) will still produce text. That text is not grounded in the training distribution and should not be trusted.
### Risks
- **Over-trust by non-experts.** A fluent natural-language description can feel more authoritative than it is. Treat MicroLens as a first-pass field note, not as an oracle. Verify before publishing, diagnosing, or acting on any output.
- **Distribution shift.** The training data is dominated by lab-quality or curated-quality images. Field images taken through cheap clip-on phone microscopes have more motion blur, chromatic aberration, and inconsistent illumination. Descriptions on those inputs remain helpful but are more generic.
### Ethical considerations
- **Distillation is explicitly disclosed.** Training data was generated from Qwen3-VL-8B-AWQ (Apache 2.0). Qwen's license permits this; Qwen is credited in the Citation section.
- **Dataset provenance is audited.** Only Apache / MIT / CC-BY / CC-BY-SA / CC0 upstream data was used. **Zero** non-licensed images were included.
- **No faces, no PII.** The training pool contains microscopy subjects only: no human faces, no personally identifiable information, no private medical imaging.
### Recommended usage pattern
1. Capture image → 2. MicroLens describes it → 3. Human confirms or rejects → 4. Log both.
The model produces a first draft. Final decisions stay with the user.
---
## 7. Environmental Impact
MicroLens v3 was trained on a **single workstation GPU for ~13 hours** (1-epoch rich-format resume from v2 checkpoint-18351). The v2 base run that v3 resumes from took an additional ~37 hours.
| Factor | Value |
|---|---|
| GPU | RTX 3090 Ti · ~400 W under sustained fine-tune load |
| CPU + chassis + cooling overhead | ~140 W |
| Wall-time **v3** (1 epoch rich) | ~13 h |
| Wall-time **v2** (3 epochs base) | ~37 h |
| Cumulative wall-time | ~50 h |
| **Estimated energy (v3 step)** | **~3 kWh** |
| **Estimated energy (cumulative v2 + v3)** | **~12.5 kWh** |
At the Austrian 2024 grid carbon intensity (~110 g CO₂ / kWh), the **v3 training step emits ~0.3 kg CO₂-equivalent**, and the full v2 + v3 pipeline emits **~1.9 kg CO₂-equivalent**.
Inference cost is negligible: the Q4_K_M GGUF build runs on a mid-range Android phone at a few watts. MicroLens is designed so that the cumulative lifetime inference energy per query can be orders of magnitude smaller than a single cloud-inference call to a frontier model.
---
## 8. Technical Specifications
### Architecture
- **Backbone:** Gemma 4 (E2B), sparse-attention transformer decoder with an integrated vision encoder stack.
- **Vision encoder:** Gemma 4 native vision tower (head dim 512).
- **Fusion:** multimodal projector that lifts vision tokens into the language model embedding space (`mmproj` is shipped separately for GGUF runtimes).
- **Positional encoding:** inherited from Gemma 4 base.
- **Attention backend:** SDPA (scaled dot-product attention) during both fine-tune and inference. FlashAttention-2 is **not** usable: Gemma 4's vision-tower head dim (512) exceeds the FA-2 kernel limit (256).
### Adapter layout
- **LoRA rank:** 32
- **LoRA α:** 64
- **LoRA dropout:** 0.05
- **Target modules:** all linear projections across both the language and vision sub-networks (attention Q/K/V/O and MLP gate/up/down), enabling multimodal co-adaptation rather than a language-only adapter.
- **Merged adapter size:** 228 MB (bf16).
### Quantisations shipped
- **Merged FP16**: full-precision full-model snapshot (8.7 GB), Transformers-native.
- **GGUF Q4_K_M**: 4-bit quantised weights via `llama.cpp` convert pipeline (3.2 GB). Pairs with the BF16 `mmproj` (942 MB) for full multimodal inference.
- **LoRA-only (bf16)**: for users who want to re-merge against a different Gemma 4 E2B base or stack additional adapters.
### Software dependencies at training time
- `unsloth == 2026.4.7`
- `transformers == 5.6.0`
- `torch == 2.10` (CUDA 12.8)
- `peft`, `bitsandbytes`, `trl` from Unsloth's pinned resolver.
---
## 9. How to Use
### Transformers (merged FP16)
```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
model_id = "Laborator/microlens-gemma4-e2b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
image = Image.open("my_microscopy_image.jpg").convert("RGB")
prompt = "Describe what you see in this microscopy image. Identify the subject and key visual features."
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=220, temperature=0.3, do_sample=True)
print(processor.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### Unsloth (LoRA on top of base)
```python
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"Laborator/microlens-gemma4-e2b",
load_in_4bit=True,
use_gradient_checkpointing=False,
)
FastVisionModel.for_inference(model)
# … same prompting pattern as above.
```
### Ollama / llama.cpp (Q4_K_M)
```bash
# download microlens-gemma4-e2b-Q4_K_M.gguf and mmproj-bf16.gguf from the HF repo
ollama create microlens -f Modelfile # see repo for Modelfile
ollama run microlens "Describe this sample." --image slide_01.jpg
```
---
## 10. Citation
If you use MicroLens in a publication, project, or downstream model, please cite:
```bibtex
@software{brinza_microlens_2026,
title = {MicroLens: a microscopy vision-language model fine-tuned from Gemma 4 E2B},
author = {Brinza, Serghei},
year = {2026},
month = may,
publisher = {Hugging Face},
url = {https://huggingface.co/Laborator/microlens-gemma4-e2b},
note = {Submission to the Gemma 4 Good Hackathon (Kaggle, May 2026)}
}
```
Upstream works used by MicroLens:
```bibtex
@misc{gemma4_2026,
title = {Gemma 4 Technical Report},
author = {Google DeepMind},
year = {2026},
note = {Base model: unsloth/gemma-4-E2B-it}
}
@misc{unsloth_2026,
title = {Unsloth: 2x faster LLM fine-tuning},
author = {Daniel Han and Michael Han and Unsloth team},
year = {2026},
url = {https://github.com/unslothai/unsloth}
}
@misc{qwen3_vl_2025,
title = {Qwen3-VL: Vision-Language Models},
author = {Alibaba Qwen Team},
year = {2025},
note = {Teacher model for distillation, Apache 2.0}
}
```
---
## 11. Model Card Authors
- **Serghei Brinza** · Vienna, Austria · sole author of the model, the training pipeline, and this card.
---
## 12. Model Card Contact
- **Hugging Face:** [`Laborator/microlens-gemma4-e2b`](https://huggingface.co/Laborator/microlens-gemma4-e2b). Open an issue / discussion on the repo.
- **GitHub:** [`SergheiBrinza/microlens`](https://github.com/SergheiBrinza/microlens). Issues, pull requests, dataset corrections welcome.
---
*MicroLens · built for the Gemma 4 Good Hackathon.*