Image-Text-to-Text
Transformers
Safetensors
GGUF
English
gemma
gemma-4
vision-language
microscopy
scientific-imaging
lora
qlora
unsloth
citizen-science
education
edge-deployment
conversational
Instructions to use Laborator/microlens-gemma4-e2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Laborator/microlens-gemma4-e2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Laborator/microlens-gemma4-e2b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Laborator/microlens-gemma4-e2b", dtype="auto") - llama-cpp-python
How to use Laborator/microlens-gemma4-e2b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Laborator/microlens-gemma4-e2b", filename="gguf/gemma-4-e2b-it.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Laborator/microlens-gemma4-e2b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Laborator/microlens-gemma4-e2b:BF16 # Run inference directly in the terminal: llama-cli -hf Laborator/microlens-gemma4-e2b:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Laborator/microlens-gemma4-e2b:BF16 # Run inference directly in the terminal: llama-cli -hf Laborator/microlens-gemma4-e2b:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Laborator/microlens-gemma4-e2b:BF16 # Run inference directly in the terminal: ./llama-cli -hf Laborator/microlens-gemma4-e2b:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Laborator/microlens-gemma4-e2b:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Laborator/microlens-gemma4-e2b:BF16
Use Docker
docker model run hf.co/Laborator/microlens-gemma4-e2b:BF16
- LM Studio
- Jan
- vLLM
How to use Laborator/microlens-gemma4-e2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Laborator/microlens-gemma4-e2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Laborator/microlens-gemma4-e2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Laborator/microlens-gemma4-e2b:BF16
- SGLang
How to use Laborator/microlens-gemma4-e2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Laborator/microlens-gemma4-e2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Laborator/microlens-gemma4-e2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Laborator/microlens-gemma4-e2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Laborator/microlens-gemma4-e2b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use Laborator/microlens-gemma4-e2b with Ollama:
ollama run hf.co/Laborator/microlens-gemma4-e2b:BF16
- Unsloth Studio new
How to use Laborator/microlens-gemma4-e2b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Laborator/microlens-gemma4-e2b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Laborator/microlens-gemma4-e2b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Laborator/microlens-gemma4-e2b to start chatting
- Pi new
How to use Laborator/microlens-gemma4-e2b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Laborator/microlens-gemma4-e2b:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Laborator/microlens-gemma4-e2b:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Laborator/microlens-gemma4-e2b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Laborator/microlens-gemma4-e2b:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Laborator/microlens-gemma4-e2b:BF16
Run Hermes
hermes
- Docker Model Runner
How to use Laborator/microlens-gemma4-e2b with Docker Model Runner:
docker model run hf.co/Laborator/microlens-gemma4-e2b:BF16
- Lemonade
How to use Laborator/microlens-gemma4-e2b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Laborator/microlens-gemma4-e2b:BF16
Run and chat with the model
lemonade run user.microlens-gemma4-e2b-BF16
List all available models
lemonade list
Serghei Brinza
Clean dataset: 5 sources, 93,014 triples, 123 genera, 8 categories + DOI table
6152f85 | language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - gemma | |
| - gemma-4 | |
| - vision-language | |
| - microscopy | |
| - scientific-imaging | |
| - lora | |
| - qlora | |
| - unsloth | |
| - citizen-science | |
| - education | |
| - edge-deployment | |
| license: apache-2.0 | |
| base_model: unsloth/gemma-4-E2B-it | |
| model-index: | |
| - name: MicroLens | |
| results: [] | |
| # MicroLens · Microscopy Vision-Language Model | |
| A small, fine-tuned multimodal model that turns a **$150 Android phone + a clip-on microscope** into a field-ready assistant for diatom-based water-quality assessment, freshwater zooplankton biodiversity, fungal spore identification, and cyanobacterial bloom monitoring. Runs offline. | |
| - **Base model:** [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) (4.65 B params) | |
| - **Adapter / Merged / GGUF:** [`Laborator/microlens-gemma4-e2b`](https://huggingface.co/Laborator/microlens-gemma4-e2b) | |
| - **Source code:** [`SergheiBrinza/microlens`](https://github.com/SergheiBrinza/microlens) | |
| - **Submission for:** *The Gemma 4 Good Hackathon*, Kaggle · May 2026 | |
| - **License:** Apache 2.0 (weights · code · dataset, see component licenses below) | |
| --- | |
| ## Table of Contents | |
| 1. [Model Details](#1-model-details) | |
| 2. [Intended Use](#2-intended-use) | |
| 3. [Training Data](#3-training-data) | |
| 4. [Training Procedure](#4-training-procedure) | |
| 5. [Evaluation](#5-evaluation) | |
| 6. [Bias, Risks & Limitations](#6-bias-risks--limitations) | |
| 7. [Environmental Impact](#7-environmental-impact) | |
| 8. [Technical Specifications](#8-technical-specifications) | |
| 9. [How to Use](#9-how-to-use) | |
| 10. [Citation](#10-citation) | |
| 11. [Model Card Authors](#11-model-card-authors) | |
| 12. [Model Card Contact](#12-model-card-contact) | |
| --- | |
| ## 1. Model Details | |
| | Field | Value | | |
| |---|---| | |
| | **Name** | MicroLens | | |
| | **Version** | 1.0 · May 2026 | | |
| | **Author** | Serghei Brinza · Vienna, Austria | | |
| | **Model type** | Vision-Language (image + text → text) | | |
| | **Language(s)** | English (primary); multilingual output via Gemma 4 base tokenizer | | |
| | **Base model** | `unsloth/gemma-4-E2B-it` (Gemma 4 Effective-2B, instruction-tuned) | | |
| | **Parameters** | 4.65 B total · 59.7 M trainable during fine-tune (1.34 %) | | |
| | **License** | Apache 2.0 | | |
| | **Finetuning method** | Unsloth FastVisionModel + 4-bit QLoRA (LoRA adapter, r = 32, α = 64, dropout = 0.05) | | |
| | **Framework** | Unsloth 2026.4.7 · Transformers 5.6.0 · PyTorch 2.10 · CUDA 12.8 | | |
| | **Hardware** | 1 × NVIDIA RTX 3090 Ti (24 GB VRAM) | | |
| | **Training time** | ~13 h (v3 = 1 epoch on rich format, ~6,200 steps · resumed from v2 checkpoint-18351 · v2 base = 3 epochs, ~37 h) | | |
| | **Distilled from** | Internal Teacher VLM (Apache 2.0, thinking mode, 3 × vLLM workers) | | |
| ### Distribution artefacts | |
| | Artefact | Size | Purpose | Target runtime | | |
| |---|---|---|---| | |
| | **LoRA adapter** | 228 MB | Load on top of base Gemma 4 E2B | Unsloth / PEFT / Transformers | | |
| | **Merged FP16** | 9.5 GB | Full stand-alone model | Transformers / vLLM / SGLang | | |
| | **GGUF Q4_K_M** | 3.2 GB | 4-bit quantised weights | Ollama · llama.cpp · LM Studio | | |
| | **BF16 `mmproj`** | 942 MB | Vision projector for GGUF runtimes | Ollama · llama.cpp | | |
| All artefacts live on the same HF repo: [Laborator/microlens-gemma4-e2b](https://huggingface.co/Laborator/microlens-gemma4-e2b). | |
| --- | |
| ## 2. Intended Use | |
| MicroLens is built to **lower the cost of scientific observation** in places where expert knowledge or network access is scarce. | |
| ### Primary intended uses | |
| - **Citizen science.** Volunteers contributing to pond-water biodiversity counts, freshwater plankton surveys, or diatom-based water-quality monitoring can capture a smartphone-microscope image and receive a structured natural-language description of the subject and its key visual features. | |
| - **Future of Education.** Offline biology / earth-science classes; the model runs on the same Android tablet the students already use, with no cloud call required. | |
| - **Research support.** Pre-screening for plankton surveys, diatom-based water-quality monitoring, and fungal spore identification, where the model narrows the candidate set before an expert confirms. | |
| - **Digital equity.** The Q4_K_M GGUF build runs on mid-range Android hardware (~$150 phones with 6 GB RAM) via `llama.cpp` / `MLC`. No API key, no telemetry, no internet. | |
| ### Intended users | |
| - Citizen-science volunteers (amateur botanists, beekeepers, freshwater monitors). | |
| - Teachers and students in biology / earth-science courses, particularly in low-connectivity regions. | |
| - Researchers doing preliminary triage of large microscopy datasets. | |
| - Hackathon / jury members evaluating *The Gemma 4 Good Hackathon* submission. | |
| ### Out-of-scope uses | |
| - **Medical diagnosis.** MicroLens has **not** been trained on medical imaging (histology, cytology, pathology, radiology). Do not use it to diagnose disease in humans or animals. | |
| - **Legally or biologically authoritative species identification.** The model returns descriptions, not court-defensible or taxonomically rigorous identifications. | |
| - **Materials outside the 8 trained categories.** Feeding the model an unrelated image (e.g. a face, a landscape, a screenshot) produces an answer but the answer is not grounded in its training and should be treated as unreliable. | |
| - **Forensics, compliance, or regulated decision-making.** Do not chain MicroLens into any pipeline where a confident but wrong output can harm a person or violate regulation. | |
| --- | |
| ## 3. Training Data | |
| ### Category distribution | |
| All samples are microscopy images from **5 open-licensed source datasets** (**AquaScope** · **ZooLake** · **UDE Diatoms** · **DiatlAS** · **TgFC** ). Total: **93,014 image-question-answer triples** (82,737 train · 10,277 validation · **123 genera** across **8 categories**). | |
| | # | Category | Train samples | Typical subjects | Source datasets | | |
| |---|---|---:|---|---| | |
| | 1 | Diatoms | 64,043 (64.6%) | Pennate / centric diatoms · genus-level taxonomy | UDE Diatoms · DiatlAS | | |
| | 2 | Freshwater zooplankton | 11,264 (11.4%) | Cladocerans · copepods · rotifers | AquaScope · ZooLake | | |
| | 3 | Fungal spores | 4,188 (4.2%) | Conidia · ascospores · basidiospores · spore morphology | TgFC | | |
| | 4 | Cyanobacteria | 1,091 (1.1%) | Filamentous and unicellular cyanobacteria | curated subset | | |
| | 5 | Fish | 177 (0.2%) | Fish or fish part (pseudo-genus, category-level only) | TgFC | | |
| | 6 | No specimen (service) | 1,350 (1.4%) | Background / empty fields for OOD detection | synthetic negatives | | |
| | 7 | Debris (service) | 428 (0.4%) | Non-biological fragments for OOD handling | curated | | |
| | 8 | Unknown (service) | 196 (0.2%) | Unidentified microscopy specimens for fallback | curated | | |
| | | **Total: 82,737 train · 10,277 val · 123 genera (top-30 hand-curated KB)** | | | | | |
| ### Class balancing | |
| The 8 categories naturally vary in genus density. **Long-tail genera** (~100 of 123 have fewer than 100 samples each) get category-generic morphology rather than genus-specific cues — this is the correct conservative behaviour given training coverage. The **30 most-common genera** have hand-curated knowledge-base entries (morphology · habitat · ID cues from AlgaeBase, WoRMS, ITIS, Round 1990, Krammer-Lange-Bertalot 1986–1991). | |
| ### Description generation pipeline (distillation) | |
| Natural-language descriptions (`question` → `answer`) were generated from the raw images using **Internal Teacher VLM (Apache 2.0)** running in **thinking mode** across **3 × vLLM workers** in parallel. For every image: | |
| 1. The teacher sees the image alongside a structured prompt asking for subject identification and key visual features. | |
| 2. The teacher produces a detailed chain-of-thought inside `<think>…</think>` tags, then a concise final answer. | |
| 3. Only the final answer is kept as the training target; the `<think>` trace is discarded. | |
| This is a **teacher-student distillation**. The student (MicroLens / Gemma 4 E2B) inherits the teacher's descriptive style while being **~1.7× smaller** in parameter count and dramatically smaller after Q4 quantisation. | |
| ### Licensing of training data | |
| All upstream datasets were checked for license compatibility. Accepted licenses: **Apache 2.0**, **MIT**, **CC-BY**, **CC-BY-SA**, **CC0**. **Zero** samples were used from unlicensed or research-only datasets. The distilled VQA pairs are released under **Apache 2.0** alongside the model. | |
| ### Source datasets & DOI | |
| | Dataset | License | DOI | | |
| |---|---|---| | |
| | AquaScope (Eawag) | CC-BY 4.0 | 10.25678/0009YP | | |
| | ZooLake (Eawag) | CC-BY 4.0 | 10.25678/0004DY | | |
| | UDE Diatoms | CC-BY 4.0 | 10.1093/gigascience/giae087 | | |
| | DIATLAS | CC-BY 4.0 | 10.5281/zenodo.16260887 | | |
| | TgFC fungal spores | CC-BY 4.0 | 10.6084/m9.figshare.28855910 | | |
| --- | |
| ## 4. Training Procedure | |
| ### Hyperparameters | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Fine-tuning method | **Unsloth FastVisionModel + 4-bit QLoRA** (NF4 base + LoRA adapter in bf16) | | |
| | LoRA rank (r) | **32** | | |
| | LoRA α | **64** | | |
| | LoRA dropout | **0.05** | | |
| | Trainable parameters | **59.7 M** (1.34 % of 4.65 B) | | |
| | Target modules | All linear projections (vision tower + language tower) | | |
| | Optimizer | AdamW (8-bit) | | |
| | Learning rate | **5 × 10⁻⁵** (4× softer than v2's 2 × 10⁻⁴ — gentle re-learning of rich format) | | |
| | LR schedule | Linear warmup (100 steps) → linear decay | | |
| | Batch size (per device) | 2 | | |
| | Gradient accumulation | 8 | | |
| | **Effective batch size** | **16** | | |
| | Max sequence length | 2048 tokens | | |
| | Epochs (v3) | **1** (rich format, resumed from v2 checkpoint-18351) | | |
| | Steps (v3) | **~6,200** | | |
| | v2 base config | 3 epochs · lr 2 × 10⁻⁴ · 18,351 steps · ~37 h wall-clock | | |
| | Mixed precision | bf16 | | |
| | Gradient checkpointing | enabled during fine-tune (Unsloth's optimised path) | | |
| | Seed | 3407 | | |
| ### Hardware & runtime | |
| - 1 × NVIDIA GeForce **RTX 3090 Ti** (24 GB GDDR6X) — single GPU only (Unsloth currently does not support multi-GPU) | |
| - AMD Ryzen host · 64 GB system RAM | |
| - Ubuntu 24.04 · CUDA 12.8 · PyTorch 2.10 | |
| - Wall-clock **v3** (1 epoch rich-format resume): **~13 h** | |
| - Wall-clock **v2** (3 epochs base training that v3 resumed from): **~37 h** | |
| - Cumulative wall-clock through full v2 + v3 pipeline: **~50 h** | |
| ### Loss curves | |
| | Stage | Split | Final loss | | |
| |---|---|---| | |
| | v2 (3 epochs base) | Eval (10,277-image holdout) | **~0.21** | | |
| | v3 (1 epoch rich-format resume) | Eval (220-image stratified holdout) | **0.0213** | | |
| The v3 step preserves v2's category/genus accuracy (no drift, ~45 % top-1 genus accuracy on 123 classes vs random baseline of 0.7 %) while overwriting the response format prior to produce structured rich answers (genus · morphology · habitat · ID cues). The 4× softer learning rate (5e-5 vs 2e-4) was specifically chosen for this gentle re-learning step — without it, 1 epoch on rich format would degrade the genus signal v2 had already learned. | |
| ### Attention backend | |
| Gemma 4's vision encoder uses a **head dimension of 512**, which exceeds the 256-head-dim limit of current **FlashAttention-2** kernels. Fine-tuning and inference therefore use **PyTorch SDPA** (scaled-dot-product attention, memory-efficient path). On RTX 3090 Ti this is the correct default; SDPA is the only supported backend for Gemma 4 in Unsloth 2026.4.7 at the time of training. Unsloth's FastVisionModel adds custom 4-bit QLoRA kernels and `UnslothVisionDataCollator` on top of this backend, which together cut peak VRAM from ~38 GB (vanilla HF Transformers) to ~12 GB and roughly halve the per-step time. | |
| --- | |
| ## 5. Evaluation | |
| Evaluation is **qualitative and per-category**, reflecting the spirit of the submission (an *assistive descriptor*, not a classifier). For every category we sampled images from the held-out validation split and compared the MicroLens answer against the original internal teacher answer. | |
| | Category | Observation | | |
| |---|---| | |
| | Diatoms | Pennate vs. centric distinction is reliable; genus-level naming on common Naviculales / Cymbellaceae / Aulacoseiraceae is consistent. Long-tail diatoms degrade gracefully into morphological description (raphe / striae / valve outline). | | |
| | Freshwater zooplankton | Cladocerans and rotifers are consistently named at family level; common copepod genera (Cyclops, Daphnia, Bosmina) are reliably tagged. | | |
| | Fungal spores | Conidial vs. ascospore vs. basidiospore separation is reliable; common spore morphologies (Neopestalotiopsis, Colletotrichum, Olivea) receive genus-level naming. | | |
| | Cyanobacteria | Identified at category level; specific cyanobacterial genus naming is best-effort due to small training share (~1% of total). | | |
| | Fish | Pseudo-genus class with no species-level annotation in training. Returns category-level templated description rather than species names. | | |
| The 3 service classes (`debris`, `no_specimen`, `unknown`) are used for out-of-distribution handling and route the model to conservative, generic responses rather than taxonomic descriptions. | |
| There is no single accuracy number for MicroLens, because the output space is free-form natural language. The correct axis of evaluation is *"does the description help a human in the field decide what to do next?"*. For the trained categories, it does. | |
| --- | |
| ## 6. Bias, Risks & Limitations | |
| ### Known failure modes | |
| - **Small-model ceiling.** MicroLens is built on Gemma 4 **E2B** (effective 2-billion scale). On edge cases the teacher (Internal Teacher) was stronger; the student inherits the style but not the full capability. Expect the student to be **close to, but not equal to**, the teacher on hard examples. | |
| - **English-first.** Scientific terminology is maximally accurate in English. The Gemma 4 base model is multilingual, so translated output is available, but translations can simplify or partially drop domain terms; always verify critical terms in the English answer. | |
| - **Out-of-distribution images.** Photographs that are not microscopy (landscapes, faces, screenshots) will still produce text. That text is not grounded in the training distribution and should not be trusted. | |
| ### Risks | |
| - **Over-trust by non-experts.** A fluent natural-language description can feel more authoritative than it is. Treat MicroLens as a first-pass field note, not as an oracle. Verify before publishing, diagnosing, or acting on any output. | |
| - **Distribution shift.** The training data is dominated by lab-quality or curated-quality images. Field images taken through cheap clip-on phone microscopes have more motion blur, chromatic aberration, and inconsistent illumination. Descriptions on those inputs remain helpful but are more generic. | |
| ### Ethical considerations | |
| - **Distillation is explicitly disclosed.** Training data was generated from Internal Teacher VLM (Apache 2.0). The teacher VLM was used under its Apache 2.0 license. | |
| - **Dataset provenance is audited.** Only Apache / MIT / CC-BY / CC-BY-SA / CC0 upstream data was used. **Zero** non-licensed images were included. | |
| - **No faces, no PII.** The training pool contains microscopy subjects only: no human faces, no personally identifiable information, no private medical imaging. | |
| ### Recommended usage pattern | |
| 1. Capture image → 2. MicroLens describes it → 3. Human confirms or rejects → 4. Log both. | |
| The model produces a first draft. Final decisions stay with the user. | |
| --- | |
| ## 7. Environmental Impact | |
| MicroLens v3 was trained on a **single workstation GPU for ~13 hours** (1-epoch rich-format resume from v2 checkpoint-18351). The v2 base run that v3 resumes from took an additional ~37 hours. | |
| | Factor | Value | | |
| |---|---| | |
| | GPU | RTX 3090 Ti · ~400 W under sustained fine-tune load | | |
| | CPU + chassis + cooling overhead | ~140 W | | |
| | Wall-time **v3** (1 epoch rich) | ~13 h | | |
| | Wall-time **v2** (3 epochs base) | ~37 h | | |
| | Cumulative wall-time | ~50 h | | |
| | **Estimated energy (v3 step)** | **~3 kWh** | | |
| | **Estimated energy (cumulative v2 + v3)** | **~12.5 kWh** | | |
| At the Austrian 2024 grid carbon intensity (~110 g CO₂ / kWh), the **v3 training step emits ~0.3 kg CO₂-equivalent**, and the full v2 + v3 pipeline emits **~1.9 kg CO₂-equivalent**. | |
| Inference cost is negligible: the Q4_K_M GGUF build runs on a mid-range Android phone at a few watts. MicroLens is designed so that the cumulative lifetime inference energy per query can be orders of magnitude smaller than a single cloud-inference call to a frontier model. | |
| --- | |
| ## 8. Technical Specifications | |
| ### Architecture | |
| - **Backbone:** Gemma 4 (E2B), sparse-attention transformer decoder with an integrated vision encoder stack. | |
| - **Vision encoder:** Gemma 4 native vision tower (head dim 512). | |
| - **Fusion:** multimodal projector that lifts vision tokens into the language model embedding space (`mmproj` is shipped separately for GGUF runtimes). | |
| - **Positional encoding:** inherited from Gemma 4 base. | |
| - **Attention backend:** SDPA (scaled dot-product attention) during both fine-tune and inference. FlashAttention-2 is **not** usable: Gemma 4's vision-tower head dim (512) exceeds the FA-2 kernel limit (256). | |
| ### Adapter layout | |
| - **LoRA rank:** 32 | |
| - **LoRA α:** 64 | |
| - **LoRA dropout:** 0.05 | |
| - **Target modules:** all linear projections across both the language and vision sub-networks (attention Q/K/V/O and MLP gate/up/down), enabling multimodal co-adaptation rather than a language-only adapter. | |
| - **Merged adapter size:** 228 MB (bf16). | |
| ### Quantisations shipped | |
| - **Merged FP16**: full-precision full-model snapshot (9.5 GB), Transformers-native. | |
| - **GGUF Q4_K_M**: 4-bit quantised weights via `llama.cpp` convert pipeline (3.2 GB). Pairs with the BF16 `mmproj` (942 MB) for full multimodal inference. | |
| - **LoRA-only (bf16)**: for users who want to re-merge against a different Gemma 4 E2B base or stack additional adapters. | |
| ### Software dependencies at training time | |
| - `unsloth == 2026.4.7` | |
| - `transformers == 5.6.0` | |
| - `torch == 2.10` (CUDA 12.8) | |
| - `peft`, `bitsandbytes`, `trl` from Unsloth's pinned resolver. | |
| --- | |
| ## 9. How to Use | |
| ### Transformers (merged FP16) | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| from PIL import Image | |
| import torch | |
| model_id = "Laborator/microlens-gemma4-e2b" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = AutoModelForVision2Seq.from_pretrained( | |
| model_id, torch_dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| image = Image.open("my_microscopy_image.jpg").convert("RGB") | |
| prompt = "Describe what you see in this microscopy image. Identify the subject and key visual features." | |
| messages = [{"role": "user", "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": prompt}, | |
| ]}] | |
| input_text = processor.apply_chat_template(messages, add_generation_prompt=True) | |
| inputs = processor(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") | |
| with torch.inference_mode(): | |
| out = model.generate(**inputs, max_new_tokens=220, temperature=0.3, do_sample=True) | |
| print(processor.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Unsloth (LoRA on top of base) | |
| ```python | |
| from unsloth import FastVisionModel | |
| model, tokenizer = FastVisionModel.from_pretrained( | |
| "Laborator/microlens-gemma4-e2b", | |
| load_in_4bit=True, | |
| use_gradient_checkpointing=False, | |
| ) | |
| FastVisionModel.for_inference(model) | |
| # … same prompting pattern as above. | |
| ``` | |
| ### Ollama / llama.cpp (Q4_K_M) | |
| ```bash | |
| # download microlens-gemma4-e2b-Q4_K_M.gguf and mmproj-bf16.gguf from the HF repo | |
| ollama create microlens -f Modelfile # see repo for Modelfile | |
| ollama run microlens "Describe this sample." --image slide_01.jpg | |
| ``` | |
| --- | |
| ## 10. Citation | |
| If you use MicroLens in a publication, project, or downstream model, please cite: | |
| ```bibtex | |
| @software{brinza_microlens_2026, | |
| title = {MicroLens: a microscopy vision-language model fine-tuned from Gemma 4 E2B}, | |
| author = {Brinza, Serghei}, | |
| year = {2026}, | |
| month = may, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/Laborator/microlens-gemma4-e2b}, | |
| note = {Submission to the Gemma 4 Good Hackathon (Kaggle, May 2026)} | |
| } | |
| ``` | |
| Upstream works used by MicroLens: | |
| ```bibtex | |
| @misc{gemma4_2026, | |
| title = {Gemma 4 Technical Report}, | |
| author = {Google DeepMind}, | |
| year = {2026}, | |
| note = {Base model: unsloth/gemma-4-E2B-it} | |
| } | |
| @misc{unsloth_2026, | |
| title = {Unsloth: faster LLM fine-tuning}, | |
| author = {Daniel Han and Michael Han and Unsloth team}, | |
| year = {2026}, | |
| url = {https://github.com/unslothai/unsloth} | |
| } | |
| ``` | |
| --- | |
| ## 11. Model Card Authors | |
| - **Serghei Brinza** · Vienna, Austria · sole author of the model, the training pipeline, and this card. | |
| --- | |
| ## 12. Model Card Contact | |
| - **Hugging Face:** [`Laborator/microlens-gemma4-e2b`](https://huggingface.co/Laborator/microlens-gemma4-e2b). Open an issue / discussion on the repo. | |
| - **GitHub:** [`SergheiBrinza/microlens`](https://github.com/SergheiBrinza/microlens). Issues, pull requests, dataset corrections welcome. | |
| --- | |
| *MicroLens · built for the Gemma 4 Good Hackathon.* |