Image-Text-to-Text
GGUF
English
herbarium
biodiversity
vision-language
structured-output
llama.cpp
gbif
conversational
Instructions to use CapPow/herb-visor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CapPow/herb-visor with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CapPow/herb-visor", filename="herb-visor-4b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use CapPow/herb-visor with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf CapPow/herb-visor:F16 # Run inference directly in the terminal: llama cli -hf CapPow/herb-visor:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf CapPow/herb-visor:F16 # Run inference directly in the terminal: llama cli -hf CapPow/herb-visor:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CapPow/herb-visor:F16 # Run inference directly in the terminal: ./llama-cli -hf CapPow/herb-visor:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CapPow/herb-visor:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf CapPow/herb-visor:F16
Use Docker
docker model run hf.co/CapPow/herb-visor:F16
- LM Studio
- Jan
- vLLM
How to use CapPow/herb-visor with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CapPow/herb-visor" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CapPow/herb-visor", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/CapPow/herb-visor:F16
- Ollama
How to use CapPow/herb-visor with Ollama:
ollama run hf.co/CapPow/herb-visor:F16
- Unsloth Studio
How to use CapPow/herb-visor with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CapPow/herb-visor to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CapPow/herb-visor to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CapPow/herb-visor to start chatting
- Atomic Chat new
- Docker Model Runner
How to use CapPow/herb-visor with Docker Model Runner:
docker model run hf.co/CapPow/herb-visor:F16
- Lemonade
How to use CapPow/herb-visor with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CapPow/herb-visor:F16
Run and chat with the model
lemonade run user.herb-visor-F16
List all available models
lemonade list
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-VL-4B-Instruct | |
| base_model_relation: finetune | |
| pipeline_tag: image-text-to-text | |
| library_name: gguf | |
| tags: | |
| - herbarium | |
| - biodiversity | |
| - vision-language | |
| - structured-output | |
| - gguf | |
| - llama.cpp | |
| - gbif | |
| language: | |
| - en | |
| # Herb-VISOR | |
| **Visual Inspector for Specimen Observation & Recognition** | |
| A 4B vision-language model that reads herbarium specimen images and emits structured, controlled-vocabulary JSON describing visible attributes (foliage, stem type, reproductive presence, and reference markers such as labels, barcodes, and scale bars). It reports what is visible on the sheet; it does not perform taxonomic identification. | |
| Given a specimen image and its taxon name, the model returns schema-valid JSON with no prompt engineering. | |
| - **Base model:** [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) (Apache 2.0) | |
| - **Method:** full-weight fine-tune, teacher-student distillation | |
| - **Format:** GGUF (llama.cpp-native), runs offline on an 8 GB-class GPU | |
| - **Code, validation, and documentation:** [GitHub repository](https://github.com/CapPow/herb-visor) | |
| ## Quickstart (recommended) | |
| One command — downloads Q8 to llama.cpp's cache and auto-fetches the projector: | |
| ```bash | |
| llama-server -hf CapPow/herb-visor:Q8 --temp 0 -c 8192 | |
| ``` | |
| Serves an OpenAI-compatible endpoint at `127.0.0.1:8080`. | |
| ## Manual download (alternative) | |
| Only needed for offline/air-gapped use or to pin a specific file. The pull above already handles downloads, so don't do both. Download the projector (required) plus one weight file: | |
| | File | Purpose | | |
| |---|---| | |
| | [`herb-visor-4b-mmproj-f16.gguf`](https://huggingface.co/CapPow/herb-visor/resolve/main/herb-visor-4b-mmproj-f16.gguf?download=true) | vision projector — **required** for image input | | |
| | [`herb-visor-4b-q8.gguf`](https://huggingface.co/CapPow/herb-visor/resolve/main/herb-visor-4b-q8.gguf?download=true) | model weights, q8 (**recommended**; ~8 GB VRAM) | | |
| | [`herb-visor-4b-f16.gguf`](https://huggingface.co/CapPow/herb-visor/resolve/main/herb-visor-4b-f16.gguf?download=true) | model weights, f16 | | |
| Pair the mmproj with either weight file, then run against the local files: | |
| ```bash | |
| llama-server \ | |
| --model herb-visor-4b-q8.gguf \ | |
| --mmproj herb-visor-4b-mmproj-f16.gguf \ | |
| --temp 0 \ | |
| -c 8192 \ | |
| --host 127.0.0.1 --port 8080 | |
| ``` | |
| The inference contract is deliberately minimal: no system prompt, no schema | |
| instructions. The only text input is the taxon binomial (standard casing, e.g. | |
| `Acer pseudoplatanus`), with the specimen image attached. Use `temperature 0` | |
| for deterministic output. The model also returns valid JSON without a taxon | |
| name; the name is included to aid reproductive-trait alignment. | |
| A minimal client ([`infer.py`](https://github.com/CapPow/herb-visor/blob/main/infer.py), pure Python standard library): | |
| ```bash | |
| python infer.py path/to/specimen.jpg "Acer pseudoplatanus" | |
| ``` | |
| Or via the OpenAI-compatible endpoint. Build the request payload in Python | |
| (a base64 image is too large to pass as a shell argument), then send it: | |
| ```bash | |
| python3 <<'PY' | |
| import json, base64 | |
| img = base64.b64encode(open("path/to/specimen.jpg", "rb").read()).decode() | |
| payload = { | |
| "messages": [{ | |
| "role": "user", | |
| "content": [ | |
| {"type": "text", "text": "Acer pseudoplatanus"}, | |
| {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}} | |
| ] | |
| }], | |
| "temperature": 0 | |
| } | |
| open("/tmp/req.json", "w").write(json.dumps(payload)) | |
| PY | |
| curl -s http://localhost:8080/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| --data-binary @/tmp/req.json | python -m json.tool | |
| ``` | |
| ### Example output | |
| For a pressed *Acer pseudoplatanus* sheet: | |
| ```json | |
| { | |
| "type": "PH", | |
| "attached_photo": false, | |
| "structures": { | |
| "foliage": "present", | |
| "foliage_type": "leaf", | |
| "stem": "woody", | |
| "phenology": { | |
| "flower": false, "fruit": false, "pollen_cone": false, | |
| "seed_cone": false, "sporulating": false, "reproductive_unknown": false | |
| } | |
| }, | |
| "refs": { | |
| "label": true, "barcode": false, "stamp": false, | |
| "crc": true, "scale_bar": true | |
| } | |
| } | |
| ``` | |
| The full output schema is in the [repository](https://github.com/CapPow/herb-visor/blob/main/schema/schema.json). | |
| ## Training | |
| The model was trained by distilling a larger teacher (Qwen3.6-27B, `Qwen3.6-27B-UD-Q5_K_XL`), whose structured-JSON captions were the training ground truth. Training used two phases: phase 1 with full schema instructions in the prompt, and phase 2 with only the image and taxon name. Phase 2 bakes the schema into the weights, so end users need no prompt beyond the binomial. On the held-out test set, output was schema-valid, strict-parsed, controlled-vocabulary JSON in all 643 of 643 cases. | |
| ## Evaluation | |
| Accuracy was measured against human-validated labels on a 100-specimen blind sample (a single non-specialist annotator scored each field cold from the image, with no access to model predictions). Per-field accuracy is strong on reference markers and foliage; the weaker fields are stem type and stamp detection. | |
| | Field | Accuracy | | |
| |---|---| | |
| | `structures.foliage` | 0.97 | | |
| | `structures.stem` | 0.79 | | |
| | `attached_photo` | 0.95 | | |
| | `refs.label` | 0.99 | | |
| | `refs.barcode` | 1.00 | | |
| | `refs.stamp` | 0.70 | | |
| | `refs.crc` | 1.00 | | |
| | `refs.scale_bar` | 1.00 | | |
| | `repro_visible` (category-level) | 0.88 | | |
| Whole-specimen strict exact match (all 10 fields correct at once) was 0.438, against 0.484 for the 27B teacher. Distillation preserved teacher behavior closely, including its errors; the student did not exceed the teacher. | |
| Speed: roughly 5.0 s/img for this model versus 68.6 s/img for the 27B teacher on the same hardware (single stream). | |
| Full methodology, the label-free taxonomic-consistency check, and reproduction instructions are in the [GitHub repository](https://github.com/CapPow/herb-visor). | |
| ## Limitations | |
| - `repro_visible` is validated at the category level only (a reproductive structure is present). Fine-grained phenology (flower vs fruit vs cone type) was not human-validated. | |
| - Ground truth is a single non-specialist annotator (n=100); some apparent errors are annotator-limited. Treat reported accuracies as a conservative floor. | |
| - Output is a curator-assist candidate, not authoritative write-back. | |
| - `type` is always `PH` on herbarium input and is not a discriminative result. | |
| ## License and attribution | |
| This model is a full-weight fine-tune of [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct), which is licensed under Apache License 2.0. Herb-VISOR is released under the same Apache 2.0 license. The weights were modified by fine-tuning on distilled teacher captions over herbarium specimen images. | |
| Repository code is released under the MIT license. Training images are GBIF-derived and follow their source-institution terms; they are not redistributed here. | |
| ## Citation | |
| ```bibtex | |
| @software{powell2026herbvisor, | |
| author = {Powell, Caleb and Sterner, Beckett}, | |
| title = {Herb-VISOR: a compact vision-language model for | |
| structured captioning of herbarium specimens}, | |
| year = {2026}, | |
| url = {https://github.com/CapPow/herb-visor}, | |
| note = {Software and model weights; manuscript in preparation} | |
| } | |
| ``` |