FoolDev commited on
Commit
e4beea4
·
1 Parent(s): 9ca8700

Add vision support via llama.cpp; document Ollama upstream gap

Browse files

Vision via Ollama is currently broken for the qwen35/qwen35moe family.
Ollama 0.22's vendored llama.cpp fork is missing the architecture
entries from upstream ggml-org/llama.cpp, so attaching mmproj via
either FROM or ADAPTER returns:
unknown model architecture: 'qwen35moe'
Tracked in ollama/ollama#15898 (and duplicates #14730, #15346). The
fix has not landed.

Rather than ship a Modelfile.vision that doesn't work, this commit:

- Adds examples/llama_cpp_vision.py — uses llama-cpp-python with the
Qwen2.5-VL chat handler and a separate mmproj-F16.gguf. Works today.
- Adds scripts/fetch_mmproj.sh + 'make mmproj' to pull the projector
from unsloth/Qwen3.6-27B-GGUF (~927 MB).
- Updates README:
- Replaces the misleading 'Multimodal: Yes (vision)' comparison row
with a per-loader breakdown.
- Adds a dedicated Vision section explaining the Ollama gap and the
working llama.cpp path.
- Updates Known Limitations.
- Adds explicit text-only headers to Modelfile and Modelfile.z13.
- Updates examples/README.md.
- CHANGELOG: documents this batch and tags 0.4.0 (9ca8700).

CHANGELOG.md CHANGED
@@ -7,6 +7,30 @@ and documentation**, not the underlying base model.
7
 
8
  ## [Unreleased]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ### Added
11
  - `Makefile` — convenience wrapper. `make help` lists targets:
12
  `build` / `smoke` / `check` / `hooks` / `clean`. Variables
 
7
 
8
  ## [Unreleased]
9
 
10
+ ### Added
11
+ - `examples/llama_cpp_vision.py` — image-text-to-text via
12
+ `llama-cpp-python` + a separate `mmproj-F16.gguf`. Currently the only
13
+ working vision path for Janus-27B (Ollama is broken; see Changed).
14
+ - `scripts/fetch_mmproj.sh` — pulls `mmproj-F16.gguf` (or BF16/F32) from
15
+ `unsloth/Qwen3.6-27B-GGUF`. Honors `MMPROJ_PATH` override.
16
+ - `Makefile`: new `mmproj` target.
17
+
18
+ ### Changed
19
+ - README: replaced the misleading "Multimodal: Yes (vision)" comparison
20
+ row with a per-loader breakdown. Added a dedicated **Vision** section
21
+ documenting:
22
+ - The Ollama 0.22 architecture gap (`unknown model architecture:
23
+ 'qwen35moe'`) tracked in
24
+ [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
25
+ - A working llama.cpp / llama-cpp-python path with the `mmproj`.
26
+ - Updated Known Limitations entry accordingly.
27
+ - `Modelfile` and `Modelfile.z13`: header comments now state explicitly
28
+ that they're text-only and link to the Vision section.
29
+ - `examples/README.md`: reflects the new Vision example and explains
30
+ why Ollama is not the recommended backend for it (yet).
31
+
32
+ ## [0.4.0] - 2026-05-02 — `9ca8700`
33
+
34
  ### Added
35
  - `Makefile` — convenience wrapper. `make help` lists targets:
36
  `build` / `smoke` / `check` / `hooks` / `clean`. Variables
Makefile CHANGED
@@ -31,13 +31,15 @@ MODEL ?= $(TAG)
31
 
32
  .DEFAULT_GOAL := help
33
 
34
- .PHONY: help build smoke check hooks clean
 
 
35
 
36
  help: ## Show this help.
37
  @awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
38
  @echo
39
  @echo "Current settings:"
40
- @echo " QUANT=$(QUANT) PROFILE=$(PROFILE) TAG=$(TAG)"
41
  ifdef GGUF_PATH
42
  @echo " GGUF_PATH=$(GGUF_PATH)"
43
  endif
@@ -48,6 +50,9 @@ build: ## Download GGUF (if needed) and run 'ollama create'.
48
  smoke: ## Verify the model is reachable and round-trips.
49
  MODEL=$(MODEL) ./scripts/smoke_test.sh
50
 
 
 
 
51
  check: ## Lint shell + python files; block dot-pattern footgun.
52
  ./scripts/check.sh
53
 
@@ -56,6 +61,6 @@ hooks: ## Install scripts/check.sh as the git pre-commit hook.
56
 
57
  clean: ## Remove local GGUF copies and ephemeral caches in this repo.
58
  @echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
59
- @rm -f ./Qwen3.6-27B-*.gguf
60
  @rm -rf ./.cache __pycache__ examples/__pycache__
61
  @echo "[+] clean"
 
31
 
32
  .DEFAULT_GOAL := help
33
 
34
+ PRECISION ?= F16
35
+
36
+ .PHONY: help build smoke check hooks mmproj clean
37
 
38
  help: ## Show this help.
39
  @awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
40
  @echo
41
  @echo "Current settings:"
42
+ @echo " QUANT=$(QUANT) PROFILE=$(PROFILE) TAG=$(TAG) PRECISION=$(PRECISION)"
43
  ifdef GGUF_PATH
44
  @echo " GGUF_PATH=$(GGUF_PATH)"
45
  endif
 
50
  smoke: ## Verify the model is reachable and round-trips.
51
  MODEL=$(MODEL) ./scripts/smoke_test.sh
52
 
53
+ mmproj: ## Fetch the vision projector for llama.cpp (Ollama vision is broken upstream).
54
+ ./scripts/fetch_mmproj.sh $(PRECISION)
55
+
56
  check: ## Lint shell + python files; block dot-pattern footgun.
57
  ./scripts/check.sh
58
 
 
61
 
62
  clean: ## Remove local GGUF copies and ephemeral caches in this repo.
63
  @echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
64
+ @rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf
65
  @rm -rf ./.cache __pycache__ examples/__pycache__
66
  @echo "[+] clean"
Modelfile CHANGED
@@ -1,5 +1,10 @@
1
  # Janus-27B — Ollama wrapper around Qwen 3.6 27B (dense)
2
  #
 
 
 
 
 
3
  # This repo does not redistribute weights. Edit the FROM line below to
4
  # point at a local Qwen 3.6 27B GGUF, then:
5
  #
 
1
  # Janus-27B — Ollama wrapper around Qwen 3.6 27B (dense)
2
  #
3
+ # Text-only. Vision via Ollama is currently broken for this architecture
4
+ # (ollama/ollama#15898 — the vendored llama.cpp fork is missing the
5
+ # qwen35 arch entries). Use llama.cpp directly for image input, or wait
6
+ # for the fix. See the Vision section in README.md.
7
+ #
8
  # This repo does not redistribute weights. Edit the FROM line below to
9
  # point at a local Qwen 3.6 27B GGUF, then:
10
  #
Modelfile.z13 CHANGED
@@ -1,5 +1,8 @@
1
  # Janus-27B — Z13 variant for ASUS ROG Flow Z13 (Ryzen AI Max+ 395, 128 GB)
2
  #
 
 
 
3
  # This Modelfile is tuned for an iGPU with a shared/unified memory pool.
4
  # Defaults differ from the main Modelfile in three ways:
5
  # 1. Smaller context (8K instead of 16K) to keep KV cache slim.
 
1
  # Janus-27B — Z13 variant for ASUS ROG Flow Z13 (Ryzen AI Max+ 395, 128 GB)
2
  #
3
+ # Text-only (same caveat as the default Modelfile — vision via Ollama
4
+ # is broken upstream for qwen35; see README Vision section).
5
+ #
6
  # This Modelfile is tuned for an iGPU with a shared/unified memory pool.
7
  # Defaults differ from the main Modelfile in three ways:
8
  # 1. Smaller context (8K instead of 16K) to keep KV cache slim.
README.md CHANGED
@@ -74,7 +74,9 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
74
  | Q4_K_M GGUF size | ~17 GB | ~19 GB |
75
  | Q3_K_S GGUF size | ~12 GB | n/a |
76
  | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
77
- | Multimodal | Yes (vision) | Yes (vision) |
 
 
78
  | Max context | 262 144 | 262 144 |
79
 
80
  ## What's here
@@ -87,6 +89,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
87
  | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
88
  | `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
89
  | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model and runs a round-trip |
 
90
  | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
91
  | `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
92
  | `Makefile` | Convenience wrapper — `make help` lists targets |
@@ -107,7 +110,10 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
107
  - Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
108
  - Vocab 248,320 (shared with 35B-A3B sibling)
109
  - 262 144 native context, extensible to ~1 M with YaRN
110
- - Vision + video support via upstream `mmproj` (not in this repo)
 
 
 
111
  - Multi-token prediction (MTP) head trained for speculative decoding
112
 
113
  ## Quick start
@@ -176,6 +182,50 @@ Behavior rules:
176
  - Finish with a usable answer, not just planning.
177
  ```
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ## Hardware requirements
180
 
181
  The dense 27B is the easier of the two Janus models to deploy.
@@ -197,7 +247,7 @@ See the [Janus-35B Chat template section](https://huggingface.co/FoolDev/janus#c
197
  ## Known limitations
198
 
199
  - **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
200
- - **No mmproj in this release.** Same as 35B fetch upstream for vision input.
201
  - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
202
  - **No formal evaluation in this card.** Numbers above are estimates.
203
 
 
74
  | Q4_K_M GGUF size | ~17 GB | ~19 GB |
75
  | Q3_K_S GGUF size | ~12 GB | n/a |
76
  | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
77
+ | Multimodal (text path) | Yes | Yes |
78
+ | Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
79
+ | Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
80
  | Max context | 262 144 | 262 144 |
81
 
82
  ## What's here
 
89
  | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
90
  | `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
91
  | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model and runs a round-trip |
92
+ | `scripts/fetch_mmproj.sh` | Pulls the vision projector for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)) |
93
  | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
94
  | `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
95
  | `Makefile` | Convenience wrapper — `make help` lists targets |
 
110
  - Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
111
  - Vocab 248,320 (shared with 35B-A3B sibling)
112
  - 262 144 native context, extensible to ~1 M with YaRN
113
+ - Vision + video supported by the **base architecture** via a separate
114
+ `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
115
+ from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
116
+ current loader compatibility.
117
  - Multi-token prediction (MTP) head trained for speculative decoding
118
 
119
  ## Quick start
 
182
  - Finish with a usable answer, not just planning.
183
  ```
184
 
185
+ ## Vision
186
+
187
+ The Qwen 3.6 base supports image (and video) input via a separate
188
+ `mmproj` projector. The full multimodal stack is:
189
+
190
+ ```
191
+ Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
192
+ mmproj-F16.gguf (~927 MB, the vision projector)
193
+ ```
194
+
195
+ Both files are at
196
+ [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
197
+ This repo intentionally does not redistribute either.
198
+
199
+ ### Loader compatibility — the honest table
200
+
201
+ | Loader | Text | Vision (mmproj) | Notes |
202
+ |---|---|---|---|
203
+ | **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | ✅ | ✅ | Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
204
+ | **llama-cpp-python** | ✅ | ✅ | See `examples/llama_cpp_vision.py`. |
205
+ | **Ollama 0.22** | ✅ | ❌ | Vendored llama.cpp fork is missing the architecture entries. Attaching `mmproj` via `FROM` *or* `ADAPTER` returns `unknown model architecture: 'qwen35moe'` (and the same for the dense `qwen35`). See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). Will work once that PR lands. |
206
+ | **LM Studio** | ✅ | ✅ (last tested) | Uses upstream llama.cpp directly. |
207
+
208
+ ### Vision via llama.cpp
209
+
210
+ ```bash
211
+ # CLI:
212
+ llama-mtmd-cli \
213
+ -m Qwen3.6-27B-Q4_K_M.gguf \
214
+ --mmproj mmproj-F16.gguf \
215
+ --image photo.jpg \
216
+ -p "Describe this image."
217
+
218
+ # Python:
219
+ python examples/llama_cpp_vision.py \
220
+ --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
221
+ --mmproj /path/to/mmproj-F16.gguf \
222
+ --image /path/to/photo.jpg \
223
+ --prompt "What is in this image?"
224
+ ```
225
+
226
+ Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
227
+ for this model.
228
+
229
  ## Hardware requirements
230
 
231
  The dense 27B is the easier of the two Janus models to deploy.
 
247
  ## Known limitations
248
 
249
  - **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
250
+ - **No mmproj in this release**, and **vision via Ollama is broken upstream** (qwen35/qwen35moe arch entries missing from Ollama's vendored llama.cpp fork — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
251
  - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
252
  - **No formal evaluation in this card.** Numbers above are estimates.
253
 
examples/README.md CHANGED
@@ -4,9 +4,10 @@ Three minimal entry points. Pick the one that matches how you run models.
4
 
5
  | File | Backend | When to use |
6
  |---|---|---|
7
- | `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `janus-27b` model created from the project `Modelfile`. |
8
  | `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
9
- | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). |
 
10
 
11
  All three apply the same Janus system prompt and sampling defaults
12
  (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
@@ -47,3 +48,23 @@ python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
47
 
48
  For GPU offload, rebuild llama-cpp-python with the matching backend — see
49
  the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  | File | Backend | When to use |
6
  |---|---|---|
7
+ | `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `janus-27b` model created from the project `Modelfile`. **Text only** — vision via Ollama is broken upstream for this arch. |
8
  | `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
9
+ | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
10
+ | `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |
11
 
12
  All three apply the same Janus system prompt and sampling defaults
13
  (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
 
48
 
49
  For GPU offload, rebuild llama-cpp-python with the matching backend — see
50
  the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
51
+
52
+ ### Vision (image input)
53
+
54
+ ```bash
55
+ # Pull the projector once (~927 MB):
56
+ hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .
57
+
58
+ pip install llama-cpp-python pillow
59
+ python llama_cpp_vision.py \
60
+ --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
61
+ --mmproj /path/to/mmproj-F16.gguf \
62
+ --image /path/to/photo.jpg \
63
+ --prompt "Describe this image."
64
+ ```
65
+
66
+ Why not Ollama? Ollama 0.22's vendored llama.cpp is missing the `qwen35`
67
+ architecture entries needed to attach an mmproj — `FROM` and `ADAPTER`
68
+ both fail with `unknown model architecture: 'qwen35moe'`. Tracked in
69
+ [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
70
+ Until that's fixed, llama.cpp / llama-cpp-python is the working path.
examples/llama_cpp_vision.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Janus-27B — vision (image-text-to-text) via llama-cpp-python.
4
+
5
+ Why this script exists:
6
+ Ollama 0.22's vendored llama.cpp fork is missing the qwen35/qwen35moe
7
+ architecture entries needed to attach a separate mmproj projector.
8
+ Both `FROM mmproj.gguf` and `ADAPTER mmproj.gguf` fail with:
9
+ unknown model architecture: 'qwen35moe'
10
+ See ollama/ollama#15898, #14730 (closed as duplicates of #15898 root
11
+ cause). Until that lands, vision via Ollama is broken for Qwen 3.5 /
12
+ 3.6.
13
+
14
+ Upstream ggml-org/llama.cpp **does** have the architecture, so vision
15
+ works fine via llama.cpp directly. This script uses the python binding.
16
+
17
+ Install:
18
+ pip install llama-cpp-python pillow
19
+ # GPU offload? rebuild with the matching backend:
20
+ # CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-binary :all:
21
+ # CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
22
+ # CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
23
+
24
+ Files you need (both from unsloth/Qwen3.6-27B-GGUF):
25
+ 1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf (~17 GB)
26
+ 2. A vision projector: mmproj-F16.gguf (~927 MB)
27
+
28
+ Usage:
29
+ python llama_cpp_vision.py \
30
+ --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
31
+ --mmproj /path/to/mmproj-F16.gguf \
32
+ --image /path/to/photo.jpg \
33
+ --prompt "What is in this image? Be specific."
34
+
35
+ # CLI alternative without python binding (ships with llama.cpp):
36
+ # llama-mtmd-cli \
37
+ # -m Qwen3.6-27B-Q4_K_M.gguf \
38
+ # --mmproj mmproj-F16.gguf \
39
+ # --image photo.jpg \
40
+ # -p "Describe this image."
41
+ """
42
+ from __future__ import annotations
43
+
44
+ import argparse
45
+ import base64
46
+ import sys
47
+ from pathlib import Path
48
+
49
+ try:
50
+ from llama_cpp import Llama
51
+ from llama_cpp.llama_chat_format import Qwen25VLChatHandler
52
+ except ImportError: # pragma: no cover
53
+ sys.exit(
54
+ "Missing llama-cpp-python (>=0.3 with VL handlers).\n"
55
+ " pip install --upgrade llama-cpp-python pillow"
56
+ )
57
+
58
+
59
+ JANUS_SYSTEM = (
60
+ "You are Janus, a precise vision-language assistant. Describe images "
61
+ "accurately, do not invent details, and ground every claim in the "
62
+ "pixels you can actually see."
63
+ )
64
+
65
+
66
+ def encode_image_data_uri(path: Path) -> str:
67
+ suffix = path.suffix.lower().lstrip(".")
68
+ mime = {"jpg": "jpeg", "jpeg": "jpeg", "png": "png", "webp": "webp", "gif": "gif"}.get(suffix, "jpeg")
69
+ return f"data:image/{mime};base64,{base64.b64encode(path.read_bytes()).decode()}"
70
+
71
+
72
+ def main() -> None:
73
+ ap = argparse.ArgumentParser()
74
+ ap.add_argument("--gguf", required=True, help="Text GGUF (e.g. Qwen3.6-27B-Q4_K_M.gguf).")
75
+ ap.add_argument("--mmproj", required=True, help="Vision projector GGUF (mmproj-F16.gguf).")
76
+ ap.add_argument("--image", required=True, help="Image to analyze.")
77
+ ap.add_argument("--prompt", default="Describe this image in detail.")
78
+ ap.add_argument("--ctx", type=int, default=8192)
79
+ ap.add_argument(
80
+ "--gpu-layers",
81
+ type=int,
82
+ default=0,
83
+ help="Layers to offload to GPU (-1 or 99 = all).",
84
+ )
85
+ ap.add_argument("--max-tokens", type=int, default=512)
86
+ args = ap.parse_args()
87
+
88
+ image_path = Path(args.image)
89
+ if not image_path.exists():
90
+ sys.exit(f"Image not found: {image_path}")
91
+
92
+ # Qwen 2.5 VL chat handler is the closest match shipped with
93
+ # llama-cpp-python; Qwen 3.5/3.6 vision uses the same projector layout.
94
+ # If/when llama-cpp-python ships a Qwen3VLChatHandler, swap it in.
95
+ handler = Qwen25VLChatHandler(clip_model_path=args.mmproj)
96
+
97
+ llm = Llama(
98
+ model_path=args.gguf,
99
+ chat_handler=handler,
100
+ n_ctx=args.ctx,
101
+ n_gpu_layers=args.gpu_layers,
102
+ verbose=False,
103
+ )
104
+
105
+ out = llm.create_chat_completion(
106
+ messages=[
107
+ {"role": "system", "content": JANUS_SYSTEM},
108
+ {
109
+ "role": "user",
110
+ "content": [
111
+ {"type": "image_url", "image_url": {"url": encode_image_data_uri(image_path)}},
112
+ {"type": "text", "text": args.prompt},
113
+ ],
114
+ },
115
+ ],
116
+ temperature=0.6,
117
+ top_p=0.95,
118
+ top_k=20,
119
+ repeat_penalty=1.05,
120
+ max_tokens=args.max_tokens,
121
+ )
122
+ print(out["choices"][0]["message"]["content"])
123
+
124
+
125
+ if __name__ == "__main__":
126
+ main()
scripts/fetch_mmproj.sh ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Janus-27B — fetch the vision projector (mmproj) for image input.
3
+ #
4
+ # Why this is separate from build.sh:
5
+ # build.sh is for the Ollama text path. The mmproj is only useful for
6
+ # llama.cpp / llama-cpp-python right now, because Ollama's vendored
7
+ # llama.cpp fork is missing the qwen35 arch entries needed to attach
8
+ # it (see README Vision section, ollama/ollama#15898).
9
+ #
10
+ # Usage:
11
+ # ./scripts/fetch_mmproj.sh # default: F16, ~927 MB
12
+ # ./scripts/fetch_mmproj.sh BF16 # ~931 MB
13
+ # ./scripts/fetch_mmproj.sh F32 # ~1.8 GB
14
+ #
15
+ # Requires: huggingface-cli (or hf).
16
+ set -euo pipefail
17
+
18
+ PRECISION="${1:-${PRECISION:-F16}}"
19
+ REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
20
+ FILE_NAME="mmproj-${PRECISION}.gguf"
21
+ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
22
+ DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
23
+
24
+ echo "[*] repo: ${REPO_ID}"
25
+ echo "[*] precision: ${PRECISION}"
26
+ echo "[*] file: ${FILE_NAME}"
27
+ echo "[*] dest: ${DEST}"
28
+
29
+ if [[ -f "${DEST}" ]]; then
30
+ echo "[=] already present at ${DEST}, skipping."
31
+ exit 0
32
+ fi
33
+
34
+ HF=""
35
+ if command -v hf >/dev/null 2>&1; then
36
+ HF="hf"
37
+ elif command -v huggingface-cli >/dev/null 2>&1; then
38
+ HF="huggingface-cli"
39
+ else
40
+ echo "[!] Neither 'hf' nor 'huggingface-cli' found." >&2
41
+ echo " pip install -U huggingface_hub" >&2
42
+ exit 1
43
+ fi
44
+
45
+ DEST_DIR="$(dirname "${DEST}")"
46
+ mkdir -p "${DEST_DIR}"
47
+
48
+ case "${HF}" in
49
+ hf) hf download "${REPO_ID}" "${FILE_NAME}" --local-dir "${DEST_DIR}" ;;
50
+ huggingface-cli) huggingface-cli download "${REPO_ID}" "${FILE_NAME}" --local-dir "${DEST_DIR}" ;;
51
+ esac
52
+
53
+ if [[ ! -f "${DEST}" ]]; then
54
+ echo "[!] download failed: ${DEST} not present." >&2
55
+ exit 1
56
+ fi
57
+
58
+ echo
59
+ echo "[+] Done. Use it via:"
60
+ echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
61
+ echo " --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
62
+ echo " --mmproj ${DEST} \\"
63
+ echo " --image /path/to/photo.jpg \\"
64
+ echo " --prompt 'Describe this image.'"