FoolDev Claude Opus 4.7 commited on
Commit
a60eff5
Β·
1 Parent(s): 31ddd88

docs: lead Vision section with llama-server (always-built path)

Browse files

Reconfirmed vision-via-llama.cpp end-to-end on 2026-05-19: bundled
Q4_K_M GGUF + mmproj-F16.gguf from unsloth/Qwen3.6-27B-GGUF, Vulkan
backend on Strix Halo (Ryzen AI Max+ 395 / Radeon 8060S), 65/65
layers offloaded. A 1024-px JPEG posted to llama-server's OpenAI-
compat /v1/chat/completions endpoint with an `image_url` data-URL
content block produced an accurate description (478 completion
tokens) of a Tokyo-style alleyway scene β€” red paper lanterns,
bicycles, Japanese signage, Sapporo sign. Confirms the README
"Vision (mmproj) βœ…" claim for the llama.cpp loader.

Two writing fixes from what the retest actually surfaced:

1. The README's CLI example led with `llama-mtmd-cli`, but on this
box that binary wasn't built. Cause was a selective cmake build
(only `llama-cli`/`llama-server`/`llama-bench` targets), not a
missing flag β€” `LLAMA_BUILD_TOOLS=ON` is already the default.
The note now says to `cmake --build build --target llama-mtmd-cli`
if a selective build skipped it.

2. The `llama-server --mmproj` HTTP path is in stock builds (it's
listed in the loader table) but had no concrete example. Now
shows the server invocation + the wire format (`image_url` data
URL), names where the thinking trace lands (`reasoning_content`)
vs the final answer (`content`), and flags the β‰₯500 max_tokens
budget needed so the `<think>` block doesn't crowd out the
visible answer.

Vision-prereq fetch (`scripts/fetch_vision.sh F16`) and smoke
(`make smoke` + `make smoke-tools` against the HF-pulled
`hf.co/FoolDev/Thanatos-27B` tag) both passed in the same session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (2) hide show
  1. CHANGELOG.md +19 -0
  2. README.md +20 -2
CHANGELOG.md CHANGED
@@ -8,6 +8,25 @@ and documentation**, not the underlying base model.
8
  ## [Unreleased]
9
 
10
  ### Added
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - Third measured tok/s data point on the Strix Halo reference hardware:
12
  **Q3_K_S under Vulkan β†’ 12.31 tok/s aggregate** (6182 tokens /
13
  501.9 s; 12.67 / 12.55 / 12.25 short/medium/long). Now apples-to-apples
 
8
  ## [Unreleased]
9
 
10
  ### Added
11
+ - README "Vision via llama.cpp" subsection now leads with the
12
+ `llama-server --mmproj` HTTP path (always built into stock llama.cpp,
13
+ no extra cmake targets needed), reconfirmed working 2026-05-19 with
14
+ llama.cpp 389ff61 + Vulkan on the Strix Halo reference machine
15
+ (bundled Q4_K_M GGUF + `mmproj-F16.gguf` from `unsloth/Qwen3.6-27B-GGUF`).
16
+ Sent a 1024-px JPEG via an OpenAI-style `image_url` data-URL content
17
+ block; model produced an accurate description (Japanese alleyway
18
+ with paper lanterns, bicycles, etc.) in 478 completion tokens. The
19
+ visible answer arrived in `message.content`, the thinking trace in
20
+ `message.reasoning_content` β€” the section notes both, plus the
21
+ β‰₯500 max_tokens budget needed so the reasoning block doesn't crowd
22
+ out the final answer. The existing `llama-mtmd-cli` and
23
+ `llama-cpp-python` examples are still listed; `llama-mtmd-cli` now
24
+ carries a note that it's a separate cmake target β€” a plain
25
+ `cmake --build build` produces it, but a selective build (e.g.
26
+ `cmake --build build --target llama-cli llama-server llama-bench`)
27
+ silently skips it, which is what tripped the retest on this box
28
+ (`LLAMA_BUILD_TOOLS=ON` was already set; the mtmd target just hadn't
29
+ been requested).
30
  - Third measured tok/s data point on the Strix Halo reference hardware:
31
  **Q3_K_S under Vulkan β†’ 12.31 tok/s aggregate** (6182 tokens /
32
  501.9 s; 12.67 / 12.55 / 12.25 short/medium/long). Now apples-to-apples
README.md CHANGED
@@ -327,15 +327,33 @@ This repo intentionally does not redistribute either.
327
 
328
  ### Vision via llama.cpp
329
 
 
 
330
  ```bash
331
- # CLI:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
  llama-mtmd-cli \
333
  -m Qwen3.6-27B-Q4_K_M.gguf \
334
  --mmproj mmproj-F16.gguf \
335
  --image photo.jpg \
336
  -p "Describe this image."
337
 
338
- # Python:
339
  python examples/llama_cpp_vision.py \
340
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
341
  --mmproj /path/to/mmproj-F16.gguf \
 
327
 
328
  ### Vision via llama.cpp
329
 
330
+ Three flavors, in order of build-time effort:
331
+
332
  ```bash
333
+ # A. HTTP via llama-server (always built β€” the easiest path).
334
+ # Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
335
+ # on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
336
+ llama-server \
337
+ -m Qwen3.6-27B-Q4_K_M.gguf \
338
+ --mmproj mmproj-F16.gguf \
339
+ --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
340
+ # then POST OpenAI-style chat completions with an image_url content
341
+ # block β€” e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
342
+ # The thinking trace arrives in message.reasoning_content; the visible
343
+ # answer is in message.content. Budget β‰₯500 max_tokens so the reasoning
344
+ # block doesn't crowd out the final answer.
345
+
346
+ # B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
347
+ # so a selective `cmake --build build --target llama-cli ...` won't
348
+ # produce it β€” a plain `cmake --build build` will. If yours didn't,
349
+ # run `cmake --build build --target llama-mtmd-cli`.
350
  llama-mtmd-cli \
351
  -m Qwen3.6-27B-Q4_K_M.gguf \
352
  --mmproj mmproj-F16.gguf \
353
  --image photo.jpg \
354
  -p "Describe this image."
355
 
356
+ # C. Python via llama-cpp-python:
357
  python examples/llama_cpp_vision.py \
358
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
359
  --mmproj /path/to/mmproj-F16.gguf \