Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
Add vision support via llama.cpp; document Ollama upstream gap
Browse filesVision via Ollama is currently broken for the qwen35/qwen35moe family.
Ollama 0.22's vendored llama.cpp fork is missing the architecture
entries from upstream ggml-org/llama.cpp, so attaching mmproj via
either FROM or ADAPTER returns:
unknown model architecture: 'qwen35moe'
Tracked in ollama/ollama#15898 (and duplicates #14730, #15346). The
fix has not landed.
Rather than ship a Modelfile.vision that doesn't work, this commit:
- Adds examples/llama_cpp_vision.py — uses llama-cpp-python with the
Qwen2.5-VL chat handler and a separate mmproj-F16.gguf. Works today.
- Adds scripts/fetch_mmproj.sh + 'make mmproj' to pull the projector
from unsloth/Qwen3.6-27B-GGUF (~927 MB).
- Updates README:
- Replaces the misleading 'Multimodal: Yes (vision)' comparison row
with a per-loader breakdown.
- Adds a dedicated Vision section explaining the Ollama gap and the
working llama.cpp path.
- Updates Known Limitations.
- Adds explicit text-only headers to Modelfile and Modelfile.z13.
- Updates examples/README.md.
- CHANGELOG: documents this batch and tags 0.4.0 (9ca8700).
- CHANGELOG.md +24 -0
- Makefile +8 -3
- Modelfile +5 -0
- Modelfile.z13 +3 -0
- README.md +53 -3
- examples/README.md +23 -2
- examples/llama_cpp_vision.py +126 -0
- scripts/fetch_mmproj.sh +64 -0
|
@@ -7,6 +7,30 @@ and documentation**, not the underlying base model.
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
### Added
|
| 11 |
- `Makefile` — convenience wrapper. `make help` lists targets:
|
| 12 |
`build` / `smoke` / `check` / `hooks` / `clean`. Variables
|
|
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
| 10 |
+
### Added
|
| 11 |
+
- `examples/llama_cpp_vision.py` — image-text-to-text via
|
| 12 |
+
`llama-cpp-python` + a separate `mmproj-F16.gguf`. Currently the only
|
| 13 |
+
working vision path for Janus-27B (Ollama is broken; see Changed).
|
| 14 |
+
- `scripts/fetch_mmproj.sh` — pulls `mmproj-F16.gguf` (or BF16/F32) from
|
| 15 |
+
`unsloth/Qwen3.6-27B-GGUF`. Honors `MMPROJ_PATH` override.
|
| 16 |
+
- `Makefile`: new `mmproj` target.
|
| 17 |
+
|
| 18 |
+
### Changed
|
| 19 |
+
- README: replaced the misleading "Multimodal: Yes (vision)" comparison
|
| 20 |
+
row with a per-loader breakdown. Added a dedicated **Vision** section
|
| 21 |
+
documenting:
|
| 22 |
+
- The Ollama 0.22 architecture gap (`unknown model architecture:
|
| 23 |
+
'qwen35moe'`) tracked in
|
| 24 |
+
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 25 |
+
- A working llama.cpp / llama-cpp-python path with the `mmproj`.
|
| 26 |
+
- Updated Known Limitations entry accordingly.
|
| 27 |
+
- `Modelfile` and `Modelfile.z13`: header comments now state explicitly
|
| 28 |
+
that they're text-only and link to the Vision section.
|
| 29 |
+
- `examples/README.md`: reflects the new Vision example and explains
|
| 30 |
+
why Ollama is not the recommended backend for it (yet).
|
| 31 |
+
|
| 32 |
+
## [0.4.0] - 2026-05-02 — `9ca8700`
|
| 33 |
+
|
| 34 |
### Added
|
| 35 |
- `Makefile` — convenience wrapper. `make help` lists targets:
|
| 36 |
`build` / `smoke` / `check` / `hooks` / `clean`. Variables
|
|
@@ -31,13 +31,15 @@ MODEL ?= $(TAG)
|
|
| 31 |
|
| 32 |
.DEFAULT_GOAL := help
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
| 35 |
|
| 36 |
help: ## Show this help.
|
| 37 |
@awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
|
| 38 |
@echo
|
| 39 |
@echo "Current settings:"
|
| 40 |
-
@echo " QUANT=$(QUANT) PROFILE=$(PROFILE) TAG=$(TAG)"
|
| 41 |
ifdef GGUF_PATH
|
| 42 |
@echo " GGUF_PATH=$(GGUF_PATH)"
|
| 43 |
endif
|
|
@@ -48,6 +50,9 @@ build: ## Download GGUF (if needed) and run 'ollama create'.
|
|
| 48 |
smoke: ## Verify the model is reachable and round-trips.
|
| 49 |
MODEL=$(MODEL) ./scripts/smoke_test.sh
|
| 50 |
|
|
|
|
|
|
|
|
|
|
| 51 |
check: ## Lint shell + python files; block dot-pattern footgun.
|
| 52 |
./scripts/check.sh
|
| 53 |
|
|
@@ -56,6 +61,6 @@ hooks: ## Install scripts/check.sh as the git pre-commit hook.
|
|
| 56 |
|
| 57 |
clean: ## Remove local GGUF copies and ephemeral caches in this repo.
|
| 58 |
@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
|
| 59 |
-
@rm -f ./Qwen3.6-27B-*.gguf
|
| 60 |
@rm -rf ./.cache __pycache__ examples/__pycache__
|
| 61 |
@echo "[+] clean"
|
|
|
|
| 31 |
|
| 32 |
.DEFAULT_GOAL := help
|
| 33 |
|
| 34 |
+
PRECISION ?= F16
|
| 35 |
+
|
| 36 |
+
.PHONY: help build smoke check hooks mmproj clean
|
| 37 |
|
| 38 |
help: ## Show this help.
|
| 39 |
@awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
|
| 40 |
@echo
|
| 41 |
@echo "Current settings:"
|
| 42 |
+
@echo " QUANT=$(QUANT) PROFILE=$(PROFILE) TAG=$(TAG) PRECISION=$(PRECISION)"
|
| 43 |
ifdef GGUF_PATH
|
| 44 |
@echo " GGUF_PATH=$(GGUF_PATH)"
|
| 45 |
endif
|
|
|
|
| 50 |
smoke: ## Verify the model is reachable and round-trips.
|
| 51 |
MODEL=$(MODEL) ./scripts/smoke_test.sh
|
| 52 |
|
| 53 |
+
mmproj: ## Fetch the vision projector for llama.cpp (Ollama vision is broken upstream).
|
| 54 |
+
./scripts/fetch_mmproj.sh $(PRECISION)
|
| 55 |
+
|
| 56 |
check: ## Lint shell + python files; block dot-pattern footgun.
|
| 57 |
./scripts/check.sh
|
| 58 |
|
|
|
|
| 61 |
|
| 62 |
clean: ## Remove local GGUF copies and ephemeral caches in this repo.
|
| 63 |
@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
|
| 64 |
+
@rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf
|
| 65 |
@rm -rf ./.cache __pycache__ examples/__pycache__
|
| 66 |
@echo "[+] clean"
|
|
@@ -1,5 +1,10 @@
|
|
| 1 |
# Janus-27B — Ollama wrapper around Qwen 3.6 27B (dense)
|
| 2 |
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
# This repo does not redistribute weights. Edit the FROM line below to
|
| 4 |
# point at a local Qwen 3.6 27B GGUF, then:
|
| 5 |
#
|
|
|
|
| 1 |
# Janus-27B — Ollama wrapper around Qwen 3.6 27B (dense)
|
| 2 |
#
|
| 3 |
+
# Text-only. Vision via Ollama is currently broken for this architecture
|
| 4 |
+
# (ollama/ollama#15898 — the vendored llama.cpp fork is missing the
|
| 5 |
+
# qwen35 arch entries). Use llama.cpp directly for image input, or wait
|
| 6 |
+
# for the fix. See the Vision section in README.md.
|
| 7 |
+
#
|
| 8 |
# This repo does not redistribute weights. Edit the FROM line below to
|
| 9 |
# point at a local Qwen 3.6 27B GGUF, then:
|
| 10 |
#
|
|
@@ -1,5 +1,8 @@
|
|
| 1 |
# Janus-27B — Z13 variant for ASUS ROG Flow Z13 (Ryzen AI Max+ 395, 128 GB)
|
| 2 |
#
|
|
|
|
|
|
|
|
|
|
| 3 |
# This Modelfile is tuned for an iGPU with a shared/unified memory pool.
|
| 4 |
# Defaults differ from the main Modelfile in three ways:
|
| 5 |
# 1. Smaller context (8K instead of 16K) to keep KV cache slim.
|
|
|
|
| 1 |
# Janus-27B — Z13 variant for ASUS ROG Flow Z13 (Ryzen AI Max+ 395, 128 GB)
|
| 2 |
#
|
| 3 |
+
# Text-only (same caveat as the default Modelfile — vision via Ollama
|
| 4 |
+
# is broken upstream for qwen35; see README Vision section).
|
| 5 |
+
#
|
| 6 |
# This Modelfile is tuned for an iGPU with a shared/unified memory pool.
|
| 7 |
# Defaults differ from the main Modelfile in three ways:
|
| 8 |
# 1. Smaller context (8K instead of 16K) to keep KV cache slim.
|
|
@@ -74,7 +74,9 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 74 |
| Q4_K_M GGUF size | ~17 GB | ~19 GB |
|
| 75 |
| Q3_K_S GGUF size | ~12 GB | n/a |
|
| 76 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 77 |
-
| Multimodal
|
|
|
|
|
|
|
| 78 |
| Max context | 262 144 | 262 144 |
|
| 79 |
|
| 80 |
## What's here
|
|
@@ -87,6 +89,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 87 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 88 |
| `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
|
| 89 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model and runs a round-trip |
|
|
|
|
| 90 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
|
| 91 |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
|
| 92 |
| `Makefile` | Convenience wrapper — `make help` lists targets |
|
|
@@ -107,7 +110,10 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
|
|
| 107 |
- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
|
| 108 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 109 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 110 |
-
- Vision + video
|
|
|
|
|
|
|
|
|
|
| 111 |
- Multi-token prediction (MTP) head trained for speculative decoding
|
| 112 |
|
| 113 |
## Quick start
|
|
@@ -176,6 +182,50 @@ Behavior rules:
|
|
| 176 |
- Finish with a usable answer, not just planning.
|
| 177 |
```
|
| 178 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
## Hardware requirements
|
| 180 |
|
| 181 |
The dense 27B is the easier of the two Janus models to deploy.
|
|
@@ -197,7 +247,7 @@ See the [Janus-35B Chat template section](https://huggingface.co/FoolDev/janus#c
|
|
| 197 |
## Known limitations
|
| 198 |
|
| 199 |
- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
|
| 200 |
-
- **No mmproj in this release
|
| 201 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 202 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
| 203 |
|
|
|
|
| 74 |
| Q4_K_M GGUF size | ~17 GB | ~19 GB |
|
| 75 |
| Q3_K_S GGUF size | ~12 GB | n/a |
|
| 76 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 77 |
+
| Multimodal (text path) | Yes | Yes |
|
| 78 |
+
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
|
| 79 |
+
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
|
| 80 |
| Max context | 262 144 | 262 144 |
|
| 81 |
|
| 82 |
## What's here
|
|
|
|
| 89 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 90 |
| `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
|
| 91 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model and runs a round-trip |
|
| 92 |
+
| `scripts/fetch_mmproj.sh` | Pulls the vision projector for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)) |
|
| 93 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
|
| 94 |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
|
| 95 |
| `Makefile` | Convenience wrapper — `make help` lists targets |
|
|
|
|
| 110 |
- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
|
| 111 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 112 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 113 |
+
- Vision + video supported by the **base architecture** via a separate
|
| 114 |
+
`mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
|
| 115 |
+
from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
|
| 116 |
+
current loader compatibility.
|
| 117 |
- Multi-token prediction (MTP) head trained for speculative decoding
|
| 118 |
|
| 119 |
## Quick start
|
|
|
|
| 182 |
- Finish with a usable answer, not just planning.
|
| 183 |
```
|
| 184 |
|
| 185 |
+
## Vision
|
| 186 |
+
|
| 187 |
+
The Qwen 3.6 base supports image (and video) input via a separate
|
| 188 |
+
`mmproj` projector. The full multimodal stack is:
|
| 189 |
+
|
| 190 |
+
```
|
| 191 |
+
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
|
| 192 |
+
mmproj-F16.gguf (~927 MB, the vision projector)
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
Both files are at
|
| 196 |
+
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
|
| 197 |
+
This repo intentionally does not redistribute either.
|
| 198 |
+
|
| 199 |
+
### Loader compatibility — the honest table
|
| 200 |
+
|
| 201 |
+
| Loader | Text | Vision (mmproj) | Notes |
|
| 202 |
+
|---|---|---|---|
|
| 203 |
+
| **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | ✅ | ✅ | Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
|
| 204 |
+
| **llama-cpp-python** | ✅ | ✅ | See `examples/llama_cpp_vision.py`. |
|
| 205 |
+
| **Ollama 0.22** | ✅ | ❌ | Vendored llama.cpp fork is missing the architecture entries. Attaching `mmproj` via `FROM` *or* `ADAPTER` returns `unknown model architecture: 'qwen35moe'` (and the same for the dense `qwen35`). See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). Will work once that PR lands. |
|
| 206 |
+
| **LM Studio** | ✅ | ✅ (last tested) | Uses upstream llama.cpp directly. |
|
| 207 |
+
|
| 208 |
+
### Vision via llama.cpp
|
| 209 |
+
|
| 210 |
+
```bash
|
| 211 |
+
# CLI:
|
| 212 |
+
llama-mtmd-cli \
|
| 213 |
+
-m Qwen3.6-27B-Q4_K_M.gguf \
|
| 214 |
+
--mmproj mmproj-F16.gguf \
|
| 215 |
+
--image photo.jpg \
|
| 216 |
+
-p "Describe this image."
|
| 217 |
+
|
| 218 |
+
# Python:
|
| 219 |
+
python examples/llama_cpp_vision.py \
|
| 220 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 221 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 222 |
+
--image /path/to/photo.jpg \
|
| 223 |
+
--prompt "What is in this image?"
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
|
| 227 |
+
for this model.
|
| 228 |
+
|
| 229 |
## Hardware requirements
|
| 230 |
|
| 231 |
The dense 27B is the easier of the two Janus models to deploy.
|
|
|
|
| 247 |
## Known limitations
|
| 248 |
|
| 249 |
- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
|
| 250 |
+
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (qwen35/qwen35moe arch entries missing from Ollama's vendored llama.cpp fork — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
|
| 251 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 252 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
| 253 |
|
|
@@ -4,9 +4,10 @@ Three minimal entry points. Pick the one that matches how you run models.
|
|
| 4 |
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
-
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `janus-27b` model created from the project `Modelfile`. |
|
| 8 |
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
|
| 9 |
-
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). |
|
|
|
|
| 10 |
|
| 11 |
All three apply the same Janus system prompt and sampling defaults
|
| 12 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
@@ -47,3 +48,23 @@ python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
|
|
| 47 |
|
| 48 |
For GPU offload, rebuild llama-cpp-python with the matching backend — see
|
| 49 |
the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
+
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `janus-27b` model created from the project `Modelfile`. **Text only** — vision via Ollama is broken upstream for this arch. |
|
| 8 |
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
|
| 9 |
+
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
|
| 10 |
+
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |
|
| 11 |
|
| 12 |
All three apply the same Janus system prompt and sampling defaults
|
| 13 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
|
|
| 48 |
|
| 49 |
For GPU offload, rebuild llama-cpp-python with the matching backend — see
|
| 50 |
the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
|
| 51 |
+
|
| 52 |
+
### Vision (image input)
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
# Pull the projector once (~927 MB):
|
| 56 |
+
hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .
|
| 57 |
+
|
| 58 |
+
pip install llama-cpp-python pillow
|
| 59 |
+
python llama_cpp_vision.py \
|
| 60 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 61 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 62 |
+
--image /path/to/photo.jpg \
|
| 63 |
+
--prompt "Describe this image."
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Why not Ollama? Ollama 0.22's vendored llama.cpp is missing the `qwen35`
|
| 67 |
+
architecture entries needed to attach an mmproj — `FROM` and `ADAPTER`
|
| 68 |
+
both fail with `unknown model architecture: 'qwen35moe'`. Tracked in
|
| 69 |
+
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 70 |
+
Until that's fixed, llama.cpp / llama-cpp-python is the working path.
|
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Janus-27B — vision (image-text-to-text) via llama-cpp-python.
|
| 4 |
+
|
| 5 |
+
Why this script exists:
|
| 6 |
+
Ollama 0.22's vendored llama.cpp fork is missing the qwen35/qwen35moe
|
| 7 |
+
architecture entries needed to attach a separate mmproj projector.
|
| 8 |
+
Both `FROM mmproj.gguf` and `ADAPTER mmproj.gguf` fail with:
|
| 9 |
+
unknown model architecture: 'qwen35moe'
|
| 10 |
+
See ollama/ollama#15898, #14730 (closed as duplicates of #15898 root
|
| 11 |
+
cause). Until that lands, vision via Ollama is broken for Qwen 3.5 /
|
| 12 |
+
3.6.
|
| 13 |
+
|
| 14 |
+
Upstream ggml-org/llama.cpp **does** have the architecture, so vision
|
| 15 |
+
works fine via llama.cpp directly. This script uses the python binding.
|
| 16 |
+
|
| 17 |
+
Install:
|
| 18 |
+
pip install llama-cpp-python pillow
|
| 19 |
+
# GPU offload? rebuild with the matching backend:
|
| 20 |
+
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-binary :all:
|
| 21 |
+
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
|
| 22 |
+
# CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
|
| 23 |
+
|
| 24 |
+
Files you need (both from unsloth/Qwen3.6-27B-GGUF):
|
| 25 |
+
1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf (~17 GB)
|
| 26 |
+
2. A vision projector: mmproj-F16.gguf (~927 MB)
|
| 27 |
+
|
| 28 |
+
Usage:
|
| 29 |
+
python llama_cpp_vision.py \
|
| 30 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 31 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 32 |
+
--image /path/to/photo.jpg \
|
| 33 |
+
--prompt "What is in this image? Be specific."
|
| 34 |
+
|
| 35 |
+
# CLI alternative without python binding (ships with llama.cpp):
|
| 36 |
+
# llama-mtmd-cli \
|
| 37 |
+
# -m Qwen3.6-27B-Q4_K_M.gguf \
|
| 38 |
+
# --mmproj mmproj-F16.gguf \
|
| 39 |
+
# --image photo.jpg \
|
| 40 |
+
# -p "Describe this image."
|
| 41 |
+
"""
|
| 42 |
+
from __future__ import annotations
|
| 43 |
+
|
| 44 |
+
import argparse
|
| 45 |
+
import base64
|
| 46 |
+
import sys
|
| 47 |
+
from pathlib import Path
|
| 48 |
+
|
| 49 |
+
try:
|
| 50 |
+
from llama_cpp import Llama
|
| 51 |
+
from llama_cpp.llama_chat_format import Qwen25VLChatHandler
|
| 52 |
+
except ImportError: # pragma: no cover
|
| 53 |
+
sys.exit(
|
| 54 |
+
"Missing llama-cpp-python (>=0.3 with VL handlers).\n"
|
| 55 |
+
" pip install --upgrade llama-cpp-python pillow"
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
JANUS_SYSTEM = (
|
| 60 |
+
"You are Janus, a precise vision-language assistant. Describe images "
|
| 61 |
+
"accurately, do not invent details, and ground every claim in the "
|
| 62 |
+
"pixels you can actually see."
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def encode_image_data_uri(path: Path) -> str:
|
| 67 |
+
suffix = path.suffix.lower().lstrip(".")
|
| 68 |
+
mime = {"jpg": "jpeg", "jpeg": "jpeg", "png": "png", "webp": "webp", "gif": "gif"}.get(suffix, "jpeg")
|
| 69 |
+
return f"data:image/{mime};base64,{base64.b64encode(path.read_bytes()).decode()}"
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def main() -> None:
|
| 73 |
+
ap = argparse.ArgumentParser()
|
| 74 |
+
ap.add_argument("--gguf", required=True, help="Text GGUF (e.g. Qwen3.6-27B-Q4_K_M.gguf).")
|
| 75 |
+
ap.add_argument("--mmproj", required=True, help="Vision projector GGUF (mmproj-F16.gguf).")
|
| 76 |
+
ap.add_argument("--image", required=True, help="Image to analyze.")
|
| 77 |
+
ap.add_argument("--prompt", default="Describe this image in detail.")
|
| 78 |
+
ap.add_argument("--ctx", type=int, default=8192)
|
| 79 |
+
ap.add_argument(
|
| 80 |
+
"--gpu-layers",
|
| 81 |
+
type=int,
|
| 82 |
+
default=0,
|
| 83 |
+
help="Layers to offload to GPU (-1 or 99 = all).",
|
| 84 |
+
)
|
| 85 |
+
ap.add_argument("--max-tokens", type=int, default=512)
|
| 86 |
+
args = ap.parse_args()
|
| 87 |
+
|
| 88 |
+
image_path = Path(args.image)
|
| 89 |
+
if not image_path.exists():
|
| 90 |
+
sys.exit(f"Image not found: {image_path}")
|
| 91 |
+
|
| 92 |
+
# Qwen 2.5 VL chat handler is the closest match shipped with
|
| 93 |
+
# llama-cpp-python; Qwen 3.5/3.6 vision uses the same projector layout.
|
| 94 |
+
# If/when llama-cpp-python ships a Qwen3VLChatHandler, swap it in.
|
| 95 |
+
handler = Qwen25VLChatHandler(clip_model_path=args.mmproj)
|
| 96 |
+
|
| 97 |
+
llm = Llama(
|
| 98 |
+
model_path=args.gguf,
|
| 99 |
+
chat_handler=handler,
|
| 100 |
+
n_ctx=args.ctx,
|
| 101 |
+
n_gpu_layers=args.gpu_layers,
|
| 102 |
+
verbose=False,
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
out = llm.create_chat_completion(
|
| 106 |
+
messages=[
|
| 107 |
+
{"role": "system", "content": JANUS_SYSTEM},
|
| 108 |
+
{
|
| 109 |
+
"role": "user",
|
| 110 |
+
"content": [
|
| 111 |
+
{"type": "image_url", "image_url": {"url": encode_image_data_uri(image_path)}},
|
| 112 |
+
{"type": "text", "text": args.prompt},
|
| 113 |
+
],
|
| 114 |
+
},
|
| 115 |
+
],
|
| 116 |
+
temperature=0.6,
|
| 117 |
+
top_p=0.95,
|
| 118 |
+
top_k=20,
|
| 119 |
+
repeat_penalty=1.05,
|
| 120 |
+
max_tokens=args.max_tokens,
|
| 121 |
+
)
|
| 122 |
+
print(out["choices"][0]["message"]["content"])
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
if __name__ == "__main__":
|
| 126 |
+
main()
|
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Janus-27B — fetch the vision projector (mmproj) for image input.
|
| 3 |
+
#
|
| 4 |
+
# Why this is separate from build.sh:
|
| 5 |
+
# build.sh is for the Ollama text path. The mmproj is only useful for
|
| 6 |
+
# llama.cpp / llama-cpp-python right now, because Ollama's vendored
|
| 7 |
+
# llama.cpp fork is missing the qwen35 arch entries needed to attach
|
| 8 |
+
# it (see README Vision section, ollama/ollama#15898).
|
| 9 |
+
#
|
| 10 |
+
# Usage:
|
| 11 |
+
# ./scripts/fetch_mmproj.sh # default: F16, ~927 MB
|
| 12 |
+
# ./scripts/fetch_mmproj.sh BF16 # ~931 MB
|
| 13 |
+
# ./scripts/fetch_mmproj.sh F32 # ~1.8 GB
|
| 14 |
+
#
|
| 15 |
+
# Requires: huggingface-cli (or hf).
|
| 16 |
+
set -euo pipefail
|
| 17 |
+
|
| 18 |
+
PRECISION="${1:-${PRECISION:-F16}}"
|
| 19 |
+
REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
|
| 20 |
+
FILE_NAME="mmproj-${PRECISION}.gguf"
|
| 21 |
+
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 22 |
+
DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
|
| 23 |
+
|
| 24 |
+
echo "[*] repo: ${REPO_ID}"
|
| 25 |
+
echo "[*] precision: ${PRECISION}"
|
| 26 |
+
echo "[*] file: ${FILE_NAME}"
|
| 27 |
+
echo "[*] dest: ${DEST}"
|
| 28 |
+
|
| 29 |
+
if [[ -f "${DEST}" ]]; then
|
| 30 |
+
echo "[=] already present at ${DEST}, skipping."
|
| 31 |
+
exit 0
|
| 32 |
+
fi
|
| 33 |
+
|
| 34 |
+
HF=""
|
| 35 |
+
if command -v hf >/dev/null 2>&1; then
|
| 36 |
+
HF="hf"
|
| 37 |
+
elif command -v huggingface-cli >/dev/null 2>&1; then
|
| 38 |
+
HF="huggingface-cli"
|
| 39 |
+
else
|
| 40 |
+
echo "[!] Neither 'hf' nor 'huggingface-cli' found." >&2
|
| 41 |
+
echo " pip install -U huggingface_hub" >&2
|
| 42 |
+
exit 1
|
| 43 |
+
fi
|
| 44 |
+
|
| 45 |
+
DEST_DIR="$(dirname "${DEST}")"
|
| 46 |
+
mkdir -p "${DEST_DIR}"
|
| 47 |
+
|
| 48 |
+
case "${HF}" in
|
| 49 |
+
hf) hf download "${REPO_ID}" "${FILE_NAME}" --local-dir "${DEST_DIR}" ;;
|
| 50 |
+
huggingface-cli) huggingface-cli download "${REPO_ID}" "${FILE_NAME}" --local-dir "${DEST_DIR}" ;;
|
| 51 |
+
esac
|
| 52 |
+
|
| 53 |
+
if [[ ! -f "${DEST}" ]]; then
|
| 54 |
+
echo "[!] download failed: ${DEST} not present." >&2
|
| 55 |
+
exit 1
|
| 56 |
+
fi
|
| 57 |
+
|
| 58 |
+
echo
|
| 59 |
+
echo "[+] Done. Use it via:"
|
| 60 |
+
echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
|
| 61 |
+
echo " --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
|
| 62 |
+
echo " --mmproj ${DEST} \\"
|
| 63 |
+
echo " --image /path/to/photo.jpg \\"
|
| 64 |
+
echo " --prompt 'Describe this image.'"
|