Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
Revert base swap back to Qwen/Qwen3.6-27B (keep -Heretic name)
Browse filesUndoes the Qwen → llmfan46/Qwen3.6-27B-uncensored-heretic-v2 base
swap from 16e1ddd. Project name string (Thanatos-27B-Heretic),
Ollama tag (thanatos-27b-heretic), HF repo URL, banner -HERETIC
wordmark, and git remote are all preserved per explicit choice
("undo base only, keep name").
- README: frontmatter base_model → Qwen/Qwen3.6-27B; drop
base_model_relation and heretic/uncensored tags (imatrix kept).
Tagline, badge, Architecture line, sibling paragraph, Quick-start
path C, Local-apps table, Vision section, Related-models table,
Credits, Known-limitations all back to vanilla framing. Added a
"Note on the name" callout explaining the name-vs-base mismatch.
- Tooling: scripts/build.sh + fetch_vision.sh REPO_ID back to
unsloth/Qwen3.6-27B-GGUF; filename pattern + Q3_K_S smallest
quant restored. Modelfile preamble flipped. transformers
example MODEL_ID back to Qwen/Qwen3.6-27B. examples/README.md
+ llama_cpp_vision.py recipes flipped. CITATION.cff
title/abstract/refs/keywords flipped. Makefile + .gitignore
comments flipped.
- banner.svg subtitle "Dense 27B · Opus 4.7 distilled · uncensored"
→ "Qwen 3.6 · Dense 27B · Opus 4.7 distilled"; PNG re-rasterized.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- .gitignore +2 -8
- CHANGELOG.md +49 -0
- CITATION.cff +13 -18
- Makefile +4 -4
- Modelfile +7 -8
- README.md +70 -105
- banner.png +0 -0
- banner.svg +1 -1
- examples/README.md +14 -15
- examples/llama_cpp_vision.py +7 -7
- examples/transformers_quickstart.py +4 -7
- scripts/build.sh +9 -8
- scripts/check.sh +3 -5
- scripts/fetch_vision.sh +7 -11
|
@@ -5,22 +5,16 @@ __pycache__/
|
|
| 5 |
.venv/
|
| 6 |
venv/
|
| 7 |
|
| 8 |
-
# Local model weights. We don't redistribute the
|
| 9 |
-
# here — `make build` fetches one from
|
| 10 |
-
# llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF locally.
|
| 11 |
# The single Thanatos-27B.*.gguf we DO ship backs the HF/Ollama
|
| 12 |
# "Use this model" widget (ollama run hf.co/FoolDev/Thanatos-27B-Heretic).
|
| 13 |
-
# The bundled file is still named Thanatos-27B.*.gguf from before the
|
| 14 |
-
# rename; whitelist also covers Thanatos-27B-Heretic.*.gguf for the
|
| 15 |
-
# pending Heretic rebundle.
|
| 16 |
*.gguf
|
| 17 |
!Thanatos-27B.*.gguf
|
| 18 |
-
!Thanatos-27B-Heretic.*.gguf
|
| 19 |
# Local-only rebadge experiments produced by scripts/rename_arch.py.
|
| 20 |
# These re-stamp general.architecture and are not loadable by current
|
| 21 |
# ollama / llama.cpp; don't track or push them.
|
| 22 |
Thanatos-27B.*.qwen[0-9]*.gguf
|
| 23 |
-
Thanatos-27B-Heretic.*.qwen[0-9]*.gguf
|
| 24 |
*.safetensors
|
| 25 |
*.bin
|
| 26 |
|
|
|
|
| 5 |
.venv/
|
| 6 |
venv/
|
| 7 |
|
| 8 |
+
# Local model weights. We don't redistribute the upstream Qwen GGUFs
|
| 9 |
+
# here — `make build` fetches one from unsloth/Qwen3.6-27B-GGUF locally.
|
|
|
|
| 10 |
# The single Thanatos-27B.*.gguf we DO ship backs the HF/Ollama
|
| 11 |
# "Use this model" widget (ollama run hf.co/FoolDev/Thanatos-27B-Heretic).
|
|
|
|
|
|
|
|
|
|
| 12 |
*.gguf
|
| 13 |
!Thanatos-27B.*.gguf
|
|
|
|
| 14 |
# Local-only rebadge experiments produced by scripts/rename_arch.py.
|
| 15 |
# These re-stamp general.architecture and are not loadable by current
|
| 16 |
# ollama / llama.cpp; don't track or push them.
|
| 17 |
Thanatos-27B.*.qwen[0-9]*.gguf
|
|
|
|
| 18 |
*.safetensors
|
| 19 |
*.bin
|
| 20 |
|
|
@@ -7,6 +7,55 @@ and documentation**, not the underlying base model.
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
### Changed (acknowledge HF's `imatrix` auto-tag in frontmatter)
|
| 11 |
- **Added `imatrix` to the README `tags:` list.** HF's tag
|
| 12 |
auto-detector was surfacing `imatrix` on the rendered model
|
|
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
| 10 |
+
### Reverted (base swap to Heretic v2 — name kept, base back to vanilla Qwen)
|
| 11 |
+
- **Undone the `Qwen/Qwen3.6-27B` → `llmfan46/Qwen3.6-27B-uncensored-heretic-v2`
|
| 12 |
+
base swap** that shipped in `16e1ddd` and was polished in
|
| 13 |
+
subsequent commits. Current base is back to vanilla
|
| 14 |
+
`Qwen/Qwen3.6-27B`. README frontmatter `base_model:`, the
|
| 15 |
+
`Base-…` badge, the Architecture line, the sibling paragraph,
|
| 16 |
+
the Quick-start path C, the Local-apps table, the Vision
|
| 17 |
+
section, the Related-models table, the Credits, and the
|
| 18 |
+
Known-limitations section all flipped back to the pre-swap
|
| 19 |
+
Qwen-only framing. `heretic` / `uncensored` tags removed
|
| 20 |
+
(`imatrix` stays — the bundled blob is still iMatrix-quantized
|
| 21 |
+
regardless of which base is described). `base_model_relation:
|
| 22 |
+
finetune` removed; this is a packaging wrapper, not a finetune.
|
| 23 |
+
- **Tooling flipped back to unsloth's GGUF mirror.**
|
| 24 |
+
`scripts/build.sh` `REPO_ID` back to `unsloth/Qwen3.6-27B-GGUF`
|
| 25 |
+
with filename pattern `Qwen3.6-27B-${QUANT}.gguf`; quant list
|
| 26 |
+
back to the unsloth catalog (Q3_K_S restored as the smallest
|
| 27 |
+
practical quant). `scripts/fetch_vision.sh` defaults back to
|
| 28 |
+
`PRECISION=F16` and `mmproj-F16.gguf` from unsloth. Modelfile
|
| 29 |
+
preamble flipped. `examples/transformers_quickstart.py`
|
| 30 |
+
`MODEL_ID` back to `Qwen/Qwen3.6-27B`. `examples/README.md` and
|
| 31 |
+
`examples/llama_cpp_vision.py` recipes flipped. `CITATION.cff`
|
| 32 |
+
title, abstract, references, and keywords flipped. `Makefile`
|
| 33 |
+
help-text + `build` docstring flipped. `.gitignore` comments
|
| 34 |
+
+ whitelist + rebadge-artifact glob flipped.
|
| 35 |
+
- **`banner.svg`** subtitle reverted `Dense 27B · Opus 4.7
|
| 36 |
+
distilled · uncensored` → `Qwen 3.6 · Dense 27B · Opus 4.7
|
| 37 |
+
distilled`. `THANATOS-27B-HERETIC` wordmark **kept** — the
|
| 38 |
+
project name string and HF repo URL are preserved per explicit
|
| 39 |
+
choice ("undo base only, keep name"). `banner.png`
|
| 40 |
+
re-rasterized at 2× via rsvg-convert.
|
| 41 |
+
- **Project name string `Thanatos-27B-Heretic` and Ollama tag
|
| 42 |
+
`thanatos-27b-heretic` retained** across all files. HF repo
|
| 43 |
+
URL stays at `FoolDev/Thanatos-27B-Heretic`; git remote
|
| 44 |
+
unchanged. A "Note on the name" callout added to the README
|
| 45 |
+
tagline explaining the name-vs-base mismatch so users aren't
|
| 46 |
+
surprised.
|
| 47 |
+
- **Bundled blob unchanged** (`Thanatos-27B.Q4_K_M.gguf` LFS
|
| 48 |
+
pointer SHA `5ed60d0a...`). It was always the legacy unsloth
|
| 49 |
+
Qwen Q4_K_M quant; with the base reverted, the blob and the
|
| 50 |
+
declared base are now consistent again. The "Bundled blob
|
| 51 |
+
status" callout in TL;DR removed since it no longer applies.
|
| 52 |
+
- **HF repo migration:** the HF repo at
|
| 53 |
+
`FoolDev/Thanatos-27B-Heretic` keeps its current name (the
|
| 54 |
+
user's earlier rename via HF UI stands). If you want to also
|
| 55 |
+
rename the HF repo back to `FoolDev/Thanatos-27B`, that's a
|
| 56 |
+
separate HF UI action — HF will serve a 307 redirect from the
|
| 57 |
+
new name to the old once renamed.
|
| 58 |
+
|
| 59 |
### Changed (acknowledge HF's `imatrix` auto-tag in frontmatter)
|
| 60 |
- **Added `imatrix` to the README `tags:` list.** HF's tag
|
| 61 |
auto-detector was surfacing `imatrix` on the rendered model
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
cff-version: 1.2.0
|
| 2 |
-
title: "Thanatos-27B-Heretic: A Dense Distillation Wrapper for
|
| 3 |
message: "If you use this model card or its accompanying files, please cite as below."
|
| 4 |
type: software
|
| 5 |
authors:
|
|
@@ -8,15 +8,17 @@ authors:
|
|
| 8 |
repository-code: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
|
| 9 |
url: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
|
| 10 |
abstract: >-
|
| 11 |
-
Thanatos-27B-Heretic is a personal repackaging of
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
(
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
keywords:
|
| 21 |
- qwen
|
| 22 |
- qwen3.6
|
|
@@ -24,17 +26,10 @@ keywords:
|
|
| 24 |
- distillation
|
| 25 |
- reasoning
|
| 26 |
- llm
|
| 27 |
-
- heretic
|
| 28 |
-
- uncensored
|
| 29 |
license: Apache-2.0
|
| 30 |
references:
|
| 31 |
- type: software
|
| 32 |
-
title: "Qwen3.6-27B
|
| 33 |
-
authors:
|
| 34 |
-
- name: llmfan46
|
| 35 |
-
url: "https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
|
| 36 |
-
- type: software
|
| 37 |
-
title: "Qwen3.6-27B (upstream base)"
|
| 38 |
authors:
|
| 39 |
- name: Alibaba Qwen Team
|
| 40 |
url: "https://huggingface.co/Qwen/Qwen3.6-27B"
|
|
|
|
| 1 |
cff-version: 1.2.0
|
| 2 |
+
title: "Thanatos-27B-Heretic: A Dense Distillation Wrapper for Qwen 3.6 27B"
|
| 3 |
message: "If you use this model card or its accompanying files, please cite as below."
|
| 4 |
type: software
|
| 5 |
authors:
|
|
|
|
| 8 |
repository-code: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
|
| 9 |
url: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
|
| 10 |
abstract: >-
|
| 11 |
+
Thanatos-27B-Heretic is a personal repackaging of the dense Qwen 3.6 27B base
|
| 12 |
+
model with Claude Opus 4.7 in the reasoning teacher slot. The
|
| 13 |
+
repository ships an Ollama Modelfile, sampling defaults, usage
|
| 14 |
+
examples, and a single ready-to-run GGUF (Q4_K_M ~17 GB) so the HF
|
| 15 |
+
"Use this model" widget surfaces a one-liner Ollama snippet. Other
|
| 16 |
+
quants (Q3_K_S, Q5_K_M, Q6_K, etc.) and the upstream safetensors
|
| 17 |
+
(Qwen/Qwen3.6-27B) are pulled from upstream
|
| 18 |
+
(unsloth/Qwen3.6-27B-GGUF) on demand rather than redistributed.
|
| 19 |
+
(The repo carries the `-Heretic` suffix from a prior swap to
|
| 20 |
+
llmfan46/Qwen3.6-27B-uncensored-heretic-v2 that was reverted;
|
| 21 |
+
current base is vanilla Qwen 3.6 27B.)
|
| 22 |
keywords:
|
| 23 |
- qwen
|
| 24 |
- qwen3.6
|
|
|
|
| 26 |
- distillation
|
| 27 |
- reasoning
|
| 28 |
- llm
|
|
|
|
|
|
|
| 29 |
license: Apache-2.0
|
| 30 |
references:
|
| 31 |
- type: software
|
| 32 |
+
title: "Qwen3.6-27B"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
authors:
|
| 34 |
- name: Alibaba Qwen Team
|
| 35 |
url: "https://huggingface.co/Qwen/Qwen3.6-27B"
|
|
@@ -10,9 +10,9 @@
|
|
| 10 |
# MODEL model tag for smoke (default: $(TAG))
|
| 11 |
#
|
| 12 |
# Examples:
|
| 13 |
-
# make build # Q4_K_M from
|
| 14 |
-
# make build QUANT=
|
| 15 |
-
# make build GGUF_PATH=~/models/Qwen3.6-27B-
|
| 16 |
# make load-bundle # this repo's bundled GGUF -> local Ollama tag (smudge LFS if needed)
|
| 17 |
# make smoke
|
| 18 |
# make check
|
|
@@ -37,7 +37,7 @@ ifdef GGUF_PATH
|
|
| 37 |
@echo " GGUF_PATH=$(GGUF_PATH)"
|
| 38 |
endif
|
| 39 |
|
| 40 |
-
build: ## Download qwen35-stamped
|
| 41 |
GGUF_PATH=$(GGUF_PATH) TAG=$(TAG) ./scripts/build.sh $(QUANT)
|
| 42 |
|
| 43 |
load-bundle: ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
|
|
|
|
| 10 |
# MODEL model tag for smoke (default: $(TAG))
|
| 11 |
#
|
| 12 |
# Examples:
|
| 13 |
+
# make build # Q4_K_M from unsloth (qwen35-stamped, loads today)
|
| 14 |
+
# make build QUANT=Q3_K_S # smaller quant
|
| 15 |
+
# make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf
|
| 16 |
# make load-bundle # this repo's bundled GGUF -> local Ollama tag (smudge LFS if needed)
|
| 17 |
# make smoke
|
| 18 |
# make check
|
|
|
|
| 37 |
@echo " GGUF_PATH=$(GGUF_PATH)"
|
| 38 |
endif
|
| 39 |
|
| 40 |
+
build: ## Download qwen35-stamped GGUF from unsloth and run 'ollama create' (loads today).
|
| 41 |
GGUF_PATH=$(GGUF_PATH) TAG=$(TAG) ./scripts/build.sh $(QUANT)
|
| 42 |
|
| 43 |
load-bundle: ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
|
|
@@ -16,16 +16,15 @@
|
|
| 16 |
# `e03e10e` after the 4th qwen36 round trip had its friction
|
| 17 |
# re-tested in a fresh next-day session).
|
| 18 |
#
|
| 19 |
-
# For other quants (
|
| 20 |
-
# downloads the chosen quant from
|
| 21 |
-
#
|
| 22 |
-
#
|
| 23 |
-
#
|
| 24 |
#
|
| 25 |
# Other GGUF sources (use with `make build GGUF_PATH=...`):
|
| 26 |
-
# https://huggingface.co/
|
| 27 |
-
# https://huggingface.co/
|
| 28 |
-
# https://huggingface.co/unsloth/Qwen3.6-27B-GGUF # vanilla Qwen 3.6 (pre-Heretic)
|
| 29 |
|
| 30 |
FROM ./Thanatos-27B.Q4_K_M.gguf
|
| 31 |
|
|
|
|
| 16 |
# `e03e10e` after the 4th qwen36 round trip had its friction
|
| 17 |
# re-tested in a fresh next-day session).
|
| 18 |
#
|
| 19 |
+
# For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
|
| 20 |
+
# downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
|
| 21 |
+
# FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
|
| 22 |
+
# it was removed so HF's Ollama bridge picks Q4_K_M as the default
|
| 23 |
+
# `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
|
| 24 |
#
|
| 25 |
# Other GGUF sources (use with `make build GGUF_PATH=...`):
|
| 26 |
+
# https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
|
| 27 |
+
# https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF
|
|
|
|
| 28 |
|
| 29 |
FROM ./Thanatos-27B.Q4_K_M.gguf
|
| 30 |
|
|
@@ -1,8 +1,7 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
-
|
| 5 |
-
base_model_relation: finetune
|
| 6 |
datasets:
|
| 7 |
- crownelius/Creative_Writing_ShareGPT_Enhanced
|
| 8 |
- microsoft/rStar-Coder
|
|
@@ -41,8 +40,6 @@ tags:
|
|
| 41 |
- agent
|
| 42 |
- gguf
|
| 43 |
- ollama
|
| 44 |
-
- heretic
|
| 45 |
-
- uncensored
|
| 46 |
- imatrix
|
| 47 |
library_name: transformers
|
| 48 |
pipeline_tag: image-text-to-text
|
|
@@ -51,19 +48,24 @@ pipeline_tag: image-text-to-text
|
|
| 51 |
<img src="https://huggingface.co/FoolDev/Thanatos-27B-Heretic/resolve/main/banner.svg" alt="Thanatos-27B-Heretic banner" width="100%" />
|
| 52 |
|
| 53 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 54 |
-
[](#architecture)
|
| 56 |
[](https://huggingface.co/FoolDev/Janus-35B)
|
| 57 |
[](https://buymeacoffee.com/cardoffoolm)
|
| 58 |
|
| 59 |
# Thanatos-27B-Heretic
|
| 60 |
|
| 61 |
-
> **Dense Reasoning. Friendlier Footprint.
|
| 62 |
-
> *
|
| 63 |
|
| 64 |
-
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`
|
| 65 |
|
| 66 |
-
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## TL;DR
|
| 69 |
|
|
@@ -76,25 +78,14 @@ template — HF's Ollama bridge ingests those three files, not
|
|
| 76 |
ollama run hf.co/FoolDev/Thanatos-27B-Heretic # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
|
| 77 |
```
|
| 78 |
|
| 79 |
-
> **Bundled blob status:** the GGUF currently bundled in this repo
|
| 80 |
-
> is the legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant from before
|
| 81 |
-
> the rename. Behaves identically to vanilla Qwen 3.6 27B for now;
|
| 82 |
-
> the Heretic v2 rebundle (from
|
| 83 |
-
> `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) is pending —
|
| 84 |
-
> see the top entry of [CHANGELOG](CHANGELOG.md). If you want the
|
| 85 |
-
> Heretic behavior today, use the local-build path below
|
| 86 |
-
> (`make build`), which pulls the Heretic GGUF directly.
|
| 87 |
-
|
| 88 |
If you pulled the bundle during any of the qwen36 windows on the
|
| 89 |
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
|
| 90 |
have a qwen36-stamped blob in your local Ollama store, `make
|
| 91 |
-
heal-hf` rebadges it in place. Fresh pulls
|
| 92 |
-
`Thanatos-27B-Heretic` repo go straight through.
|
| 93 |
|
| 94 |
-
For other quants (
|
| 95 |
QUANT=...` is the simplest path. See [Quick start](#quick-start)
|
| 96 |
-
below for the full matrix.
|
| 97 |
-
repo — use Q3_K_M for the smallest practical quant.
|
| 98 |
|
| 99 |
For image input use llama.cpp directly — Ollama vision is broken for
|
| 100 |
this architecture upstream (see [Vision](#vision)).
|
|
@@ -103,7 +94,7 @@ this architecture upstream (see [Vision](#vision)).
|
|
| 103 |
|
| 104 |
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
|
| 105 |
|
| 106 |
-
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix
|
| 107 |
|
| 108 |
| | Thanatos-27B-Heretic (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|
| 109 |
|---|---|---|
|
|
@@ -113,7 +104,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 113 |
| Layers | 64 | 40 |
|
| 114 |
| Hidden size | 5120 | 2048 |
|
| 115 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
|
| 116 |
-
|
|
| 117 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 118 |
| Multimodal (text path) | Yes | Yes |
|
| 119 |
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
|
|
@@ -126,15 +117,15 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 126 |
|---|---|
|
| 127 |
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
|
| 128 |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
|
| 129 |
-
| `Modelfile` | Ollama wrapper around the bundled
|
| 130 |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
|
| 131 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 132 |
-
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `
|
| 133 |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
|
| 134 |
-
| `scripts/heal_hf_pull.sh` | Legacy recovery for users
|
| 135 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
|
| 136 |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
|
| 137 |
-
| `scripts/fetch_vision.sh` | Pulls the vision projector (`
|
| 138 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
|
| 139 |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
|
| 140 |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
|
|
@@ -144,17 +135,16 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 144 |
| `CHANGELOG.md` | Versioned tooling/docs changes |
|
| 145 |
| `README.md` | This file |
|
| 146 |
|
| 147 |
-
For 16 GB GPUs / unified-memory laptops, `make build QUANT=
|
| 148 |
-
downloads the smaller ~
|
| 149 |
-
`
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
`Modelfile`).
|
| 156 |
|
| 157 |
-
If you want the
|
| 158 |
|
| 159 |
## Architecture
|
| 160 |
|
|
@@ -170,30 +160,23 @@ If you want the Heretic safetensors for `transformers`, fetch them from [`llmfan
|
|
| 170 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 171 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 172 |
- Vision + video supported by the **base architecture** via a separate
|
| 173 |
-
`mmproj` projector (not redistributed here; pull
|
| 174 |
-
`Qwen3.6-27B-
|
| 175 |
-
|
| 176 |
-
`mmproj-F16.gguf` from `unsloth/Qwen3.6-27B-GGUF` as a reference
|
| 177 |
-
alternative). See [Vision](#vision) below for current loader
|
| 178 |
-
compatibility.
|
| 179 |
- Multi-token prediction (MTP) head trained for speculative decoding —
|
| 180 |
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
|
| 181 |
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
|
| 182 |
**Not usable via llama.cpp / Ollama today**: the GGUF converter
|
| 183 |
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
|
| 184 |
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
|
| 185 |
-
inference yet"), so the
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
support lands. (Earlier README versions claimed MTP was available
|
| 194 |
-
via llama.cpp without this caveat — confirmed empirically via
|
| 195 |
-
`gguf.GGUFReader` on both this bundle and
|
| 196 |
-
`unsloth/Qwen3.6-27B-GGUF`, 2026-05-19.)
|
| 197 |
|
| 198 |
**The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
|
| 199 |
workaround for an unimplemented `qwen36` arch, but the canonical
|
|
@@ -209,11 +192,9 @@ stack:
|
|
| 209 |
exists in `transformers`; Qwen reuses the 3.5 class names.
|
| 210 |
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
|
| 211 |
`Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
|
| 212 |
-
`Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The
|
| 213 |
-
GGUFs this repo pulls from
|
| 214 |
-
|
| 215 |
-
stamps, as do the upstream unsloth GGUFs (`unsloth/Qwen3.6-27B-GGUF`,
|
| 216 |
-
`unsloth/Qwen3.6-35B-A3B-GGUF`).
|
| 217 |
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
|
| 218 |
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
|
| 219 |
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
|
|
@@ -307,14 +288,12 @@ ollama run hf.co/FoolDev/Thanatos-27B-Heretic # 17 GB Q4_K_M, qwen35-s
|
|
| 307 |
make load-bundle # creates local tag thanatos-27b-heretic
|
| 308 |
ollama run thanatos-27b-heretic
|
| 309 |
|
| 310 |
-
# C. Bypass the bundle: download a qwen35-stamped
|
| 311 |
-
#
|
| 312 |
-
# llama.cpp / Ollama. This is the path that gets you actual
|
| 313 |
-
# Heretic behavior until the bundled blob is rebundled.
|
| 314 |
make build # Q4_K_M -> thanatos-27b-heretic
|
| 315 |
-
make build QUANT=
|
| 316 |
-
make build QUANT=Q5_K_M #
|
| 317 |
-
make build GGUF_PATH=~/models/Qwen3.6-27B-
|
| 318 |
ollama run thanatos-27b-heretic
|
| 319 |
```
|
| 320 |
|
|
@@ -338,10 +317,10 @@ python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-
|
|
| 338 |
|
| 339 |
| App | How to load this model |
|
| 340 |
|---|---|
|
| 341 |
-
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=
|
| 342 |
-
| **LM Studio** | Search → `FoolDev/Thanatos-27B-Heretic` → pick `Thanatos-27B.Q4_K_M.gguf`
|
| 343 |
| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B-Heretic`. Same template behavior as LM Studio. |
|
| 344 |
-
| **llama.cpp** | `hf download FoolDev/Thanatos-27B-Heretic Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via `
|
| 345 |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
|
| 346 |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
|
| 347 |
|
|
@@ -397,21 +376,17 @@ Behavior rules:
|
|
| 397 |
|
| 398 |
## Vision
|
| 399 |
|
| 400 |
-
The Qwen 3.6 base (and
|
| 401 |
-
|
| 402 |
-
multimodal stack is:
|
| 403 |
|
| 404 |
```
|
| 405 |
-
Qwen3.6-27B-
|
| 406 |
-
|
| 407 |
```
|
| 408 |
|
| 409 |
Both files are at
|
| 410 |
-
[`
|
| 411 |
-
|
| 412 |
-
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
|
| 413 |
-
(`mmproj-F16.gguf`, ~927 MB). This repo intentionally does not
|
| 414 |
-
redistribute either.
|
| 415 |
|
| 416 |
### Loader compatibility — the honest table
|
| 417 |
|
|
@@ -429,11 +404,10 @@ Three flavors, in order of build-time effort:
|
|
| 429 |
```bash
|
| 430 |
# A. HTTP via llama-server (always built — the easiest path).
|
| 431 |
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
|
| 432 |
-
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU
|
| 433 |
-
# bundle; Heretic v2 shares the architecture so the recipe carries).
|
| 434 |
llama-server \
|
| 435 |
-
-m Qwen3.6-27B-
|
| 436 |
-
--mmproj
|
| 437 |
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
|
| 438 |
# then POST OpenAI-style chat completions with an image_url content
|
| 439 |
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
|
|
@@ -446,15 +420,15 @@ llama-server \
|
|
| 446 |
# produce it — a plain `cmake --build build` will. If yours didn't,
|
| 447 |
# run `cmake --build build --target llama-mtmd-cli`.
|
| 448 |
llama-mtmd-cli \
|
| 449 |
-
-m Qwen3.6-27B-
|
| 450 |
-
--mmproj
|
| 451 |
--image photo.jpg \
|
| 452 |
-p "Describe this image."
|
| 453 |
|
| 454 |
# C. Python via llama-cpp-python:
|
| 455 |
python examples/llama_cpp_vision.py \
|
| 456 |
-
--gguf /path/to/Qwen3.6-27B-
|
| 457 |
-
--mmproj /path/to/
|
| 458 |
--image /path/to/photo.jpg \
|
| 459 |
--prompt "What is in this image?"
|
| 460 |
```
|
|
@@ -472,22 +446,19 @@ The dense 27B is the lighter sibling to Janus-35B and the easier of the two to d
|
|
| 472 |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
|
| 473 |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
|
| 474 |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
|
| 475 |
-
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=
|
| 476 |
|
| 477 |
Most numbers in this table are estimates from comparable models; the
|
| 478 |
gradient is right but the absolute values will move ±20% with prompt
|
| 479 |
shape, KV cache type, and parallel-request count. Measure your own
|
| 480 |
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
|
| 481 |
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
|
| 482 |
-
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan
|
| 483 |
-
(measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits
|
| 484 |
-
the architecture so per-step cost should match within bench noise):
|
| 485 |
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
|
| 486 |
steady across short / medium / long prompts), sitting between CPU-only
|
| 487 |
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
|
| 488 |
same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
|
| 489 |
-
this hardware.
|
| 490 |
-
~13 GB Q3_K_M should sit within 5% of the ~12 GB Q3_K_S numbers.)
|
| 491 |
|
| 492 |
## Chat template
|
| 493 |
|
|
@@ -588,25 +559,19 @@ python examples/ollama_chat.py # section 3 runs a real round-trip
|
|
| 588 |
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
|
| 589 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 590 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
| 591 |
-
- **Bundled blob is pre-Heretic.** The currently-bundled `Thanatos-27B.Q4_K_M.gguf` blob is the legacy Qwen 3.6 27B Q4_K_M quant from before the rename — it behaves like vanilla Qwen 3.6, not Heretic v2. Use `make build` (which pulls the Heretic GGUF from llmfan46) until the rebundle ships.
|
| 592 |
-
- **Uncensored base.** The Heretic v2 abliteration dials back the refusal-training of upstream Qwen 3.6. Outputs may be more compliant with sensitive requests than the vanilla base; the Thanatos system prompt still steers behavior, but the safety floor is lower. Apply your own filtering for user-facing deployments.
|
| 593 |
|
| 594 |
## Related models
|
| 595 |
|
| 596 |
| Model | Notes |
|
| 597 |
|---|---|
|
| 598 |
-
| [
|
| 599 |
-
| [
|
| 600 |
-
| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved) | Same Heretic v2 but keeps the MTP head for vLLM / SGLang speculative decoding |
|
| 601 |
-
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream pre-Heretic base, safetensors |
|
| 602 |
-
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Pre-Heretic GGUF mirror + reference `mmproj-F16.gguf` projector |
|
| 603 |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
|
| 604 |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
|
| 605 |
|
| 606 |
## Credits
|
| 607 |
|
| 608 |
-
-
|
| 609 |
-
- Upstream base: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
|
| 610 |
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
|
| 611 |
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
|
| 612 |
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
+
- Qwen/Qwen3.6-27B
|
|
|
|
| 5 |
datasets:
|
| 6 |
- crownelius/Creative_Writing_ShareGPT_Enhanced
|
| 7 |
- microsoft/rStar-Coder
|
|
|
|
| 40 |
- agent
|
| 41 |
- gguf
|
| 42 |
- ollama
|
|
|
|
|
|
|
| 43 |
- imatrix
|
| 44 |
library_name: transformers
|
| 45 |
pipeline_tag: image-text-to-text
|
|
|
|
| 48 |
<img src="https://huggingface.co/FoolDev/Thanatos-27B-Heretic/resolve/main/banner.svg" alt="Thanatos-27B-Heretic banner" width="100%" />
|
| 49 |
|
| 50 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 51 |
+
[](https://huggingface.co/Qwen/Qwen3.6-27B)
|
| 52 |
[](#architecture)
|
| 53 |
[](https://huggingface.co/FoolDev/Janus-35B)
|
| 54 |
[](https://buymeacoffee.com/cardoffoolm)
|
| 55 |
|
| 56 |
# Thanatos-27B-Heretic
|
| 57 |
|
| 58 |
+
> **Dense Reasoning. Friendlier Footprint.**
|
| 59 |
+
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
|
| 60 |
|
| 61 |
+
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
|
| 62 |
|
| 63 |
+
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
|
| 64 |
+
|
| 65 |
+
> **Note on the name.** The repo carries the `-Heretic` suffix from a
|
| 66 |
+
> prior swap to `llmfan46/Qwen3.6-27B-uncensored-heretic-v2` that was
|
| 67 |
+
> reverted. The current base is the vanilla `Qwen/Qwen3.6-27B`; the
|
| 68 |
+
> name string and HF repo URL are kept for continuity.
|
| 69 |
|
| 70 |
## TL;DR
|
| 71 |
|
|
|
|
| 78 |
ollama run hf.co/FoolDev/Thanatos-27B-Heretic # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
|
| 79 |
```
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
If you pulled the bundle during any of the qwen36 windows on the
|
| 82 |
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
|
| 83 |
have a qwen36-stamped blob in your local Ollama store, `make
|
| 84 |
+
heal-hf` rebadges it in place. Fresh pulls go straight through.
|
|
|
|
| 85 |
|
| 86 |
+
For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
|
| 87 |
QUANT=...` is the simplest path. See [Quick start](#quick-start)
|
| 88 |
+
below for the full matrix.
|
|
|
|
| 89 |
|
| 90 |
For image input use llama.cpp directly — Ollama vision is broken for
|
| 91 |
this architecture upstream (see [Vision](#vision)).
|
|
|
|
| 94 |
|
| 95 |
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
|
| 96 |
|
| 97 |
+
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
|
| 98 |
|
| 99 |
| | Thanatos-27B-Heretic (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|
| 100 |
|---|---|---|
|
|
|
|
| 104 |
| Layers | 64 | 40 |
|
| 105 |
| Hidden size | 5120 | 2048 |
|
| 106 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
|
| 107 |
+
| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
|
| 108 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 109 |
| Multimodal (text path) | Yes | Yes |
|
| 110 |
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
|
|
|
|
| 117 |
|---|---|
|
| 118 |
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
|
| 119 |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
|
| 120 |
+
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for **local** builds |
|
| 121 |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
|
| 122 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 123 |
+
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
|
| 124 |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
|
| 125 |
+
| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B-Heretic` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
|
| 126 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
|
| 127 |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
|
| 128 |
+
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
|
| 129 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
|
| 130 |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
|
| 131 |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
|
|
|
|
| 135 |
| `CHANGELOG.md` | Versioned tooling/docs changes |
|
| 136 |
| `README.md` | This file |
|
| 137 |
|
| 138 |
+
For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
|
| 139 |
+
downloads the smaller ~12 GB Q3_K_S quant from
|
| 140 |
+
`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
|
| 141 |
+
creates a local `thanatos-27b-heretic` Ollama tag. Does not redistribute
|
| 142 |
+
via this repo. For other quants use `make build QUANT=...`. The
|
| 143 |
+
local-build path applies this repo's `Modelfile`; the `hf.co/...`
|
| 144 |
+
path applies the root-level `template`, `system`, and `params`
|
| 145 |
+
files (kept in sync with the `Modelfile`).
|
|
|
|
| 146 |
|
| 147 |
+
If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
|
| 148 |
|
| 149 |
## Architecture
|
| 150 |
|
|
|
|
| 160 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 161 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 162 |
- Vision + video supported by the **base architecture** via a separate
|
| 163 |
+
`mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
|
| 164 |
+
from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
|
| 165 |
+
current loader compatibility.
|
|
|
|
|
|
|
|
|
|
| 166 |
- Multi-token prediction (MTP) head trained for speculative decoding —
|
| 167 |
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
|
| 168 |
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
|
| 169 |
**Not usable via llama.cpp / Ollama today**: the GGUF converter
|
| 170 |
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
|
| 171 |
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
|
| 172 |
+
inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
|
| 173 |
+
851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
|
| 174 |
+
merged 2026-05-16) currently covers other architectures only;
|
| 175 |
+
tracking that PR's follow-up work for when qwen35 / qwen35moe
|
| 176 |
+
consumer support lands. (Earlier README versions claimed MTP was
|
| 177 |
+
available without this caveat — confirmed empirically via
|
| 178 |
+
`gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
|
| 179 |
+
2026-05-19.)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
**The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
|
| 182 |
workaround for an unimplemented `qwen36` arch, but the canonical
|
|
|
|
| 192 |
exists in `transformers`; Qwen reuses the 3.5 class names.
|
| 193 |
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
|
| 194 |
`Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
|
| 195 |
+
`Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The unsloth
|
| 196 |
+
GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
|
| 197 |
+
`unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
|
|
|
|
|
|
|
| 198 |
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
|
| 199 |
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
|
| 200 |
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
|
|
|
|
| 288 |
make load-bundle # creates local tag thanatos-27b-heretic
|
| 289 |
ollama run thanatos-27b-heretic
|
| 290 |
|
| 291 |
+
# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
|
| 292 |
+
# and build locally. Loads on every current llama.cpp / Ollama.
|
|
|
|
|
|
|
| 293 |
make build # Q4_K_M -> thanatos-27b-heretic
|
| 294 |
+
make build QUANT=Q3_K_S # 12 GB smaller quant
|
| 295 |
+
make build QUANT=Q5_K_M # 20 GB higher quality
|
| 296 |
+
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
|
| 297 |
ollama run thanatos-27b-heretic
|
| 298 |
```
|
| 299 |
|
|
|
|
| 317 |
|
| 318 |
| App | How to load this model |
|
| 319 |
|---|---|
|
| 320 |
+
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
|
| 321 |
+
| **LM Studio** | Search → `FoolDev/Thanatos-27B-Heretic` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
|
| 322 |
| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B-Heretic`. Same template behavior as LM Studio. |
|
| 323 |
+
| **llama.cpp** | `hf download FoolDev/Thanatos-27B-Heretic Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
|
| 324 |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
|
| 325 |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
|
| 326 |
|
|
|
|
| 376 |
|
| 377 |
## Vision
|
| 378 |
|
| 379 |
+
The Qwen 3.6 base supports image (and video) input via a separate
|
| 380 |
+
`mmproj` projector. The full multimodal stack is:
|
|
|
|
| 381 |
|
| 382 |
```
|
| 383 |
+
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
|
| 384 |
+
mmproj-F16.gguf (~927 MB, the vision projector)
|
| 385 |
```
|
| 386 |
|
| 387 |
Both files are at
|
| 388 |
+
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
|
| 389 |
+
This repo intentionally does not redistribute either.
|
|
|
|
|
|
|
|
|
|
| 390 |
|
| 391 |
### Loader compatibility — the honest table
|
| 392 |
|
|
|
|
| 404 |
```bash
|
| 405 |
# A. HTTP via llama-server (always built — the easiest path).
|
| 406 |
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
|
| 407 |
+
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
|
|
|
|
| 408 |
llama-server \
|
| 409 |
+
-m Qwen3.6-27B-Q4_K_M.gguf \
|
| 410 |
+
--mmproj mmproj-F16.gguf \
|
| 411 |
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
|
| 412 |
# then POST OpenAI-style chat completions with an image_url content
|
| 413 |
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
|
|
|
|
| 420 |
# produce it — a plain `cmake --build build` will. If yours didn't,
|
| 421 |
# run `cmake --build build --target llama-mtmd-cli`.
|
| 422 |
llama-mtmd-cli \
|
| 423 |
+
-m Qwen3.6-27B-Q4_K_M.gguf \
|
| 424 |
+
--mmproj mmproj-F16.gguf \
|
| 425 |
--image photo.jpg \
|
| 426 |
-p "Describe this image."
|
| 427 |
|
| 428 |
# C. Python via llama-cpp-python:
|
| 429 |
python examples/llama_cpp_vision.py \
|
| 430 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 431 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 432 |
--image /path/to/photo.jpg \
|
| 433 |
--prompt "What is in this image?"
|
| 434 |
```
|
|
|
|
| 446 |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
|
| 447 |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
|
| 448 |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
|
| 449 |
+
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
|
| 450 |
|
| 451 |
Most numbers in this table are estimates from comparable models; the
|
| 452 |
gradient is right but the absolute values will move ±20% with prompt
|
| 453 |
shape, KV cache type, and parallel-request count. Measure your own
|
| 454 |
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
|
| 455 |
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
|
| 456 |
+
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
|
|
|
|
|
|
|
| 457 |
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
|
| 458 |
steady across short / medium / long prompts), sitting between CPU-only
|
| 459 |
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
|
| 460 |
same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
|
| 461 |
+
this hardware.
|
|
|
|
| 462 |
|
| 463 |
## Chat template
|
| 464 |
|
|
|
|
| 559 |
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
|
| 560 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 561 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
|
|
|
|
|
|
| 562 |
|
| 563 |
## Related models
|
| 564 |
|
| 565 |
| Model | Notes |
|
| 566 |
|---|---|
|
| 567 |
+
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
|
| 568 |
+
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
|
|
|
|
|
|
|
|
|
|
| 569 |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
|
| 570 |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
|
| 571 |
|
| 572 |
## Credits
|
| 573 |
|
| 574 |
+
- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
|
|
|
|
| 575 |
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
|
| 576 |
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
|
| 577 |
|
|
|
|
|
|
|
|
@@ -5,9 +5,9 @@ Four minimal entry points. Pick the one that matches how you run models.
|
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b-heretic` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
|
| 8 |
-
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the
|
| 9 |
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
|
| 10 |
-
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `
|
| 11 |
|
| 12 |
All four apply the same Thanatos system prompt and sampling defaults
|
| 13 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
@@ -36,13 +36,12 @@ in place (qwen36 → qwen35, metadata-only, ~5 s) — the same
|
|
| 36 |
tag then loads. Fresh pulls after the re-stamp go straight
|
| 37 |
through.
|
| 38 |
|
| 39 |
-
For a non-bundled quant (e.g.
|
| 40 |
-
`make build QUANT=...` downloads from
|
| 41 |
-
`
|
| 42 |
-
local `thanatos-27b-heretic` tag:
|
| 43 |
|
| 44 |
```bash
|
| 45 |
-
cd .. && make build QUANT=
|
| 46 |
MODEL=thanatos-27b-heretic python ollama_chat.py
|
| 47 |
```
|
| 48 |
|
|
@@ -55,8 +54,8 @@ MODEL=thanatos-27b-heretic python ollama_chat.py
|
|
| 55 |
```
|
| 56 |
|
| 57 |
For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
|
| 58 |
-
fetch it from `
|
| 59 |
-
|
| 60 |
|
| 61 |
```bash
|
| 62 |
cd .. && make build QUANT=Q5_K_M && cd examples
|
|
@@ -75,7 +74,7 @@ python transformers_quickstart.py --no-4bit # bf16, ~54 GB VRAM
|
|
| 75 |
|
| 76 |
```bash
|
| 77 |
pip install llama-cpp-python # CPU-only build
|
| 78 |
-
python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-
|
| 79 |
```
|
| 80 |
|
| 81 |
For GPU offload, rebuild llama-cpp-python with the matching backend — see
|
|
@@ -84,13 +83,13 @@ the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
|
|
| 84 |
### Vision (image input)
|
| 85 |
|
| 86 |
```bash
|
| 87 |
-
# Pull the projector once (~
|
| 88 |
-
hf download
|
| 89 |
|
| 90 |
pip install llama-cpp-python pillow
|
| 91 |
python llama_cpp_vision.py \
|
| 92 |
-
--gguf /path/to/Qwen3.6-27B-
|
| 93 |
-
--mmproj /path/to/
|
| 94 |
--image /path/to/photo.jpg \
|
| 95 |
--prompt "Describe this image."
|
| 96 |
```
|
|
@@ -102,7 +101,7 @@ lacks them. `ollama create` accepts the dual-`FROM` and `ollama show`
|
|
| 102 |
reports `vision` capability, but the first inference call fails with
|
| 103 |
`error loading model architecture: unknown model architecture:
|
| 104 |
'qwen35'` (verified empirically against the dense 27B +
|
| 105 |
-
|
| 106 |
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 107 |
Until that's fixed, llama.cpp / llama-cpp-python is the working path
|
| 108 |
for vision.
|
|
|
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b-heretic` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
|
| 8 |
+
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
|
| 9 |
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
|
| 10 |
+
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |
|
| 11 |
|
| 12 |
All four apply the same Thanatos system prompt and sampling defaults
|
| 13 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
|
|
| 36 |
tag then loads. Fresh pulls after the re-stamp go straight
|
| 37 |
through.
|
| 38 |
|
| 39 |
+
For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
|
| 40 |
+
`make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`
|
| 41 |
+
and creates a local `thanatos-27b-heretic` tag:
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
+
cd .. && make build QUANT=Q3_K_S && cd examples
|
| 45 |
MODEL=thanatos-27b-heretic python ollama_chat.py
|
| 46 |
```
|
| 47 |
|
|
|
|
| 54 |
```
|
| 55 |
|
| 56 |
For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
|
| 57 |
+
fetch it from `unsloth/Qwen3.6-27B-GGUF` and patch the `Modelfile`
|
| 58 |
+
`FROM` line into a temp copy automatically:
|
| 59 |
|
| 60 |
```bash
|
| 61 |
cd .. && make build QUANT=Q5_K_M && cd examples
|
|
|
|
| 74 |
|
| 75 |
```bash
|
| 76 |
pip install llama-cpp-python # CPU-only build
|
| 77 |
+
python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
|
| 78 |
```
|
| 79 |
|
| 80 |
For GPU offload, rebuild llama-cpp-python with the matching backend — see
|
|
|
|
| 83 |
### Vision (image input)
|
| 84 |
|
| 85 |
```bash
|
| 86 |
+
# Pull the projector once (~927 MB):
|
| 87 |
+
hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .
|
| 88 |
|
| 89 |
pip install llama-cpp-python pillow
|
| 90 |
python llama_cpp_vision.py \
|
| 91 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 92 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 93 |
--image /path/to/photo.jpg \
|
| 94 |
--prompt "Describe this image."
|
| 95 |
```
|
|
|
|
| 101 |
reports `vision` capability, but the first inference call fails with
|
| 102 |
`error loading model architecture: unknown model architecture:
|
| 103 |
'qwen35'` (verified empirically against the dense 27B +
|
| 104 |
+
`mmproj-F16.gguf`). Tracked in
|
| 105 |
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 106 |
Until that's fixed, llama.cpp / llama-cpp-python is the working path
|
| 107 |
for vision.
|
|
@@ -23,21 +23,21 @@ Install:
|
|
| 23 |
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
|
| 24 |
# CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
|
| 25 |
|
| 26 |
-
Files you need (both from
|
| 27 |
-
1. A text GGUF (any quant): e.g. Qwen3.6-27B-
|
| 28 |
-
2. A vision projector:
|
| 29 |
|
| 30 |
Usage:
|
| 31 |
python llama_cpp_vision.py \
|
| 32 |
-
--gguf /path/to/Qwen3.6-27B-
|
| 33 |
-
--mmproj /path/to/
|
| 34 |
--image /path/to/photo.jpg \
|
| 35 |
--prompt "What is in this image? Be specific."
|
| 36 |
|
| 37 |
# CLI alternative without python binding (ships with llama.cpp):
|
| 38 |
# llama-mtmd-cli \
|
| 39 |
-
# -m Qwen3.6-27B-
|
| 40 |
-
# --mmproj
|
| 41 |
# --image photo.jpg \
|
| 42 |
# -p "Describe this image."
|
| 43 |
"""
|
|
|
|
| 23 |
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
|
| 24 |
# CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
|
| 25 |
|
| 26 |
+
Files you need (both from unsloth/Qwen3.6-27B-GGUF):
|
| 27 |
+
1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf (~17 GB)
|
| 28 |
+
2. A vision projector: mmproj-F16.gguf (~927 MB)
|
| 29 |
|
| 30 |
Usage:
|
| 31 |
python llama_cpp_vision.py \
|
| 32 |
+
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 33 |
+
--mmproj /path/to/mmproj-F16.gguf \
|
| 34 |
--image /path/to/photo.jpg \
|
| 35 |
--prompt "What is in this image? Be specific."
|
| 36 |
|
| 37 |
# CLI alternative without python binding (ships with llama.cpp):
|
| 38 |
# llama-mtmd-cli \
|
| 39 |
+
# -m Qwen3.6-27B-Q4_K_M.gguf \
|
| 40 |
+
# --mmproj mmproj-F16.gguf \
|
| 41 |
# --image photo.jpg \
|
| 42 |
# -p "Describe this image."
|
| 43 |
"""
|
|
@@ -2,14 +2,11 @@
|
|
| 2 |
"""
|
| 3 |
Thanatos-27B-Heretic — Hugging Face Transformers quickstart.
|
| 4 |
|
| 5 |
-
Loads the
|
| 6 |
chat turn using its embedded chat template. Thanatos-27B-Heretic is a
|
| 7 |
*wrapper* around that base, so for the transformers route there is nothing
|
| 8 |
-
to download from this repo — point at
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
Set MODEL_ID = "Qwen/Qwen3.6-27B" to bypass the Heretic abliteration and
|
| 12 |
-
load the vanilla upstream base instead.
|
| 13 |
|
| 14 |
Requirements:
|
| 15 |
pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
|
|
@@ -39,7 +36,7 @@ except ImportError as e: # pragma: no cover
|
|
| 39 |
)
|
| 40 |
|
| 41 |
|
| 42 |
-
MODEL_ID = "
|
| 43 |
|
| 44 |
THANATOS_SYSTEM = (
|
| 45 |
"You are Thanatos, a precise and capable assistant for reasoning, writing, "
|
|
|
|
| 2 |
"""
|
| 3 |
Thanatos-27B-Heretic — Hugging Face Transformers quickstart.
|
| 4 |
|
| 5 |
+
Loads the upstream Qwen 3.6 27B safetensors directly and runs a single
|
| 6 |
chat turn using its embedded chat template. Thanatos-27B-Heretic is a
|
| 7 |
*wrapper* around that base, so for the transformers route there is nothing
|
| 8 |
+
to download from this repo — point at Qwen/Qwen3.6-27B and apply the same
|
| 9 |
+
system prompt the Modelfile uses.
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
Requirements:
|
| 12 |
pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
|
|
|
|
| 36 |
)
|
| 37 |
|
| 38 |
|
| 39 |
+
MODEL_ID = "Qwen/Qwen3.6-27B"
|
| 40 |
|
| 41 |
THANATOS_SYSTEM = (
|
| 42 |
"You are Thanatos, a precise and capable assistant for reasoning, writing, "
|
|
@@ -7,20 +7,21 @@
|
|
| 7 |
# QUANT=Q6_K ./scripts/build.sh
|
| 8 |
#
|
| 9 |
# Skip the download by pointing at a GGUF you already have:
|
| 10 |
-
# GGUF_PATH=/path/to/Qwen3.6-27B-
|
| 11 |
#
|
| 12 |
# Requires: huggingface-cli (or hf), ollama, awk.
|
| 13 |
set -euo pipefail
|
| 14 |
|
| 15 |
QUANT="${1:-${QUANT:-Q4_K_M}}"
|
| 16 |
|
| 17 |
-
REPO_ID="${REPO_ID:-
|
| 18 |
-
#
|
| 19 |
-
#
|
| 20 |
-
#
|
| 21 |
-
#
|
| 22 |
-
#
|
| 23 |
-
|
|
|
|
| 24 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 25 |
# GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
|
| 26 |
# with cached weights elsewhere don't have to copy or symlink anything.
|
|
|
|
| 7 |
# QUANT=Q6_K ./scripts/build.sh
|
| 8 |
#
|
| 9 |
# Skip the download by pointing at a GGUF you already have:
|
| 10 |
+
# GGUF_PATH=/path/to/Qwen3.6-27B-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
|
| 11 |
#
|
| 12 |
# Requires: huggingface-cli (or hf), ollama, awk.
|
| 13 |
set -euo pipefail
|
| 14 |
|
| 15 |
QUANT="${1:-${QUANT:-Q4_K_M}}"
|
| 16 |
|
| 17 |
+
REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
|
| 18 |
+
# Upstream uses dashes, e.g. Qwen3.6-27B-Q4_K_M.gguf. Quants known to exist
|
| 19 |
+
# at unsloth/Qwen3.6-27B-GGUF (as of 2026-04):
|
| 20 |
+
# Q3_K_S Q3_K_M Q4_0 Q4_1 Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0
|
| 21 |
+
# IQ4_XS IQ4_NL
|
| 22 |
+
# UD-IQ2_XXS UD-IQ2_M UD-Q2_K_XL UD-IQ3_XXS UD-Q3_K_XL UD-Q4_K_XL
|
| 23 |
+
# UD-Q5_K_XL UD-Q6_K_XL UD-Q8_K_XL
|
| 24 |
+
GGUF_NAME="Qwen3.6-27B-${QUANT}.gguf"
|
| 25 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 26 |
# GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
|
| 27 |
# with cached weights elsewhere don't have to copy or symlink anything.
|
|
@@ -104,11 +104,9 @@ fi
|
|
| 104 |
|
| 105 |
# ---- 5. footgun: dot-vs-dash filename -------------------------------------
|
| 106 |
#
|
| 107 |
-
# Upstream
|
| 108 |
-
#
|
| 109 |
-
#
|
| 110 |
-
# Qwen3.6-27B-Q4_K_M.gguf). Earlier commits used the wrong
|
| 111 |
-
# dot-separated pattern, which 404s. Block re-introduction.
|
| 112 |
|
| 113 |
blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
|
| 114 |
if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \
|
|
|
|
| 104 |
|
| 105 |
# ---- 5. footgun: dot-vs-dash filename -------------------------------------
|
| 106 |
#
|
| 107 |
+
# Upstream unsloth/Qwen3.6-27B-GGUF uses dashes (Qwen3.6-27B-Q4_K_M.gguf).
|
| 108 |
+
# Earlier commits used the wrong dot-separated pattern, which 404s.
|
| 109 |
+
# Block re-introduction.
|
|
|
|
|
|
|
| 110 |
|
| 111 |
blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
|
| 112 |
if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \
|
|
@@ -8,20 +8,16 @@
|
|
| 8 |
# it (see README Vision section, ollama/ollama#15898).
|
| 9 |
#
|
| 10 |
# Usage:
|
| 11 |
-
# ./scripts/fetch_vision.sh # default:
|
| 12 |
-
#
|
| 13 |
-
#
|
| 14 |
-
# for F16/F32 variants fall back to unsloth's reference projector:
|
| 15 |
-
# REPO_ID=unsloth/Qwen3.6-27B-GGUF FILE_NAME=mmproj-F16.gguf ./scripts/fetch_vision.sh
|
| 16 |
-
# (vision tokens are projected the same way across Qwen 3.6 27B
|
| 17 |
-
# finetunes, so the unsloth projector is functionally interchangeable.)
|
| 18 |
#
|
| 19 |
# Requires: huggingface-cli (or hf).
|
| 20 |
set -euo pipefail
|
| 21 |
|
| 22 |
-
PRECISION="${1:-${PRECISION:-
|
| 23 |
-
REPO_ID="${REPO_ID:-
|
| 24 |
-
FILE_NAME="
|
| 25 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 26 |
DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
|
| 27 |
|
|
@@ -62,7 +58,7 @@ fi
|
|
| 62 |
echo
|
| 63 |
echo "[+] Done. Use it via:"
|
| 64 |
echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
|
| 65 |
-
echo " --gguf /path/to/Qwen3.6-27B-
|
| 66 |
echo " --mmproj ${DEST} \\"
|
| 67 |
echo " --image /path/to/photo.jpg \\"
|
| 68 |
echo " --prompt 'Describe this image.'"
|
|
|
|
| 8 |
# it (see README Vision section, ollama/ollama#15898).
|
| 9 |
#
|
| 10 |
# Usage:
|
| 11 |
+
# ./scripts/fetch_vision.sh # default: F16, ~927 MB
|
| 12 |
+
# ./scripts/fetch_vision.sh BF16 # ~931 MB
|
| 13 |
+
# ./scripts/fetch_vision.sh F32 # ~1.8 GB
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
#
|
| 15 |
# Requires: huggingface-cli (or hf).
|
| 16 |
set -euo pipefail
|
| 17 |
|
| 18 |
+
PRECISION="${1:-${PRECISION:-F16}}"
|
| 19 |
+
REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
|
| 20 |
+
FILE_NAME="mmproj-${PRECISION}.gguf"
|
| 21 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 22 |
DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
|
| 23 |
|
|
|
|
| 58 |
echo
|
| 59 |
echo "[+] Done. Use it via:"
|
| 60 |
echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
|
| 61 |
+
echo " --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
|
| 62 |
echo " --mmproj ${DEST} \\"
|
| 63 |
echo " --image /path/to/photo.jpg \\"
|
| 64 |
echo " --prompt 'Describe this image.'"
|