Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use FoolDev/Thanatos-27B with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "FoolDev/Thanatos-27B:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

FoolDev Claude Opus 4.7 commited on May 24

Commit

73e905b

1 Parent(s): 7097156

Revert base swap back to Qwen/Qwen3.6-27B (keep -Heretic name)

Browse files

Undoes the Qwen → llmfan46/Qwen3.6-27B-uncensored-heretic-v2 base
swap from 16e1ddd. Project name string (Thanatos-27B-Heretic),
Ollama tag (thanatos-27b-heretic), HF repo URL, banner -HERETIC
wordmark, and git remote are all preserved per explicit choice
("undo base only, keep name").

- README: frontmatter base_model → Qwen/Qwen3.6-27B; drop
base_model_relation and heretic/uncensored tags (imatrix kept).
Tagline, badge, Architecture line, sibling paragraph, Quick-start
path C, Local-apps table, Vision section, Related-models table,
Credits, Known-limitations all back to vanilla framing. Added a
"Note on the name" callout explaining the name-vs-base mismatch.
- Tooling: scripts/build.sh + fetch_vision.sh REPO_ID back to
unsloth/Qwen3.6-27B-GGUF; filename pattern + Q3_K_S smallest
quant restored. Modelfile preamble flipped. transformers
example MODEL_ID back to Qwen/Qwen3.6-27B. examples/README.md
+ llama_cpp_vision.py recipes flipped. CITATION.cff
title/abstract/refs/keywords flipped. Makefile + .gitignore
comments flipped.
- banner.svg subtitle "Dense 27B · Opus 4.7 distilled · uncensored"
→ "Qwen 3.6 · Dense 27B · Opus 4.7 distilled"; PNG re-rasterized.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (14) hide show

.gitignore +2 -8
CHANGELOG.md +49 -0
CITATION.cff +13 -18
Makefile +4 -4
Modelfile +7 -8
README.md +70 -105
banner.png +0 -0
banner.svg +1 -1
examples/README.md +14 -15
examples/llama_cpp_vision.py +7 -7
examples/transformers_quickstart.py +4 -7
scripts/build.sh +9 -8
scripts/check.sh +3 -5
scripts/fetch_vision.sh +7 -11

.gitignore CHANGED Viewed

@@ -5,22 +5,16 @@ __pycache__/
 .venv/
 venv/
-# Local model weights. We don't redistribute the Heretic v2 GGUFs
-# here — `make build` fetches one from
-# llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF locally.
 # The single Thanatos-27B.*.gguf we DO ship backs the HF/Ollama
 # "Use this model" widget (ollama run hf.co/FoolDev/Thanatos-27B-Heretic).
-# The bundled file is still named Thanatos-27B.*.gguf from before the
-# rename; whitelist also covers Thanatos-27B-Heretic.*.gguf for the
-# pending Heretic rebundle.
 *.gguf
 !Thanatos-27B.*.gguf
-!Thanatos-27B-Heretic.*.gguf
 # Local-only rebadge experiments produced by scripts/rename_arch.py.
 # These re-stamp general.architecture and are not loadable by current
 # ollama / llama.cpp; don't track or push them.
 Thanatos-27B.*.qwen[0-9]*.gguf
-Thanatos-27B-Heretic.*.qwen[0-9]*.gguf
 *.safetensors
 *.bin

 .venv/
 venv/
+# Local model weights. We don't redistribute the upstream Qwen GGUFs
+# here — `make build` fetches one from unsloth/Qwen3.6-27B-GGUF locally.
 # The single Thanatos-27B.*.gguf we DO ship backs the HF/Ollama
 # "Use this model" widget (ollama run hf.co/FoolDev/Thanatos-27B-Heretic).
 *.gguf
 !Thanatos-27B.*.gguf
 # Local-only rebadge experiments produced by scripts/rename_arch.py.
 # These re-stamp general.architecture and are not loadable by current
 # ollama / llama.cpp; don't track or push them.
 Thanatos-27B.*.qwen[0-9]*.gguf
 *.safetensors
 *.bin

CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,55 @@ and documentation**, not the underlying base model.
 ## [Unreleased]
 ### Changed (acknowledge HF's `imatrix` auto-tag in frontmatter)
 - **Added `imatrix` to the README `tags:` list.** HF's tag
   auto-detector was surfacing `imatrix` on the rendered model

 ## [Unreleased]
+### Reverted (base swap to Heretic v2 — name kept, base back to vanilla Qwen)
+- **Undone the `Qwen/Qwen3.6-27B` → `llmfan46/Qwen3.6-27B-uncensored-heretic-v2`
+  base swap** that shipped in `16e1ddd` and was polished in
+  subsequent commits. Current base is back to vanilla
+  `Qwen/Qwen3.6-27B`. README frontmatter `base_model:`, the
+  `Base-…` badge, the Architecture line, the sibling paragraph,
+  the Quick-start path C, the Local-apps table, the Vision
+  section, the Related-models table, the Credits, and the
+  Known-limitations section all flipped back to the pre-swap
+  Qwen-only framing. `heretic` / `uncensored` tags removed
+  (`imatrix` stays — the bundled blob is still iMatrix-quantized
+  regardless of which base is described). `base_model_relation:
+  finetune` removed; this is a packaging wrapper, not a finetune.
+- **Tooling flipped back to unsloth's GGUF mirror.**
+  `scripts/build.sh` `REPO_ID` back to `unsloth/Qwen3.6-27B-GGUF`
+  with filename pattern `Qwen3.6-27B-${QUANT}.gguf`; quant list
+  back to the unsloth catalog (Q3_K_S restored as the smallest
+  practical quant). `scripts/fetch_vision.sh` defaults back to
+  `PRECISION=F16` and `mmproj-F16.gguf` from unsloth. Modelfile
+  preamble flipped. `examples/transformers_quickstart.py`
+  `MODEL_ID` back to `Qwen/Qwen3.6-27B`. `examples/README.md` and
+  `examples/llama_cpp_vision.py` recipes flipped. `CITATION.cff`
+  title, abstract, references, and keywords flipped. `Makefile`
+  help-text + `build` docstring flipped. `.gitignore` comments
+  + whitelist + rebadge-artifact glob flipped.
+- **`banner.svg`** subtitle reverted `Dense 27B · Opus 4.7
+  distilled · uncensored` → `Qwen 3.6 · Dense 27B · Opus 4.7
+  distilled`. `THANATOS-27B-HERETIC` wordmark **kept** — the
+  project name string and HF repo URL are preserved per explicit
+  choice ("undo base only, keep name"). `banner.png`
+  re-rasterized at 2× via rsvg-convert.
+- **Project name string `Thanatos-27B-Heretic` and Ollama tag
+  `thanatos-27b-heretic` retained** across all files. HF repo
+  URL stays at `FoolDev/Thanatos-27B-Heretic`; git remote
+  unchanged. A "Note on the name" callout added to the README
+  tagline explaining the name-vs-base mismatch so users aren't
+  surprised.
+- **Bundled blob unchanged** (`Thanatos-27B.Q4_K_M.gguf` LFS
+  pointer SHA `5ed60d0a...`). It was always the legacy unsloth
+  Qwen Q4_K_M quant; with the base reverted, the blob and the
+  declared base are now consistent again. The "Bundled blob
+  status" callout in TL;DR removed since it no longer applies.
+- **HF repo migration:** the HF repo at
+  `FoolDev/Thanatos-27B-Heretic` keeps its current name (the
+  user's earlier rename via HF UI stands). If you want to also
+  rename the HF repo back to `FoolDev/Thanatos-27B`, that's a
+  separate HF UI action — HF will serve a 307 redirect from the
+  new name to the old once renamed.
 ### Changed (acknowledge HF's `imatrix` auto-tag in frontmatter)
 - **Added `imatrix` to the README `tags:` list.** HF's tag
   auto-detector was surfacing `imatrix` on the rendered model

CITATION.cff CHANGED Viewed

@@ -1,5 +1,5 @@
 cff-version: 1.2.0
-title: "Thanatos-27B-Heretic: A Dense Distillation Wrapper for llmfan46's Qwen 3.6 27B Uncensored Heretic v2"
 message: "If you use this model card or its accompanying files, please cite as below."
 type: software
 authors:
@@ -8,15 +8,17 @@ authors:
 repository-code: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
 url: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
 abstract: >-
-  Thanatos-27B-Heretic is a personal repackaging of llmfan46's uncensored
-  Heretic v2 finetune of Qwen 3.6 27B (dense), with Claude Opus 4.7 in
-  the reasoning teacher slot. The repository ships an Ollama Modelfile,
-  sampling defaults, usage examples, and a single ready-to-run GGUF
-  (Q4_K_M ~17 GB) so the HF "Use this model" widget surfaces a one-liner
-  Ollama snippet. Other quants (Q3_K_M, Q5_K_M, Q6_K, etc.) and the
-  Heretic safetensors are pulled from upstream
-  (llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF and the matching
-  non-GGUF repo) on demand rather than redistributed.
 keywords:
   - qwen
   - qwen3.6
@@ -24,17 +26,10 @@ keywords:
   - distillation
   - reasoning
   - llm
-  - heretic
-  - uncensored
 license: Apache-2.0
 references:
   - type: software
-    title: "Qwen3.6-27B-uncensored-heretic-v2 (immediate base)"
-    authors:
-      - name: llmfan46
-    url: "https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
-  - type: software
-    title: "Qwen3.6-27B (upstream base)"
     authors:
       - name: Alibaba Qwen Team
     url: "https://huggingface.co/Qwen/Qwen3.6-27B"

 cff-version: 1.2.0
+title: "Thanatos-27B-Heretic: A Dense Distillation Wrapper for Qwen 3.6 27B"
 message: "If you use this model card or its accompanying files, please cite as below."
 type: software
 authors:
 repository-code: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
 url: "https://huggingface.co/FoolDev/Thanatos-27B-Heretic"
 abstract: >-
+  Thanatos-27B-Heretic is a personal repackaging of the dense Qwen 3.6 27B base
+  model with Claude Opus 4.7 in the reasoning teacher slot. The
+  repository ships an Ollama Modelfile, sampling defaults, usage
+  examples, and a single ready-to-run GGUF (Q4_K_M ~17 GB) so the HF
+  "Use this model" widget surfaces a one-liner Ollama snippet. Other
+  quants (Q3_K_S, Q5_K_M, Q6_K, etc.) and the upstream safetensors
+  (Qwen/Qwen3.6-27B) are pulled from upstream
+  (unsloth/Qwen3.6-27B-GGUF) on demand rather than redistributed.
+  (The repo carries the `-Heretic` suffix from a prior swap to
+  llmfan46/Qwen3.6-27B-uncensored-heretic-v2 that was reverted;
+  current base is vanilla Qwen 3.6 27B.)
 keywords:
   - qwen
   - qwen3.6
   - distillation
   - reasoning
   - llm
 license: Apache-2.0
 references:
   - type: software
+    title: "Qwen3.6-27B"
     authors:
       - name: Alibaba Qwen Team
     url: "https://huggingface.co/Qwen/Qwen3.6-27B"

Makefile CHANGED Viewed

@@ -10,9 +10,9 @@
 #   MODEL     model tag for smoke     (default: $(TAG))
 #
 # Examples:
-#   make build                          # Q4_K_M from llmfan46 Heretic v2 GGUF (qwen35-stamped, loads today)
-#   make build QUANT=Q3_K_M             # smaller quant (Heretic repo has no Q3_K_S)
-#   make build GGUF_PATH=~/models/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf
 #   make load-bundle                    # this repo's bundled GGUF -> local Ollama tag (smudge LFS if needed)
 #   make smoke
 #   make check
@@ -37,7 +37,7 @@ ifdef GGUF_PATH
 	@echo "  GGUF_PATH=$(GGUF_PATH)"
 endif
-build:  ## Download qwen35-stamped Heretic v2 GGUF from llmfan46 and run 'ollama create' (loads today).
 	GGUF_PATH=$(GGUF_PATH) TAG=$(TAG) ./scripts/build.sh $(QUANT)
 load-bundle:  ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).

 #   MODEL     model tag for smoke     (default: $(TAG))
 #
 # Examples:
+#   make build                          # Q4_K_M from unsloth (qwen35-stamped, loads today)
+#   make build QUANT=Q3_K_S             # smaller quant
+#   make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf
 #   make load-bundle                    # this repo's bundled GGUF -> local Ollama tag (smudge LFS if needed)
 #   make smoke
 #   make check
 	@echo "  GGUF_PATH=$(GGUF_PATH)"
 endif
+build:  ## Download qwen35-stamped GGUF from unsloth and run 'ollama create' (loads today).
 	GGUF_PATH=$(GGUF_PATH) TAG=$(TAG) ./scripts/build.sh $(QUANT)
 load-bundle:  ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).

Modelfile CHANGED Viewed

@@ -16,16 +16,15 @@
 # `e03e10e` after the 4th qwen36 round trip had its friction
 # re-tested in a fresh next-day session).
 #
-# For other quants (Q3_K_M, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_M`
-# downloads the chosen quant from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF
-# (filename pattern Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf) and
-# patches FROM in a temp Modelfile copy. Note: no Q3_K_S in this repo;
-# use Q3_K_M for the smallest practical quant.
 #
 # Other GGUF sources (use with `make build GGUF_PATH=...`):
-#     https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF        # primary (this repo's default)
-#     https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF  # MTP head preserved
-#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF                               # vanilla Qwen 3.6 (pre-Heretic)
 FROM ./Thanatos-27B.Q4_K_M.gguf

 # `e03e10e` after the 4th qwen36 round trip had its friction
 # re-tested in a fresh next-day session).
 #
+# For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
+# downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
+# FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
+# it was removed so HF's Ollama bridge picks Q4_K_M as the default
+# `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
 #
 # Other GGUF sources (use with `make build GGUF_PATH=...`):
+#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
+#     https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF
 FROM ./Thanatos-27B.Q4_K_M.gguf

README.md CHANGED Viewed

@@ -1,8 +1,7 @@
 ---
 license: apache-2.0
 base_model:
-  - llmfan46/Qwen3.6-27B-uncensored-heretic-v2
-base_model_relation: finetune
 datasets:
   - crownelius/Creative_Writing_ShareGPT_Enhanced
   - microsoft/rStar-Coder
@@ -41,8 +40,6 @@ tags:
   - agent
   - gguf
   - ollama
-  - heretic
-  - uncensored
   - imatrix
 library_name: transformers
 pipeline_tag: image-text-to-text
@@ -51,19 +48,24 @@ pipeline_tag: image-text-to-text
 <img src="https://huggingface.co/FoolDev/Thanatos-27B-Heretic/resolve/main/banner.svg" alt="Thanatos-27B-Heretic banner" width="100%" />
 [![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
-[![Base Model](https://img.shields.io/badge/Base-Heretic_v2-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2)
 [![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
 [![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
 [![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)
 # Thanatos-27B-Heretic
-> **Dense Reasoning. Friendlier Footprint. Uncensored.**
-> *llmfan46's Heretic v2 abliteration of Qwen 3.6 27B (dense), repackaged with Claude Opus 4.7 in the teacher slot.*
-**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Base:`** `Heretic v2 (llmfan46)` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled + Abliterated LLM`
-A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) — an uncensored Heretic-style abliteration of the dense [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base — instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises, and refusal-trained behavior is dialed back at the base layer.
 ## TL;DR
@@ -76,25 +78,14 @@ template — HF's Ollama bridge ingests those three files, not
 ollama run hf.co/FoolDev/Thanatos-27B-Heretic           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
 ```
-> **Bundled blob status:** the GGUF currently bundled in this repo
-> is the legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant from before
-> the rename. Behaves identically to vanilla Qwen 3.6 27B for now;
-> the Heretic v2 rebundle (from
-> `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) is pending —
-> see the top entry of [CHANGELOG](CHANGELOG.md). If you want the
-> Heretic behavior today, use the local-build path below
-> (`make build`), which pulls the Heretic GGUF directly.
 If you pulled the bundle during any of the qwen36 windows on the
 pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
 have a qwen36-stamped blob in your local Ollama store, `make
-heal-hf` rebadges it in place. Fresh pulls of the new
-`Thanatos-27B-Heretic` repo go straight through.
-For other quants (Q3_K_M ~13 GB, Q5_K_M ~19 GB, etc.), `make build
 QUANT=...` is the simplest path. See [Quick start](#quick-start)
-below for the full matrix. Note: no Q3_K_S in the Heretic GGUF
-repo — use Q3_K_M for the smallest practical quant.
 For image input use llama.cpp directly — Ollama vision is broken for
 this architecture upstream (see [Vision](#vision)).
@@ -103,7 +94,7 @@ this architecture upstream (see [Vision](#vision)).
 The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
-The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix, measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits the same architecture so per-step cost should match) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
 | | Thanatos-27B-Heretic (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
 |---|---|---|
@@ -113,7 +104,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | Layers | 64 | 40 |
 | Hidden size | 5120 | 2048 |
 | Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
-| Q3_K_M GGUF size | ~13 GB (build locally via `make build QUANT=Q3_K_M`) | n/a |
 | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
 | Multimodal (text path) | Yes | Yes |
 | Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
@@ -126,15 +117,15 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
 | `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
-| `Modelfile` | Ollama wrapper around the bundled GGUF (currently the legacy pre-Heretic Qwen 3.6 27B Q4_K_M; Heretic v2 rebundle pending) — used by `make build` / `ollama create` for **local** builds |
 | `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
 | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
-| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`). This is the path that gets you actual Heretic behavior until the bundled blob is rebundled. |
 | `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
-| `scripts/heal_hf_pull.sh` | Legacy recovery for users migrating from the pre-rename `FoolDev/Thanatos-27B` repo who still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls of `Thanatos-27B-Heretic` don't need it. |
 | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
 | `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
-| `scripts/fetch_vision.sh` | Pulls the vision projector (`Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo, or `mmproj-F16.gguf` from the unsloth reference projector) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
 | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
 | `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
 | `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
@@ -144,17 +135,16 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | `CHANGELOG.md` | Versioned tooling/docs changes |
 | `README.md` | This file |
-For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_M`
-downloads the smaller ~13 GB Q3_K_M quant from
-`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` (qwen35-stamped,
-loads directly) and creates a local `thanatos-27b-heretic` Ollama
-tag. Does not redistribute via this repo. For other quants use
-`make build QUANT=...`. The local-build path applies this repo's
-`Modelfile`; the `hf.co/...` path applies the root-level
-`template`, `system`, and `params` files (kept in sync with the
-`Modelfile`).
-If you want the Heretic safetensors for `transformers`, fetch them from [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2). For the vanilla pre-Heretic Qwen 3.6 27B base, use [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
 ## Architecture
@@ -170,30 +160,23 @@ If you want the Heretic safetensors for `transformers`, fetch them from [`llmfan
 - Vocab 248,320 (shared with 35B-A3B sibling)
 - 262 144 native context, extensible to ~1 M with YaRN
 - Vision + video supported by the **base architecture** via a separate
-  `mmproj` projector (not redistributed here; pull
-  `Qwen3.6-27B-mmproj-BF16.gguf` from
-  `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`, or
-  `mmproj-F16.gguf` from `unsloth/Qwen3.6-27B-GGUF` as a reference
-  alternative). See [Vision](#vision) below for current loader
-  compatibility.
 - Multi-token prediction (MTP) head trained for speculative decoding —
   present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
   vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
   **Not usable via llama.cpp / Ollama today**: the GGUF converter
   (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
   `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
-  inference yet"), so the standard GGUFs (this bundle, unsloth's,
-  llmfan46's Heretic v2) ship with 851 tensors and no MTP head.
-  llmfan46 also publishes a separate
-  `Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF` repo
-  that keeps the MTP tensors for vLLM/SGLang users who want both
-  Heretic v2 + MTP. llama.cpp's MTP support (PR #22673, merged
-  2026-05-16) currently covers other architectures only; tracking
-  that PR's follow-up work for when qwen35 / qwen35moe consumer
-  support lands. (Earlier README versions claimed MTP was available
-  via llama.cpp without this caveat — confirmed empirically via
-  `gguf.GGUFReader` on both this bundle and
-  `unsloth/Qwen3.6-27B-GGUF`, 2026-05-19.)
 **The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
 workaround for an unimplemented `qwen36` arch, but the canonical
@@ -209,11 +192,9 @@ stack:
   exists in `transformers`; Qwen reuses the 3.5 class names.
 - **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
   `Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
-  `Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The Heretic
-  GGUFs this repo pulls from
-  (`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) inherit those
-  stamps, as do the upstream unsloth GGUFs (`unsloth/Qwen3.6-27B-GGUF`,
-  `unsloth/Qwen3.6-35B-A3B-GGUF`).
 - **llama.cpp's model code.** `src/models/qwen35.cpp` has an
   explicit `case 64: type = LLM_TYPE_27B` branch for this model;
   `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
@@ -307,14 +288,12 @@ ollama run hf.co/FoolDev/Thanatos-27B-Heretic           # 17 GB Q4_K_M, qwen35-s
 make load-bundle                                 # creates local tag thanatos-27b-heretic
 ollama run thanatos-27b-heretic
-# C. Bypass the bundle: download a qwen35-stamped Heretic v2 GGUF
-#    from llmfan46 and build locally. Loads on every current
-#    llama.cpp / Ollama. This is the path that gets you actual
-#    Heretic behavior until the bundled blob is rebundled.
 make build                                              # Q4_K_M  -> thanatos-27b-heretic
-make build QUANT=Q3_K_M                                 # 13 GB smaller quant
-make build QUANT=Q5_K_M                                 # 19 GB higher quality
-make build GGUF_PATH=~/models/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf   # skip download
 ollama run thanatos-27b-heretic
 ```
@@ -338,10 +317,10 @@ python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-
 | App | How to load this model |
 |---|---|
-| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_M` downloads from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
-| **LM Studio** | Search → `FoolDev/Thanatos-27B-Heretic` → pick `Thanatos-27B.Q4_K_M.gguf` (current bundled filename; will become `Thanatos-27B-Heretic.Q4_K_M.gguf` after the rebundle). Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
 | **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B-Heretic`. Same template behavior as LM Studio. |
-| **llama.cpp** | `hf download FoolDev/Thanatos-27B-Heretic Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via `Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
@@ -397,21 +376,17 @@ Behavior rules:
 ## Vision
-The Qwen 3.6 base (and llmfan46's Heretic v2 finetune of it) supports
-image (and video) input via a separate `mmproj` projector. The full
-multimodal stack is:
 ```
-Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf   (~17 GB, the text decoder)
-Qwen3.6-27B-mmproj-BF16.gguf                    (~931 MB, the vision projector)
 ```
 Both files are at
-[`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF).
-For the vanilla pre-Heretic projector, see
-[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
-(`mmproj-F16.gguf`, ~927 MB). This repo intentionally does not
-redistribute either.
 ### Loader compatibility — the honest table
@@ -429,11 +404,10 @@ Three flavors, in order of build-time effort:
 ```bash
 # A. HTTP via llama-server (always built — the easiest path).
 #    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
-#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU (pre-Heretic Qwen 3.6
-#    bundle; Heretic v2 shares the architecture so the recipe carries).
 llama-server \
-  -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-  --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
   --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
 # then POST OpenAI-style chat completions with an image_url content
 # block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
@@ -446,15 +420,15 @@ llama-server \
 #    produce it — a plain `cmake --build build` will. If yours didn't,
 #    run `cmake --build build --target llama-mtmd-cli`.
 llama-mtmd-cli \
-  -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-  --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
   --image photo.jpg \
   -p "Describe this image."
 # C. Python via llama-cpp-python:
 python examples/llama_cpp_vision.py \
-  --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-  --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
   --image /path/to/photo.jpg \
   --prompt "What is in this image?"
 ```
@@ -472,22 +446,19 @@ The dense 27B is the lighter sibling to Janus-35B and the easier of the two to d
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
-| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_M` (~13 GB) and trim `num_ctx` for headroom. |
 Most numbers in this table are estimates from comparable models; the
 gradient is right but the absolute values will move ±20% with prompt
 shape, KV cache type, and parallel-request count. Measure your own
 machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
 `eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
-data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan
-(measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits
-the architecture so per-step cost should match within bench noise):
 **~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
 steady across short / medium / long prompts), sitting between CPU-only
 and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
 same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
-this hardware. (Heretic v2 publishes Q3_K_M rather than Q3_K_S; the
-~13 GB Q3_K_M should sit within 5% of the ~12 GB Q3_K_S numbers.)
 ## Chat template
@@ -588,25 +559,19 @@ python examples/ollama_chat.py        # section 3 runs a real round-trip
 - **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
 - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
 - **No formal evaluation in this card.** Numbers above are estimates.
-- **Bundled blob is pre-Heretic.** The currently-bundled `Thanatos-27B.Q4_K_M.gguf` blob is the legacy Qwen 3.6 27B Q4_K_M quant from before the rename — it behaves like vanilla Qwen 3.6, not Heretic v2. Use `make build` (which pulls the Heretic GGUF from llmfan46) until the rebundle ships.
-- **Uncensored base.** The Heretic v2 abliteration dials back the refusal-training of upstream Qwen 3.6. Outputs may be more compliant with sensitive requests than the vanilla base; the Thanatos system prompt still steers behavior, but the safety floor is lower. Apply your own filtering for user-facing deployments.
 ## Related models
 | Model | Notes |
 |---|---|
-| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) | **Immediate base**, safetensors |
-| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF) | Recommended GGUF source (what `make build` pulls from) |
-| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved) | Same Heretic v2 but keeps the MTP head for vLLM / SGLang speculative decoding |
-| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream pre-Heretic base, safetensors |
-| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Pre-Heretic GGUF mirror + reference `mmproj-F16.gguf` projector |
 | [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
 | [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
 ## Credits
-- Immediate base: [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) — Heretic-style abliteration of Qwen 3.6 27B
-- Upstream base: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
 - Reasoning teacher: Claude Opus 4.7 (Anthropic)
 - Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

 ---
 license: apache-2.0
 base_model:
+  - Qwen/Qwen3.6-27B
 datasets:
   - crownelius/Creative_Writing_ShareGPT_Enhanced
   - microsoft/rStar-Coder
   - agent
   - gguf
   - ollama
   - imatrix
 library_name: transformers
 pipeline_tag: image-text-to-text
 <img src="https://huggingface.co/FoolDev/Thanatos-27B-Heretic/resolve/main/banner.svg" alt="Thanatos-27B-Heretic banner" width="100%" />
 [![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
+[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
 [![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
 [![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
 [![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)
 # Thanatos-27B-Heretic
+> **Dense Reasoning. Friendlier Footprint.**
+> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
+**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
+A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
+> **Note on the name.** The repo carries the `-Heretic` suffix from a
+> prior swap to `llmfan46/Qwen3.6-27B-uncensored-heretic-v2` that was
+> reverted. The current base is the vanilla `Qwen/Qwen3.6-27B`; the
+> name string and HF repo URL are kept for continuity.
 ## TL;DR
 ollama run hf.co/FoolDev/Thanatos-27B-Heretic           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
 ```
 If you pulled the bundle during any of the qwen36 windows on the
 pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
 have a qwen36-stamped blob in your local Ollama store, `make
+heal-hf` rebadges it in place. Fresh pulls go straight through.
+For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
 QUANT=...` is the simplest path. See [Quick start](#quick-start)
+below for the full matrix.
 For image input use llama.cpp directly — Ollama vision is broken for
 this architecture upstream (see [Vision](#vision)).
 The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
+The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
 | | Thanatos-27B-Heretic (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
 |---|---|---|
 | Layers | 64 | 40 |
 | Hidden size | 5120 | 2048 |
 | Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
+| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
 | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
 | Multimodal (text path) | Yes | Yes |
 | Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
 | `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
+| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for **local** builds |
 | `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
 | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
+| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
 | `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
+| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B-Heretic` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
 | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
 | `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
+| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
 | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
 | `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
 | `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
 | `CHANGELOG.md` | Versioned tooling/docs changes |
 | `README.md` | This file |
+For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
+downloads the smaller ~12 GB Q3_K_S quant from
+`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
+creates a local `thanatos-27b-heretic` Ollama tag. Does not redistribute
+via this repo. For other quants use `make build QUANT=...`. The
+local-build path applies this repo's `Modelfile`; the `hf.co/...`
+path applies the root-level `template`, `system`, and `params`
+files (kept in sync with the `Modelfile`).
+If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
 ## Architecture
 - Vocab 248,320 (shared with 35B-A3B sibling)
 - 262 144 native context, extensible to ~1 M with YaRN
 - Vision + video supported by the **base architecture** via a separate
+  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
+  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
+  current loader compatibility.
 - Multi-token prediction (MTP) head trained for speculative decoding —
   present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
   vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
   **Not usable via llama.cpp / Ollama today**: the GGUF converter
   (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
   `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
+  inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
+  851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
+  merged 2026-05-16) currently covers other architectures only;
+  tracking that PR's follow-up work for when qwen35 / qwen35moe
+  consumer support lands. (Earlier README versions claimed MTP was
+  available without this caveat — confirmed empirically via
+  `gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
+  2026-05-19.)
 **The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
 workaround for an unimplemented `qwen36` arch, but the canonical
   exists in `transformers`; Qwen reuses the 3.5 class names.
 - **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
   `Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
+  `Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The unsloth
+  GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
+  `unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
 - **llama.cpp's model code.** `src/models/qwen35.cpp` has an
   explicit `case 64: type = LLM_TYPE_27B` branch for this model;
   `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
 make load-bundle                                 # creates local tag thanatos-27b-heretic
 ollama run thanatos-27b-heretic
+# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
+#    and build locally. Loads on every current llama.cpp / Ollama.
 make build                                              # Q4_K_M  -> thanatos-27b-heretic
+make build QUANT=Q3_K_S                                 # 12 GB smaller quant
+make build QUANT=Q5_K_M                                 # 20 GB higher quality
+make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
 ollama run thanatos-27b-heretic
 ```
 | App | How to load this model |
 |---|---|
+| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B-Heretic` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
+| **LM Studio** | Search → `FoolDev/Thanatos-27B-Heretic` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
 | **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B-Heretic`. Same template behavior as LM Studio. |
+| **llama.cpp** | `hf download FoolDev/Thanatos-27B-Heretic Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
 ## Vision
+The Qwen 3.6 base supports image (and video) input via a separate
+`mmproj` projector. The full multimodal stack is:
 ```
+Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
+mmproj-F16.gguf           (~927 MB, the vision projector)
 ```
 Both files are at
+[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
+This repo intentionally does not redistribute either.
 ### Loader compatibility — the honest table
 ```bash
 # A. HTTP via llama-server (always built — the easiest path).
 #    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
+#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
 llama-server \
+  -m Qwen3.6-27B-Q4_K_M.gguf \
+  --mmproj mmproj-F16.gguf \
   --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
 # then POST OpenAI-style chat completions with an image_url content
 # block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
 #    produce it — a plain `cmake --build build` will. If yours didn't,
 #    run `cmake --build build --target llama-mtmd-cli`.
 llama-mtmd-cli \
+  -m Qwen3.6-27B-Q4_K_M.gguf \
+  --mmproj mmproj-F16.gguf \
   --image photo.jpg \
   -p "Describe this image."
 # C. Python via llama-cpp-python:
 python examples/llama_cpp_vision.py \
+  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
+  --mmproj /path/to/mmproj-F16.gguf \
   --image /path/to/photo.jpg \
   --prompt "What is in this image?"
 ```
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
+| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
 Most numbers in this table are estimates from comparable models; the
 gradient is right but the absolute values will move ±20% with prompt
 shape, KV cache type, and parallel-request count. Measure your own
 machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
 `eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
+data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
 **~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
 steady across short / medium / long prompts), sitting between CPU-only
 and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
 same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
+this hardware.
 ## Chat template
 - **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
 - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
 - **No formal evaluation in this card.** Numbers above are estimates.
 ## Related models
 | Model | Notes |
 |---|---|
+| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
+| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
 | [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
 | [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
 ## Credits
+- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
 - Reasoning teacher: Claude Opus 4.7 (Anthropic)
 - Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

banner.png CHANGED Viewed

banner.svg CHANGED Viewed

examples/README.md CHANGED Viewed

@@ -5,9 +5,9 @@ Four minimal entry points. Pick the one that matches how you run models.
 | File | Backend | When to use |
 |---|---|---|
 | `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b-heretic` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
-| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the Heretic safetensors (`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`) on GPU, optionally in 4-bit via bitsandbytes. |
 | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
-| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `Qwen3.6-27B-mmproj-BF16.gguf` and answers questions about an image. The only working vision path right now. |
 All four apply the same Thanatos system prompt and sampling defaults
 (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
@@ -36,13 +36,12 @@ in place (qwen36 → qwen35, metadata-only, ~5 s) — the same
 tag then loads. Fresh pulls after the re-stamp go straight
 through.
-For a non-bundled quant (e.g. Q3_K_M ~13 GB, Q5_K_M ~19 GB),
-`make build QUANT=...` downloads from
-`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a
-local `thanatos-27b-heretic` tag:
 ```bash
-cd ..  &&  make build QUANT=Q3_K_M  &&  cd examples
 MODEL=thanatos-27b-heretic python ollama_chat.py
 ```
@@ -55,8 +54,8 @@ MODEL=thanatos-27b-heretic python ollama_chat.py
 ```
 For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
-fetch it from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and
-patch the `Modelfile` `FROM` line into a temp copy automatically:
 ```bash
 cd ..  &&  make build QUANT=Q5_K_M  &&  cd examples
@@ -75,7 +74,7 @@ python transformers_quickstart.py --no-4bit  # bf16, ~54 GB VRAM
 ```bash
 pip install llama-cpp-python  # CPU-only build
-python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf --gpu-layers 99
 ```
 For GPU offload, rebuild llama-cpp-python with the matching backend — see
@@ -84,13 +83,13 @@ the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
 ### Vision (image input)
 ```bash
-# Pull the projector once (~931 MB):
-hf download llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF Qwen3.6-27B-mmproj-BF16.gguf --local-dir .
 pip install llama-cpp-python pillow
 python llama_cpp_vision.py \
-  --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-  --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
   --image /path/to/photo.jpg \
   --prompt "Describe this image."
 ```
@@ -102,7 +101,7 @@ lacks them. `ollama create` accepts the dual-`FROM` and `ollama show`
 reports `vision` capability, but the first inference call fails with
 `error loading model architecture: unknown model architecture:
 'qwen35'` (verified empirically against the dense 27B +
-the F16 reference projector). Tracked in
 [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
 Until that's fixed, llama.cpp / llama-cpp-python is the working path
 for vision.

 | File | Backend | When to use |
 |---|---|---|
 | `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b-heretic` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
+| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
 | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
+| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |
 All four apply the same Thanatos system prompt and sampling defaults
 (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
 tag then loads. Fresh pulls after the re-stamp go straight
 through.
+For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
+`make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`
+and creates a local `thanatos-27b-heretic` tag:
 ```bash
+cd ..  &&  make build QUANT=Q3_K_S  &&  cd examples
 MODEL=thanatos-27b-heretic python ollama_chat.py
 ```
 ```
 For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
+fetch it from `unsloth/Qwen3.6-27B-GGUF` and patch the `Modelfile`
+`FROM` line into a temp copy automatically:
 ```bash
 cd ..  &&  make build QUANT=Q5_K_M  &&  cd examples
 ```bash
 pip install llama-cpp-python  # CPU-only build
+python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
 ```
 For GPU offload, rebuild llama-cpp-python with the matching backend — see
 ### Vision (image input)
 ```bash
+# Pull the projector once (~927 MB):
+hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .
 pip install llama-cpp-python pillow
 python llama_cpp_vision.py \
+  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
+  --mmproj /path/to/mmproj-F16.gguf \
   --image /path/to/photo.jpg \
   --prompt "Describe this image."
 ```
 reports `vision` capability, but the first inference call fails with
 `error loading model architecture: unknown model architecture:
 'qwen35'` (verified empirically against the dense 27B +
+`mmproj-F16.gguf`). Tracked in
 [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
 Until that's fixed, llama.cpp / llama-cpp-python is the working path
 for vision.

examples/llama_cpp_vision.py CHANGED Viewed

@@ -23,21 +23,21 @@ Install:
     #   CMAKE_ARGS="-DGGML_METAL=on"  pip install llama-cpp-python --no-binary :all:
     #   CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
-Files you need (both from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF):
-    1. A text GGUF (any quant): e.g. Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf  (~17 GB)
-    2. A vision projector:        Qwen3.6-27B-mmproj-BF16.gguf                      (~931 MB)
 Usage:
     python llama_cpp_vision.py \
-        --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-        --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
         --image  /path/to/photo.jpg \
         --prompt "What is in this image? Be specific."
     # CLI alternative without python binding (ships with llama.cpp):
     #   llama-mtmd-cli \
-    #     -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
-    #     --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
     #     --image photo.jpg \
     #     -p "Describe this image."
 """

     #   CMAKE_ARGS="-DGGML_METAL=on"  pip install llama-cpp-python --no-binary :all:
     #   CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
+Files you need (both from unsloth/Qwen3.6-27B-GGUF):
+    1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf  (~17 GB)
+    2. A vision projector:        mmproj-F16.gguf              (~927 MB)
 Usage:
     python llama_cpp_vision.py \
+        --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
+        --mmproj /path/to/mmproj-F16.gguf \
         --image  /path/to/photo.jpg \
         --prompt "What is in this image? Be specific."
     # CLI alternative without python binding (ships with llama.cpp):
     #   llama-mtmd-cli \
+    #     -m Qwen3.6-27B-Q4_K_M.gguf \
+    #     --mmproj mmproj-F16.gguf \
     #     --image photo.jpg \
     #     -p "Describe this image."
 """

examples/transformers_quickstart.py CHANGED Viewed

@@ -2,14 +2,11 @@
 """
 Thanatos-27B-Heretic — Hugging Face Transformers quickstart.
-Loads the Heretic v2 Qwen 3.6 27B safetensors directly and runs a single
 chat turn using its embedded chat template. Thanatos-27B-Heretic is a
 *wrapper* around that base, so for the transformers route there is nothing
-to download from this repo — point at llmfan46/Qwen3.6-27B-uncensored-heretic-v2
-and apply the same system prompt the Modelfile uses.
-Set MODEL_ID = "Qwen/Qwen3.6-27B" to bypass the Heretic abliteration and
-load the vanilla upstream base instead.
 Requirements:
     pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
@@ -39,7 +36,7 @@ except ImportError as e:  # pragma: no cover
     )
-MODEL_ID = "llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
 THANATOS_SYSTEM = (
     "You are Thanatos, a precise and capable assistant for reasoning, writing, "

 """
 Thanatos-27B-Heretic — Hugging Face Transformers quickstart.
+Loads the upstream Qwen 3.6 27B safetensors directly and runs a single
 chat turn using its embedded chat template. Thanatos-27B-Heretic is a
 *wrapper* around that base, so for the transformers route there is nothing
+to download from this repo — point at Qwen/Qwen3.6-27B and apply the same
+system prompt the Modelfile uses.
 Requirements:
     pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
     )
+MODEL_ID = "Qwen/Qwen3.6-27B"
 THANATOS_SYSTEM = (
     "You are Thanatos, a precise and capable assistant for reasoning, writing, "

scripts/build.sh CHANGED Viewed

@@ -7,20 +7,21 @@
 #   QUANT=Q6_K ./scripts/build.sh
 #
 # Skip the download by pointing at a GGUF you already have:
-#   GGUF_PATH=/path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
 #
 # Requires: huggingface-cli (or hf), ollama, awk.
 set -euo pipefail
 QUANT="${1:-${QUANT:-Q4_K_M}}"
-REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
-# Filenames at llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF follow
-#   Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf
-# Quants known to exist (as of 2026-05):
-#   Q3_K_M Q3_K_L Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0 BF16
-# Note: no Q3_K_S in this repo — use Q3_K_M for the smallest practical quant.
-GGUF_NAME="Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 # GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
 # with cached weights elsewhere don't have to copy or symlink anything.

 #   QUANT=Q6_K ./scripts/build.sh
 #
 # Skip the download by pointing at a GGUF you already have:
+#   GGUF_PATH=/path/to/Qwen3.6-27B-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
 #
 # Requires: huggingface-cli (or hf), ollama, awk.
 set -euo pipefail
 QUANT="${1:-${QUANT:-Q4_K_M}}"
+REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
+# Upstream uses dashes, e.g. Qwen3.6-27B-Q4_K_M.gguf. Quants known to exist
+# at unsloth/Qwen3.6-27B-GGUF (as of 2026-04):
+#   Q3_K_S Q3_K_M Q4_0 Q4_1 Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0
+#   IQ4_XS IQ4_NL
+#   UD-IQ2_XXS UD-IQ2_M UD-Q2_K_XL UD-IQ3_XXS UD-Q3_K_XL UD-Q4_K_XL
+#   UD-Q5_K_XL UD-Q6_K_XL UD-Q8_K_XL
+GGUF_NAME="Qwen3.6-27B-${QUANT}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 # GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
 # with cached weights elsewhere don't have to copy or symlink anything.

scripts/check.sh CHANGED Viewed

@@ -104,11 +104,9 @@ fi
 # ---- 5. footgun: dot-vs-dash filename -------------------------------------
 #
-# Upstream llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF (and the
-# legacy unsloth/Qwen3.6-27B-GGUF) use dashes
-# (Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf,
-#  Qwen3.6-27B-Q4_K_M.gguf). Earlier commits used the wrong
-# dot-separated pattern, which 404s. Block re-introduction.
 blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
 if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \

 # ---- 5. footgun: dot-vs-dash filename -------------------------------------
 #
+# Upstream unsloth/Qwen3.6-27B-GGUF uses dashes (Qwen3.6-27B-Q4_K_M.gguf).
+# Earlier commits used the wrong dot-separated pattern, which 404s.
+# Block re-introduction.
 blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
 if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \

scripts/fetch_vision.sh CHANGED Viewed

@@ -8,20 +8,16 @@
 #   it (see README Vision section, ollama/ollama#15898).
 #
 # Usage:
-#   ./scripts/fetch_vision.sh                    # default: BF16 (~931 MB)
-#
-# llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF publishes BF16 only;
-# for F16/F32 variants fall back to unsloth's reference projector:
-#   REPO_ID=unsloth/Qwen3.6-27B-GGUF FILE_NAME=mmproj-F16.gguf ./scripts/fetch_vision.sh
-# (vision tokens are projected the same way across Qwen 3.6 27B
-# finetunes, so the unsloth projector is functionally interchangeable.)
 #
 # Requires: huggingface-cli (or hf).
 set -euo pipefail
-PRECISION="${1:-${PRECISION:-BF16}}"
-REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
-FILE_NAME="${FILE_NAME:-Qwen3.6-27B-mmproj-${PRECISION}.gguf}"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
@@ -62,7 +58,7 @@ fi
 echo
 echo "[+] Done. Use it via:"
 echo "    python ${ROOT}/examples/llama_cpp_vision.py \\"
-echo "        --gguf  /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \\"
 echo "        --mmproj ${DEST} \\"
 echo "        --image /path/to/photo.jpg \\"
 echo "        --prompt 'Describe this image.'"

 #   it (see README Vision section, ollama/ollama#15898).
 #
 # Usage:
+#   ./scripts/fetch_vision.sh                    # default: F16, ~927 MB
+#   ./scripts/fetch_vision.sh BF16               # ~931 MB
+#   ./scripts/fetch_vision.sh F32                # ~1.8 GB
 #
 # Requires: huggingface-cli (or hf).
 set -euo pipefail
+PRECISION="${1:-${PRECISION:-F16}}"
+REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
+FILE_NAME="mmproj-${PRECISION}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
 echo
 echo "[+] Done. Use it via:"
 echo "    python ${ROOT}/examples/llama_cpp_vision.py \\"
+echo "        --gguf  /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
 echo "        --mmproj ${DEST} \\"
 echo "        --image /path/to/photo.jpg \\"
 echo "        --prompt 'Describe this image.'"