Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use FoolDev/Thanatos-27B with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "FoolDev/Thanatos-27B:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

FoolDev Claude Opus 4.7 commited on May 23

Commit

16e1ddd

1 Parent(s): 9cf363e

Rename to Thanatos-Heretic-27B and swap base to llmfan46 Heretic v2

Browse files

Project rename Thanatos-27B -> Thanatos-Heretic-27B (Ollama tag
thanatos-heretic-27b) and immediate-base swap from Qwen/Qwen3.6-27B
to llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (an uncensored Heretic
abliteration of the same Qwen 3.6 27B dense arch).

Docs + Modelfile + scripts only — bundled Thanatos-27B.Q4_K_M.gguf
LFS pointer unchanged. The blob is still the legacy pre-Heretic
Qwen quant; README "Bundled blob status" callout + Known Limitations
warn users until the rebundle ships.

- scripts/build.sh: REPO_ID -> llmfan46 Heretic GGUF, filename
pattern Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf, default
TAG thanatos-heretic-27b. Q3_K_S replaced by Q3_K_M throughout
(Heretic repo doesn't publish Q3_K_S).
- scripts/fetch_vision.sh: PRECISION=BF16, REPO_ID -> llmfan46,
FILE_NAME=Qwen3.6-27B-mmproj-BF16.gguf. Unsloth's mmproj-F16.gguf
documented as a reference fallback.
- README: tagline, base_model frontmatter, badge, Vision section,
Related models, Credits, hardware/quick-start tables all flipped
to the Heretic lineage. Architecture section unchanged — Heretic
v2 is qwen35-stamped like vanilla Qwen 3.6 27B.
- CHANGELOG: top entry documents the rename + base swap; historical
entries below intentionally left referring to Thanatos-27B as
they happened on the old repo identity.

HF repo migration (new FoolDev/Thanatos-Heretic-27B repo + remote
re-point + old-repo migration notice) and Heretic re-quantization
rebundle are separate follow-ups.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (20) hide show

CHANGELOG.md +79 -0
CITATION.cff +20 -12
Makefile +5 -5
Modelfile +18 -17
README.md +131 -91
examples/README.md +21 -20
examples/llama_cpp_quickstart.py +1 -1
examples/llama_cpp_vision.py +8 -8
examples/ollama_chat.py +5 -5
examples/transformers_quickstart.py +10 -7
scripts/bench.sh +4 -4
scripts/build.sh +11 -12
scripts/check.sh +6 -4
scripts/check_bridge_sync.py +2 -2
scripts/fetch_vision.sh +12 -8
scripts/heal_hf_pull.sh +8 -8
scripts/install-hooks.sh +1 -1
scripts/load_bundle.sh +7 -7
scripts/smoke_test.sh +6 -6
scripts/verify_arch.py +4 -4

CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,85 @@ and documentation**, not the underlying base model.
 ## [Unreleased]
 ### Changed (5th round trip — qwen36 → qwen35, retested next-day)
 - **Bundle re-stamped `general.architecture: 'qwen36'` → `'qwen35'`**
   in `hf upload` commit `e03e10e` (HF), 2026-05-20 midday — 8

 ## [Unreleased]
+### Changed (project rename + base swap to Heretic v2)
+- **Renamed project `Thanatos-27B` → `Thanatos-Heretic-27B`** and
+  **swapped immediate base from `Qwen/Qwen3.6-27B` (vanilla) →
+  `llmfan46/Qwen3.6-27B-uncensored-heretic-v2`** (an uncensored
+  Heretic-style abliteration of the dense Qwen 3.6 27B base).
+  README, Modelfile preamble, `CITATION.cff`, all scripts, and
+  all examples now refer to `Thanatos-Heretic-27B` /
+  `thanatos-heretic-27b` (lowercase Ollama tag) and pull GGUFs
+  from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`.
+  Architecture is unchanged (still Qwen 3.6 dense 27B,
+  `qwen35`-stamped, hybrid SSM+attention stack) — only the
+  weights' finetune lineage moves.
+- **`base_model:` frontmatter** flipped to
+  `llmfan46/Qwen3.6-27B-uncensored-heretic-v2`;
+  `base_model_relation: finetune` added; `heretic` and
+  `uncensored` tags appended. `library_name: transformers` stays
+  for HF Hub placement (snippet trap accepted as before;
+  `config.json` is still intentionally absent).
+- **`scripts/build.sh`** now points `REPO_ID` at
+  `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and uses the
+  filename pattern `Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf`.
+  Default `TAG` is `thanatos-heretic-27b`. Note: no `Q3_K_S` in
+  the Heretic GGUF repo — use `Q3_K_M` for the smallest practical
+  quant (`Modelfile` preamble and README hardware/quick-start
+  tables updated accordingly).
+- **`scripts/fetch_vision.sh`** defaults flipped to
+  `PRECISION=BF16` and
+  `REPO_ID=llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`
+  (`Qwen3.6-27B-mmproj-BF16.gguf`, ~931 MB). Unsloth's
+  `mmproj-F16.gguf` is documented as a reference fallback for
+  users who want the F16/F32 variants.
+- **Bundled blob status:** the in-repo
+  `Thanatos-27B.Q4_K_M.gguf` LFS pointer is unchanged — still the
+  legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant
+  (`5ed60d0af4650a854b1755bd392f9aef4872643dc25a254bc68043fa638392a0`).
+  Behaves identically to vanilla Qwen 3.6 27B for now. Heretic v2
+  re-quantization + rebundle (file rename to
+  `Thanatos-Heretic-27B.Q4_K_M.gguf` + LFS swap) is a separate
+  follow-up; users wanting actual Heretic behavior today should
+  use the local-build path (`make build`).
+- **HF repo migration:** the local git remote still points at
+  `huggingface.co/FoolDev/Thanatos-27B`. A new HF repo at
+  `FoolDev/Thanatos-Heretic-27B` needs to be created and the
+  remote re-pointed before the next push. Migration notice on the
+  old `FoolDev/Thanatos-27B` model card is pending.
+- **CHANGELOG history left intact:** entries below this one still
+  reference `Thanatos-27B` and the bundled-blob saga as they
+  happened on the old repo identity. Historical, not retconned.
+### Changed (HF tag-surface cleanup — `general.tags` strip + `config.json` drop)
+- **Stripped `general.tags` KV from the bundled GGUF** (`9cc78e7`,
+  2026-05-20). Drops the upstream-baked `unsloth` and
+  `image-text-to-text` tags that `llama.cpp`'s converter copies
+  into GGUFs from `unsloth/Qwen3.6-27B-GGUF`; both surfaced on
+  the HF model page and obscured this card's positioning.
+  Tensors byte-identical; only the `general.tags` KV is gone.
+- **Dropped `config.json`** (`5302d10`, 2026-05-20) to suppress
+  HF's tag auto-detector surfacing `qwen3_5` in the repo header
+  — the detector reads `architectures` from `config.json`.
+  Consequence: `AutoModelForCausalLM.from_pretrained(
+  "FoolDev/Thanatos-27B")` no longer works on its own.
+  `examples/transformers_quickstart.py` and the README
+  transformers note now point users at upstream
+  `Qwen/Qwen3.6-27B` directly (tensors byte-identical, so the
+  result is the same model). `library_name: transformers` stays
+  in the model-card metadata for Hub placement.
+### Reverted (safetensors mirror experiment)
+- **Mirrored Qwen/Qwen3.6-27B's safetensors set into this repo
+  (`b420378`, 2026-05-20), reverted within the day** (`50f6684`
+  + `9cf363e`, 2026-05-21). 15 sharded `.safetensors` + tokenizer
+  + processor configs (~58 GB) were briefly added so users
+  wanting GGUF + safetensors in one place could skip a second
+  `hf download`; reverted on reflection. Transformers users
+  continue to pull from upstream `Qwen/Qwen3.6-27B`. `.gitignore`
+  whitelist for the Qwen sharded naming pattern (`0c5bee4`) was
+  removed alongside the mirror; `*.safetensors` block rule is
+  back to baseline.
 ### Changed (5th round trip — qwen36 → qwen35, retested next-day)
 - **Bundle re-stamped `general.architecture: 'qwen36'` → `'qwen35'`**
   in `hf upload` commit `e03e10e` (HF), 2026-05-20 midday — 8

CITATION.cff CHANGED Viewed

@@ -1,21 +1,22 @@
 cff-version: 1.2.0
-title: "Thanatos-27B: A Dense Distillation Wrapper for Qwen 3.6 27B"
 message: "If you use this model card or its accompanying files, please cite as below."
 type: software
 authors:
   - name: FoolDev
     website: "https://huggingface.co/FoolDev"
-repository-code: "https://huggingface.co/FoolDev/Thanatos-27B"
-url: "https://huggingface.co/FoolDev/Thanatos-27B"
 abstract: >-
-  Thanatos-27B is a personal repackaging of the dense Qwen 3.6 27B base model
-  with Claude Opus 4.7 in the reasoning teacher slot. The repository ships
-  an Ollama Modelfile, sampling defaults, usage examples, and a single
-  ready-to-run GGUF (Q4_K_M ~17 GB) so the HF "Use this model" widget
-  surfaces a one-liner Ollama snippet. Other quants (Q3_K_S, Q5_K_M,
-  Q6_K, etc.) and the upstream safetensors (Qwen/Qwen3.6-27B) are
-  pulled from upstream (unsloth/Qwen3.6-27B-GGUF) on demand rather
-  than redistributed.
 keywords:
   - qwen
   - qwen3.6
@@ -23,10 +24,17 @@ keywords:
   - distillation
   - reasoning
   - llm
 license: Apache-2.0
 references:
   - type: software
-    title: "Qwen3.6-27B"
     authors:
       - name: Alibaba Qwen Team
     url: "https://huggingface.co/Qwen/Qwen3.6-27B"

 cff-version: 1.2.0
+title: "Thanatos-Heretic-27B: A Dense Distillation Wrapper for llmfan46's Qwen 3.6 27B Uncensored Heretic v2"
 message: "If you use this model card or its accompanying files, please cite as below."
 type: software
 authors:
   - name: FoolDev
     website: "https://huggingface.co/FoolDev"
+repository-code: "https://huggingface.co/FoolDev/Thanatos-Heretic-27B"
+url: "https://huggingface.co/FoolDev/Thanatos-Heretic-27B"
 abstract: >-
+  Thanatos-Heretic-27B is a personal repackaging of llmfan46's uncensored
+  Heretic v2 finetune of Qwen 3.6 27B (dense), with Claude Opus 4.7 in
+  the reasoning teacher slot. The repository ships an Ollama Modelfile,
+  sampling defaults, usage examples, and a single ready-to-run GGUF
+  (Q4_K_M ~17 GB) so the HF "Use this model" widget surfaces a one-liner
+  Ollama snippet. Other quants (Q3_K_M, Q5_K_M, Q6_K, etc.) and the
+  Heretic safetensors are pulled from upstream
+  (llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF and the matching
+  non-GGUF repo) on demand rather than redistributed.
 keywords:
   - qwen
   - qwen3.6
   - distillation
   - reasoning
   - llm
+  - heretic
+  - uncensored
 license: Apache-2.0
 references:
   - type: software
+    title: "Qwen3.6-27B-uncensored-heretic-v2 (immediate base)"
+    authors:
+      - name: llmfan46
+    url: "https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
+  - type: software
+    title: "Qwen3.6-27B (upstream base)"
     authors:
       - name: Alibaba Qwen Team
     url: "https://huggingface.co/Qwen/Qwen3.6-27B"

Makefile CHANGED Viewed

@@ -1,11 +1,11 @@
-# Thanatos-27B convenience Makefile.
 #
 # All work is delegated to scripts/* — this file just gives common
 # operations short, discoverable names.
 #
 # Variables you can override on the command line:
 #   QUANT     GGUF quant suffix       (default: Q4_K_M)
-#   TAG       Ollama model tag        (default: thanatos-27b)
 #   GGUF_PATH path to existing GGUF   (skip the download)
 #   MODEL     model tag for smoke     (default: $(TAG))
 #
@@ -19,7 +19,7 @@
 #   make clean
 QUANT ?= Q4_K_M
-TAG   ?= thanatos-27b
 MODEL ?= $(TAG)
 .DEFAULT_GOAL := help
@@ -43,7 +43,7 @@ build:  ## Download qwen35-stamped GGUF from unsloth and run 'ollama create' (lo
 load-bundle:  ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
 	TAG=$(TAG) ./scripts/load_bundle.sh
-heal-hf:  ## Heal an already-pulled hf.co/FoolDev/Thanatos-27B tag in-store (rebadge blob + manifest digest).
 	./scripts/heal_hf_pull.sh
 smoke:  ## Verify the model is reachable and round-trips.
@@ -69,6 +69,6 @@ hooks:  ## Install scripts/check.sh as the git pre-commit hook.
 clean:  ## Remove local GGUF copies and ephemeral caches in this repo.
 	@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
-	@rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf ./Thanatos-27B.*.qwen[0-9]*.gguf
 	@rm -rf ./.cache __pycache__ examples/__pycache__
 	@echo "[+] clean"

+# Thanatos-Heretic-27B convenience Makefile.
 #
 # All work is delegated to scripts/* — this file just gives common
 # operations short, discoverable names.
 #
 # Variables you can override on the command line:
 #   QUANT     GGUF quant suffix       (default: Q4_K_M)
+#   TAG       Ollama model tag        (default: thanatos-heretic-27b)
 #   GGUF_PATH path to existing GGUF   (skip the download)
 #   MODEL     model tag for smoke     (default: $(TAG))
 #
 #   make clean
 QUANT ?= Q4_K_M
+TAG   ?= thanatos-heretic-27b
 MODEL ?= $(TAG)
 .DEFAULT_GOAL := help
 load-bundle:  ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
 	TAG=$(TAG) ./scripts/load_bundle.sh
+heal-hf:  ## Heal an already-pulled hf.co/FoolDev/Thanatos-Heretic-27B tag in-store (rebadge blob + manifest digest).
 	./scripts/heal_hf_pull.sh
 smoke:  ## Verify the model is reachable and round-trips.
 clean:  ## Remove local GGUF copies and ephemeral caches in this repo.
 	@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
+	@rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf ./Thanatos-Heretic-27B.*.qwen[0-9]*.gguf
 	@rm -rf ./.cache __pycache__ examples/__pycache__
 	@echo "[+] clean"

Modelfile CHANGED Viewed

@@ -1,4 +1,4 @@
-# Thanatos-27B — Ollama wrapper around Qwen 3.6 27B (dense)
 #
 # Text + tool calling. Vision via Ollama is currently broken for this
 # architecture (ollama/ollama#15898 — the qwen35 arch entries are in
@@ -10,21 +10,22 @@
 # stamped `general.architecture: 'qwen35'` — the upstream-canonical
 # arch entry every released llama.cpp / Ollama loads under for the
 # Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
-# thanatos-27b -f Modelfile && ollama run thanatos-27b` loads it
 # directly. See README "Architecture" for the full stamp history
 # (eight flips between qwen35 and qwen36, settled on qwen35 at
 # `e03e10e` after the 4th qwen36 round trip had its friction
 # re-tested in a fresh next-day session).
 #
-# For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
-# downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
-# FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
-# it was removed so HF's Ollama bridge picks Q4_K_M as the default
-# `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
 #
 # Other GGUF sources (use with `make build GGUF_PATH=...`):
-#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
-#     https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF
 FROM ./Thanatos-27B.Q4_K_M.gguf
@@ -140,14 +141,14 @@ Behavior rules:
 #       (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
 #     Q3_K_S → 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
 #       (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
-#       Second run measured against `thanatos-27b:latest` built via
-#       `make build QUANT=Q3_K_S` — i.e. unsloth/Qwen3.6-27B-GGUF's
-#       qwen35-stamped Q3_K_S, the friction-free path the README
-#       points users at. Aggregate is 4.9% below run 1 (within
-#       the ±20% noise band) — slightly longer per-prompt outputs
-#       this run (8009 vs 6182 tokens) likely contribute the
-#       difference, plus late-in-session thermal pressure on the
-#       Strix Halo iGPU. The friction-free unsloth path works.
 #     Q4_K_M →  9.31 tok/s aggregate (run 1)
 #       (5356 tokens / 574.9 s;  9.48 /  9.43 /  9.28 short/medium/long)
 #     Q4_K_M →  9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)

+# Thanatos-Heretic-27B — Ollama wrapper around Qwen 3.6 27B (dense)
 #
 # Text + tool calling. Vision via Ollama is currently broken for this
 # architecture (ollama/ollama#15898 — the qwen35 arch entries are in
 # stamped `general.architecture: 'qwen35'` — the upstream-canonical
 # arch entry every released llama.cpp / Ollama loads under for the
 # Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
+# thanatos-heretic-27b -f Modelfile && ollama run thanatos-heretic-27b` loads it
 # directly. See README "Architecture" for the full stamp history
 # (eight flips between qwen35 and qwen36, settled on qwen35 at
 # `e03e10e` after the 4th qwen36 round trip had its friction
 # re-tested in a fresh next-day session).
 #
+# For other quants (Q3_K_M, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_M`
+# downloads the chosen quant from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF
+# (filename pattern Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf) and
+# patches FROM in a temp Modelfile copy. Note: no Q3_K_S in this repo;
+# use Q3_K_M for the smallest practical quant.
 #
 # Other GGUF sources (use with `make build GGUF_PATH=...`):
+#     https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF        # primary (this repo's default)
+#     https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF  # MTP head preserved
+#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF                               # vanilla Qwen 3.6 (pre-Heretic)
 FROM ./Thanatos-27B.Q4_K_M.gguf
 #       (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
 #     Q3_K_S → 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
 #       (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
+#       Second run measured against a `thanatos-27b:latest` (pre-rename)
+#       built via `make build QUANT=Q3_K_S` against the then-current
+#       unsloth/Qwen3.6-27B-GGUF source. Aggregate is 4.9% below
+#       run 1 (within the ±20% noise band) — slightly longer
+#       per-prompt outputs this run (8009 vs 6182 tokens) likely
+#       contribute the difference, plus late-in-session thermal
+#       pressure on the Strix Halo iGPU.
+#       (Heretic v2 base is not benched here yet; rebundle pending.)
 #     Q4_K_M →  9.31 tok/s aggregate (run 1)
 #       (5356 tokens / 574.9 s;  9.48 /  9.43 /  9.28 short/medium/long)
 #     Q4_K_M →  9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)

README.md CHANGED Viewed

@@ -1,7 +1,8 @@
 ---
 license: apache-2.0
 base_model:
-  - Qwen/Qwen3.6-27B
 datasets:
   - crownelius/Creative_Writing_ShareGPT_Enhanced
   - microsoft/rStar-Coder
@@ -40,26 +41,28 @@ tags:
   - agent
   - gguf
   - ollama
 library_name: transformers
 pipeline_tag: image-text-to-text
 ---
-<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />
 [![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
-[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
 [![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
 [![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
 [![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)
-# Thanatos-27B
-> **Dense Reasoning. Friendlier Footprint.**
-> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
-**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
-A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
 ## TL;DR
@@ -69,18 +72,28 @@ template — HF's Ollama bridge ingests those three files, not
 `Modelfile`):
 ```bash
-ollama run hf.co/FoolDev/Thanatos-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
 ```
-If you pulled the bundle during any of the qwen36 windows on
-2026-05-19/20 (most recently between `ae67ed1` and `e03e10e`)
-the load will 500 on that stale blob — `make heal-hf` rebadges
-it in place. Fresh pulls after the latest qwen35 re-stamp
-(`e03e10e`) go straight through.
-For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
 QUANT=...` is the simplest path. See [Quick start](#quick-start)
-below for the full matrix.
 For image input use llama.cpp directly — Ollama vision is broken for
 this architecture upstream (see [Vision](#vision)).
@@ -89,9 +102,9 @@ this architecture upstream (see [Vision](#vision)).
 The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
-The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
-| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
 |---|---|---|
 | Architecture | Dense transformer | MoE 256 experts, 8 active |
 | Total params | 27 B | 35 B |
@@ -99,7 +112,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | Layers | 64 | 40 |
 | Hidden size | 5120 | 2048 |
 | Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
-| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
 | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
 | Multimodal (text path) | Yes | Yes |
 | Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
@@ -111,15 +124,15 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | File | Use |
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
-| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for **local** builds |
-| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
 | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
-| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
-| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy v0.6.0-era / 3rd-round-trip-era checkouts — no-op on the current qwen35-stamped bundle. |
-| `scripts/heal_hf_pull.sh` | Recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` *before* the latest qwen35 re-stamp (`978798f`) and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
 | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
 | `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
-| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
 | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
 | `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
 | `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
@@ -129,21 +142,22 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | `CHANGELOG.md` | Versioned tooling/docs changes |
 | `README.md` | This file |
-For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
-downloads the smaller ~12 GB Q3_K_S quant from
-`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
-creates a local `thanatos-27b` Ollama tag. Does not redistribute
-via this repo. For other quants use `make build QUANT=...`. The
-local-build path applies this repo's `Modelfile`; the `hf.co/...`
-path applies the root-level `template`, `system`, and `params`
-files (kept in sync with the `Modelfile`).
-If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
 ## Architecture
 <p align="left">
-  <img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
 </p>
 - Qwen 3.6 dense, 27B parameters, 64 transformer layers
@@ -154,23 +168,30 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
 - Vocab 248,320 (shared with 35B-A3B sibling)
 - 262 144 native context, extensible to ~1 M with YaRN
 - Vision + video supported by the **base architecture** via a separate
-  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
-  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
-  current loader compatibility.
 - Multi-token prediction (MTP) head trained for speculative decoding —
   present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
   vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
   **Not usable via llama.cpp / Ollama today**: the GGUF converter
   (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
   `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
-  inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
-  851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
-  merged 2026-05-16) currently covers other architectures only;
-  tracking that PR's follow-up work for when qwen35 / qwen35moe
-  consumer support lands. (Earlier README versions claimed MTP was
-  available without this caveat — confirmed empirically via
-  `gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
-  2026-05-19.)
 **The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
 workaround for an unimplemented `qwen36` arch, but the canonical
@@ -186,9 +207,11 @@ stack:
   exists in `transformers`; Qwen reuses the 3.5 class names.
 - **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
   `Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
-  `Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The unsloth
-  GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
-  `unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
 - **llama.cpp's model code.** `src/models/qwen35.cpp` has an
   explicit `case 64: type = LLM_TYPE_27B` branch for this model;
   `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
@@ -200,7 +223,7 @@ There is no PR or tracking issue for a `qwen36` arch entry in
 `qwen35` already loads the model the upstream code path was
 designed to load.
-`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
 Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
 loaders.
@@ -257,7 +280,8 @@ the legacy qwen36 → qwen35 in-store rebadge (used by `make
 heal-hf` and `make load-bundle`) and any future arch flip:
 ```bash
-# qwen36 -> qwen35 (the legacy recovery direction)
 python3 scripts/rename_arch.py \
     --from-arch qwen36 --to-arch qwen35 \
     Thanatos-27B.Q4_K_M.qwen36.gguf \
@@ -273,21 +297,23 @@ Three paths:
 ```bash
 # A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
 #    root-level template / system / params files in one step):
-ollama run hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M, qwen35-stamped
-# B. Build a local `thanatos-27b` tag from THIS repo's bundle
 #    (LFS smudge if needed, then `ollama create`). Useful if you
 #    want a bare local tag rather than the `hf.co/...` path:
-make load-bundle                                 # creates local tag thanatos-27b
-ollama run thanatos-27b
-# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
-#    and build locally. Loads on every current llama.cpp / Ollama.
-make build                                              # Q4_K_M  -> thanatos-27b
-make build QUANT=Q3_K_S                                 # 12 GB smaller quant
 make build QUANT=Q5_K_M                                 # 20 GB higher quality
-make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
-ollama run thanatos-27b
 ```
 Under the hood, `make build` calls `scripts/build.sh`, which downloads the
@@ -295,7 +321,7 @@ GGUF if missing (set `GGUF_PATH` to point at one you already have) and
 runs `ollama create` with the matching `Modelfile`.
 If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
-run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.
 Confirm everything works:
@@ -310,10 +336,10 @@ python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-
 | App | How to load this model |
 |---|---|
-| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
-| **LM Studio** | Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
-| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
-| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
@@ -331,7 +357,7 @@ external schema.
 curl -s http://localhost:11434/v1/chat/completions \
   -H 'Content-Type: application/json' \
   -d '{
-    "model": "thanatos-27b",
     "messages": [
       {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
       {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
@@ -369,17 +395,21 @@ Behavior rules:
 ## Vision
-The Qwen 3.6 base supports image (and video) input via a separate
-`mmproj` projector. The full multimodal stack is:
 ```
-Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
-mmproj-F16.gguf           (~927 MB, the vision projector)
 ```
 Both files are at
-[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
-This repo intentionally does not redistribute either.
 ### Loader compatibility — the honest table
@@ -397,10 +427,11 @@ Three flavors, in order of build-time effort:
 ```bash
 # A. HTTP via llama-server (always built — the easiest path).
 #    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
-#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
 llama-server \
-  -m Qwen3.6-27B-Q4_K_M.gguf \
-  --mmproj mmproj-F16.gguf \
   --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
 # then POST OpenAI-style chat completions with an image_url content
 # block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
@@ -413,15 +444,15 @@ llama-server \
 #    produce it — a plain `cmake --build build` will. If yours didn't,
 #    run `cmake --build build --target llama-mtmd-cli`.
 llama-mtmd-cli \
-  -m Qwen3.6-27B-Q4_K_M.gguf \
-  --mmproj mmproj-F16.gguf \
   --image photo.jpg \
   -p "Describe this image."
 # C. Python via llama-cpp-python:
 python examples/llama_cpp_vision.py \
-  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
-  --mmproj /path/to/mmproj-F16.gguf \
   --image /path/to/photo.jpg \
   --prompt "What is in this image?"
 ```
@@ -439,19 +470,22 @@ The dense 27B is the lighter sibling to Janus-35B and the easier of the two to d
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
-| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
 Most numbers in this table are estimates from comparable models; the
 gradient is right but the absolute values will move ±20% with prompt
 shape, KV cache type, and parallel-request count. Measure your own
 machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
 `eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
-data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
 **~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
 steady across short / medium / long prompts), sitting between CPU-only
 and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
 same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
-this hardware.
 ## Chat template
@@ -465,10 +499,10 @@ Ollama is the exception: its conversion of the embedded jinja loses the
 `.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
 Two paths fix this, depending on how you pull the model:
-- **`ollama run hf.co/FoolDev/Thanatos-27B`** — HF's Ollama bridge applies
   the root-level `template` / `system` / `params` files in this repo
   (the bridge does **not** read `Modelfile`).
-- **`make build` / `ollama create thanatos-27b -f Modelfile`** — uses the
   `Modelfile`'s `TEMPLATE` block.
 Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
@@ -511,7 +545,7 @@ the model adapts to whichever shape the system prompt prescribes.
 **Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
 prompts the model to emit JSON-in-XML, the form Ollama's tool-call
 extractor parses into a structured `tool_calls` array. After
-`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
 under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
 accept a `tools` array.
@@ -552,19 +586,25 @@ python examples/ollama_chat.py        # section 3 runs a real round-trip
 - **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
 - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
 - **No formal evaluation in this card.** Numbers above are estimates.
 ## Related models
 | Model | Notes |
 |---|---|
-| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
-| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
 | [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
 | [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
 ## Credits
-- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
 - Reasoning teacher: Claude Opus 4.7 (Anthropic)
 - Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

 ---
 license: apache-2.0
 base_model:
+  - llmfan46/Qwen3.6-27B-uncensored-heretic-v2
+base_model_relation: finetune
 datasets:
   - crownelius/Creative_Writing_ShareGPT_Enhanced
   - microsoft/rStar-Coder
   - agent
   - gguf
   - ollama
+  - heretic
+  - uncensored
 library_name: transformers
 pipeline_tag: image-text-to-text
 ---
+<img src="https://huggingface.co/FoolDev/Thanatos-Heretic-27B/resolve/main/banner.svg" alt="Thanatos-Heretic-27B banner" width="100%" />
 [![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
+[![Base Model](https://img.shields.io/badge/Base-Heretic_v2-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2)
 [![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
 [![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
 [![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)
+# Thanatos-Heretic-27B
+> **Dense Reasoning. Friendlier Footprint. Uncensored.**
+> *llmfan46's Heretic v2 abliteration of Qwen 3.6 27B (dense), repackaged with Claude Opus 4.7 in the teacher slot.*
+**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Base:`** `Heretic v2 (llmfan46)` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled + Abliterated LLM`
+A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) — an uncensored Heretic-style abliteration of the dense [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base — instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises, and refusal-trained behavior is dialed back at the base layer.
 ## TL;DR
 `Modelfile`):
 ```bash
+ollama run hf.co/FoolDev/Thanatos-Heretic-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
 ```
+> **Bundled blob status:** the GGUF currently bundled in this repo
+> is the legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant from before
+> the rename. Behaves identically to vanilla Qwen 3.6 27B for now;
+> the Heretic v2 rebundle (from
+> `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) is pending —
+> see the top entry of [CHANGELOG](CHANGELOG.md). If you want the
+> Heretic behavior today, use the local-build path below
+> (`make build`), which pulls the Heretic GGUF directly.
+If you pulled the bundle during any of the qwen36 windows on the
+pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
+have a qwen36-stamped blob in your local Ollama store, `make
+heal-hf` rebadges it in place. Fresh pulls of the new
+`Thanatos-Heretic-27B` repo go straight through.
+For other quants (Q3_K_M ~12 GB, Q5_K_M ~20 GB, etc.), `make build
 QUANT=...` is the simplest path. See [Quick start](#quick-start)
+below for the full matrix. Note: no Q3_K_S in the Heretic GGUF
+repo — use Q3_K_M for the smallest practical quant.
 For image input use llama.cpp directly — Ollama vision is broken for
 this architecture upstream (see [Vision](#vision)).
 The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
+The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix, measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits the same architecture so per-step cost should match) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
+| | Thanatos-Heretic-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
 |---|---|---|
 | Architecture | Dense transformer | MoE 256 experts, 8 active |
 | Total params | 27 B | 35 B |
 | Layers | 64 | 40 |
 | Hidden size | 5120 | 2048 |
 | Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
+| Q3_K_M GGUF size | ~13 GB (build locally via `make build QUANT=Q3_K_M`) | n/a |
 | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
 | Multimodal (text path) | Yes | Yes |
 | Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
 | File | Use |
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
+| `Modelfile` | Ollama wrapper around the bundled GGUF (currently the legacy pre-Heretic Qwen 3.6 27B Q4_K_M; Heretic v2 rebundle pending) — used by `make build` / `ollama create` for **local** builds |
+| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
 | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
+| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`). This is the path that gets you actual Heretic behavior until the bundled blob is rebundled. |
+| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
+| `scripts/heal_hf_pull.sh` | Legacy recovery for users migrating from the pre-rename `FoolDev/Thanatos-27B` repo who still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 ��� fresh pulls of `Thanatos-Heretic-27B` don't need it. |
 | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
 | `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
+| `scripts/fetch_vision.sh` | Pulls the vision projector (`Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo, or `mmproj-F16.gguf` from the unsloth reference projector) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
 | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
 | `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
 | `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
 | `CHANGELOG.md` | Versioned tooling/docs changes |
 | `README.md` | This file |
+For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_M`
+downloads the smaller ~13 GB Q3_K_M quant from
+`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` (qwen35-stamped,
+loads directly) and creates a local `thanatos-heretic-27b` Ollama
+tag. Does not redistribute via this repo. For other quants use
+`make build QUANT=...`. The local-build path applies this repo's
+`Modelfile`; the `hf.co/...` path applies the root-level
+`template`, `system`, and `params` files (kept in sync with the
+`Modelfile`).
+If you want the Heretic safetensors for `transformers`, fetch them from [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2). For the vanilla pre-Heretic Qwen 3.6 27B base, use [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
 ## Architecture
 <p align="left">
+  <img src="https://huggingface.co/FoolDev/Thanatos-Heretic-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
 </p>
 - Qwen 3.6 dense, 27B parameters, 64 transformer layers
 - Vocab 248,320 (shared with 35B-A3B sibling)
 - 262 144 native context, extensible to ~1 M with YaRN
 - Vision + video supported by the **base architecture** via a separate
+  `mmproj` projector (not redistributed here; pull
+  `Qwen3.6-27B-mmproj-BF16.gguf` from
+  `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`, or
+  `mmproj-F16.gguf` from `unsloth/Qwen3.6-27B-GGUF` as a reference
+  alternative). See [Vision](#vision) below for current loader
+  compatibility.
 - Multi-token prediction (MTP) head trained for speculative decoding —
   present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
   vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
   **Not usable via llama.cpp / Ollama today**: the GGUF converter
   (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
   `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
+  inference yet"), so the standard GGUFs (this bundle, unsloth's,
+  llmfan46's Heretic v2) ship with 851 tensors and no MTP head.
+  llmfan46 also publishes a separate
+  `Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF` repo
+  that keeps the MTP tensors for vLLM/SGLang users who want both
+  Heretic v2 + MTP. llama.cpp's MTP support (PR #22673, merged
+  2026-05-16) currently covers other architectures only; tracking
+  that PR's follow-up work for when qwen35 / qwen35moe consumer
+  support lands. (Earlier README versions claimed MTP was available
+  via llama.cpp without this caveat — confirmed empirically via
+  `gguf.GGUFReader` on both this bundle and
+  `unsloth/Qwen3.6-27B-GGUF`, 2026-05-19.)
 **The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
 workaround for an unimplemented `qwen36` arch, but the canonical
   exists in `transformers`; Qwen reuses the 3.5 class names.
 - **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
   `Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
+  `Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The Heretic
+  GGUFs this repo pulls from
+  (`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) inherit those
+  stamps, as do the upstream unsloth GGUFs (`unsloth/Qwen3.6-27B-GGUF`,
+  `unsloth/Qwen3.6-35B-A3B-GGUF`).
 - **llama.cpp's model code.** `src/models/qwen35.cpp` has an
   explicit `case 64: type = LLM_TYPE_27B` branch for this model;
   `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
 `qwen35` already loads the model the upstream code path was
 designed to load.
+`ollama run hf.co/FoolDev/Thanatos-Heretic-27B` and `llama-server -m
 Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
 loaders.
 heal-hf` and `make load-bundle`) and any future arch flip:
 ```bash
+# qwen36 -> qwen35 (the legacy recovery direction, for blobs
+# pulled from the pre-rename FoolDev/Thanatos-27B repo)
 python3 scripts/rename_arch.py \
     --from-arch qwen36 --to-arch qwen35 \
     Thanatos-27B.Q4_K_M.qwen36.gguf \
 ```bash
 # A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
 #    root-level template / system / params files in one step):
+ollama run hf.co/FoolDev/Thanatos-Heretic-27B           # 17 GB Q4_K_M, qwen35-stamped
+# B. Build a local `thanatos-heretic-27b` tag from THIS repo's bundle
 #    (LFS smudge if needed, then `ollama create`). Useful if you
 #    want a bare local tag rather than the `hf.co/...` path:
+make load-bundle                                 # creates local tag thanatos-heretic-27b
+ollama run thanatos-heretic-27b
+# C. Bypass the bundle: download a qwen35-stamped Heretic v2 GGUF
+#    from llmfan46 and build locally. Loads on every current
+#    llama.cpp / Ollama. This is the path that gets you actual
+#    Heretic behavior until the bundled blob is rebundled.
+make build                                              # Q4_K_M  -> thanatos-heretic-27b
+make build QUANT=Q3_K_M                                 # 13 GB smaller quant
 make build QUANT=Q5_K_M                                 # 20 GB higher quality
+make build GGUF_PATH=~/models/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf   # skip download
+ollama run thanatos-heretic-27b
 ```
 Under the hood, `make build` calls `scripts/build.sh`, which downloads the
 runs `ollama create` with the matching `Modelfile`.
 If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
+run `ollama create thanatos-heretic-27b -f Modelfile && ollama run thanatos-heretic-27b`.
 Confirm everything works:
 | App | How to load this model |
 |---|---|
+| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_M` downloads from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
+| **LM Studio** | Search → `FoolDev/Thanatos-Heretic-27B` → pick `Thanatos-27B.Q4_K_M.gguf` (current bundled filename; will become `Thanatos-Heretic-27B.Q4_K_M.gguf` after the rebundle). Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
+| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-Heretic-27B`. Same template behavior as LM Studio. |
+| **llama.cpp** | `hf download FoolDev/Thanatos-Heretic-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via `Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
 curl -s http://localhost:11434/v1/chat/completions \
   -H 'Content-Type: application/json' \
   -d '{
+    "model": "thanatos-heretic-27b",
     "messages": [
       {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
       {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
 ## Vision
+The Qwen 3.6 base (and llmfan46's Heretic v2 finetune of it) supports
+image (and video) input via a separate `mmproj` projector. The full
+multimodal stack is:
 ```
+Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf   (~17 GB, the text decoder)
+Qwen3.6-27B-mmproj-BF16.gguf                    (~931 MB, the vision projector)
 ```
 Both files are at
+[`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF).
+For the vanilla pre-Heretic projector, see
+[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
+(`mmproj-F16.gguf`, ~927 MB). This repo intentionally does not
+redistribute either.
 ### Loader compatibility — the honest table
 ```bash
 # A. HTTP via llama-server (always built — the easiest path).
 #    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
+#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU (pre-Heretic Qwen 3.6
+#    bundle; Heretic v2 shares the architecture so the recipe carries).
 llama-server \
+  -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+  --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
   --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
 # then POST OpenAI-style chat completions with an image_url content
 # block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
 #    produce it — a plain `cmake --build build` will. If yours didn't,
 #    run `cmake --build build --target llama-mtmd-cli`.
 llama-mtmd-cli \
+  -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+  --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
   --image photo.jpg \
   -p "Describe this image."
 # C. Python via llama-cpp-python:
 python examples/llama_cpp_vision.py \
+  --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+  --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
   --image /path/to/photo.jpg \
   --prompt "What is in this image?"
 ```
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
+| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_M` (~13 GB) and trim `num_ctx` for headroom. |
 Most numbers in this table are estimates from comparable models; the
 gradient is right but the absolute values will move ±20% with prompt
 shape, KV cache type, and parallel-request count. Measure your own
 machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
 `eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
+data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan
+(measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits
+the architecture so per-step cost should match within bench noise):
 **~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
 steady across short / medium / long prompts), sitting between CPU-only
 and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
 same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
+this hardware. (Heretic v2 publishes Q3_K_M rather than Q3_K_S; the
+~13 GB Q3_K_M should sit within 5% of the ~12 GB Q3_K_S numbers.)
 ## Chat template
 `.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
 Two paths fix this, depending on how you pull the model:
+- **`ollama run hf.co/FoolDev/Thanatos-Heretic-27B`** — HF's Ollama bridge applies
   the root-level `template` / `system` / `params` files in this repo
   (the bridge does **not** read `Modelfile`).
+- **`make build` / `ollama create thanatos-heretic-27b -f Modelfile`** — uses the
   `Modelfile`'s `TEMPLATE` block.
 Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
 **Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
 prompts the model to emit JSON-in-XML, the form Ollama's tool-call
 extractor parses into a structured `tool_calls` array. After
+`make build`, `ollama show thanatos-heretic-27b` lists `tools` and `thinking`
 under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
 accept a `tools` array.
 - **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
 - **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
 - **No formal evaluation in this card.** Numbers above are estimates.
+- **Bundled blob is pre-Heretic.** The currently-bundled `Thanatos-27B.Q4_K_M.gguf` blob is the legacy Qwen 3.6 27B Q4_K_M quant from before the rename — it behaves like vanilla Qwen 3.6, not Heretic v2. Use `make build` (which pulls the Heretic GGUF from llmfan46) until the rebundle ships.
+- **Uncensored base.** The Heretic v2 abliteration dials back the refusal-training of upstream Qwen 3.6. Outputs may be more compliant with sensitive requests than the vanilla base; the Thanatos system prompt still steers behavior, but the safety floor is lower. Apply your own filtering for user-facing deployments.
 ## Related models
 | Model | Notes |
 |---|---|
+| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) | **Immediate base**, safetensors |
+| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF) | Recommended GGUF source (what `make build` pulls from) |
+| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved) | Same Heretic v2 but keeps the MTP head for vLLM / SGLang speculative decoding |
+| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream pre-Heretic base, safetensors |
+| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Pre-Heretic GGUF mirror + reference `mmproj-F16.gguf` projector |
 | [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
 | [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
 ## Credits
+- Immediate base: [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) — Heretic-style abliteration of Qwen 3.6 27B
+- Upstream base: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
 - Reasoning teacher: Claude Opus 4.7 (Anthropic)
 - Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

examples/README.md CHANGED Viewed

@@ -1,13 +1,13 @@
-# Thanatos-27B examples
 Four minimal entry points. Pick the one that matches how you run models.
 | File | Backend | When to use |
 |---|---|---|
-| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
-| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
 | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
-| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |
 All four apply the same Thanatos system prompt and sampling defaults
 (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
@@ -24,9 +24,9 @@ root-level `template` / `system` / `params` files via HF's Ollama
 bridge):
 ```bash
-ollama pull hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M (only bundled quant)
 pip install requests
-MODEL=hf.co/FoolDev/Thanatos-27B python ollama_chat.py
 ```
 If you pulled before the latest qwen35 re-stamp (HF commit
@@ -36,13 +36,14 @@ in place (qwen36 → qwen35, metadata-only, ~5 s) — the same
 tag then loads. Fresh pulls after the re-stamp go straight
 through.
-For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
-`make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`
-and creates a local `thanatos-27b` tag:
 ```bash
-cd ..  &&  make build QUANT=Q3_K_S  &&  cd examples
-MODEL=thanatos-27b python ollama_chat.py
 ```
 Or build a local tag from this repo's bundled GGUF without going
@@ -50,12 +51,12 @@ through the HF pull:
 ```bash
 cd ..  &&  make load-bundle  &&  cd examples
-MODEL=thanatos-27b python ollama_chat.py
 ```
 For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
-fetch it from `unsloth/Qwen3.6-27B-GGUF` and patch the `Modelfile`
-`FROM` line into a temp copy automatically:
 ```bash
 cd ..  &&  make build QUANT=Q5_K_M  &&  cd examples
@@ -74,7 +75,7 @@ python transformers_quickstart.py --no-4bit  # bf16, ~54 GB VRAM
 ```bash
 pip install llama-cpp-python  # CPU-only build
-python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
 ```
 For GPU offload, rebuild llama-cpp-python with the matching backend — see
@@ -83,13 +84,13 @@ the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
 ### Vision (image input)
 ```bash
-# Pull the projector once (~927 MB):
-hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .
 pip install llama-cpp-python pillow
 python llama_cpp_vision.py \
-  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
-  --mmproj /path/to/mmproj-F16.gguf \
   --image /path/to/photo.jpg \
   --prompt "Describe this image."
 ```
@@ -101,7 +102,7 @@ lacks them. `ollama create` accepts the dual-`FROM` and `ollama show`
 reports `vision` capability, but the first inference call fails with
 `error loading model architecture: unknown model architecture:
 'qwen35'` (verified empirically against the dense 27B +
-`mmproj-F16.gguf`). Tracked in
 [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
 Until that's fixed, llama.cpp / llama-cpp-python is the working path
 for vision.

+# Thanatos-Heretic-27B examples
 Four minimal entry points. Pick the one that matches how you run models.
 | File | Backend | When to use |
 |---|---|---|
+| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-heretic-27b` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
+| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the Heretic safetensors (`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`) on GPU, optionally in 4-bit via bitsandbytes. |
 | `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
+| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `Qwen3.6-27B-mmproj-BF16.gguf` and answers questions about an image. The only working vision path right now. |
 All four apply the same Thanatos system prompt and sampling defaults
 (`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
 bridge):
 ```bash
+ollama pull hf.co/FoolDev/Thanatos-Heretic-27B           # 17 GB Q4_K_M (only bundled quant)
 pip install requests
+MODEL=hf.co/FoolDev/Thanatos-Heretic-27B python ollama_chat.py
 ```
 If you pulled before the latest qwen35 re-stamp (HF commit
 tag then loads. Fresh pulls after the re-stamp go straight
 through.
+For a non-bundled quant (e.g. Q3_K_M ~12 GB, Q5_K_M ~20 GB),
+`make build QUANT=...` downloads from
+`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a
+local `thanatos-heretic-27b` tag:
 ```bash
+cd ..  &&  make build QUANT=Q3_K_M  &&  cd examples
+MODEL=thanatos-heretic-27b python ollama_chat.py
 ```
 Or build a local tag from this repo's bundled GGUF without going
 ```bash
 cd ..  &&  make load-bundle  &&  cd examples
+MODEL=thanatos-heretic-27b python ollama_chat.py
 ```
 For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
+fetch it from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and
+patch the `Modelfile` `FROM` line into a temp copy automatically:
 ```bash
 cd ..  &&  make build QUANT=Q5_K_M  &&  cd examples
 ```bash
 pip install llama-cpp-python  # CPU-only build
+python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf --gpu-layers 99
 ```
 For GPU offload, rebuild llama-cpp-python with the matching backend — see
 ### Vision (image input)
 ```bash
+# Pull the projector once (~931 MB):
+hf download llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF Qwen3.6-27B-mmproj-BF16.gguf --local-dir .
 pip install llama-cpp-python pillow
 python llama_cpp_vision.py \
+  --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+  --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
   --image /path/to/photo.jpg \
   --prompt "Describe this image."
 ```
 reports `vision` capability, but the first inference call fails with
 `error loading model architecture: unknown model architecture:
 'qwen35'` (verified empirically against the dense 27B +
+the F16 reference projector). Tracked in
 [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
 Until that's fixed, llama.cpp / llama-cpp-python is the working path
 for vision.

examples/llama_cpp_quickstart.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — llama-cpp-python quickstart.
 Skip Ollama entirely and call the GGUF directly through llama-cpp-python.
 Useful for batch jobs, CI, or environments where you don't want a daemon.

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — llama-cpp-python quickstart.
 Skip Ollama entirely and call the GGUF directly through llama-cpp-python.
 Useful for batch jobs, CI, or environments where you don't want a daemon.

examples/llama_cpp_vision.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — vision (image-text-to-text) via llama-cpp-python.
 Why this script exists:
     Ollama's Go engine has the qwen35 / qwen35moe arch entries (text
@@ -23,21 +23,21 @@ Install:
     #   CMAKE_ARGS="-DGGML_METAL=on"  pip install llama-cpp-python --no-binary :all:
     #   CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
-Files you need (both from unsloth/Qwen3.6-27B-GGUF):
-    1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf  (~17 GB)
-    2. A vision projector:        mmproj-F16.gguf              (~927 MB)
 Usage:
     python llama_cpp_vision.py \
-        --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
-        --mmproj /path/to/mmproj-F16.gguf \
         --image  /path/to/photo.jpg \
         --prompt "What is in this image? Be specific."
     # CLI alternative without python binding (ships with llama.cpp):
     #   llama-mtmd-cli \
-    #     -m Qwen3.6-27B-Q4_K_M.gguf \
-    #     --mmproj mmproj-F16.gguf \
     #     --image photo.jpg \
     #     -p "Describe this image."
 """

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — vision (image-text-to-text) via llama-cpp-python.
 Why this script exists:
     Ollama's Go engine has the qwen35 / qwen35moe arch entries (text
     #   CMAKE_ARGS="-DGGML_METAL=on"  pip install llama-cpp-python --no-binary :all:
     #   CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
+Files you need (both from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF):
+    1. A text GGUF (any quant): e.g. Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf  (~17 GB)
+    2. A vision projector:        Qwen3.6-27B-mmproj-BF16.gguf                      (~931 MB)
 Usage:
     python llama_cpp_vision.py \
+        --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+        --mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
         --image  /path/to/photo.jpg \
         --prompt "What is in this image? Be specific."
     # CLI alternative without python binding (ships with llama.cpp):
     #   llama-mtmd-cli \
+    #     -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
+    #     --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
     #     --image photo.jpg \
     #     -p "Describe this image."
 """

examples/ollama_chat.py CHANGED Viewed

@@ -1,17 +1,17 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — Ollama chat examples.
 Prerequisites (pick one):
     A. From the bundled GGUFs (default flow):
         $ make build                     # uses Thanatos-27B.Q4_K_M.gguf
         # or:
-        $ ollama create thanatos-27b -f ../Modelfile
     B. Pull straight from HF (Q4_K_M is the only bundled quant):
-        $ ollama run hf.co/FoolDev/Thanatos-27B
-        # then set MODEL=hf.co/FoolDev/Thanatos-27B below
 Then:
     $ ollama serve         # usually already running
@@ -39,7 +39,7 @@ from typing import Any, Iterator
 import requests
-MODEL = os.environ.get("MODEL", "thanatos-27b")
 HOST = os.environ.get("HOST", "http://localhost:11434")
 _THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — Ollama chat examples.
 Prerequisites (pick one):
     A. From the bundled GGUFs (default flow):
         $ make build                     # uses Thanatos-27B.Q4_K_M.gguf
         # or:
+        $ ollama create thanatos-heretic-27b -f ../Modelfile
     B. Pull straight from HF (Q4_K_M is the only bundled quant):
+        $ ollama run hf.co/FoolDev/Thanatos-Heretic-27B
+        # then set MODEL=hf.co/FoolDev/Thanatos-Heretic-27B below
 Then:
     $ ollama serve         # usually already running
 import requests
+MODEL = os.environ.get("MODEL", "thanatos-heretic-27b")
 HOST = os.environ.get("HOST", "http://localhost:11434")
 _THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)

examples/transformers_quickstart.py CHANGED Viewed

@@ -1,12 +1,15 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — Hugging Face Transformers quickstart.
-Loads the upstream Qwen 3.6 27B safetensors directly and runs a single
-chat turn using its embedded chat template. Thanatos-27B is a *wrapper*
-around that base, so for the transformers route there is nothing to
-download from this repo — point at Qwen/Qwen3.6-27B and apply the same
-system prompt the Modelfile uses.
 Requirements:
     pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
@@ -36,7 +39,7 @@ except ImportError as e:  # pragma: no cover
     )
-MODEL_ID = "Qwen/Qwen3.6-27B"
 THANATOS_SYSTEM = (
     "You are Thanatos, a precise and capable assistant for reasoning, writing, "

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — Hugging Face Transformers quickstart.
+Loads the Heretic v2 Qwen 3.6 27B safetensors directly and runs a single
+chat turn using its embedded chat template. Thanatos-Heretic-27B is a
+*wrapper* around that base, so for the transformers route there is nothing
+to download from this repo — point at llmfan46/Qwen3.6-27B-uncensored-heretic-v2
+and apply the same system prompt the Modelfile uses.
+Set MODEL_ID = "Qwen/Qwen3.6-27B" to bypass the Heretic abliteration and
+load the vanilla upstream base instead.
 Requirements:
     pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
     )
+MODEL_ID = "llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
 THANATOS_SYSTEM = (
     "You are Thanatos, a precise and capable assistant for reasoning, writing, "

scripts/bench.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — tok/s benchmark via Ollama.
 #
 # Reads timing from Ollama's /api/chat response metadata (eval_count and
 # eval_duration are authoritative — no client-side stopwatch noise) and
@@ -7,14 +7,14 @@
 # number generalises a bit beyond a single shape.
 #
 # Usage:
-#   ./scripts/bench.sh                       # uses MODEL=thanatos-27b
-#   MODEL=thanatos-27b ./scripts/bench.sh
 #   HOST=http://localhost:11434 ./scripts/bench.sh
 #
 # Requires: curl, jq, a running Ollama daemon with the model created.
 set -euo pipefail
-MODEL="${MODEL:-thanatos-27b}"
 HOST="${HOST:-http://localhost:11434}"
 red()   { printf "\033[31m%s\033[0m\n" "$*" >&2; }

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — tok/s benchmark via Ollama.
 #
 # Reads timing from Ollama's /api/chat response metadata (eval_count and
 # eval_duration are authoritative — no client-side stopwatch noise) and
 # number generalises a bit beyond a single shape.
 #
 # Usage:
+#   ./scripts/bench.sh                       # uses MODEL=thanatos-heretic-27b
+#   MODEL=thanatos-heretic-27b ./scripts/bench.sh
 #   HOST=http://localhost:11434 ./scripts/bench.sh
 #
 # Requires: curl, jq, a running Ollama daemon with the model created.
 set -euo pipefail
+MODEL="${MODEL:-thanatos-heretic-27b}"
 HOST="${HOST:-http://localhost:11434}"
 red()   { printf "\033[31m%s\033[0m\n" "$*" >&2; }

scripts/build.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — fetch a Qwen 3.6 27B GGUF and build the Ollama model.
 #
 # Usage:
 #   ./scripts/build.sh                       # default: Q4_K_M
@@ -7,28 +7,27 @@
 #   QUANT=Q6_K ./scripts/build.sh
 #
 # Skip the download by pointing at a GGUF you already have:
-#   GGUF_PATH=/path/to/Qwen3.6-27B-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
 #
 # Requires: huggingface-cli (or hf), ollama, awk.
 set -euo pipefail
 QUANT="${1:-${QUANT:-Q4_K_M}}"
-REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
-# Upstream uses dashes, e.g. Qwen3.6-27B-Q4_K_M.gguf. Quants known to exist
-# at unsloth/Qwen3.6-27B-GGUF (as of 2026-04):
-#   Q3_K_S Q3_K_M Q4_0 Q4_1 Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0
-#   IQ4_XS IQ4_NL
-#   UD-IQ2_XXS UD-IQ2_M UD-Q2_K_XL UD-IQ3_XXS UD-Q3_K_XL UD-Q4_K_XL
-#   UD-Q5_K_XL UD-Q6_K_XL UD-Q8_K_XL
-GGUF_NAME="Qwen3.6-27B-${QUANT}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 # GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
 # with cached weights elsewhere don't have to copy or symlink anything.
 GGUF_PATH="${GGUF_PATH:-${ROOT}/${GGUF_NAME}}"
 MODELFILE="${ROOT}/Modelfile"
-TAG="${TAG:-thanatos-27b}"
 echo "[*] repo:     ${REPO_ID}"
 echo "[*] quant:    ${QUANT}"
@@ -96,4 +95,4 @@ ollama create "${TAG}" -f "${TMP_MODELFILE}"
 echo
 echo "[+] Done. Try it:"
 echo "    ollama run ${TAG}"
-echo "    python ${ROOT}/examples/ollama_chat.py   # update MODEL constant if not 'thanatos-27b'"

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — fetch a Qwen 3.6 27B GGUF and build the Ollama model.
 #
 # Usage:
 #   ./scripts/build.sh                       # default: Q4_K_M
 #   QUANT=Q6_K ./scripts/build.sh
 #
 # Skip the download by pointing at a GGUF you already have:
+#   GGUF_PATH=/path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
 #
 # Requires: huggingface-cli (or hf), ollama, awk.
 set -euo pipefail
 QUANT="${1:-${QUANT:-Q4_K_M}}"
+REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
+# Filenames at llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF follow
+#   Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf
+# Quants known to exist (as of 2026-05):
+#   Q3_K_M Q3_K_L Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0 BF16
+# Note: no Q3_K_S in this repo — use Q3_K_M for the smallest practical quant.
+GGUF_NAME="Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 # GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
 # with cached weights elsewhere don't have to copy or symlink anything.
 GGUF_PATH="${GGUF_PATH:-${ROOT}/${GGUF_NAME}}"
 MODELFILE="${ROOT}/Modelfile"
+TAG="${TAG:-thanatos-heretic-27b}"
 echo "[*] repo:     ${REPO_ID}"
 echo "[*] quant:    ${QUANT}"
 echo
 echo "[+] Done. Try it:"
 echo "    ollama run ${TAG}"
+echo "    python ${ROOT}/examples/ollama_chat.py   # update MODEL constant if not 'thanatos-heretic-27b'"

scripts/check.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — repo-local sanity checks.
 #
 # Runs everything that's cheap and catches a real-world bug we've already hit:
 #
@@ -104,9 +104,11 @@ fi
 # ---- 5. footgun: dot-vs-dash filename -------------------------------------
 #
-# Upstream unsloth/Qwen3.6-27B-GGUF uses dashes (Qwen3.6-27B-Q4_K_M.gguf).
-# Earlier commits used the wrong dot-separated pattern, which 404s.
-# Block re-introduction.
 blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
 if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — repo-local sanity checks.
 #
 # Runs everything that's cheap and catches a real-world bug we've already hit:
 #
 # ---- 5. footgun: dot-vs-dash filename -------------------------------------
 #
+# Upstream llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF (and the
+# legacy unsloth/Qwen3.6-27B-GGUF) use dashes
+# (Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf,
+#  Qwen3.6-27B-Q4_K_M.gguf). Earlier commits used the wrong
+# dot-separated pattern, which 404s. Block re-introduction.
 blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
 if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \

scripts/check_bridge_sync.py CHANGED Viewed

@@ -1,13 +1,13 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — verify Modelfile and HF Ollama bridge files stay in sync.
 The repo ships two parallel Ollama configurations:
   - ``Modelfile`` is consumed by the local-build path (``ollama create -f Modelfile``).
     It contains ``TEMPLATE`` / ``SYSTEM`` / ``PARAMETER`` directives.
   - ``template`` / ``system`` / ``params`` at the repo root are consumed by HF's
-    Ollama bridge when users ``ollama run hf.co/FoolDev/Thanatos-27B`` directly. HF
     does NOT read the Modelfile (per https://huggingface.co/docs/hub/en/ollama).
 If the two configurations drift apart, ``hf.co/...`` users and ``make build``

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — verify Modelfile and HF Ollama bridge files stay in sync.
 The repo ships two parallel Ollama configurations:
   - ``Modelfile`` is consumed by the local-build path (``ollama create -f Modelfile``).
     It contains ``TEMPLATE`` / ``SYSTEM`` / ``PARAMETER`` directives.
   - ``template`` / ``system`` / ``params`` at the repo root are consumed by HF's
+    Ollama bridge when users ``ollama run hf.co/FoolDev/Thanatos-Heretic-27B`` directly. HF
     does NOT read the Modelfile (per https://huggingface.co/docs/hub/en/ollama).
 If the two configurations drift apart, ``hf.co/...`` users and ``make build``

scripts/fetch_vision.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — fetch the vision projector (mmproj) for image input.
 #
 # Why this is separate from build.sh:
 #   build.sh is for the Ollama text path. The mmproj is only useful for
@@ -8,16 +8,20 @@
 #   it (see README Vision section, ollama/ollama#15898).
 #
 # Usage:
-#   ./scripts/fetch_vision.sh                    # default: F16, ~927 MB
-#   ./scripts/fetch_vision.sh BF16               # ~931 MB
-#   ./scripts/fetch_vision.sh F32                # ~1.8 GB
 #
 # Requires: huggingface-cli (or hf).
 set -euo pipefail
-PRECISION="${1:-${PRECISION:-F16}}"
-REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
-FILE_NAME="mmproj-${PRECISION}.gguf"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
@@ -58,7 +62,7 @@ fi
 echo
 echo "[+] Done. Use it via:"
 echo "    python ${ROOT}/examples/llama_cpp_vision.py \\"
-echo "        --gguf  /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
 echo "        --mmproj ${DEST} \\"
 echo "        --image /path/to/photo.jpg \\"
 echo "        --prompt 'Describe this image.'"

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — fetch the vision projector (mmproj) for image input.
 #
 # Why this is separate from build.sh:
 #   build.sh is for the Ollama text path. The mmproj is only useful for
 #   it (see README Vision section, ollama/ollama#15898).
 #
 # Usage:
+#   ./scripts/fetch_vision.sh                    # default: BF16 (~931 MB)
+#
+# llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF publishes BF16 only;
+# for F16/F32 variants fall back to unsloth's reference projector:
+#   REPO_ID=unsloth/Qwen3.6-27B-GGUF FILE_NAME=mmproj-F16.gguf ./scripts/fetch_vision.sh
+# (vision tokens are projected the same way across Qwen 3.6 27B
+# finetunes, so the unsloth projector is functionally interchangeable.)
 #
 # Requires: huggingface-cli (or hf).
 set -euo pipefail
+PRECISION="${1:-${PRECISION:-BF16}}"
+REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
+FILE_NAME="${FILE_NAME:-Qwen3.6-27B-mmproj-${PRECISION}.gguf}"
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
 echo
 echo "[+] Done. Use it via:"
 echo "    python ${ROOT}/examples/llama_cpp_vision.py \\"
+echo "        --gguf  /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \\"
 echo "        --mmproj ${DEST} \\"
 echo "        --image /path/to/photo.jpg \\"
 echo "        --prompt 'Describe this image.'"

scripts/heal_hf_pull.sh CHANGED Viewed

@@ -1,10 +1,10 @@
 #!/usr/bin/env bash
-# Thanatos-27B — heal a previously pulled HF-bridge tag whose bundled
 # GGUF is `qwen36`-stamped (legacy v0.6.0-era pulls before `964e418`,
 # 3rd-round-trip-era pulls between `973d7ef` and `978798f`, or
 # 5th-round-trip-era pulls between `ae67ed1` and `e03e10e`).
 #
-# Fresh pulls of `ollama run hf.co/FoolDev/Thanatos-27B` now get the
 # qwen35-stamped bundle and load directly — this script is the
 # recovery path for users who pulled a qwen36-stamped blob into
 # their local Ollama store during one of the qwen36 windows
@@ -13,7 +13,7 @@
 # It rebadges the HF-bridge tag's model blob in-place (qwen36 ->
 # qwen35, metadata-only, byte-identical tensors) and rewrites the
 # manifest's model-layer digest to point at the new blob. After
-# running, the cached `hf.co/FoolDev/Thanatos-27B` tag loads.
 #
 # Idempotent: a tag already on qwen35 / qwen35moe is left untouched.
 # The current bundle is qwen35-stamped so this script is a no-op for
@@ -22,13 +22,13 @@
 #
 # Usage:
 #   ./scripts/heal_hf_pull.sh                                # default tag
-#   TAG=hf.co/FoolDev/Thanatos-27B:Q4_K_M ./scripts/heal_hf_pull.sh
 #
 # Requires: ollama, jq, python3 with the `gguf` package, sha256sum.
 set -euo pipefail
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
-TAG="${TAG:-hf.co/FoolDev/Thanatos-27B:Q4_K_M}"
 OLLAMA_MODELS="${OLLAMA_MODELS:-${HOME}/.ollama/models}"
 red()   { printf "\033[31m%s\033[0m\n" "$*"; }
@@ -50,7 +50,7 @@ done
 # `ollama show --modelfile` writes a FROM line with the absolute blob path.
 # Reliable regardless of which case variant the user pulled with
-# (hf.co's 307 lets `Thanatos-27B` and `thanatos-27b` both resolve to the
 # canonical repo, and ollama stores the manifest under whichever case
 # was first registered).
 #
@@ -79,8 +79,8 @@ blue "[*] blob:   ${MODEL_BLOB}"
 # referenced from exactly one tag in the heal scenario — fresh HF pull
 # of a single :Q4_K_M tag — but if someone has multiple tags pointing
 # at the same blob, we filter down to the one matching ${TAG}.
-TAG_PATH="${TAG#hf.co/}"      # FoolDev/Thanatos-27B:Q4_K_M
-NAMESPACE_PATH="${TAG_PATH%:*}" # FoolDev/Thanatos-27B
 TAG_FILE="${TAG_PATH##*:}"    # Q4_K_M
 MANIFEST="$(find "${OLLAMA_MODELS}/manifests/hf.co" \

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — heal a previously pulled HF-bridge tag whose bundled
 # GGUF is `qwen36`-stamped (legacy v0.6.0-era pulls before `964e418`,
 # 3rd-round-trip-era pulls between `973d7ef` and `978798f`, or
 # 5th-round-trip-era pulls between `ae67ed1` and `e03e10e`).
 #
+# Fresh pulls of `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` now get the
 # qwen35-stamped bundle and load directly — this script is the
 # recovery path for users who pulled a qwen36-stamped blob into
 # their local Ollama store during one of the qwen36 windows
 # It rebadges the HF-bridge tag's model blob in-place (qwen36 ->
 # qwen35, metadata-only, byte-identical tensors) and rewrites the
 # manifest's model-layer digest to point at the new blob. After
+# running, the cached `hf.co/FoolDev/Thanatos-Heretic-27B` tag loads.
 #
 # Idempotent: a tag already on qwen35 / qwen35moe is left untouched.
 # The current bundle is qwen35-stamped so this script is a no-op for
 #
 # Usage:
 #   ./scripts/heal_hf_pull.sh                                # default tag
+#   TAG=hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M ./scripts/heal_hf_pull.sh
 #
 # Requires: ollama, jq, python3 with the `gguf` package, sha256sum.
 set -euo pipefail
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+TAG="${TAG:-hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M}"
 OLLAMA_MODELS="${OLLAMA_MODELS:-${HOME}/.ollama/models}"
 red()   { printf "\033[31m%s\033[0m\n" "$*"; }
 # `ollama show --modelfile` writes a FROM line with the absolute blob path.
 # Reliable regardless of which case variant the user pulled with
+# (hf.co's 307 lets `Thanatos-Heretic-27B` and `thanatos-heretic-27b` both resolve to the
 # canonical repo, and ollama stores the manifest under whichever case
 # was first registered).
 #
 # referenced from exactly one tag in the heal scenario — fresh HF pull
 # of a single :Q4_K_M tag — but if someone has multiple tags pointing
 # at the same blob, we filter down to the one matching ${TAG}.
+TAG_PATH="${TAG#hf.co/}"      # FoolDev/Thanatos-Heretic-27B:Q4_K_M
+NAMESPACE_PATH="${TAG_PATH%:*}" # FoolDev/Thanatos-Heretic-27B
 TAG_FILE="${TAG_PATH##*:}"    # Q4_K_M
 MANIFEST="$(find "${OLLAMA_MODELS}/manifests/hf.co" \

scripts/install-hooks.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — install scripts/check.sh as a git pre-commit hook.
 #
 # Idempotent. Re-runs are safe.
 set -euo pipefail

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — install scripts/check.sh as a git pre-commit hook.
 #
 # Idempotent. Re-runs are safe.
 set -euo pipefail

scripts/load_bundle.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — load this repo's bundle into Ollama as a local tag.
 #
 # The bundled GGUF (Thanatos-27B.Q4_K_M.gguf) is qwen35-stamped and
 # loads directly on stock llama.cpp / Ollama. This script is the
@@ -15,13 +15,13 @@
 #   3. Run `ollama create <tag> -f <temp Modelfile pointing at the
 #      resolved bundle>`.
 #
-# Useful if you want a bare local tag (`thanatos-27b`) rather than
-# the `hf.co/FoolDev/Thanatos-27B` path. The legacy qwen36 rebadge
 # branch is kept for anyone working from a pre-e03e10e checkout.
 #
 # Usage:
-#   ./scripts/load_bundle.sh                 # default tag: thanatos-27b
-#   TAG=thanatos-27b-bundle ./scripts/load_bundle.sh
 #   BUNDLE=/path/to/Thanatos-27B.Q4_K_M.gguf ./scripts/load_bundle.sh
 #
 # Requires: ollama, python3 with the `gguf` package, hf (if the bundle
@@ -30,8 +30,8 @@ set -euo pipefail
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 BUNDLE="${BUNDLE:-${ROOT}/Thanatos-27B.Q4_K_M.gguf}"
-TAG="${TAG:-thanatos-27b}"
-REPO_ID="${REPO_ID:-FoolDev/Thanatos-27B}"
 MODELFILE="${ROOT}/Modelfile"
 red()    { printf "\033[31m%s\033[0m\n" "$*"; }

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — load this repo's bundle into Ollama as a local tag.
 #
 # The bundled GGUF (Thanatos-27B.Q4_K_M.gguf) is qwen35-stamped and
 # loads directly on stock llama.cpp / Ollama. This script is the
 #   3. Run `ollama create <tag> -f <temp Modelfile pointing at the
 #      resolved bundle>`.
 #
+# Useful if you want a bare local tag (`thanatos-heretic-27b`) rather than
+# the `hf.co/FoolDev/Thanatos-Heretic-27B` path. The legacy qwen36 rebadge
 # branch is kept for anyone working from a pre-e03e10e checkout.
 #
 # Usage:
+#   ./scripts/load_bundle.sh                 # default tag: thanatos-heretic-27b
+#   TAG=thanatos-heretic-27b-bundle ./scripts/load_bundle.sh
 #   BUNDLE=/path/to/Thanatos-27B.Q4_K_M.gguf ./scripts/load_bundle.sh
 #
 # Requires: ollama, python3 with the `gguf` package, hf (if the bundle
 ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
 BUNDLE="${BUNDLE:-${ROOT}/Thanatos-27B.Q4_K_M.gguf}"
+TAG="${TAG:-thanatos-heretic-27b}"
+REPO_ID="${REPO_ID:-FoolDev/Thanatos-Heretic-27B}"
 MODELFILE="${ROOT}/Modelfile"
 red()    { printf "\033[31m%s\033[0m\n" "$*"; }

scripts/smoke_test.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Thanatos-27B — smoke test against a running Ollama daemon.
 #
 # Verifies:
 #   1. The Ollama server is reachable.
@@ -14,11 +14,11 @@
 # Usage:
 #   ./scripts/smoke_test.sh                       # fast checks only
 #   TOOLS_TEST=1 ./scripts/smoke_test.sh          # add tool-call round-trip
-#   MODEL=hf.co/FoolDev/Thanatos-27B:Q4_K_M ./scripts/smoke_test.sh
 #   HOST=http://localhost:11434 ./scripts/smoke_test.sh
 set -euo pipefail
-MODEL="${MODEL:-thanatos-27b}"
 HOST="${HOST:-http://localhost:11434}"
 PROMPT="${PROMPT:-Reply with the single word: OK}"
@@ -46,9 +46,9 @@ green "[+] server reachable"
 # 2. Model present? Match case-insensitively: Ollama 0.24 normalizes
 # model names at lookup but preserves whatever case was first registered
-# on disk (e.g. `make load-bundle` may produce `Thanatos-27B:latest`
-# even when invoked with TAG=thanatos-27b, if an earlier session left a
-# Thanatos-27B manifest dir behind). The exact tag the user typed is
 # still valid for `ollama run` — the comparison just needs to be
 # case-folded to match.
 if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select((.name | ascii_downcase) | startswith($m | ascii_downcase))' >/dev/null; then

 #!/usr/bin/env bash
+# Thanatos-Heretic-27B — smoke test against a running Ollama daemon.
 #
 # Verifies:
 #   1. The Ollama server is reachable.
 # Usage:
 #   ./scripts/smoke_test.sh                       # fast checks only
 #   TOOLS_TEST=1 ./scripts/smoke_test.sh          # add tool-call round-trip
+#   MODEL=hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M ./scripts/smoke_test.sh
 #   HOST=http://localhost:11434 ./scripts/smoke_test.sh
 set -euo pipefail
+MODEL="${MODEL:-thanatos-heretic-27b}"
 HOST="${HOST:-http://localhost:11434}"
 PROMPT="${PROMPT:-Reply with the single word: OK}"
 # 2. Model present? Match case-insensitively: Ollama 0.24 normalizes
 # model names at lookup but preserves whatever case was first registered
+# on disk (e.g. `make load-bundle` may produce `Thanatos-Heretic-27B:latest`
+# even when invoked with TAG=thanatos-heretic-27b, if an earlier session left a
+# Thanatos-Heretic-27B manifest dir behind). The exact tag the user typed is
 # still valid for `ollama run` — the comparison just needs to be
 # case-folded to match.
 if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select((.name | ascii_downcase) | startswith($m | ascii_downcase))' >/dev/null; then

scripts/verify_arch.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Thanatos-27B — verify the README "Architecture" forward-pass bullets
 against the actual GGUF metadata.
 Reads either the qwen35- or qwen36-stamped bundle (or any GGUF that
@@ -69,8 +69,8 @@ def main() -> int:
         return 2
     root = Path(__file__).resolve().parent.parent
     default_paths = [
-        root / "Thanatos-27B.Q4_K_M.qwen35.gguf",
-        root / "Thanatos-27B.Q4_K_M.qwen36.gguf",
         root / "Thanatos-27B.Q4_K_M.gguf",
     ]
     if len(sys.argv) == 2:
@@ -78,7 +78,7 @@ def main() -> int:
     else:
         path = next((p for p in default_paths if p.exists() and p.stat().st_size > 1024), None)
         if path is None:
-            print("[!] no Thanatos-27B GGUF found in repo root; pass a path explicitly", file=sys.stderr)
             return 2
     print(f"[*] reading: {path}")

 #!/usr/bin/env python3
 """
+Thanatos-Heretic-27B — verify the README "Architecture" forward-pass bullets
 against the actual GGUF metadata.
 Reads either the qwen35- or qwen36-stamped bundle (or any GGUF that
         return 2
     root = Path(__file__).resolve().parent.parent
     default_paths = [
+        root / "Thanatos-Heretic-27B.Q4_K_M.qwen35.gguf",
+        root / "Thanatos-Heretic-27B.Q4_K_M.qwen36.gguf",
         root / "Thanatos-27B.Q4_K_M.gguf",
     ]
     if len(sys.argv) == 2:
     else:
         path = next((p for p in default_paths if p.exists() and p.stat().st_size > 1024), None)
         if path is None:
+            print("[!] no Thanatos-Heretic-27B GGUF found in repo root; pass a path explicitly", file=sys.stderr)
             return 2
     print(f"[*] reading: {path}")