Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio new

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Pi new

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

Thanatos-27B / README.md

FoolDev

Rename back: Thanatos-27B-Heretic → Thanatos-27B (HF repo also renamed)

7197abd 6 days ago

preview code

raw

history blame contribute delete

28.4 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3.6-27B
	datasets:
	- crownelius/Creative_Writing_ShareGPT_Enhanced
	- microsoft/rStar-Coder
	- peteromallet/dataclaw-peteromallet
	- crownelius/Opus-4.7-Reasoning
	- openbmb/UltraData-Math
	- Crownelius/Crow-Heretic-TeichAI-Unified
	language:
	- en
	- zh
	- ru
	- es
	- fr
	- it
	- ja
	- ko
	- de
	- ar
	- tr
	- pl
	- sv
	- nl
	- he
	- id
	- uk
	- fa
	- pt
	- ms
	- fi
	- el
	tags:
	- qwen36
	- dense
	- conversational
	- multimodal
	- agent
	- gguf
	- ollama
	- imatrix
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />

	[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
	[![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
	[![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
	[![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)

	# Thanatos-27B

	> Dense Reasoning. Friendlier Footprint.
	> Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.

	`Architecture:` `Qwen 3.6 27B (Dense)` \| `Parameters:` `27B` \| `Teacher:` `Claude Opus 4.7` \| `Type:` `Distilled LLM`

	A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the dense [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.

	## TL;DR

	One-liner via Hugging Face (pulls a GGUF + this repo's root-level
	`template` / `system` / `params` files, including the tool-calling
	template — HF's Ollama bridge ingests those three files, not
	`Modelfile`):

	```bash
	ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
	```

	If you pulled the bundle during any of the qwen36 windows on the
	pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
	have a qwen36-stamped blob in your local Ollama store, `make
	heal-hf` rebadges it in place. Fresh pulls go straight through.

	For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
	QUANT=...` is the simplest path. See [Quick start](#quick-start)
	below for the full matrix.

	For image input use llama.cpp directly — Ollama vision is broken for
	this architecture upstream (see [Vision](#vision)).

	## Why a 27B variant?

	The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but memory-hungry at load time — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.

	The 27B is dense: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.

	\| \| Thanatos-27B (this) \| [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) \|
	\|---\|---\|---\|
	\| Architecture \| Dense transformer \| MoE 256 experts, 8 active \|
	\| Total params \| 27 B \| 35 B \|
	\| Active params per token \| 27 B \| ~3 B \|
	\| Layers \| 64 \| 40 \|
	\| Hidden size \| 5120 \| 2048 \|
	\| Q4_K_M GGUF size \| ~17 GB (bundled) \| ~19 GB (bundled) \|
	\| Q3_K_S GGUF size \| ~12 GB (build locally via `make build QUANT=Q3_K_S`) \| n/a \|
	\| Min host memory @ Q4 / 8K ctx \| ~22 GB \| ~38 GB \|
	\| Multimodal (text path) \| Yes \| Yes \|
	\| Multimodal (vision via Ollama) \| Broken upstream — see below \| Broken upstream \|
	\| Multimodal (vision via llama.cpp) \| Yes, with mmproj \| Yes, with mmproj \|
	\| Max context \| 262 144 \| 262 144 \|

	## What's here

	\| File \| Use \|
	\|---\|---\|
	\| `banner.svg` / `banner.png` \| Repo header, Tokyo Night themed \|
	\| `dense-flow.svg` / `dense-flow.png` \| Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) \|
	\| `Modelfile` \| Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for local builds \|
	\| `template`, `system`, `params` \| Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does not read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. \|
	\| `examples/` \| Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python \|
	\| `scripts/build.sh` \| Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) \|
	\| `scripts/load_bundle.sh` \| One-shot path from this repo's bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. \|
	\| `scripts/heal_hf_pull.sh` \| Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) before the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. \|
	\| `scripts/smoke_test.sh` \| Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape \|
	\| `scripts/bench.sh` \| Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) \|
	\| `scripts/fetch_vision.sh` \| Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). \|
	\| `scripts/check.sh` \| Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check \|
	\| `scripts/check_bridge_sync.py` \| Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. \|
	\| `scripts/verify_arch.py` \| Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. \|
	\| `scripts/install-hooks.sh` \| Installs `check.sh` as a git pre-commit hook \|
	\| `Makefile` \| Convenience wrapper — `make help` lists targets \|
	\| `LICENSE`, `CITATION.cff` \| Apache-2.0 license and citation metadata \|
	\| `CHANGELOG.md` \| Versioned tooling/docs changes \|
	\| `README.md` \| This file \|

	For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
	downloads the smaller ~12 GB Q3_K_S quant from
	`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
	creates a local `thanatos-27b` Ollama tag. Does not redistribute
	via this repo. For other quants use `make build QUANT=...`. The
	local-build path applies this repo's `Modelfile`; the `hf.co/...`
	path applies the root-level `template`, `system`, and `params`
	files (kept in sync with the `Modelfile`).

	If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).

	## Architecture

	<p align="left">
	<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
	</p>

	- Qwen 3.6 dense, 27B parameters, 64 transformer layers
	- Hybrid attention stack: 16 repeats of `[3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]`
	- Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
	- Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
	- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
	- Vocab 248,320 (shared with 35B-A3B sibling)
	- 262 144 native context, extensible to ~1 M with YaRN
	- Vision + video supported by the base architecture via a separate
	`mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
	from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
	current loader compatibility.
	- Multi-token prediction (MTP) head trained for speculative decoding —
	present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
	vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
	Not usable via llama.cpp / Ollama today: the GGUF converter
	(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
	`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
	inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
	851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
	merged 2026-05-16) currently covers other architectures only;
	tracking that PR's follow-up work for when qwen35 / qwen35moe
	consumer support lands. (Earlier README versions claimed MTP was
	available without this caveat — confirmed empirically via
	`gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
	2026-05-19.)

	The bundled GGUF declares `general.architecture: 'qwen35'` — not a
	workaround for an unimplemented `qwen36` arch, but the canonical
	upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
	family. The naming convergence runs through three layers of the
	stack:

	- Qwen's own HF configs. `Qwen/Qwen3.6-27B/config.json` declares
	`"model_type": "qwen3_5"` and
	`"architectures": ["Qwen3_5ForConditionalGeneration"]`. The MoE
	sibling `Qwen/Qwen3.6-35B-A3B` declares `"qwen3_5_moe"` /
	`Qwen3_5MoeForConditionalGeneration`. No `Qwen3_6` arch class
	exists in `transformers`; Qwen reuses the 3.5 class names.
	- llama.cpp's converter. `convert_hf_to_gguf.py` registers
	`Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
	`Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The unsloth
	GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
	`unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
	- llama.cpp's model code. `src/models/qwen35.cpp` has an
	explicit `case 64: type = LLM_TYPE_27B` branch for this model;
	`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
	Janus-35B sibling base. The arch entries were written to load
	Qwen 3.6 weights, not just Qwen 3.5.

	There is no PR or tracking issue for a `qwen36` arch entry in
	`ggml-org/llama.cpp` or `ollama/ollama` because none is needed —
	`qwen35` already loads the model the upstream code path was
	designed to load.

	`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
	Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
	loaders.

	### History

	The bundle's `general.architecture` stamp has now flipped eight
	times — four landings on qwen36 and four on qwen35 — each time
	after weighing the friction-vs-honesty tradeoff anew. The saga
	is resolved on the upstream-canonical `qwen35` side:

	- v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC): initial qwen35
	→ qwen36 stamp, on the theory that qwen35 was a loader stand-in
	awaiting proper Qwen 3.6 support. Upstream audit later showed
	that theory was mistaken (see above).
	- 2026-05-19 afternoon (`964e418`): flipped back to qwen35
	after daily friction outweighed version-specificity for that
	iteration; doc workaround narrative collapsed (`83022eb`).
	- 2026-05-19 evening (`07fa120`): brief re-flip to qwen36
	during a fresh-pull integration test on Strix Halo.
	- 2026-05-19 evening (`72259c1`, ~1 hour later): reverted to
	qwen35 again because the live friction was worse than the doc
	prose suggested.
	- 2026-05-19 evening (`973d7ef`): flipped to qwen36 one more
	time, after the upstream-evidence audit had been shipped and
	the friction was a known quantity. Project owner wanted to
	test the friction tradeoff in practice with the audit's
	conclusion staring them in the face.
	- 2026-05-19 evening (`978798f`): flipped back to qwen35
	after seven sequential fresh-pull → heal-hf cycles on the
	Strix Halo box made the friction concretely-experienced
	rather than hypothetical. Each cycle worked (the heal flow
	is solid) — and each cycle was an unnecessary obstacle for
	users who just want `ollama run` to work first try. The
	audit (`a4d3b6e`) called the canonical stamp correctly and
	the practical friction outweighed the version-specificity
	payoff.
	- 2026-05-20 midday (`ae67ed1`): brief re-flip to qwen36
	the next morning to re-test the friction in a fresh session.
	- 2026-05-20 midday (`e03e10e`, 8 minutes later): flipped
	back to qwen35. Same conclusion as the prior round trip —
	friction outweighs version-specificity. **This is the
	current state.**

	Tensor data was byte-identical across all stamps; only the
	`general.architecture` KV (and namespaced KV keys) flipped.
	See the [CHANGELOG](CHANGELOG.md) entries for each flip's
	rationale.

	### Rebadge utility

	`scripts/rename_arch.py` is the generic GGUF arch renamer
	(metadata only, tensors byte-identical), kept in the repo for
	the legacy qwen36 → qwen35 in-store rebadge (used by `make
	heal-hf` and `make load-bundle`) and any future arch flip:

	```bash
	# qwen36 -> qwen35 (the legacy recovery direction, for blobs
	# pulled from the pre-rename FoolDev/Thanatos-27B repo)
	python3 scripts/rename_arch.py \
	--from-arch qwen36 --to-arch qwen35 \
	Thanatos-27B.Q4_K_M.qwen36.gguf \
	Thanatos-27B.Q4_K_M.gguf
	```

	## Quick start

	### Ollama

	Three paths:

	```bash
	# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
	# root-level template / system / params files in one step):
	ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen35-stamped

	# B. Build a local `thanatos-27b` tag from THIS repo's bundle
	# (LFS smudge if needed, then `ollama create`). Useful if you
	# want a bare local tag rather than the `hf.co/...` path:
	make load-bundle # creates local tag thanatos-27b
	ollama run thanatos-27b

	# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
	# and build locally. Loads on every current llama.cpp / Ollama.
	make build # Q4_K_M -> thanatos-27b
	make build QUANT=Q3_K_S # 12 GB smaller quant
	make build QUANT=Q5_K_M # 20 GB higher quality
	make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
	ollama run thanatos-27b
	```

	Under the hood, `make build` calls `scripts/build.sh`, which downloads the
	GGUF if missing (set `GGUF_PATH` to point at one you already have) and
	runs `ollama create` with the matching `Modelfile`.

	If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
	run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.

	Confirm everything works:

	```bash
	make smoke # checks server, model, round-trip, no token leakage
	make smoke-tools # adds an end-to-end tool-call round-trip (~10s extra)
	make bench # measured tok/s on this machine (3-prompt mix)
	python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-compat
	```

	### Local apps

	\| App \| How to load this model \|
	\|---\|---\|
	\| Ollama \| `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does not read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. \|
	\| LM Studio \| Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. \|
	\| Jan \| Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. \|
	\| llama.cpp \| `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). \|
	\| llama-cpp-python \| See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). \|
	\| Open WebUI / KoboldCpp / text-generation-webui \| Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. \|

	For the full Vision (image input) loader matrix, see [Vision](#vision).
	Tool calling currently works in Ollama (via the root-level
	`template` file when pulling from `hf.co/...`, or via the `Modelfile`
	TEMPLATE when building locally) and llama.cpp / llama-cpp-python
	(via the GGUF's embedded jinja). Other apps' tool-calling support
	depends on whether they read the embedded template or require an
	external schema.

	### Inference (OpenAI-compatible)

	```bash
	curl -s http://localhost:11434/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "thanatos-27b",
	"messages": [
	{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
	{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
	],
	"temperature": 0.6
	}' \| jq -r '.choices[0].message.content'
	```

	### Recommended sampling

	\| Use \| temp \| top_p \| top_k \| repeat_penalty \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Reasoning / general \| 0.6 \| 0.95 \| 20 \| 1.05 \|
	\| Creative / RP \| 0.8 \| 0.95 \| 40 \| 1.02 \|

	Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.

	### System prompt

	The Modelfile bakes this in. Override per-request via the `system` role
	in your client:

	```text
	You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

	Behavior rules:
	- Answer the user's actual request directly.
	- Be accurate, complete, and structured.
	- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
	- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
	- If the user wants creative writing, preserve tone, continuity, and character consistency.
	- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
	- Finish with a usable answer, not just planning.
	```

	## Vision

	The Qwen 3.6 base supports image (and video) input via a separate
	`mmproj` projector. The full multimodal stack is:

	```
	Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
	mmproj-F16.gguf (~927 MB, the vision projector)
	```

	Both files are at
	[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
	This repo intentionally does not redistribute either.

	### Loader compatibility — the honest table

	\| Loader \| Text \| Vision (mmproj) \| Notes \|
	\|---\|---\|---\|---\|
	\| llama.cpp (`llama-mtmd-cli`, `llama-server --mmproj`) \| ✅ \| ✅ \| Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. \|
	\| llama-cpp-python \| ✅ \| ✅ \| See `examples/llama_cpp_vision.py`. \|
	\| Ollama 0.24 \| ✅ \| ❌ \| Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability — but the first inference request fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). \|
	\| LM Studio \| ✅ \| ✅ (last tested) \| Uses upstream llama.cpp directly. \|

	### Vision via llama.cpp

	Three flavors, in order of build-time effort:

	```bash
	# A. HTTP via llama-server (always built — the easiest path).
	# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
	# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
	llama-server \
	-m Qwen3.6-27B-Q4_K_M.gguf \
	--mmproj mmproj-F16.gguf \
	--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
	# then POST OpenAI-style chat completions with an image_url content
	# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
	# The thinking trace arrives in message.reasoning_content; the visible
	# answer is in message.content. Budget ≥500 max_tokens so the reasoning
	# block doesn't crowd out the final answer.

	# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
	# so a selective `cmake --build build --target llama-cli ...` won't
	# produce it — a plain `cmake --build build` will. If yours didn't,
	# run `cmake --build build --target llama-mtmd-cli`.
	llama-mtmd-cli \
	-m Qwen3.6-27B-Q4_K_M.gguf \
	--mmproj mmproj-F16.gguf \
	--image photo.jpg \
	-p "Describe this image."

	# C. Python via llama-cpp-python:
	python examples/llama_cpp_vision.py \
	--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
	--mmproj /path/to/mmproj-F16.gguf \
	--image /path/to/photo.jpg \
	--prompt "What is in this image?"
	```

	Until the Ollama upstream issue is fixed, treat Ollama as text-only
	for this model.

	## Hardware requirements

	The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.

	\| Hardware \| Status \|
	\|---\|---\|
	\| ≥32 GB RAM (CPU-only) \| Works, ~1-3 tok/s \|
	\| RTX 3090 / 4090 24 GB \| Works, full Q4 offload, ~25-40 tok/s \|
	\| RTX 5090 32 GB \| Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s \|
	\| Mac Studio M2/M3 32 GB+ unified \| Works, ~15-25 tok/s \|
	\| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) \| Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. \|

	Most numbers in this table are estimates from comparable models; the
	gradient is right but the absolute values will move ±20% with prompt
	shape, KV cache type, and parallel-request count. Measure your own
	machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
	`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
	data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
	~12.3 tok/s at Q3_K_S and ~9.3 tok/s at Q4_K_M (3-prompt mix,
	steady across short / medium / long prompts), sitting between CPU-only
	and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
	same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
	this hardware.

	## Chat template

	Standard Qwen 3.x ChatML with `<\|im_start\|>` / `<\|im_end\|>` role markers
	and `<think>...</think>` blocks for reasoning traces. The Qwen 3.6 jinja
	template is embedded in the GGUF metadata; loaders that read GGUF chat
	templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
	plain-conversation formatting automatically.

	Ollama is the exception: its conversion of the embedded jinja loses the
	`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
	Two paths fix this, depending on how you pull the model:

	- `ollama run hf.co/FoolDev/Thanatos-27B` — HF's Ollama bridge applies
	the root-level `template` / `system` / `params` files in this repo
	(the bridge does not read `Modelfile`).
	- `make build` / `ollama create thanatos-27b -f Modelfile` — uses the
	`Modelfile`'s `TEMPLATE` block.

	Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
	`/api/chat` and `/v1/chat/completions`. The two configurations are
	kept in sync: edit them together if you change one.

	#### Plain conversation

	```text
	<\|im_start\|>system
	You are Thanatos, a precise and capable assistant…<\|im_end\|>
	<\|im_start\|>user
	What is the time complexity of mergesort?<\|im_end\|>
	<\|im_start\|>assistant
	```

	#### With reasoning trace

	```text
	<\|im_start\|>assistant
	<think>
	The user asked about mergesort. It splits, recursively sorts each half,
	then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
	</think>

	Mergesort runs in O(n log n) time in the worst, average, and best
	cases.<\|im_end\|>
	```

	Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by
	default and surface only the visible answer. Strip it manually with
	`re.sub(r"<think>.?</think>\s", "", content, flags=re.DOTALL)` if your
	client doesn't.

	#### Tool / function calling

	The wire format depends on the loader. Both are valid Qwen 3.6 outputs;
	the model adapts to whichever shape the system prompt prescribes.

	Ollama path (this repo's `Modelfile`). The `TEMPLATE` directive
	prompts the model to emit JSON-in-XML, the form Ollama's tool-call
	extractor parses into a structured `tool_calls` array. After
	`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
	under Capabilities, and both `/api/chat` and `/v1/chat/completions`
	accept a `tools` array.

	```text
	<tool_call>
	{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
	</tool_call>
	```

	Embedded-jinja path (llama.cpp, llama-cpp-python, LM Studio). The
	Qwen 3.6 native chat template baked into the GGUF instructs the model
	to emit the more verbose XML form it was trained on:

	```text
	<tool_call>
	<function=get_current_weather>
	<parameter=city>
	Paris
	</parameter>
	<parameter=unit>
	celsius
	</parameter>
	</function>
	</tool_call>
	```

	Use whichever your client expects; don't mix parsers.

	End-to-end exercise (Ollama path):

	```bash
	python examples/ollama_chat.py # section 3 runs a real round-trip
	```

	## Known limitations

	- Slower per token than the 35B-A3B sibling. Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
	- No mmproj in this release, and vision via Ollama is broken upstream (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
	- Q4_K_M quality loss is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
	- No formal evaluation in this card. Numbers above are estimates.

	## Related models

	\| Model \| Notes \|
	\|---\|---\|
	\| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) \| Upstream base, safetensors \|
	\| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) \| Recommended GGUF source \|
	\| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) \| 35B-A3B MoE sibling. More capacity, more memory pressure. \|
	\| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) \| 9B starter model when 27B/35B is too heavy \|

	## Credits

	- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
	- Reasoning teacher: Claude Opus 4.7 (Anthropic)
	- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

	License inherited from upstream: Apache-2.0.