Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio new

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Pi new

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

Thanatos-27B

File size: 28,441 Bytes

---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-27B
datasets:
  - crownelius/Creative_Writing_ShareGPT_Enhanced
  - microsoft/rStar-Coder
  - peteromallet/dataclaw-peteromallet
  - crownelius/Opus-4.7-Reasoning
  - openbmb/UltraData-Math
  - Crownelius/Crow-Heretic-TeichAI-Unified
language:
  - en
  - zh
  - ru
  - es
  - fr
  - it
  - ja
  - ko
  - de
  - ar
  - tr
  - pl
  - sv
  - nl
  - he
  - id
  - uk
  - fa
  - pt
  - ms
  - fi
  - el
tags:
  - qwen36
  - dense
  - conversational
  - multimodal
  - agent
  - gguf
  - ollama
  - imatrix
library_name: transformers
pipeline_tag: image-text-to-text
---

<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />

[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
[![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
[![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
[![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)

# Thanatos-27B

> **Dense Reasoning. Friendlier Footprint.**
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*

**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`

A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.

## TL;DR

One-liner via Hugging Face (pulls a GGUF + this repo's root-level
`template` / `system` / `params` files, including the tool-calling
template — HF's Ollama bridge ingests those three files, not
`Modelfile`):

```bash
ollama run hf.co/FoolDev/Thanatos-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
```

If you pulled the bundle during any of the qwen36 windows on the
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
have a qwen36-stamped blob in your local Ollama store, `make
heal-hf` rebadges it in place. Fresh pulls go straight through.

For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
QUANT=...` is the simplest path. See [Quick start](#quick-start)
below for the full matrix.

For image input use llama.cpp directly — Ollama vision is broken for
this architecture upstream (see [Vision](#vision)).

## Why a 27B variant?

The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.

The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.

| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|---|---|---|
| Architecture | Dense transformer | MoE 256 experts, 8 active |
| Total params | 27 B | 35 B |
| Active params per token | 27 B | ~3 B |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
| Multimodal (text path) | Yes | Yes |
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
| Max context | 262 144 | 262 144 |

## What's here

| File | Use |
|---|---|
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for **local** builds |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
| `Makefile` | Convenience wrapper — `make help` lists targets |
| `LICENSE`, `CITATION.cff` | Apache-2.0 license and citation metadata |
| `CHANGELOG.md` | Versioned tooling/docs changes |
| `README.md` | This file |

For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
downloads the smaller ~12 GB Q3_K_S quant from
`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
creates a local `thanatos-27b` Ollama tag. Does not redistribute
via this repo. For other quants use `make build QUANT=...`. The
local-build path applies this repo's `Modelfile`; the `hf.co/...`
path applies the root-level `template`, `system`, and `params`
files (kept in sync with the `Modelfile`).

If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).

## Architecture

<p align="left">
  <img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
</p>

- Qwen 3.6 dense, 27B parameters, 64 transformer layers
- **Hybrid attention stack**: 16 repeats of `[3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]`
  - Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
  - Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
- Vocab 248,320 (shared with 35B-A3B sibling)
- 262 144 native context, extensible to ~1 M with YaRN
- Vision + video supported by the **base architecture** via a separate
  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
  current loader compatibility.
- Multi-token prediction (MTP) head trained for speculative decoding —
  present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
  vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
  **Not usable via llama.cpp / Ollama today**: the GGUF converter
  (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
  `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
  inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
  851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
  merged 2026-05-16) currently covers other architectures only;
  tracking that PR's follow-up work for when qwen35 / qwen35moe
  consumer support lands. (Earlier README versions claimed MTP was
  available without this caveat — confirmed empirically via
  `gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
  2026-05-19.)

**The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
workaround for an unimplemented `qwen36` arch, but the canonical
upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
family. The naming convergence runs through three layers of the
stack:

- **Qwen's own HF configs.** `Qwen/Qwen3.6-27B/config.json` declares
  `"model_type": "qwen3_5"` and
  `"architectures": ["Qwen3_5ForConditionalGeneration"]`. The MoE
  sibling `Qwen/Qwen3.6-35B-A3B` declares `"qwen3_5_moe"` /
  `Qwen3_5MoeForConditionalGeneration`. No `Qwen3_6` arch class
  exists in `transformers`; Qwen reuses the 3.5 class names.
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
  `Qwen3_5ForCausalLM` → `MODEL_ARCH.QWEN35` and
  `Qwen3_5MoeForCausalLM` → `MODEL_ARCH.QWEN35MOE`. The unsloth
  GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
  `unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
  explicit `case 64: type = LLM_TYPE_27B` branch for this model;
  `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
  Janus-35B sibling base. The arch entries were written to load
  Qwen 3.6 weights, not just Qwen 3.5.

There is no PR or tracking issue for a `qwen36` arch entry in
`ggml-org/llama.cpp` or `ollama/ollama` because none is needed —
`qwen35` already loads the model the upstream code path was
designed to load.

`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
loaders.

### History

The bundle's `general.architecture` stamp has now flipped eight
times — four landings on qwen36 and four on qwen35 — each time
after weighing the friction-vs-honesty tradeoff anew. The saga
is resolved on the upstream-canonical `qwen35` side:

- **v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC):** initial qwen35
  → qwen36 stamp, on the theory that qwen35 was a loader stand-in
  awaiting proper Qwen 3.6 support. Upstream audit later showed
  that theory was mistaken (see above).
- **2026-05-19 afternoon (`964e418`):** flipped back to qwen35
  after daily friction outweighed version-specificity for that
  iteration; doc workaround narrative collapsed (`83022eb`).
- **2026-05-19 evening (`07fa120`):** brief re-flip to qwen36
  during a fresh-pull integration test on Strix Halo.
- **2026-05-19 evening (`72259c1`, ~1 hour later):** reverted to
  qwen35 again because the live friction was worse than the doc
  prose suggested.
- **2026-05-19 evening (`973d7ef`):** flipped to qwen36 one more
  time, after the upstream-evidence audit had been shipped and
  the friction was a known quantity. Project owner wanted to
  test the friction tradeoff in practice with the audit's
  conclusion staring them in the face.
- **2026-05-19 evening (`978798f`):** flipped back to qwen35
  after seven sequential fresh-pull → heal-hf cycles on the
  Strix Halo box made the friction concretely-experienced
  rather than hypothetical. Each cycle worked (the heal flow
  is solid) — and each cycle was an unnecessary obstacle for
  users who just want `ollama run` to work first try. The
  audit (`a4d3b6e`) called the canonical stamp correctly and
  the practical friction outweighed the version-specificity
  payoff.
- **2026-05-20 midday (`ae67ed1`):** brief re-flip to qwen36
  the next morning to re-test the friction in a fresh session.
- **2026-05-20 midday (`e03e10e`, 8 minutes later):** flipped
  back to qwen35. Same conclusion as the prior round trip —
  friction outweighs version-specificity. **This is the
  current state.**

Tensor data was byte-identical across all stamps; only the
`general.architecture` KV (and namespaced KV keys) flipped.
See the [CHANGELOG](CHANGELOG.md) entries for each flip's
rationale.

### Rebadge utility

`scripts/rename_arch.py` is the generic GGUF arch renamer
(metadata only, tensors byte-identical), kept in the repo for
the legacy qwen36 → qwen35 in-store rebadge (used by `make
heal-hf` and `make load-bundle`) and any future arch flip:

```bash
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
    --from-arch qwen36 --to-arch qwen35 \
    Thanatos-27B.Q4_K_M.qwen36.gguf \
    Thanatos-27B.Q4_K_M.gguf
```

## Quick start

### Ollama

Three paths:

```bash
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
#    root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M, qwen35-stamped

# B. Build a local `thanatos-27b` tag from THIS repo's bundle
#    (LFS smudge if needed, then `ollama create`). Useful if you
#    want a bare local tag rather than the `hf.co/...` path:
make load-bundle                                 # creates local tag thanatos-27b
ollama run thanatos-27b

# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
#    and build locally. Loads on every current llama.cpp / Ollama.
make build                                              # Q4_K_M  -> thanatos-27b
make build QUANT=Q3_K_S                                 # 12 GB smaller quant
make build QUANT=Q5_K_M                                 # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
ollama run thanatos-27b
```

Under the hood, `make build` calls `scripts/build.sh`, which downloads the
GGUF if missing (set `GGUF_PATH` to point at one you already have) and
runs `ollama create` with the matching `Modelfile`.

If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.

Confirm everything works:

```bash
make smoke                          # checks server, model, round-trip, no token leakage
make smoke-tools                    # adds an end-to-end tool-call round-trip (~10s extra)
make bench                          # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-compat
```

### Local apps

| App | How to load this model |
|---|---|
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
| **LM Studio** | Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |

For the full Vision (image input) loader matrix, see [Vision](#vision).
Tool calling currently works in **Ollama** (via the root-level
`template` file when pulling from `hf.co/...`, or via the `Modelfile`
TEMPLATE when building locally) and **llama.cpp / llama-cpp-python**
(via the GGUF's embedded jinja). Other apps' tool-calling support
depends on whether they read the embedded template or require an
external schema.

### Inference (OpenAI-compatible)

```bash
curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "thanatos-27b",
    "messages": [
      {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
      {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
    ],
    "temperature": 0.6
  }' | jq -r '.choices[0].message.content'
```

### Recommended sampling

| Use | temp | top_p | top_k | repeat_penalty |
|---|---:|---:|---:|---:|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |

Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.

### System prompt

The Modelfile bakes this in. Override per-request via the `system` role
in your client:

```text
You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
```

## Vision

The Qwen 3.6 base supports image (and video) input via a separate
`mmproj` projector. The full multimodal stack is:

```
Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
mmproj-F16.gguf           (~927 MB, the vision projector)
```

Both files are at
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
This repo intentionally does not redistribute either.

### Loader compatibility — the honest table

| Loader | Text | Vision (mmproj) | Notes |
|---|---|---|---|
| **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | ✅ | ✅ | Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
| **llama-cpp-python** | ✅ | ✅ | See `examples/llama_cpp_vision.py`. |
| **Ollama 0.24** | ✅ | ❌ | Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability — but the **first inference request** fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). |
| **LM Studio** | ✅ | ✅ (last tested) | Uses upstream llama.cpp directly. |

### Vision via llama.cpp

Three flavors, in order of build-time effort:

```bash
# A. HTTP via llama-server (always built — the easiest path).
#    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget ≥500 max_tokens so the reasoning
# block doesn't crowd out the final answer.

# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
#    so a selective `cmake --build build --target llama-cli ...` won't
#    produce it — a plain `cmake --build build` will. If yours didn't,
#    run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image photo.jpg \
  -p "Describe this image."

# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-F16.gguf \
  --image /path/to/photo.jpg \
  --prompt "What is in this image?"
```

Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
for this model.

## Hardware requirements

The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.

| Hardware | Status |
|---|---|
| ≥32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |

Most numbers in this table are estimates from comparable models; the
gradient is right but the absolute values will move ±20% with prompt
shape, KV cache type, and parallel-request count. Measure your own
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
steady across short / medium / long prompts), sitting between CPU-only
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
this hardware.

## Chat template

Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers
and `<think>...</think>` blocks for reasoning traces. The Qwen 3.6 jinja
template is embedded in the GGUF metadata; loaders that read GGUF chat
templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
plain-conversation formatting automatically.

Ollama is the exception: its conversion of the embedded jinja loses the
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
Two paths fix this, depending on how you pull the model:

- **`ollama run hf.co/FoolDev/Thanatos-27B`** — HF's Ollama bridge applies
  the root-level `template` / `system` / `params` files in this repo
  (the bridge does **not** read `Modelfile`).
- **`make build` / `ollama create thanatos-27b -f Modelfile`** — uses the
  `Modelfile`'s `TEMPLATE` block.

Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
`/api/chat` and `/v1/chat/completions`. The two configurations are
kept in sync: edit them together if you change one.

#### Plain conversation

```text
<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
```

#### With reasoning trace

```text
<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>

Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>
```

Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by
default and surface only the visible answer. Strip it manually with
`re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL)` if your
client doesn't.

#### Tool / function calling

The wire format depends on the loader. Both are valid Qwen 3.6 outputs;
the model adapts to whichever shape the system prompt prescribes.

**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
extractor parses into a structured `tool_calls` array. After
`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
accept a `tools` array.

```text
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>
```

**Embedded-jinja path** (llama.cpp, llama-cpp-python, LM Studio). The
Qwen 3.6 native chat template baked into the GGUF instructs the model
to emit the more verbose XML form it was trained on:

```text
<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
```

Use whichever your client expects; don't mix parsers.

End-to-end exercise (Ollama path):

```bash
python examples/ollama_chat.py        # section 3 runs a real round-trip
```

## Known limitations

- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
- **No formal evaluation in this card.** Numbers above are estimates.

## Related models

| Model | Notes |
|---|---|
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |

## Credits

- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

License inherited from upstream: Apache-2.0.