Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use FoolDev/Thanatos-27B with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "FoolDev/Thanatos-27B:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

Thanatos-27B

Dense Reasoning. Friendlier Footprint. Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.

Architecture: Qwen 3.6 27B (Dense) | Parameters: 27B | Teacher: Claude Opus 4.7 | Type: Distilled LLM

A personal sibling to FoolDev/Janus-35B. Same teacher (Claude Opus 4.7), same dataset family, but built on the dense Qwen/Qwen3.6-27B base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.

TL;DR

One-liner via Hugging Face (pulls a GGUF + this repo's root-level template / system / params files, including the tool-calling template — HF's Ollama bridge ingests those three files, not Modelfile):

ollama run hf.co/FoolDev/Thanatos-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama

If you pulled the bundle during any of the qwen36 windows on the pre-rename FoolDev/Thanatos-27B repo (2026-05-19/20) and still have a qwen36-stamped blob in your local Ollama store, make heal-hf rebadges it in place. Fresh pulls go straight through.

For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), make build QUANT=... is the simplest path. See Quick start below for the full matrix.

For image input use llama.cpp directly — Ollama vision is broken for this architecture upstream (see Vision).

Why a 27B variant?

The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but memory-hungry at load time — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.

The 27B is dense: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (make bench, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.

	Thanatos-27B (this)	Janus-35B
Architecture	Dense transformer	MoE 256 experts, 8 active
Total params	27 B	35 B
Active params per token	27 B	~3 B
Layers	64	40
Hidden size	5120	2048
Q4_K_M GGUF size	~17 GB (bundled)	~19 GB (bundled)
Q3_K_S GGUF size	~12 GB (build locally via `make build QUANT=Q3_K_S`)	n/a
Min host memory @ Q4 / 8K ctx	~22 GB	~38 GB
Multimodal (text path)	Yes	Yes
Multimodal (vision via Ollama)	Broken upstream — see below	Broken upstream
Multimodal (vision via llama.cpp)	Yes, with mmproj	Yes, with mmproj
Max context	262 144	262 144

What's here

File	Use
`banner.svg` / `banner.png`	Repo header, Tokyo Night themed
`dense-flow.svg` / `dense-flow.png`	Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG)
`Modelfile`	Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for local builds
`template`, `system`, `params`	Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does not read `Modelfile` — see HF Ollama docs). Mirrors the `Modelfile`'s template / system prompt / sampling params.
`examples/`	Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python
`scripts/build.sh`	Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`)
`scripts/load_bundle.sh`	One-shot path from this repo's bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle.
`scripts/heal_hf_pull.sh`	Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) before the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it.
`scripts/smoke_test.sh`	Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape
`scripts/bench.sh`	Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`)
`scripts/fetch_vision.sh`	Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see Vision). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match).
`scripts/check.sh`	Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check
`scripts/check_bridge_sync.py`	Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook.
`scripts/verify_arch.py`	Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand.
`scripts/install-hooks.sh`	Installs `check.sh` as a git pre-commit hook
`Makefile`	Convenience wrapper — `make help` lists targets
`LICENSE`, `CITATION.cff`	Apache-2.0 license and citation metadata
`CHANGELOG.md`	Versioned tooling/docs changes
`README.md`	This file

For 16 GB GPUs / unified-memory laptops, make build QUANT=Q3_K_S downloads the smaller ~12 GB Q3_K_S quant from unsloth/Qwen3.6-27B-GGUF (qwen35-stamped, loads directly) and creates a local thanatos-27b Ollama tag. Does not redistribute via this repo. For other quants use make build QUANT=.... The local-build path applies this repo's Modelfile; the hf.co/... path applies the root-level template, system, and params files (kept in sync with the Modelfile).

If you want the safetensors for transformers, fetch them from Qwen/Qwen3.6-27B.

Architecture

animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn

Qwen 3.6 dense, 27B parameters, 64 transformer layers
Hybrid attention stack: 16 repeats of [3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]
- Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
- Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
Vocab 248,320 (shared with 35B-A3B sibling)
262 144 native context, extensible to ~1 M with YaRN
Vision + video supported by the base architecture via a separate mmproj projector (not redistributed here; pull mmproj-F16.gguf from unsloth/Qwen3.6-27B-GGUF). See Vision below for current loader compatibility.
Multi-token prediction (MTP) head trained for speculative decoding — present in the upstream Qwen/Qwen3.6-27B safetensors and usable via vLLM (qwen3_next_mtp) or SGLang (--speculative-algo NEXTN). Not usable via llama.cpp / Ollama today: the GGUF converter (convert_hf_to_gguf.py) explicitly skips MTP tensors for the qwen35 / qwen35moe arch family ("MTP tensors are not used at inference yet"), so the bundled GGUF and the unsloth GGUFs ship with 851 tensors and no MTP head. llama.cpp's MTP support (PR #22673, merged 2026-05-16) currently covers other architectures only; tracking that PR's follow-up work for when qwen35 / qwen35moe consumer support lands. (Earlier README versions claimed MTP was available without this caveat — confirmed empirically via gguf.GGUFReader on both this bundle and unsloth/Qwen3.6-27B-GGUF, 2026-05-19.)

The bundled GGUF declares general.architecture: 'qwen35' — not a workaround for an unimplemented qwen36 arch, but the canonical upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention family. The naming convergence runs through three layers of the stack:

Qwen's own HF configs. Qwen/Qwen3.6-27B/config.json declares "model_type": "qwen3_5" and "architectures": ["Qwen3_5ForConditionalGeneration"]. The MoE sibling Qwen/Qwen3.6-35B-A3B declares "qwen3_5_moe" / Qwen3_5MoeForConditionalGeneration. No Qwen3_6 arch class exists in transformers; Qwen reuses the 3.5 class names.
llama.cpp's converter. convert_hf_to_gguf.py registers Qwen3_5ForCausalLM → MODEL_ARCH.QWEN35 and Qwen3_5MoeForCausalLM → MODEL_ARCH.QWEN35MOE. The unsloth GGUFs this repo pulls from (unsloth/Qwen3.6-27B-GGUF, unsloth/Qwen3.6-35B-A3B-GGUF) inherit those stamps.
llama.cpp's model code. src/models/qwen35.cpp has an explicit case 64: type = LLM_TYPE_27B branch for this model; qwen35moe.cpp has case 40: type = LLM_TYPE_35B_A3B for the Janus-35B sibling base. The arch entries were written to load Qwen 3.6 weights, not just Qwen 3.5.

There is no PR or tracking issue for a qwen36 arch entry in ggml-org/llama.cpp or ollama/ollama because none is needed — qwen35 already loads the model the upstream code path was designed to load.

ollama run hf.co/FoolDev/Thanatos-27B and llama-server -m Thanatos-27B.Q4_K_M.gguf both load directly on current stock loaders.

History

The bundle's general.architecture stamp has now flipped eight times — four landings on qwen36 and four on qwen35 — each time after weighing the friction-vs-honesty tradeoff anew. The saga is resolved on the upstream-canonical qwen35 side:

v0.6.0-era (e1f78fa, 2026-05-19 14:38 UTC): initial qwen35 → qwen36 stamp, on the theory that qwen35 was a loader stand-in awaiting proper Qwen 3.6 support. Upstream audit later showed that theory was mistaken (see above).
2026-05-19 afternoon (964e418): flipped back to qwen35 after daily friction outweighed version-specificity for that iteration; doc workaround narrative collapsed (83022eb).
2026-05-19 evening (07fa120): brief re-flip to qwen36 during a fresh-pull integration test on Strix Halo.
2026-05-19 evening (72259c1, ~1 hour later): reverted to qwen35 again because the live friction was worse than the doc prose suggested.
2026-05-19 evening (973d7ef): flipped to qwen36 one more time, after the upstream-evidence audit had been shipped and the friction was a known quantity. Project owner wanted to test the friction tradeoff in practice with the audit's conclusion staring them in the face.
2026-05-19 evening (978798f): flipped back to qwen35 after seven sequential fresh-pull → heal-hf cycles on the Strix Halo box made the friction concretely-experienced rather than hypothetical. Each cycle worked (the heal flow is solid) — and each cycle was an unnecessary obstacle for users who just want ollama run to work first try. The audit (a4d3b6e) called the canonical stamp correctly and the practical friction outweighed the version-specificity payoff.
2026-05-20 midday (ae67ed1): brief re-flip to qwen36 the next morning to re-test the friction in a fresh session.
2026-05-20 midday (e03e10e, 8 minutes later): flipped back to qwen35. Same conclusion as the prior round trip — friction outweighs version-specificity. This is the current state.

Tensor data was byte-identical across all stamps; only the general.architecture KV (and namespaced KV keys) flipped. See the CHANGELOG entries for each flip's rationale.

Rebadge utility

scripts/rename_arch.py is the generic GGUF arch renamer (metadata only, tensors byte-identical), kept in the repo for the legacy qwen36 → qwen35 in-store rebadge (used by make heal-hf and make load-bundle) and any future arch flip:

# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
    --from-arch qwen36 --to-arch qwen35 \
    Thanatos-27B.Q4_K_M.qwen36.gguf \
    Thanatos-27B.Q4_K_M.gguf

Quick start

Ollama

Three paths:

# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
#    root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M, qwen35-stamped

# B. Build a local `thanatos-27b` tag from THIS repo's bundle
#    (LFS smudge if needed, then `ollama create`). Useful if you
#    want a bare local tag rather than the `hf.co/...` path:
make load-bundle                                 # creates local tag thanatos-27b
ollama run thanatos-27b

# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
#    and build locally. Loads on every current llama.cpp / Ollama.
make build                                              # Q4_K_M  -> thanatos-27b
make build QUANT=Q3_K_S                                 # 12 GB smaller quant
make build QUANT=Q5_K_M                                 # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
ollama run thanatos-27b

Under the hood, make build calls scripts/build.sh, which downloads the GGUF if missing (set GGUF_PATH to point at one you already have) and runs ollama create with the matching Modelfile.

If you'd rather do it by hand: edit the FROM line in Modelfile and run ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b.

Confirm everything works:

make smoke                          # checks server, model, round-trip, no token leakage
make smoke-tools                    # adds an end-to-end tool-call round-trip (~10s extra)
make bench                          # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-compat

Local apps

App	How to load this model
Ollama	`ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does not read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files.
LM Studio	Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`.
Jan	Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio.
llama.cpp	`hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`).
llama-cpp-python	See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input).
Open WebUI / KoboldCpp / text-generation-webui	Standard llama.cpp loader path — point at the GGUF, use the embedded chat template.

For the full Vision (image input) loader matrix, see Vision. Tool calling currently works in Ollama (via the root-level template file when pulling from hf.co/..., or via the Modelfile TEMPLATE when building locally) and llama.cpp / llama-cpp-python (via the GGUF's embedded jinja). Other apps' tool-calling support depends on whether they read the embedded template or require an external schema.

Inference (OpenAI-compatible)

curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "thanatos-27b",
    "messages": [
      {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
      {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
    ],
    "temperature": 0.6
  }' | jq -r '.choices[0].message.content'

Recommended sampling

Use	temp	top_p	top_k	repeat_penalty
Reasoning / general	0.6	0.95	20	1.05
Creative / RP	0.8	0.95	40	1.02

Lower temperature (0.4-0.6) and bump repeat_penalty to 1.08 if it loops inside <think> tags.

System prompt

The Modelfile bakes this in. Override per-request via the system role in your client:

You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.

Vision

The Qwen 3.6 base supports image (and video) input via a separate mmproj projector. The full multimodal stack is:

Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
mmproj-F16.gguf           (~927 MB, the vision projector)

Both files are at unsloth/Qwen3.6-27B-GGUF. This repo intentionally does not redistribute either.

Loader compatibility — the honest table

Loader	Text	Vision (mmproj)	Notes
llama.cpp (`llama-mtmd-cli`, `llama-server --mmproj`)	✅	✅	Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries.
llama-cpp-python	✅	✅	See `examples/llama_cpp_vision.py`.
Ollama 0.24	✅	❌	Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability — but the first inference request fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See ollama/ollama#15898.
LM Studio	✅	✅ (last tested)	Uses upstream llama.cpp directly.

Vision via llama.cpp

Three flavors, in order of build-time effort:

# A. HTTP via llama-server (always built — the easiest path).
#    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget ≥500 max_tokens so the reasoning
# block doesn't crowd out the final answer.

# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
#    so a selective `cmake --build build --target llama-cli ...` won't
#    produce it — a plain `cmake --build build` will. If yours didn't,
#    run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image photo.jpg \
  -p "Describe this image."

# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-F16.gguf \
  --image /path/to/photo.jpg \
  --prompt "What is in this image?"

Until the Ollama upstream issue is fixed, treat Ollama as text-only for this model.

Hardware requirements

The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.

Hardware	Status
≥32 GB RAM (CPU-only)	Works, ~1-3 tok/s
RTX 3090 / 4090 24 GB	Works, full Q4 offload, ~25-40 tok/s
RTX 5090 32 GB	Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s
Mac Studio M2/M3 32 GB+ unified	Works, ~15-25 tok/s
32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.)	Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom.

Most numbers in this table are estimates from comparable models; the gradient is right but the absolute values will move ±20% with prompt shape, KV cache type, and parallel-request count. Measure your own machine with make bench (3-prompt mix, reports tok/s from Ollama's eval_count / eval_duration so it's not stopwatch-noisy). Reference data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan: ~12.3 tok/s at Q3_K_S and ~9.3 tok/s at Q4_K_M (3-prompt mix, steady across short / medium / long prompts), sitting between CPU-only and a 24 GB discrete card as expected. An earlier ROCm snapshot of the same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on this hardware.

Chat template

Standard Qwen 3.x ChatML with <|im_start|> / <|im_end|> role markers and <think>...</think> blocks for reasoning traces. The Qwen 3.6 jinja template is embedded in the GGUF metadata; loaders that read GGUF chat templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the plain-conversation formatting automatically.

Ollama is the exception: its conversion of the embedded jinja loses the .Tools / .ToolCalls blocks Ollama's capability detector requires. Two paths fix this, depending on how you pull the model:

ollama run hf.co/FoolDev/Thanatos-27B — HF's Ollama bridge applies the root-level template / system / params files in this repo (the bridge does not read Modelfile).
make build / ollama create thanatos-27b -f Modelfile — uses the Modelfile's TEMPLATE block.

Both routes wire .Tools / .ToolCalls and tools work end-to-end on /api/chat and /v1/chat/completions. The two configurations are kept in sync: edit them together if you change one.

Plain conversation

<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant

With reasoning trace

<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>

Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>

Most clients (Open WebUI, LibreChat, etc.) hide the <think> block by default and surface only the visible answer. Strip it manually with re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL) if your client doesn't.

Tool / function calling

The wire format depends on the loader. Both are valid Qwen 3.6 outputs; the model adapts to whichever shape the system prompt prescribes.

Ollama path (this repo's Modelfile). The TEMPLATE directive prompts the model to emit JSON-in-XML, the form Ollama's tool-call extractor parses into a structured tool_calls array. After make build, ollama show thanatos-27b lists tools and thinking under Capabilities, and both /api/chat and /v1/chat/completions accept a tools array.

<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>

Embedded-jinja path (llama.cpp, llama-cpp-python, LM Studio). The Qwen 3.6 native chat template baked into the GGUF instructs the model to emit the more verbose XML form it was trained on:

<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>

Use whichever your client expects; don't mix parsers.

End-to-end exercise (Ollama path):

python examples/ollama_chat.py        # section 3 runs a real round-trip

Known limitations

Slower per token than the 35B-A3B sibling. Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
No mmproj in this release, and vision via Ollama is broken upstream (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the Vision section). For image input use llama.cpp directly until that's fixed.
Q4_K_M quality loss is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
No formal evaluation in this card. Numbers above are estimates.

Related models

Model	Notes
Qwen/Qwen3.6-27B	Upstream base, safetensors
unsloth/Qwen3.6-27B-GGUF	Recommended GGUF source
FoolDev/Janus-35B	35B-A3B MoE sibling. More capacity, more memory pressure.
Crownelius/Crow-9B-HERETIC-4.6	9B starter model when 27B/35B is too heavy

Credits

Base model: Qwen/Qwen3.6-27B (Alibaba)
Reasoning teacher: Claude Opus 4.7 (Anthropic)
Distillation lineage and dataset curation: Crownelius

License inherited from upstream: Apache-2.0.

Downloads last month: 87

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for FoolDev/Thanatos-27B

Base model

Qwen/Qwen3.6-27B

Quantized

(607)

this model

FoolDev
/

Thanatos-27B

Thanatos-27B

TL;DR

Why a 27B variant?

What's here

Architecture

History

Rebadge utility

Quick start

Ollama

Local apps

Inference (OpenAI-compatible)

Recommended sampling

System prompt

Vision

Loader compatibility — the honest table

Vision via llama.cpp

Hardware requirements

Chat template

Plain conversation

With reasoning trace

Tool / function calling

Known limitations

Related models

Credits

Model tree for FoolDev/Thanatos-27B

Datasets used to train FoolDev/Thanatos-27B

Space using FoolDev/Thanatos-27B 1