Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
File size: 28,441 Bytes
b564869 73e905b b564869 a2c541f b564869 b0d8482 b20f7c9 7097156 b564869 7197abd b564869 73e905b b564869 ab668b6 1c2a85f b564869 7197abd b564869 73e905b b564869 73e905b b564869 73e905b f605870 5426482 f605870 7197abd 4811e8d 1c88b41 16e1ddd 73e905b 16e1ddd 73e905b cee14f4 73e905b ef3c5d9 f605870 b564869 73e905b b564869 7197abd b564869 75bbdfe 73e905b 82677d0 e4beea4 b564869 9150ad2 73e905b 7197abd 6f2884f 73e905b 16e1ddd 7197abd 84d3da6 d344201 73e905b 5c67b08 25d5454 7766f0b 9ca8700 6f2884f 7766f0b b564869 73e905b 7197abd 73e905b b564869 73e905b b564869 c843f11 7197abd c843f11 b564869 82677d0 b564869 82677d0 e4beea4 73e905b 732c3be 73e905b b564869 5426482 a4d3b6e 73e905b a4d3b6e 5426482 7197abd 5426482 cee14f4 ac94e67 2b2ba03 cee14f4 2b2ba03 5426482 ac94e67 cee14f4 5426482 05226da 16e1ddd 05226da a4d3b6e 5426482 05226da b564869 b0d8482 ef3c5d9 b564869 6f2884f 4811e8d 5426482 7197abd 83022eb 7197abd 4811e8d 7197abd 16e1ddd 73e905b 7197abd 73e905b 7197abd 6f2884f 9ca8700 b564869 9ca8700 7197abd b564869 6f2884f ab19d26 84d3da6 ab19d26 84d3da6 6f2884f b20f7c9 7197abd b20f7c9 33458f7 b20f7c9 b564869 7197abd b564869 bc0cbc6 b564869 f605870 b564869 bc0cbc6 b564869 e4beea4 73e905b e4beea4 73e905b e4beea4 73e905b e4beea4 59f5706 e4beea4 a60eff5 e4beea4 a60eff5 73e905b a60eff5 73e905b a60eff5 e4beea4 73e905b e4beea4 a60eff5 e4beea4 73e905b e4beea4 b564869 bc0cbc6 b564869 73e905b b564869 d344201 73e905b 8bddbe0 73e905b d344201 b564869 72958b4 33458f7 7197abd 33458f7 7197abd 33458f7 f605870 bc0cbc6 f605870 72958b4 80f4494 72958b4 7197abd 72958b4 f605870 80f4494 f605870 b564869 72958b4 80f4494 72958b4 80f4494 72958b4 80f4494 b564869 59f5706 b564869 73e905b ab668b6 b564869 73e905b b564869 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 | ---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-27B
datasets:
- crownelius/Creative_Writing_ShareGPT_Enhanced
- microsoft/rStar-Coder
- peteromallet/dataclaw-peteromallet
- crownelius/Opus-4.7-Reasoning
- openbmb/UltraData-Math
- Crownelius/Crow-Heretic-TeichAI-Unified
language:
- en
- zh
- ru
- es
- fr
- it
- ja
- ko
- de
- ar
- tr
- pl
- sv
- nl
- he
- id
- uk
- fa
- pt
- ms
- fi
- el
tags:
- qwen36
- dense
- conversational
- multimodal
- agent
- gguf
- ollama
- imatrix
library_name: transformers
pipeline_tag: image-text-to-text
---
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/Qwen/Qwen3.6-27B)
[](#architecture)
[](https://huggingface.co/FoolDev/Janus-35B)
[](https://buymeacoffee.com/cardoffoolm)
# Thanatos-27B
> **Dense Reasoning. Friendlier Footprint.**
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
## TL;DR
One-liner via Hugging Face (pulls a GGUF + this repo's root-level
`template` / `system` / `params` files, including the tool-calling
template β HF's Ollama bridge ingests those three files, not
`Modelfile`):
```bash
ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
```
If you pulled the bundle during any of the qwen36 windows on the
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
have a qwen36-stamped blob in your local Ollama store, `make
heal-hf` rebadges it in place. Fresh pulls go straight through.
For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
QUANT=...` is the simplest path. See [Quick start](#quick-start)
below for the full matrix.
For image input use llama.cpp directly β Ollama vision is broken for
this architecture upstream (see [Vision](#vision)).
## Why a 27B variant?
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** β the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B β on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) β but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|---|---|---|
| Architecture | Dense transformer | MoE 256 experts, 8 active |
| Total params | 27 B | 35 B |
| Active params per token | 27 B | ~3 B |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
| Multimodal (text path) | Yes | Yes |
| Multimodal (vision via Ollama) | Broken upstream β see below | Broken upstream |
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
| Max context | 262 144 | 262 144 |
## What's here
| File | Use |
|---|---|
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF β used by `make build` / `ollama create` for **local** builds |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` β see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle β loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 β qwen35 rebadge branch for legacy pre-rename checkouts β no-op on the current qwen35-stamped bundle. |
| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 β qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 β fresh pulls don't need it. |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream β see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
| `Makefile` | Convenience wrapper β `make help` lists targets |
| `LICENSE`, `CITATION.cff` | Apache-2.0 license and citation metadata |
| `CHANGELOG.md` | Versioned tooling/docs changes |
| `README.md` | This file |
For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
downloads the smaller ~12 GB Q3_K_S quant from
`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
creates a local `thanatos-27b` Ollama tag. Does not redistribute
via this repo. For other quants use `make build QUANT=...`. The
local-build path applies this repo's `Modelfile`; the `hf.co/...`
path applies the root-level `template`, `system`, and `params`
files (kept in sync with the `Modelfile`).
If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
## Architecture
<p align="left">
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
</p>
- Qwen 3.6 dense, 27B parameters, 64 transformer layers
- **Hybrid attention stack**: 16 repeats of `[3 Γ (Gated DeltaNet β FFN) β 1 Γ (Gated Attention β FFN)]`
- Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
- Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
- Hidden size 5120, FFN intermediate 17408 (~3.4Γ ratio)
- Vocab 248,320 (shared with 35B-A3B sibling)
- 262 144 native context, extensible to ~1 M with YaRN
- Vision + video supported by the **base architecture** via a separate
`mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
current loader compatibility.
- Multi-token prediction (MTP) head trained for speculative decoding β
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
**Not usable via llama.cpp / Ollama today**: the GGUF converter
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
merged 2026-05-16) currently covers other architectures only;
tracking that PR's follow-up work for when qwen35 / qwen35moe
consumer support lands. (Earlier README versions claimed MTP was
available without this caveat β confirmed empirically via
`gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
2026-05-19.)
**The bundled GGUF declares `general.architecture: 'qwen35'`** β not a
workaround for an unimplemented `qwen36` arch, but the canonical
upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
family. The naming convergence runs through three layers of the
stack:
- **Qwen's own HF configs.** `Qwen/Qwen3.6-27B/config.json` declares
`"model_type": "qwen3_5"` and
`"architectures": ["Qwen3_5ForConditionalGeneration"]`. The MoE
sibling `Qwen/Qwen3.6-35B-A3B` declares `"qwen3_5_moe"` /
`Qwen3_5MoeForConditionalGeneration`. No `Qwen3_6` arch class
exists in `transformers`; Qwen reuses the 3.5 class names.
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
`Qwen3_5ForCausalLM` β `MODEL_ARCH.QWEN35` and
`Qwen3_5MoeForCausalLM` β `MODEL_ARCH.QWEN35MOE`. The unsloth
GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
`unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
Janus-35B sibling base. The arch entries were written to load
Qwen 3.6 weights, not just Qwen 3.5.
There is no PR or tracking issue for a `qwen36` arch entry in
`ggml-org/llama.cpp` or `ollama/ollama` because none is needed β
`qwen35` already loads the model the upstream code path was
designed to load.
`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
loaders.
### History
The bundle's `general.architecture` stamp has now flipped eight
times β four landings on qwen36 and four on qwen35 β each time
after weighing the friction-vs-honesty tradeoff anew. The saga
is resolved on the upstream-canonical `qwen35` side:
- **v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC):** initial qwen35
β qwen36 stamp, on the theory that qwen35 was a loader stand-in
awaiting proper Qwen 3.6 support. Upstream audit later showed
that theory was mistaken (see above).
- **2026-05-19 afternoon (`964e418`):** flipped back to qwen35
after daily friction outweighed version-specificity for that
iteration; doc workaround narrative collapsed (`83022eb`).
- **2026-05-19 evening (`07fa120`):** brief re-flip to qwen36
during a fresh-pull integration test on Strix Halo.
- **2026-05-19 evening (`72259c1`, ~1 hour later):** reverted to
qwen35 again because the live friction was worse than the doc
prose suggested.
- **2026-05-19 evening (`973d7ef`):** flipped to qwen36 one more
time, after the upstream-evidence audit had been shipped and
the friction was a known quantity. Project owner wanted to
test the friction tradeoff in practice with the audit's
conclusion staring them in the face.
- **2026-05-19 evening (`978798f`):** flipped back to qwen35
after seven sequential fresh-pull β heal-hf cycles on the
Strix Halo box made the friction concretely-experienced
rather than hypothetical. Each cycle worked (the heal flow
is solid) β and each cycle was an unnecessary obstacle for
users who just want `ollama run` to work first try. The
audit (`a4d3b6e`) called the canonical stamp correctly and
the practical friction outweighed the version-specificity
payoff.
- **2026-05-20 midday (`ae67ed1`):** brief re-flip to qwen36
the next morning to re-test the friction in a fresh session.
- **2026-05-20 midday (`e03e10e`, 8 minutes later):** flipped
back to qwen35. Same conclusion as the prior round trip β
friction outweighs version-specificity. **This is the
current state.**
Tensor data was byte-identical across all stamps; only the
`general.architecture` KV (and namespaced KV keys) flipped.
See the [CHANGELOG](CHANGELOG.md) entries for each flip's
rationale.
### Rebadge utility
`scripts/rename_arch.py` is the generic GGUF arch renamer
(metadata only, tensors byte-identical), kept in the repo for
the legacy qwen36 β qwen35 in-store rebadge (used by `make
heal-hf` and `make load-bundle`) and any future arch flip:
```bash
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
--from-arch qwen36 --to-arch qwen35 \
Thanatos-27B.Q4_K_M.qwen36.gguf \
Thanatos-27B.Q4_K_M.gguf
```
## Quick start
### Ollama
Three paths:
```bash
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
# root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen35-stamped
# B. Build a local `thanatos-27b` tag from THIS repo's bundle
# (LFS smudge if needed, then `ollama create`). Useful if you
# want a bare local tag rather than the `hf.co/...` path:
make load-bundle # creates local tag thanatos-27b
ollama run thanatos-27b
# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
# and build locally. Loads on every current llama.cpp / Ollama.
make build # Q4_K_M -> thanatos-27b
make build QUANT=Q3_K_S # 12 GB smaller quant
make build QUANT=Q5_K_M # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
ollama run thanatos-27b
```
Under the hood, `make build` calls `scripts/build.sh`, which downloads the
GGUF if missing (set `GGUF_PATH` to point at one you already have) and
runs `ollama create` with the matching `Modelfile`.
If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.
Confirm everything works:
```bash
make smoke # checks server, model, round-trip, no token leakage
make smoke-tools # adds an end-to-end tool-call round-trip (~10s extra)
make bench # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-compat
```
### Local apps
| App | How to load this model |
|---|---|
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
| **LM Studio** | Search β `FoolDev/Thanatos-27B` β pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
| **Jan** | Hub β "Import from Hugging Face" β `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path β point at the GGUF, use the embedded chat template. |
For the full Vision (image input) loader matrix, see [Vision](#vision).
Tool calling currently works in **Ollama** (via the root-level
`template` file when pulling from `hf.co/...`, or via the `Modelfile`
TEMPLATE when building locally) and **llama.cpp / llama-cpp-python**
(via the GGUF's embedded jinja). Other apps' tool-calling support
depends on whether they read the embedded template or require an
external schema.
### Inference (OpenAI-compatible)
```bash
curl -s http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "thanatos-27b",
"messages": [
{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
],
"temperature": 0.6
}' | jq -r '.choices[0].message.content'
```
### Recommended sampling
| Use | temp | top_p | top_k | repeat_penalty |
|---|---:|---:|---:|---:|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |
Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.
### System prompt
The Modelfile bakes this in. Override per-request via the `system` role
in your client:
```text
You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
```
## Vision
The Qwen 3.6 base supports image (and video) input via a separate
`mmproj` projector. The full multimodal stack is:
```
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
mmproj-F16.gguf (~927 MB, the vision projector)
```
Both files are at
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
This repo intentionally does not redistribute either.
### Loader compatibility β the honest table
| Loader | Text | Vision (mmproj) | Notes |
|---|---|---|---|
| **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | β
| β
| Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
| **llama-cpp-python** | β
| β
| See `examples/llama_cpp_vision.py`. |
| **Ollama 0.24** | β
| β | Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability β but the **first inference request** fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). |
| **LM Studio** | β
| β
(last tested) | Uses upstream llama.cpp directly. |
### Vision via llama.cpp
Three flavors, in order of build-time effort:
```bash
# A. HTTP via llama-server (always built β the easiest path).
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block β e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget β₯500 max_tokens so the reasoning
# block doesn't crowd out the final answer.
# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
# so a selective `cmake --build build --target llama-cli ...` won't
# produce it β a plain `cmake --build build` will. If yours didn't,
# run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--image photo.jpg \
-p "Describe this image."
# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj /path/to/mmproj-F16.gguf \
--image /path/to/photo.jpg \
--prompt "What is in this image?"
```
Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
for this model.
## Hardware requirements
The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.
| Hardware | Status |
|---|---|
| β₯32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
Most numbers in this table are estimates from comparable models; the
gradient is right but the absolute values will move Β±20% with prompt
shape, KV cache type, and parallel-request count. Measure your own
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
steady across short / medium / long prompts), sitting between CPU-only
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
same Q3_K_S bench gave ~10.1 tok/s β Vulkan was the clear winner on
this hardware.
## Chat template
Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers
and `<think>...</think>` blocks for reasoning traces. The Qwen 3.6 jinja
template is embedded in the GGUF metadata; loaders that read GGUF chat
templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
plain-conversation formatting automatically.
Ollama is the exception: its conversion of the embedded jinja loses the
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
Two paths fix this, depending on how you pull the model:
- **`ollama run hf.co/FoolDev/Thanatos-27B`** β HF's Ollama bridge applies
the root-level `template` / `system` / `params` files in this repo
(the bridge does **not** read `Modelfile`).
- **`make build` / `ollama create thanatos-27b -f Modelfile`** β uses the
`Modelfile`'s `TEMPLATE` block.
Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
`/api/chat` and `/v1/chat/completions`. The two configurations are
kept in sync: edit them together if you change one.
#### Plain conversation
```text
<|im_start|>system
You are Thanatos, a precise and capable assistantβ¦<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
```
#### With reasoning trace
```text
<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>
Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>
```
Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by
default and surface only the visible answer. Strip it manually with
`re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL)` if your
client doesn't.
#### Tool / function calling
The wire format depends on the loader. Both are valid Qwen 3.6 outputs;
the model adapts to whichever shape the system prompt prescribes.
**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
extractor parses into a structured `tool_calls` array. After
`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
accept a `tools` array.
```text
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>
```
**Embedded-jinja path** (llama.cpp, llama-cpp-python, LM Studio). The
Qwen 3.6 native chat template baked into the GGUF instructs the model
to emit the more verbose XML form it was trained on:
```text
<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
```
Use whichever your client expects; don't mix parsers.
End-to-end exercise (Ollama path):
```bash
python examples/ollama_chat.py # section 3 runs a real round-trip
```
## Known limitations
- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached β see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
- **No formal evaluation in this card.** Numbers above are estimates.
## Related models
| Model | Notes |
|---|---|
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
## Credits
- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
License inherited from upstream: Apache-2.0.
|