File size: 28,441 Bytes
b564869
 
 
73e905b
b564869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2c541f
b564869
 
 
 
b0d8482
b20f7c9
7097156
b564869
 
 
 
7197abd
b564869
 
73e905b
b564869
ab668b6
1c2a85f
b564869
7197abd
b564869
73e905b
 
b564869
73e905b
b564869
73e905b
 
f605870
 
5426482
 
 
 
f605870
 
7197abd
4811e8d
1c88b41
16e1ddd
 
 
73e905b
16e1ddd
73e905b
cee14f4
73e905b
ef3c5d9
f605870
 
 
b564869
 
 
 
73e905b
b564869
7197abd
b564869
 
 
 
 
 
75bbdfe
73e905b
82677d0
e4beea4
 
 
b564869
 
 
 
 
 
 
9150ad2
73e905b
7197abd
6f2884f
73e905b
16e1ddd
7197abd
84d3da6
d344201
73e905b
5c67b08
 
25d5454
7766f0b
9ca8700
6f2884f
7766f0b
b564869
 
73e905b
 
 
7197abd
73e905b
 
 
 
b564869
73e905b
b564869
 
 
c843f11
7197abd
c843f11
 
b564869
82677d0
 
 
 
b564869
82677d0
e4beea4
73e905b
 
 
732c3be
 
 
 
 
 
73e905b
 
 
 
 
 
 
 
b564869
5426482
 
 
 
 
a4d3b6e
 
 
 
 
 
 
 
 
73e905b
 
 
a4d3b6e
 
 
 
 
 
 
5426482
 
 
 
7197abd
5426482
 
cee14f4
 
 
ac94e67
 
 
 
2b2ba03
 
 
 
 
 
cee14f4
2b2ba03
 
 
 
 
 
 
 
5426482
 
 
 
 
 
 
 
 
 
 
ac94e67
 
 
 
 
 
 
cee14f4
 
 
 
 
 
 
 
 
5426482
 
 
05226da
 
16e1ddd
 
05226da
a4d3b6e
5426482
 
05226da
 
b564869
 
b0d8482
 
ef3c5d9
b564869
6f2884f
4811e8d
5426482
7197abd
83022eb
7197abd
4811e8d
 
7197abd
 
16e1ddd
73e905b
 
7197abd
73e905b
 
 
7197abd
6f2884f
 
9ca8700
 
 
b564869
9ca8700
7197abd
b564869
6f2884f
 
 
ab19d26
84d3da6
ab19d26
84d3da6
6f2884f
 
b20f7c9
 
 
 
7197abd
 
 
 
b20f7c9
 
 
 
33458f7
 
 
 
 
 
b20f7c9
b564869
 
 
 
 
 
7197abd
b564869
bc0cbc6
b564869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f605870
 
b564869
 
bc0cbc6
b564869
 
 
 
 
 
 
 
 
 
 
e4beea4
 
73e905b
 
e4beea4
 
73e905b
 
e4beea4
 
 
73e905b
 
e4beea4
 
 
 
 
 
 
59f5706
e4beea4
 
 
 
a60eff5
 
e4beea4
a60eff5
 
73e905b
a60eff5
73e905b
 
a60eff5
 
 
 
 
 
 
 
 
 
 
e4beea4
73e905b
 
e4beea4
 
 
a60eff5
e4beea4
73e905b
 
e4beea4
 
 
 
 
 
 
b564869
 
bc0cbc6
b564869
 
 
 
 
 
 
73e905b
b564869
d344201
 
 
 
 
73e905b
8bddbe0
 
 
 
73e905b
d344201
b564869
 
72958b4
 
 
 
 
 
 
33458f7
 
 
7197abd
33458f7
 
7197abd
33458f7
 
 
 
 
f605870
 
 
 
 
bc0cbc6
f605870
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72958b4
 
80f4494
72958b4
 
 
7197abd
72958b4
 
f605870
 
 
80f4494
f605870
 
b564869
72958b4
 
 
80f4494
72958b4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80f4494
 
72958b4
80f4494
 
b564869
 
 
59f5706
b564869
 
 
 
 
 
 
73e905b
 
ab668b6
b564869
 
 
 
73e905b
b564869
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-27B
datasets:
  - crownelius/Creative_Writing_ShareGPT_Enhanced
  - microsoft/rStar-Coder
  - peteromallet/dataclaw-peteromallet
  - crownelius/Opus-4.7-Reasoning
  - openbmb/UltraData-Math
  - Crownelius/Crow-Heretic-TeichAI-Unified
language:
  - en
  - zh
  - ru
  - es
  - fr
  - it
  - ja
  - ko
  - de
  - ar
  - tr
  - pl
  - sv
  - nl
  - he
  - id
  - uk
  - fa
  - pt
  - ms
  - fi
  - el
tags:
  - qwen36
  - dense
  - conversational
  - multimodal
  - agent
  - gguf
  - ollama
  - imatrix
library_name: transformers
pipeline_tag: image-text-to-text
---

<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />

[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
[![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
[![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
[![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)

# Thanatos-27B

> **Dense Reasoning. Friendlier Footprint.**
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*

**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`

A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.

## TL;DR

One-liner via Hugging Face (pulls a GGUF + this repo's root-level
`template` / `system` / `params` files, including the tool-calling
template β€” HF's Ollama bridge ingests those three files, not
`Modelfile`):

```bash
ollama run hf.co/FoolDev/Thanatos-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
```

If you pulled the bundle during any of the qwen36 windows on the
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
have a qwen36-stamped blob in your local Ollama store, `make
heal-hf` rebadges it in place. Fresh pulls go straight through.

For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
QUANT=...` is the simplest path. See [Quick start](#quick-start)
below for the full matrix.

For image input use llama.cpp directly β€” Ollama vision is broken for
this architecture upstream (see [Vision](#vision)).

## Why a 27B variant?

The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** β€” the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.

The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B β€” on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) β€” but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.

| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|---|---|---|
| Architecture | Dense transformer | MoE 256 experts, 8 active |
| Total params | 27 B | 35 B |
| Active params per token | 27 B | ~3 B |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
| Multimodal (text path) | Yes | Yes |
| Multimodal (vision via Ollama) | Broken upstream β€” see below | Broken upstream |
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
| Max context | 262 144 | 262 144 |

## What's here

| File | Use |
|---|---|
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF β€” used by `make build` / `ollama create` for **local** builds |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` β€” see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle β†’ loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 β†’ qwen35 rebadge branch for legacy pre-rename checkouts β€” no-op on the current qwen35-stamped bundle. |
| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 β†’ qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 β€” fresh pulls don't need it. |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream β€” see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
| `Makefile` | Convenience wrapper β€” `make help` lists targets |
| `LICENSE`, `CITATION.cff` | Apache-2.0 license and citation metadata |
| `CHANGELOG.md` | Versioned tooling/docs changes |
| `README.md` | This file |

For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
downloads the smaller ~12 GB Q3_K_S quant from
`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
creates a local `thanatos-27b` Ollama tag. Does not redistribute
via this repo. For other quants use `make build QUANT=...`. The
local-build path applies this repo's `Modelfile`; the `hf.co/...`
path applies the root-level `template`, `system`, and `params`
files (kept in sync with the `Modelfile`).

If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).

## Architecture

<p align="left">
  <img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
</p>

- Qwen 3.6 dense, 27B parameters, 64 transformer layers
- **Hybrid attention stack**: 16 repeats of `[3 Γ— (Gated DeltaNet β†’ FFN) β†’ 1 Γ— (Gated Attention β†’ FFN)]`
  - Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
  - Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
- Hidden size 5120, FFN intermediate 17408 (~3.4Γ— ratio)
- Vocab 248,320 (shared with 35B-A3B sibling)
- 262 144 native context, extensible to ~1 M with YaRN
- Vision + video supported by the **base architecture** via a separate
  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
  current loader compatibility.
- Multi-token prediction (MTP) head trained for speculative decoding β€”
  present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
  vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
  **Not usable via llama.cpp / Ollama today**: the GGUF converter
  (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
  `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
  inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
  851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
  merged 2026-05-16) currently covers other architectures only;
  tracking that PR's follow-up work for when qwen35 / qwen35moe
  consumer support lands. (Earlier README versions claimed MTP was
  available without this caveat β€” confirmed empirically via
  `gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
  2026-05-19.)

**The bundled GGUF declares `general.architecture: 'qwen35'`** β€” not a
workaround for an unimplemented `qwen36` arch, but the canonical
upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
family. The naming convergence runs through three layers of the
stack:

- **Qwen's own HF configs.** `Qwen/Qwen3.6-27B/config.json` declares
  `"model_type": "qwen3_5"` and
  `"architectures": ["Qwen3_5ForConditionalGeneration"]`. The MoE
  sibling `Qwen/Qwen3.6-35B-A3B` declares `"qwen3_5_moe"` /
  `Qwen3_5MoeForConditionalGeneration`. No `Qwen3_6` arch class
  exists in `transformers`; Qwen reuses the 3.5 class names.
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
  `Qwen3_5ForCausalLM` β†’ `MODEL_ARCH.QWEN35` and
  `Qwen3_5MoeForCausalLM` β†’ `MODEL_ARCH.QWEN35MOE`. The unsloth
  GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
  `unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
  explicit `case 64: type = LLM_TYPE_27B` branch for this model;
  `qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
  Janus-35B sibling base. The arch entries were written to load
  Qwen 3.6 weights, not just Qwen 3.5.

There is no PR or tracking issue for a `qwen36` arch entry in
`ggml-org/llama.cpp` or `ollama/ollama` because none is needed β€”
`qwen35` already loads the model the upstream code path was
designed to load.

`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
loaders.

### History

The bundle's `general.architecture` stamp has now flipped eight
times β€” four landings on qwen36 and four on qwen35 β€” each time
after weighing the friction-vs-honesty tradeoff anew. The saga
is resolved on the upstream-canonical `qwen35` side:

- **v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC):** initial qwen35
  β†’ qwen36 stamp, on the theory that qwen35 was a loader stand-in
  awaiting proper Qwen 3.6 support. Upstream audit later showed
  that theory was mistaken (see above).
- **2026-05-19 afternoon (`964e418`):** flipped back to qwen35
  after daily friction outweighed version-specificity for that
  iteration; doc workaround narrative collapsed (`83022eb`).
- **2026-05-19 evening (`07fa120`):** brief re-flip to qwen36
  during a fresh-pull integration test on Strix Halo.
- **2026-05-19 evening (`72259c1`, ~1 hour later):** reverted to
  qwen35 again because the live friction was worse than the doc
  prose suggested.
- **2026-05-19 evening (`973d7ef`):** flipped to qwen36 one more
  time, after the upstream-evidence audit had been shipped and
  the friction was a known quantity. Project owner wanted to
  test the friction tradeoff in practice with the audit's
  conclusion staring them in the face.
- **2026-05-19 evening (`978798f`):** flipped back to qwen35
  after seven sequential fresh-pull β†’ heal-hf cycles on the
  Strix Halo box made the friction concretely-experienced
  rather than hypothetical. Each cycle worked (the heal flow
  is solid) β€” and each cycle was an unnecessary obstacle for
  users who just want `ollama run` to work first try. The
  audit (`a4d3b6e`) called the canonical stamp correctly and
  the practical friction outweighed the version-specificity
  payoff.
- **2026-05-20 midday (`ae67ed1`):** brief re-flip to qwen36
  the next morning to re-test the friction in a fresh session.
- **2026-05-20 midday (`e03e10e`, 8 minutes later):** flipped
  back to qwen35. Same conclusion as the prior round trip β€”
  friction outweighs version-specificity. **This is the
  current state.**

Tensor data was byte-identical across all stamps; only the
`general.architecture` KV (and namespaced KV keys) flipped.
See the [CHANGELOG](CHANGELOG.md) entries for each flip's
rationale.

### Rebadge utility

`scripts/rename_arch.py` is the generic GGUF arch renamer
(metadata only, tensors byte-identical), kept in the repo for
the legacy qwen36 β†’ qwen35 in-store rebadge (used by `make
heal-hf` and `make load-bundle`) and any future arch flip:

```bash
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
    --from-arch qwen36 --to-arch qwen35 \
    Thanatos-27B.Q4_K_M.qwen36.gguf \
    Thanatos-27B.Q4_K_M.gguf
```

## Quick start

### Ollama

Three paths:

```bash
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
#    root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M, qwen35-stamped

# B. Build a local `thanatos-27b` tag from THIS repo's bundle
#    (LFS smudge if needed, then `ollama create`). Useful if you
#    want a bare local tag rather than the `hf.co/...` path:
make load-bundle                                 # creates local tag thanatos-27b
ollama run thanatos-27b

# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
#    and build locally. Loads on every current llama.cpp / Ollama.
make build                                              # Q4_K_M  -> thanatos-27b
make build QUANT=Q3_K_S                                 # 12 GB smaller quant
make build QUANT=Q5_K_M                                 # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
ollama run thanatos-27b
```

Under the hood, `make build` calls `scripts/build.sh`, which downloads the
GGUF if missing (set `GGUF_PATH` to point at one you already have) and
runs `ollama create` with the matching `Modelfile`.

If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.

Confirm everything works:

```bash
make smoke                          # checks server, model, round-trip, no token leakage
make smoke-tools                    # adds an end-to-end tool-call round-trip (~10s extra)
make bench                          # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-compat
```

### Local apps

| App | How to load this model |
|---|---|
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
| **LM Studio** | Search β†’ `FoolDev/Thanatos-27B` β†’ pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
| **Jan** | Hub β†’ "Import from Hugging Face" β†’ `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path β€” point at the GGUF, use the embedded chat template. |

For the full Vision (image input) loader matrix, see [Vision](#vision).
Tool calling currently works in **Ollama** (via the root-level
`template` file when pulling from `hf.co/...`, or via the `Modelfile`
TEMPLATE when building locally) and **llama.cpp / llama-cpp-python**
(via the GGUF's embedded jinja). Other apps' tool-calling support
depends on whether they read the embedded template or require an
external schema.

### Inference (OpenAI-compatible)

```bash
curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "thanatos-27b",
    "messages": [
      {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
      {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
    ],
    "temperature": 0.6
  }' | jq -r '.choices[0].message.content'
```

### Recommended sampling

| Use | temp | top_p | top_k | repeat_penalty |
|---|---:|---:|---:|---:|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |

Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.

### System prompt

The Modelfile bakes this in. Override per-request via the `system` role
in your client:

```text
You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
```

## Vision

The Qwen 3.6 base supports image (and video) input via a separate
`mmproj` projector. The full multimodal stack is:

```
Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
mmproj-F16.gguf           (~927 MB, the vision projector)
```

Both files are at
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
This repo intentionally does not redistribute either.

### Loader compatibility β€” the honest table

| Loader | Text | Vision (mmproj) | Notes |
|---|---|---|---|
| **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | βœ… | βœ… | Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
| **llama-cpp-python** | βœ… | βœ… | See `examples/llama_cpp_vision.py`. |
| **Ollama 0.24** | βœ… | ❌ | Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability β€” but the **first inference request** fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). |
| **LM Studio** | βœ… | βœ… (last tested) | Uses upstream llama.cpp directly. |

### Vision via llama.cpp

Three flavors, in order of build-time effort:

```bash
# A. HTTP via llama-server (always built β€” the easiest path).
#    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block β€” e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget β‰₯500 max_tokens so the reasoning
# block doesn't crowd out the final answer.

# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
#    so a selective `cmake --build build --target llama-cli ...` won't
#    produce it β€” a plain `cmake --build build` will. If yours didn't,
#    run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image photo.jpg \
  -p "Describe this image."

# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-F16.gguf \
  --image /path/to/photo.jpg \
  --prompt "What is in this image?"
```

Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
for this model.

## Hardware requirements

The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.

| Hardware | Status |
|---|---|
| β‰₯32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |

Most numbers in this table are estimates from comparable models; the
gradient is right but the absolute values will move Β±20% with prompt
shape, KV cache type, and parallel-request count. Measure your own
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
steady across short / medium / long prompts), sitting between CPU-only
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
same Q3_K_S bench gave ~10.1 tok/s β€” Vulkan was the clear winner on
this hardware.

## Chat template

Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers
and `<think>...</think>` blocks for reasoning traces. The Qwen 3.6 jinja
template is embedded in the GGUF metadata; loaders that read GGUF chat
templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
plain-conversation formatting automatically.

Ollama is the exception: its conversion of the embedded jinja loses the
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
Two paths fix this, depending on how you pull the model:

- **`ollama run hf.co/FoolDev/Thanatos-27B`** β€” HF's Ollama bridge applies
  the root-level `template` / `system` / `params` files in this repo
  (the bridge does **not** read `Modelfile`).
- **`make build` / `ollama create thanatos-27b -f Modelfile`** β€” uses the
  `Modelfile`'s `TEMPLATE` block.

Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
`/api/chat` and `/v1/chat/completions`. The two configurations are
kept in sync: edit them together if you change one.

#### Plain conversation

```text
<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
```

#### With reasoning trace

```text
<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>

Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>
```

Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by
default and surface only the visible answer. Strip it manually with
`re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL)` if your
client doesn't.

#### Tool / function calling

The wire format depends on the loader. Both are valid Qwen 3.6 outputs;
the model adapts to whichever shape the system prompt prescribes.

**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
extractor parses into a structured `tool_calls` array. After
`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
accept a `tools` array.

```text
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>
```

**Embedded-jinja path** (llama.cpp, llama-cpp-python, LM Studio). The
Qwen 3.6 native chat template baked into the GGUF instructs the model
to emit the more verbose XML form it was trained on:

```text
<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
```

Use whichever your client expects; don't mix parsers.

End-to-end exercise (Ollama path):

```bash
python examples/ollama_chat.py        # section 3 runs a real round-trip
```

## Known limitations

- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached β€” see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
- **No formal evaluation in this card.** Numbers above are estimates.

## Related models

| Model | Notes |
|---|---|
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |

## Credits

- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

License inherited from upstream: Apache-2.0.