Thanatos-27B / README.md
FoolDev's picture
Rename back: Thanatos-27B-Heretic → Thanatos-27B (HF repo also renamed)
7197abd
---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-27B
datasets:
- crownelius/Creative_Writing_ShareGPT_Enhanced
- microsoft/rStar-Coder
- peteromallet/dataclaw-peteromallet
- crownelius/Opus-4.7-Reasoning
- openbmb/UltraData-Math
- Crownelius/Crow-Heretic-TeichAI-Unified
language:
- en
- zh
- ru
- es
- fr
- it
- ja
- ko
- de
- ar
- tr
- pl
- sv
- nl
- he
- id
- uk
- fa
- pt
- ms
- fi
- el
tags:
- qwen36
- dense
- conversational
- multimodal
- agent
- gguf
- ollama
- imatrix
library_name: transformers
pipeline_tag: image-text-to-text
---
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />
[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
[![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
[![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/Janus-35B)
[![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)
# Thanatos-27B
> **Dense Reasoning. Friendlier Footprint.**
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
## TL;DR
One-liner via Hugging Face (pulls a GGUF + this repo's root-level
`template` / `system` / `params` files, including the tool-calling
template — HF's Ollama bridge ingests those three files, not
`Modelfile`):
```bash
ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
```
If you pulled the bundle during any of the qwen36 windows on the
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
have a qwen36-stamped blob in your local Ollama store, `make
heal-hf` rebadges it in place. Fresh pulls go straight through.
For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
QUANT=...` is the simplest path. See [Quick start](#quick-start)
below for the full matrix.
For image input use llama.cpp directly — Ollama vision is broken for
this architecture upstream (see [Vision](#vision)).
## Why a 27B variant?
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|---|---|---|
| Architecture | Dense transformer | MoE 256 experts, 8 active |
| Total params | 27 B | 35 B |
| Active params per token | 27 B | ~3 B |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
| Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
| Multimodal (text path) | Yes | Yes |
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
| Max context | 262 144 | 262 144 |
## What's here
| File | Use |
|---|---|
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
| `dense-flow.svg` / `dense-flow.png` | Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by `make build` / `ollama create` for **local** builds |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` — see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `unsloth/Qwen3.6-27B-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`) |
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle → loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
| `scripts/heal_hf_pull.sh` | Legacy recovery for users who pulled `hf.co/FoolDev/Thanatos-27B` (or the pre-rename `FoolDev/Thanatos-27B`) *before* the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
| `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
| `Makefile` | Convenience wrapper — `make help` lists targets |
| `LICENSE`, `CITATION.cff` | Apache-2.0 license and citation metadata |
| `CHANGELOG.md` | Versioned tooling/docs changes |
| `README.md` | This file |
For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
downloads the smaller ~12 GB Q3_K_S quant from
`unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads directly) and
creates a local `thanatos-27b` Ollama tag. Does not redistribute
via this repo. For other quants use `make build QUANT=...`. The
local-build path applies this repo's `Modelfile`; the `hf.co/...`
path applies the root-level `template`, `system`, and `params`
files (kept in sync with the `Modelfile`).
If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
## Architecture
<p align="left">
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
</p>
- Qwen 3.6 dense, 27B parameters, 64 transformer layers
- **Hybrid attention stack**: 16 repeats of `[3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]`
- Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
- Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
- Vocab 248,320 (shared with 35B-A3B sibling)
- 262 144 native context, extensible to ~1 M with YaRN
- Vision + video supported by the **base architecture** via a separate
`mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
current loader compatibility.
- Multi-token prediction (MTP) head trained for speculative decoding —
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
**Not usable via llama.cpp / Ollama today**: the GGUF converter
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
merged 2026-05-16) currently covers other architectures only;
tracking that PR's follow-up work for when qwen35 / qwen35moe
consumer support lands. (Earlier README versions claimed MTP was
available without this caveat — confirmed empirically via
`gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
2026-05-19.)
**The bundled GGUF declares `general.architecture: 'qwen35'`** — not a
workaround for an unimplemented `qwen36` arch, but the canonical
upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
family. The naming convergence runs through three layers of the
stack:
- **Qwen's own HF configs.** `Qwen/Qwen3.6-27B/config.json` declares
`"model_type": "qwen3_5"` and
`"architectures": ["Qwen3_5ForConditionalGeneration"]`. The MoE
sibling `Qwen/Qwen3.6-35B-A3B` declares `"qwen3_5_moe"` /
`Qwen3_5MoeForConditionalGeneration`. No `Qwen3_6` arch class
exists in `transformers`; Qwen reuses the 3.5 class names.
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
`Qwen3_5ForCausalLM``MODEL_ARCH.QWEN35` and
`Qwen3_5MoeForCausalLM``MODEL_ARCH.QWEN35MOE`. The unsloth
GGUFs this repo pulls from (`unsloth/Qwen3.6-27B-GGUF`,
`unsloth/Qwen3.6-35B-A3B-GGUF`) inherit those stamps.
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
Janus-35B sibling base. The arch entries were written to load
Qwen 3.6 weights, not just Qwen 3.5.
There is no PR or tracking issue for a `qwen36` arch entry in
`ggml-org/llama.cpp` or `ollama/ollama` because none is needed —
`qwen35` already loads the model the upstream code path was
designed to load.
`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
loaders.
### History
The bundle's `general.architecture` stamp has now flipped eight
times — four landings on qwen36 and four on qwen35 — each time
after weighing the friction-vs-honesty tradeoff anew. The saga
is resolved on the upstream-canonical `qwen35` side:
- **v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC):** initial qwen35
→ qwen36 stamp, on the theory that qwen35 was a loader stand-in
awaiting proper Qwen 3.6 support. Upstream audit later showed
that theory was mistaken (see above).
- **2026-05-19 afternoon (`964e418`):** flipped back to qwen35
after daily friction outweighed version-specificity for that
iteration; doc workaround narrative collapsed (`83022eb`).
- **2026-05-19 evening (`07fa120`):** brief re-flip to qwen36
during a fresh-pull integration test on Strix Halo.
- **2026-05-19 evening (`72259c1`, ~1 hour later):** reverted to
qwen35 again because the live friction was worse than the doc
prose suggested.
- **2026-05-19 evening (`973d7ef`):** flipped to qwen36 one more
time, after the upstream-evidence audit had been shipped and
the friction was a known quantity. Project owner wanted to
test the friction tradeoff in practice with the audit's
conclusion staring them in the face.
- **2026-05-19 evening (`978798f`):** flipped back to qwen35
after seven sequential fresh-pull → heal-hf cycles on the
Strix Halo box made the friction concretely-experienced
rather than hypothetical. Each cycle worked (the heal flow
is solid) — and each cycle was an unnecessary obstacle for
users who just want `ollama run` to work first try. The
audit (`a4d3b6e`) called the canonical stamp correctly and
the practical friction outweighed the version-specificity
payoff.
- **2026-05-20 midday (`ae67ed1`):** brief re-flip to qwen36
the next morning to re-test the friction in a fresh session.
- **2026-05-20 midday (`e03e10e`, 8 minutes later):** flipped
back to qwen35. Same conclusion as the prior round trip —
friction outweighs version-specificity. **This is the
current state.**
Tensor data was byte-identical across all stamps; only the
`general.architecture` KV (and namespaced KV keys) flipped.
See the [CHANGELOG](CHANGELOG.md) entries for each flip's
rationale.
### Rebadge utility
`scripts/rename_arch.py` is the generic GGUF arch renamer
(metadata only, tensors byte-identical), kept in the repo for
the legacy qwen36 → qwen35 in-store rebadge (used by `make
heal-hf` and `make load-bundle`) and any future arch flip:
```bash
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
--from-arch qwen36 --to-arch qwen35 \
Thanatos-27B.Q4_K_M.qwen36.gguf \
Thanatos-27B.Q4_K_M.gguf
```
## Quick start
### Ollama
Three paths:
```bash
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
# root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen35-stamped
# B. Build a local `thanatos-27b` tag from THIS repo's bundle
# (LFS smudge if needed, then `ollama create`). Useful if you
# want a bare local tag rather than the `hf.co/...` path:
make load-bundle # creates local tag thanatos-27b
ollama run thanatos-27b
# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
# and build locally. Loads on every current llama.cpp / Ollama.
make build # Q4_K_M -> thanatos-27b
make build QUANT=Q3_K_S # 12 GB smaller quant
make build QUANT=Q5_K_M # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
ollama run thanatos-27b
```
Under the hood, `make build` calls `scripts/build.sh`, which downloads the
GGUF if missing (set `GGUF_PATH` to point at one you already have) and
runs `ollama create` with the matching `Modelfile`.
If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.
Confirm everything works:
```bash
make smoke # checks server, model, round-trip, no token leakage
make smoke-tools # adds an end-to-end tool-call round-trip (~10s extra)
make bench # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-compat
```
### Local apps
| App | How to load this model |
|---|---|
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
| **LM Studio** | Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
| **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
For the full Vision (image input) loader matrix, see [Vision](#vision).
Tool calling currently works in **Ollama** (via the root-level
`template` file when pulling from `hf.co/...`, or via the `Modelfile`
TEMPLATE when building locally) and **llama.cpp / llama-cpp-python**
(via the GGUF's embedded jinja). Other apps' tool-calling support
depends on whether they read the embedded template or require an
external schema.
### Inference (OpenAI-compatible)
```bash
curl -s http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "thanatos-27b",
"messages": [
{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
],
"temperature": 0.6
}' | jq -r '.choices[0].message.content'
```
### Recommended sampling
| Use | temp | top_p | top_k | repeat_penalty |
|---|---:|---:|---:|---:|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |
Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.
### System prompt
The Modelfile bakes this in. Override per-request via the `system` role
in your client:
```text
You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
```
## Vision
The Qwen 3.6 base supports image (and video) input via a separate
`mmproj` projector. The full multimodal stack is:
```
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
mmproj-F16.gguf (~927 MB, the vision projector)
```
Both files are at
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF).
This repo intentionally does not redistribute either.
### Loader compatibility — the honest table
| Loader | Text | Vision (mmproj) | Notes |
|---|---|---|---|
| **llama.cpp** (`llama-mtmd-cli`, `llama-server --mmproj`) | ✅ | ✅ | Reference path. Upstream has the `qwen35`/`qwen35moe` arch entries. |
| **llama-cpp-python** | ✅ | ✅ | See `examples/llama_cpp_vision.py`. |
| **Ollama 0.24** | ✅ | ❌ | Text inference works: Ollama's Go engine has the `qwen35` / `qwen35moe` arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. `ollama create` accepts a dual-`FROM` (text + mmproj) and `ollama show` reports `vision` capability — but the **first inference request** fails with `error loading model architecture: unknown model architecture: 'qwen35'` (or `'qwen35moe'`), and once mmproj is attached this blocks text inference too. See [ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898). |
| **LM Studio** | ✅ | ✅ (last tested) | Uses upstream llama.cpp directly. |
### Vision via llama.cpp
Three flavors, in order of build-time effort:
```bash
# A. HTTP via llama-server (always built — the easiest path).
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget ≥500 max_tokens so the reasoning
# block doesn't crowd out the final answer.
# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
# so a selective `cmake --build build --target llama-cli ...` won't
# produce it — a plain `cmake --build build` will. If yours didn't,
# run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--image photo.jpg \
-p "Describe this image."
# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj /path/to/mmproj-F16.gguf \
--image /path/to/photo.jpg \
--prompt "What is in this image?"
```
Until the Ollama upstream issue is fixed, treat Ollama as **text-only**
for this model.
## Hardware requirements
The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.
| Hardware | Status |
|---|---|
| ≥32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
Most numbers in this table are estimates from comparable models; the
gradient is right but the absolute values will move ±20% with prompt
shape, KV cache type, and parallel-request count. Measure your own
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
steady across short / medium / long prompts), sitting between CPU-only
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
this hardware.
## Chat template
Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers
and `<think>...</think>` blocks for reasoning traces. The Qwen 3.6 jinja
template is embedded in the GGUF metadata; loaders that read GGUF chat
templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
plain-conversation formatting automatically.
Ollama is the exception: its conversion of the embedded jinja loses the
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
Two paths fix this, depending on how you pull the model:
- **`ollama run hf.co/FoolDev/Thanatos-27B`** — HF's Ollama bridge applies
the root-level `template` / `system` / `params` files in this repo
(the bridge does **not** read `Modelfile`).
- **`make build` / `ollama create thanatos-27b -f Modelfile`** — uses the
`Modelfile`'s `TEMPLATE` block.
Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
`/api/chat` and `/v1/chat/completions`. The two configurations are
kept in sync: edit them together if you change one.
#### Plain conversation
```text
<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
```
#### With reasoning trace
```text
<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>
Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>
```
Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by
default and surface only the visible answer. Strip it manually with
`re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL)` if your
client doesn't.
#### Tool / function calling
The wire format depends on the loader. Both are valid Qwen 3.6 outputs;
the model adapts to whichever shape the system prompt prescribes.
**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
extractor parses into a structured `tool_calls` array. After
`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
accept a `tools` array.
```text
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>
```
**Embedded-jinja path** (llama.cpp, llama-cpp-python, LM Studio). The
Qwen 3.6 native chat template baked into the GGUF instructs the model
to emit the more verbose XML form it was trained on:
```text
<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
```
Use whichever your client expects; don't mix parsers.
End-to-end exercise (Ollama path):
```bash
python examples/ollama_chat.py # section 3 runs a real round-trip
```
## Known limitations
- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
- **No formal evaluation in this card.** Numbers above are estimates.
## Related models
| Model | Notes |
|---|---|
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
## Credits
- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
License inherited from upstream: Apache-2.0.