How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
Quick Links
Thanatos-27B banner

License Base Model Architecture Sibling Buy me a coffee

Thanatos-27B

Dense Reasoning. Friendlier Footprint. Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.

Architecture: Qwen 3.6 27B (Dense) | Parameters: 27B | Teacher: Claude Opus 4.7 | Type: Distilled LLM

A personal sibling to FoolDev/Janus-35B. Same teacher (Claude Opus 4.7), same dataset family, but built on the dense Qwen/Qwen3.6-27B base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.

TL;DR

One-liner via Hugging Face (pulls a GGUF + this repo's root-level template / system / params files, including the tool-calling template — HF's Ollama bridge ingests those three files, not Modelfile):

ollama run hf.co/FoolDev/Thanatos-27B           # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama

If you pulled the bundle during any of the qwen36 windows on the pre-rename FoolDev/Thanatos-27B repo (2026-05-19/20) and still have a qwen36-stamped blob in your local Ollama store, make heal-hf rebadges it in place. Fresh pulls go straight through.

For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), make build QUANT=... is the simplest path. See Quick start below for the full matrix.

For image input use llama.cpp directly — Ollama vision is broken for this architecture upstream (see Vision).

Why a 27B variant?

The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but memory-hungry at load time — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.

The 27B is dense: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (make bench, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.

Thanatos-27B (this) Janus-35B
Architecture Dense transformer MoE 256 experts, 8 active
Total params 27 B 35 B
Active params per token 27 B ~3 B
Layers 64 40
Hidden size 5120 2048
Q4_K_M GGUF size ~17 GB (bundled) ~19 GB (bundled)
Q3_K_S GGUF size ~12 GB (build locally via make build QUANT=Q3_K_S) n/a
Min host memory @ Q4 / 8K ctx ~22 GB ~38 GB
Multimodal (text path) Yes Yes
Multimodal (vision via Ollama) Broken upstream — see below Broken upstream
Multimodal (vision via llama.cpp) Yes, with mmproj Yes, with mmproj
Max context 262 144 262 144

What's here

File Use
banner.svg / banner.png Repo header, Tokyo Night themed
dense-flow.svg / dense-flow.png Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG)
Modelfile Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by make build / ollama create for local builds
template, system, params Used by HF's Ollama bridge when users ollama run hf.co/FoolDev/Thanatos-27B directly (the bridge does not read Modelfile — see HF Ollama docs). Mirrors the Modelfile's template / system prompt / sampling params.
examples/ Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python
scripts/build.sh Pulls a qwen35-stamped GGUF from unsloth/Qwen3.6-27B-GGUF and runs ollama create (loads on today's llama.cpp / Ollama; see make build)
scripts/load_bundle.sh One-shot path from this repo's bundle → loadable local Ollama tag (smudges LFS pointer via hf download if needed, runs ollama create; see make load-bundle). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle.
scripts/heal_hf_pull.sh Legacy recovery for users who pulled hf.co/FoolDev/Thanatos-27B (or the pre-rename FoolDev/Thanatos-27B) before the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See make heal-hf. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it.
scripts/smoke_test.sh Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With TOOLS_TEST=1, also exercises an end-to-end tool-call round-trip and checks the response shape
scripts/bench.sh Measures real tok/s using Ollama's eval_count / eval_duration metadata over a 3-prompt mix (run make bench)
scripts/fetch_vision.sh Pulls the vision projector (mmproj-F16.gguf) for llama.cpp (Ollama vision is broken upstream — see Vision). Renamed from fetch_mmproj.sh because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match).
scripts/check.sh Local lint: bash -n, pyflakes, py_compile, footgun-grep, plus Modelfile-vs-bridge-files sync check
scripts/check_bridge_sync.py Verifies the Modelfile TEMPLATE / SYSTEM / PARAMETER directives stay in sync with the root-level template / system / params files. Run as part of make check; called from the pre-commit hook.
scripts/verify_arch.py Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as make verify-arch. Handles both qwen35- and qwen36-stamped bundles; exit non-zero if any value mismatches. Not part of make check because it loads the 17 GB GGUF (LFS smudge required); run on demand.
scripts/install-hooks.sh Installs check.sh as a git pre-commit hook
Makefile Convenience wrapper — make help lists targets
LICENSE, CITATION.cff Apache-2.0 license and citation metadata
CHANGELOG.md Versioned tooling/docs changes
README.md This file

For 16 GB GPUs / unified-memory laptops, make build QUANT=Q3_K_S downloads the smaller ~12 GB Q3_K_S quant from unsloth/Qwen3.6-27B-GGUF (qwen35-stamped, loads directly) and creates a local thanatos-27b Ollama tag. Does not redistribute via this repo. For other quants use make build QUANT=.... The local-build path applies this repo's Modelfile; the hf.co/... path applies the root-level template, system, and params files (kept in sync with the Modelfile).

If you want the safetensors for transformers, fetch them from Qwen/Qwen3.6-27B.

Architecture

animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn

  • Qwen 3.6 dense, 27B parameters, 64 transformer layers
  • Hybrid attention stack: 16 repeats of [3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]
    • Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
    • Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
  • Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
  • Vocab 248,320 (shared with 35B-A3B sibling)
  • 262 144 native context, extensible to ~1 M with YaRN
  • Vision + video supported by the base architecture via a separate mmproj projector (not redistributed here; pull mmproj-F16.gguf from unsloth/Qwen3.6-27B-GGUF). See Vision below for current loader compatibility.
  • Multi-token prediction (MTP) head trained for speculative decoding — present in the upstream Qwen/Qwen3.6-27B safetensors and usable via vLLM (qwen3_next_mtp) or SGLang (--speculative-algo NEXTN). Not usable via llama.cpp / Ollama today: the GGUF converter (convert_hf_to_gguf.py) explicitly skips MTP tensors for the qwen35 / qwen35moe arch family ("MTP tensors are not used at inference yet"), so the bundled GGUF and the unsloth GGUFs ship with 851 tensors and no MTP head. llama.cpp's MTP support (PR #22673, merged 2026-05-16) currently covers other architectures only; tracking that PR's follow-up work for when qwen35 / qwen35moe consumer support lands. (Earlier README versions claimed MTP was available without this caveat — confirmed empirically via gguf.GGUFReader on both this bundle and unsloth/Qwen3.6-27B-GGUF, 2026-05-19.)

The bundled GGUF declares general.architecture: 'qwen35' — not a workaround for an unimplemented qwen36 arch, but the canonical upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention family. The naming convergence runs through three layers of the stack:

  • Qwen's own HF configs. Qwen/Qwen3.6-27B/config.json declares "model_type": "qwen3_5" and "architectures": ["Qwen3_5ForConditionalGeneration"]. The MoE sibling Qwen/Qwen3.6-35B-A3B declares "qwen3_5_moe" / Qwen3_5MoeForConditionalGeneration. No Qwen3_6 arch class exists in transformers; Qwen reuses the 3.5 class names.
  • llama.cpp's converter. convert_hf_to_gguf.py registers Qwen3_5ForCausalLMMODEL_ARCH.QWEN35 and Qwen3_5MoeForCausalLMMODEL_ARCH.QWEN35MOE. The unsloth GGUFs this repo pulls from (unsloth/Qwen3.6-27B-GGUF, unsloth/Qwen3.6-35B-A3B-GGUF) inherit those stamps.
  • llama.cpp's model code. src/models/qwen35.cpp has an explicit case 64: type = LLM_TYPE_27B branch for this model; qwen35moe.cpp has case 40: type = LLM_TYPE_35B_A3B for the Janus-35B sibling base. The arch entries were written to load Qwen 3.6 weights, not just Qwen 3.5.

There is no PR or tracking issue for a qwen36 arch entry in ggml-org/llama.cpp or ollama/ollama because none is needed — qwen35 already loads the model the upstream code path was designed to load.

ollama run hf.co/FoolDev/Thanatos-27B and llama-server -m Thanatos-27B.Q4_K_M.gguf both load directly on current stock loaders.

History

The bundle's general.architecture stamp has now flipped eight times — four landings on qwen36 and four on qwen35 — each time after weighing the friction-vs-honesty tradeoff anew. The saga is resolved on the upstream-canonical qwen35 side:

  • v0.6.0-era (e1f78fa, 2026-05-19 14:38 UTC): initial qwen35 → qwen36 stamp, on the theory that qwen35 was a loader stand-in awaiting proper Qwen 3.6 support. Upstream audit later showed that theory was mistaken (see above).
  • 2026-05-19 afternoon (964e418): flipped back to qwen35 after daily friction outweighed version-specificity for that iteration; doc workaround narrative collapsed (83022eb).
  • 2026-05-19 evening (07fa120): brief re-flip to qwen36 during a fresh-pull integration test on Strix Halo.
  • 2026-05-19 evening (72259c1, ~1 hour later): reverted to qwen35 again because the live friction was worse than the doc prose suggested.
  • 2026-05-19 evening (973d7ef): flipped to qwen36 one more time, after the upstream-evidence audit had been shipped and the friction was a known quantity. Project owner wanted to test the friction tradeoff in practice with the audit's conclusion staring them in the face.
  • 2026-05-19 evening (978798f): flipped back to qwen35 after seven sequential fresh-pull → heal-hf cycles on the Strix Halo box made the friction concretely-experienced rather than hypothetical. Each cycle worked (the heal flow is solid) — and each cycle was an unnecessary obstacle for users who just want ollama run to work first try. The audit (a4d3b6e) called the canonical stamp correctly and the practical friction outweighed the version-specificity payoff.
  • 2026-05-20 midday (ae67ed1): brief re-flip to qwen36 the next morning to re-test the friction in a fresh session.
  • 2026-05-20 midday (e03e10e, 8 minutes later): flipped back to qwen35. Same conclusion as the prior round trip — friction outweighs version-specificity. This is the current state.

Tensor data was byte-identical across all stamps; only the general.architecture KV (and namespaced KV keys) flipped. See the CHANGELOG entries for each flip's rationale.

Rebadge utility

scripts/rename_arch.py is the generic GGUF arch renamer (metadata only, tensors byte-identical), kept in the repo for the legacy qwen36 → qwen35 in-store rebadge (used by make heal-hf and make load-bundle) and any future arch flip:

# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
    --from-arch qwen36 --to-arch qwen35 \
    Thanatos-27B.Q4_K_M.qwen36.gguf \
    Thanatos-27B.Q4_K_M.gguf

Quick start

Ollama

Three paths:

# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
#    root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M, qwen35-stamped

# B. Build a local `thanatos-27b` tag from THIS repo's bundle
#    (LFS smudge if needed, then `ollama create`). Useful if you
#    want a bare local tag rather than the `hf.co/...` path:
make load-bundle                                 # creates local tag thanatos-27b
ollama run thanatos-27b

# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
#    and build locally. Loads on every current llama.cpp / Ollama.
make build                                              # Q4_K_M  -> thanatos-27b
make build QUANT=Q3_K_S                                 # 12 GB smaller quant
make build QUANT=Q5_K_M                                 # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf   # skip download
ollama run thanatos-27b

Under the hood, make build calls scripts/build.sh, which downloads the GGUF if missing (set GGUF_PATH to point at one you already have) and runs ollama create with the matching Modelfile.

If you'd rather do it by hand: edit the FROM line in Modelfile and run ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b.

Confirm everything works:

make smoke                          # checks server, model, round-trip, no token leakage
make smoke-tools                    # adds an end-to-end tool-call round-trip (~10s extra)
make bench                          # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-compat

Local apps

App How to load this model
Ollama ollama run hf.co/FoolDev/Thanatos-27B (default Q4_K_M). Pulls the GGUF + the root-level template / system / params files in one step (HF's Ollama bridge ingests these three files; it does not read Modelfile). For other quants, make build QUANT=Q3_K_S downloads from unsloth and creates a local Ollama tag using the Modelfile, which is kept in sync with the bridge files.
LM Studio Search → FoolDev/Thanatos-27B → pick Thanatos-27B.Q4_K_M.gguf. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the SYSTEM block in this repo's Modelfile.
Jan Hub → "Import from Hugging Face" → FoolDev/Thanatos-27B. Same template behavior as LM Studio.
llama.cpp hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir . then llama-server -m Thanatos-27B.Q4_K_M.gguf (or llama-cli, llama-mtmd-cli for vision via the upstream mmproj-F16.gguf).
llama-cpp-python See examples/llama_cpp_quickstart.py (text) and examples/llama_cpp_vision.py (image input).
Open WebUI / KoboldCpp / text-generation-webui Standard llama.cpp loader path — point at the GGUF, use the embedded chat template.

For the full Vision (image input) loader matrix, see Vision. Tool calling currently works in Ollama (via the root-level template file when pulling from hf.co/..., or via the Modelfile TEMPLATE when building locally) and llama.cpp / llama-cpp-python (via the GGUF's embedded jinja). Other apps' tool-calling support depends on whether they read the embedded template or require an external schema.

Inference (OpenAI-compatible)

curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "thanatos-27b",
    "messages": [
      {"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
      {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
    ],
    "temperature": 0.6
  }' | jq -r '.choices[0].message.content'

Recommended sampling

Use temp top_p top_k repeat_penalty
Reasoning / general 0.6 0.95 20 1.05
Creative / RP 0.8 0.95 40 1.02

Lower temperature (0.4-0.6) and bump repeat_penalty to 1.08 if it loops inside <think> tags.

System prompt

The Modelfile bakes this in. Override per-request via the system role in your client:

You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.

Vision

The Qwen 3.6 base supports image (and video) input via a separate mmproj projector. The full multimodal stack is:

Qwen3.6-27B-Q4_K_M.gguf   (~17 GB, the text decoder)
mmproj-F16.gguf           (~927 MB, the vision projector)

Both files are at unsloth/Qwen3.6-27B-GGUF. This repo intentionally does not redistribute either.

Loader compatibility — the honest table

Loader Text Vision (mmproj) Notes
llama.cpp (llama-mtmd-cli, llama-server --mmproj) Reference path. Upstream has the qwen35/qwen35moe arch entries.
llama-cpp-python See examples/llama_cpp_vision.py.
Ollama 0.24 Text inference works: Ollama's Go engine has the qwen35 / qwen35moe arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. ollama create accepts a dual-FROM (text + mmproj) and ollama show reports vision capability — but the first inference request fails with error loading model architecture: unknown model architecture: 'qwen35' (or 'qwen35moe'), and once mmproj is attached this blocks text inference too. See ollama/ollama#15898.
LM Studio ✅ (last tested) Uses upstream llama.cpp directly.

Vision via llama.cpp

Three flavors, in order of build-time effort:

# A. HTTP via llama-server (always built — the easiest path).
#    Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
#    on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget ≥500 max_tokens so the reasoning
# block doesn't crowd out the final answer.

# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
#    so a selective `cmake --build build --target llama-cli ...` won't
#    produce it — a plain `cmake --build build` will. If yours didn't,
#    run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image photo.jpg \
  -p "Describe this image."

# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-F16.gguf \
  --image /path/to/photo.jpg \
  --prompt "What is in this image?"

Until the Ollama upstream issue is fixed, treat Ollama as text-only for this model.

Hardware requirements

The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.

Hardware Status
≥32 GB RAM (CPU-only) Works, ~1-3 tok/s
RTX 3090 / 4090 24 GB Works, full Q4 offload, ~25-40 tok/s
RTX 5090 32 GB Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s
Mac Studio M2/M3 32 GB+ unified Works, ~15-25 tok/s
32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) Borderline at Q4. make build QUANT=Q3_K_S (~12 GB) and trim num_ctx for headroom.

Most numbers in this table are estimates from comparable models; the gradient is right but the absolute values will move ±20% with prompt shape, KV cache type, and parallel-request count. Measure your own machine with make bench (3-prompt mix, reports tok/s from Ollama's eval_count / eval_duration so it's not stopwatch-noisy). Reference data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan: ~12.3 tok/s at Q3_K_S and ~9.3 tok/s at Q4_K_M (3-prompt mix, steady across short / medium / long prompts), sitting between CPU-only and a 24 GB discrete card as expected. An earlier ROCm snapshot of the same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on this hardware.

Chat template

Standard Qwen 3.x ChatML with <|im_start|> / <|im_end|> role markers and <think>...</think> blocks for reasoning traces. The Qwen 3.6 jinja template is embedded in the GGUF metadata; loaders that read GGUF chat templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the plain-conversation formatting automatically.

Ollama is the exception: its conversion of the embedded jinja loses the .Tools / .ToolCalls blocks Ollama's capability detector requires. Two paths fix this, depending on how you pull the model:

  • ollama run hf.co/FoolDev/Thanatos-27B — HF's Ollama bridge applies the root-level template / system / params files in this repo (the bridge does not read Modelfile).
  • make build / ollama create thanatos-27b -f Modelfile — uses the Modelfile's TEMPLATE block.

Both routes wire .Tools / .ToolCalls and tools work end-to-end on /api/chat and /v1/chat/completions. The two configurations are kept in sync: edit them together if you change one.

Plain conversation

<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant

With reasoning trace

<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>

Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>

Most clients (Open WebUI, LibreChat, etc.) hide the <think> block by default and surface only the visible answer. Strip it manually with re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL) if your client doesn't.

Tool / function calling

The wire format depends on the loader. Both are valid Qwen 3.6 outputs; the model adapts to whichever shape the system prompt prescribes.

Ollama path (this repo's Modelfile). The TEMPLATE directive prompts the model to emit JSON-in-XML, the form Ollama's tool-call extractor parses into a structured tool_calls array. After make build, ollama show thanatos-27b lists tools and thinking under Capabilities, and both /api/chat and /v1/chat/completions accept a tools array.

<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>

Embedded-jinja path (llama.cpp, llama-cpp-python, LM Studio). The Qwen 3.6 native chat template baked into the GGUF instructs the model to emit the more verbose XML form it was trained on:

<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>

Use whichever your client expects; don't mix parsers.

End-to-end exercise (Ollama path):

python examples/ollama_chat.py        # section 3 runs a real round-trip

Known limitations

  • Slower per token than the 35B-A3B sibling. Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
  • No mmproj in this release, and vision via Ollama is broken upstream (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the Vision section). For image input use llama.cpp directly until that's fixed.
  • Q4_K_M quality loss is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
  • No formal evaluation in this card. Numbers above are estimates.

Related models

Model Notes
Qwen/Qwen3.6-27B Upstream base, safetensors
unsloth/Qwen3.6-27B-GGUF Recommended GGUF source
FoolDev/Janus-35B 35B-A3B MoE sibling. More capacity, more memory pressure.
Crownelius/Crow-9B-HERETIC-4.6 9B starter model when 27B/35B is too heavy

Credits

  • Base model: Qwen/Qwen3.6-27B (Alibaba)
  • Reasoning teacher: Claude Opus 4.7 (Anthropic)
  • Distillation lineage and dataset curation: Crownelius

License inherited from upstream: Apache-2.0.

Downloads last month
407
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FoolDev/Thanatos-27B

Base model

Qwen/Qwen3.6-27B
Quantized
(402)
this model

Datasets used to train FoolDev/Thanatos-27B

Space using FoolDev/Thanatos-27B 1