Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
Thanatos-27B
Dense Reasoning. Friendlier Footprint. Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.
Architecture: Qwen 3.6 27B (Dense) | Parameters: 27B | Teacher: Claude Opus 4.7 | Type: Distilled LLM
A personal sibling to FoolDev/Janus-35B. Same teacher (Claude Opus 4.7), same dataset family, but built on the dense Qwen/Qwen3.6-27B base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
TL;DR
One-liner via Hugging Face (pulls a GGUF + this repo's root-level
template / system / params files, including the tool-calling
template — HF's Ollama bridge ingests those three files, not
Modelfile):
ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
If you pulled the bundle during any of the qwen36 windows on the
pre-rename FoolDev/Thanatos-27B repo (2026-05-19/20) and still
have a qwen36-stamped blob in your local Ollama store, make heal-hf rebadges it in place. Fresh pulls go straight through.
For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), make build QUANT=... is the simplest path. See Quick start
below for the full matrix.
For image input use llama.cpp directly — Ollama vision is broken for this architecture upstream (see Vision).
Why a 27B variant?
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but memory-hungry at load time — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
The 27B is dense: every parameter participates in every forward pass. It's slower per token than 35B-A3B — on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (make bench, 3-prompt mix) — but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
| Thanatos-27B (this) | Janus-35B | |
|---|---|---|
| Architecture | Dense transformer | MoE 256 experts, 8 active |
| Total params | 27 B | 35 B |
| Active params per token | 27 B | ~3 B |
| Layers | 64 | 40 |
| Hidden size | 5120 | 2048 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
| Q3_K_S GGUF size | ~12 GB (build locally via make build QUANT=Q3_K_S) |
n/a |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
| Multimodal (text path) | Yes | Yes |
| Multimodal (vision via Ollama) | Broken upstream — see below | Broken upstream |
| Multimodal (vision via llama.cpp) | Yes, with mmproj | Yes, with mmproj |
| Max context | 262 144 | 262 144 |
What's here
| File | Use |
|---|---|
banner.svg / banner.png |
Repo header, Tokyo Night themed |
dense-flow.svg / dense-flow.png |
Architecture diagram: 64-layer hybrid attention stack with animated forward-pass pulse (SVG); static frame fallback (PNG) |
Modelfile |
Ollama wrapper around the bundled Qwen 3.6 27B GGUF — used by make build / ollama create for local builds |
template, system, params |
Used by HF's Ollama bridge when users ollama run hf.co/FoolDev/Thanatos-27B directly (the bridge does not read Modelfile — see HF Ollama docs). Mirrors the Modelfile's template / system prompt / sampling params. |
examples/ |
Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
scripts/build.sh |
Pulls a qwen35-stamped GGUF from unsloth/Qwen3.6-27B-GGUF and runs ollama create (loads on today's llama.cpp / Ollama; see make build) |
scripts/load_bundle.sh |
One-shot path from this repo's bundle → loadable local Ollama tag (smudges LFS pointer via hf download if needed, runs ollama create; see make load-bundle). Carries a qwen36 → qwen35 rebadge branch for legacy pre-rename checkouts — no-op on the current qwen35-stamped bundle. |
scripts/heal_hf_pull.sh |
Legacy recovery for users who pulled hf.co/FoolDev/Thanatos-27B (or the pre-rename FoolDev/Thanatos-27B) before the latest qwen35 re-stamp and still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 → qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See make heal-hf. Idempotent and a no-op on tags already on qwen35 — fresh pulls don't need it. |
scripts/smoke_test.sh |
Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With TOOLS_TEST=1, also exercises an end-to-end tool-call round-trip and checks the response shape |
scripts/bench.sh |
Measures real tok/s using Ollama's eval_count / eval_duration metadata over a 3-prompt mix (run make bench) |
scripts/fetch_vision.sh |
Pulls the vision projector (mmproj-F16.gguf) for llama.cpp (Ollama vision is broken upstream — see Vision). Renamed from fetch_mmproj.sh because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
scripts/check.sh |
Local lint: bash -n, pyflakes, py_compile, footgun-grep, plus Modelfile-vs-bridge-files sync check |
scripts/check_bridge_sync.py |
Verifies the Modelfile TEMPLATE / SYSTEM / PARAMETER directives stay in sync with the root-level template / system / params files. Run as part of make check; called from the pre-commit hook. |
scripts/verify_arch.py |
Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as make verify-arch. Handles both qwen35- and qwen36-stamped bundles; exit non-zero if any value mismatches. Not part of make check because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
scripts/install-hooks.sh |
Installs check.sh as a git pre-commit hook |
Makefile |
Convenience wrapper — make help lists targets |
LICENSE, CITATION.cff |
Apache-2.0 license and citation metadata |
CHANGELOG.md |
Versioned tooling/docs changes |
README.md |
This file |
For 16 GB GPUs / unified-memory laptops, make build QUANT=Q3_K_S
downloads the smaller ~12 GB Q3_K_S quant from
unsloth/Qwen3.6-27B-GGUF (qwen35-stamped, loads directly) and
creates a local thanatos-27b Ollama tag. Does not redistribute
via this repo. For other quants use make build QUANT=.... The
local-build path applies this repo's Modelfile; the hf.co/...
path applies the root-level template, system, and params
files (kept in sync with the Modelfile).
If you want the safetensors for transformers, fetch them from Qwen/Qwen3.6-27B.
Architecture
- Qwen 3.6 dense, 27B parameters, 64 transformer layers
- Hybrid attention stack: 16 repeats of
[3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)]- Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
- Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
- Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
- Vocab 248,320 (shared with 35B-A3B sibling)
- 262 144 native context, extensible to ~1 M with YaRN
- Vision + video supported by the base architecture via a separate
mmprojprojector (not redistributed here; pullmmproj-F16.gguffromunsloth/Qwen3.6-27B-GGUF). See Vision below for current loader compatibility. - Multi-token prediction (MTP) head trained for speculative decoding —
present in the upstream
Qwen/Qwen3.6-27Bsafetensors and usable via vLLM (qwen3_next_mtp) or SGLang (--speculative-algo NEXTN). Not usable via llama.cpp / Ollama today: the GGUF converter (convert_hf_to_gguf.py) explicitly skips MTP tensors for theqwen35/qwen35moearch family ("MTP tensors are not used at inference yet"), so the bundled GGUF and the unsloth GGUFs ship with 851 tensors and no MTP head. llama.cpp's MTP support (PR #22673, merged 2026-05-16) currently covers other architectures only; tracking that PR's follow-up work for when qwen35 / qwen35moe consumer support lands. (Earlier README versions claimed MTP was available without this caveat — confirmed empirically viagguf.GGUFReaderon both this bundle andunsloth/Qwen3.6-27B-GGUF, 2026-05-19.)
The bundled GGUF declares general.architecture: 'qwen35' — not a
workaround for an unimplemented qwen36 arch, but the canonical
upstream label for the entire Qwen 3.5 / 3.6 hybrid SSM + attention
family. The naming convergence runs through three layers of the
stack:
- Qwen's own HF configs.
Qwen/Qwen3.6-27B/config.jsondeclares"model_type": "qwen3_5"and"architectures": ["Qwen3_5ForConditionalGeneration"]. The MoE siblingQwen/Qwen3.6-35B-A3Bdeclares"qwen3_5_moe"/Qwen3_5MoeForConditionalGeneration. NoQwen3_6arch class exists intransformers; Qwen reuses the 3.5 class names. - llama.cpp's converter.
convert_hf_to_gguf.pyregistersQwen3_5ForCausalLM→MODEL_ARCH.QWEN35andQwen3_5MoeForCausalLM→MODEL_ARCH.QWEN35MOE. The unsloth GGUFs this repo pulls from (unsloth/Qwen3.6-27B-GGUF,unsloth/Qwen3.6-35B-A3B-GGUF) inherit those stamps. - llama.cpp's model code.
src/models/qwen35.cpphas an explicitcase 64: type = LLM_TYPE_27Bbranch for this model;qwen35moe.cpphascase 40: type = LLM_TYPE_35B_A3Bfor the Janus-35B sibling base. The arch entries were written to load Qwen 3.6 weights, not just Qwen 3.5.
There is no PR or tracking issue for a qwen36 arch entry in
ggml-org/llama.cpp or ollama/ollama because none is needed —
qwen35 already loads the model the upstream code path was
designed to load.
ollama run hf.co/FoolDev/Thanatos-27B and llama-server -m Thanatos-27B.Q4_K_M.gguf both load directly on current stock
loaders.
History
The bundle's general.architecture stamp has now flipped eight
times — four landings on qwen36 and four on qwen35 — each time
after weighing the friction-vs-honesty tradeoff anew. The saga
is resolved on the upstream-canonical qwen35 side:
- v0.6.0-era (
e1f78fa, 2026-05-19 14:38 UTC): initial qwen35 → qwen36 stamp, on the theory that qwen35 was a loader stand-in awaiting proper Qwen 3.6 support. Upstream audit later showed that theory was mistaken (see above). - 2026-05-19 afternoon (
964e418): flipped back to qwen35 after daily friction outweighed version-specificity for that iteration; doc workaround narrative collapsed (83022eb). - 2026-05-19 evening (
07fa120): brief re-flip to qwen36 during a fresh-pull integration test on Strix Halo. - 2026-05-19 evening (
72259c1, ~1 hour later): reverted to qwen35 again because the live friction was worse than the doc prose suggested. - 2026-05-19 evening (
973d7ef): flipped to qwen36 one more time, after the upstream-evidence audit had been shipped and the friction was a known quantity. Project owner wanted to test the friction tradeoff in practice with the audit's conclusion staring them in the face. - 2026-05-19 evening (
978798f): flipped back to qwen35 after seven sequential fresh-pull → heal-hf cycles on the Strix Halo box made the friction concretely-experienced rather than hypothetical. Each cycle worked (the heal flow is solid) — and each cycle was an unnecessary obstacle for users who just wantollama runto work first try. The audit (a4d3b6e) called the canonical stamp correctly and the practical friction outweighed the version-specificity payoff. - 2026-05-20 midday (
ae67ed1): brief re-flip to qwen36 the next morning to re-test the friction in a fresh session. - 2026-05-20 midday (
e03e10e, 8 minutes later): flipped back to qwen35. Same conclusion as the prior round trip — friction outweighs version-specificity. This is the current state.
Tensor data was byte-identical across all stamps; only the
general.architecture KV (and namespaced KV keys) flipped.
See the CHANGELOG entries for each flip's
rationale.
Rebadge utility
scripts/rename_arch.py is the generic GGUF arch renamer
(metadata only, tensors byte-identical), kept in the repo for
the legacy qwen36 → qwen35 in-store rebadge (used by make heal-hf and make load-bundle) and any future arch flip:
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
python3 scripts/rename_arch.py \
--from-arch qwen36 --to-arch qwen35 \
Thanatos-27B.Q4_K_M.qwen36.gguf \
Thanatos-27B.Q4_K_M.gguf
Quick start
Ollama
Three paths:
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
# root-level template / system / params files in one step):
ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen35-stamped
# B. Build a local `thanatos-27b` tag from THIS repo's bundle
# (LFS smudge if needed, then `ollama create`). Useful if you
# want a bare local tag rather than the `hf.co/...` path:
make load-bundle # creates local tag thanatos-27b
ollama run thanatos-27b
# C. Bypass the bundle: download a qwen35-stamped GGUF from unsloth
# and build locally. Loads on every current llama.cpp / Ollama.
make build # Q4_K_M -> thanatos-27b
make build QUANT=Q3_K_S # 12 GB smaller quant
make build QUANT=Q5_K_M # 20 GB higher quality
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
ollama run thanatos-27b
Under the hood, make build calls scripts/build.sh, which downloads the
GGUF if missing (set GGUF_PATH to point at one you already have) and
runs ollama create with the matching Modelfile.
If you'd rather do it by hand: edit the FROM line in Modelfile and
run ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b.
Confirm everything works:
make smoke # checks server, model, round-trip, no token leakage
make smoke-tools # adds an end-to-end tool-call round-trip (~10s extra)
make bench # measured tok/s on this machine (3-prompt mix)
python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-compat
Local apps
| App | How to load this model |
|---|---|
| Ollama | ollama run hf.co/FoolDev/Thanatos-27B (default Q4_K_M). Pulls the GGUF + the root-level template / system / params files in one step (HF's Ollama bridge ingests these three files; it does not read Modelfile). For other quants, make build QUANT=Q3_K_S downloads from unsloth and creates a local Ollama tag using the Modelfile, which is kept in sync with the bridge files. |
| LM Studio | Search → FoolDev/Thanatos-27B → pick Thanatos-27B.Q4_K_M.gguf. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the SYSTEM block in this repo's Modelfile. |
| Jan | Hub → "Import from Hugging Face" → FoolDev/Thanatos-27B. Same template behavior as LM Studio. |
| llama.cpp | hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir . then llama-server -m Thanatos-27B.Q4_K_M.gguf (or llama-cli, llama-mtmd-cli for vision via the upstream mmproj-F16.gguf). |
| llama-cpp-python | See examples/llama_cpp_quickstart.py (text) and examples/llama_cpp_vision.py (image input). |
| Open WebUI / KoboldCpp / text-generation-webui | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |
For the full Vision (image input) loader matrix, see Vision.
Tool calling currently works in Ollama (via the root-level
template file when pulling from hf.co/..., or via the Modelfile
TEMPLATE when building locally) and llama.cpp / llama-cpp-python
(via the GGUF's embedded jinja). Other apps' tool-calling support
depends on whether they read the embedded template or require an
external schema.
Inference (OpenAI-compatible)
curl -s http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "thanatos-27b",
"messages": [
{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
],
"temperature": 0.6
}' | jq -r '.choices[0].message.content'
Recommended sampling
| Use | temp | top_p | top_k | repeat_penalty |
|---|---|---|---|---|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |
Lower temperature (0.4-0.6) and bump repeat_penalty to 1.08 if it loops inside <think> tags.
System prompt
The Modelfile bakes this in. Override per-request via the system role
in your client:
You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
Vision
The Qwen 3.6 base supports image (and video) input via a separate
mmproj projector. The full multimodal stack is:
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
mmproj-F16.gguf (~927 MB, the vision projector)
Both files are at
unsloth/Qwen3.6-27B-GGUF.
This repo intentionally does not redistribute either.
Loader compatibility — the honest table
| Loader | Text | Vision (mmproj) | Notes |
|---|---|---|---|
llama.cpp (llama-mtmd-cli, llama-server --mmproj) |
✅ | ✅ | Reference path. Upstream has the qwen35/qwen35moe arch entries. |
| llama-cpp-python | ✅ | ✅ | See examples/llama_cpp_vision.py. |
| Ollama 0.24 | ✅ | ❌ | Text inference works: Ollama's Go engine has the qwen35 / qwen35moe arch entries. Vision (mmproj) is still broken: the C++ llama.cpp fallback that Ollama switches to when an mmproj is attached lacks those entries. ollama create accepts a dual-FROM (text + mmproj) and ollama show reports vision capability — but the first inference request fails with error loading model architecture: unknown model architecture: 'qwen35' (or 'qwen35moe'), and once mmproj is attached this blocks text inference too. See ollama/ollama#15898. |
| LM Studio | ✅ | ✅ (last tested) | Uses upstream llama.cpp directly. |
Vision via llama.cpp
Three flavors, in order of build-time effort:
# A. HTTP via llama-server (always built — the easiest path).
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
# then POST OpenAI-style chat completions with an image_url content
# block — e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
# The thinking trace arrives in message.reasoning_content; the visible
# answer is in message.content. Budget ≥500 max_tokens so the reasoning
# block doesn't crowd out the final answer.
# B. CLI via llama-mtmd-cli (one-shot). It's a separate cmake target,
# so a selective `cmake --build build --target llama-cli ...` won't
# produce it — a plain `cmake --build build` will. If yours didn't,
# run `cmake --build build --target llama-mtmd-cli`.
llama-mtmd-cli \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--image photo.jpg \
-p "Describe this image."
# C. Python via llama-cpp-python:
python examples/llama_cpp_vision.py \
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj /path/to/mmproj-F16.gguf \
--image /path/to/photo.jpg \
--prompt "What is in this image?"
Until the Ollama upstream issue is fixed, treat Ollama as text-only for this model.
Hardware requirements
The dense 27B is the lighter sibling to Janus-35B and the easier of the two to deploy.
| Hardware | Status |
|---|---|
| ≥32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. make build QUANT=Q3_K_S (~12 GB) and trim num_ctx for headroom. |
Most numbers in this table are estimates from comparable models; the
gradient is right but the absolute values will move ±20% with prompt
shape, KV cache type, and parallel-request count. Measure your own
machine with make bench (3-prompt mix, reports tok/s from Ollama's
eval_count / eval_duration so it's not stopwatch-noisy). Reference
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan:
~12.3 tok/s at Q3_K_S and ~9.3 tok/s at Q4_K_M (3-prompt mix,
steady across short / medium / long prompts), sitting between CPU-only
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
same Q3_K_S bench gave ~10.1 tok/s — Vulkan was the clear winner on
this hardware.
Chat template
Standard Qwen 3.x ChatML with <|im_start|> / <|im_end|> role markers
and <think>...</think> blocks for reasoning traces. The Qwen 3.6 jinja
template is embedded in the GGUF metadata; loaders that read GGUF chat
templates directly (llama.cpp, llama-cpp-python, LM Studio) handle the
plain-conversation formatting automatically.
Ollama is the exception: its conversion of the embedded jinja loses the
.Tools / .ToolCalls blocks Ollama's capability detector requires.
Two paths fix this, depending on how you pull the model:
ollama run hf.co/FoolDev/Thanatos-27B— HF's Ollama bridge applies the root-leveltemplate/system/paramsfiles in this repo (the bridge does not readModelfile).make build/ollama create thanatos-27b -f Modelfile— uses theModelfile'sTEMPLATEblock.
Both routes wire .Tools / .ToolCalls and tools work end-to-end on
/api/chat and /v1/chat/completions. The two configurations are
kept in sync: edit them together if you change one.
Plain conversation
<|im_start|>system
You are Thanatos, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
With reasoning trace
<|im_start|>assistant
<think>
The user asked about mergesort. It splits, recursively sorts each half,
then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>
Mergesort runs in **O(n log n)** time in the worst, average, and best
cases.<|im_end|>
Most clients (Open WebUI, LibreChat, etc.) hide the <think> block by
default and surface only the visible answer. Strip it manually with
re.sub(r"<think>.*?</think>\s*", "", content, flags=re.DOTALL) if your
client doesn't.
Tool / function calling
The wire format depends on the loader. Both are valid Qwen 3.6 outputs; the model adapts to whichever shape the system prompt prescribes.
Ollama path (this repo's Modelfile). The TEMPLATE directive
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
extractor parses into a structured tool_calls array. After
make build, ollama show thanatos-27b lists tools and thinking
under Capabilities, and both /api/chat and /v1/chat/completions
accept a tools array.
<tool_call>
{"name": "get_current_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
</tool_call>
Embedded-jinja path (llama.cpp, llama-cpp-python, LM Studio). The Qwen 3.6 native chat template baked into the GGUF instructs the model to emit the more verbose XML form it was trained on:
<tool_call>
<function=get_current_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
Use whichever your client expects; don't mix parsers.
End-to-end exercise (Ollama path):
python examples/ollama_chat.py # section 3 runs a real round-trip
Known limitations
- Slower per token than the 35B-A3B sibling. Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
- No mmproj in this release, and vision via Ollama is broken upstream (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached — see the Vision section). For image input use llama.cpp directly until that's fixed.
- Q4_K_M quality loss is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
- No formal evaluation in this card. Numbers above are estimates.
Related models
| Model | Notes |
|---|---|
| Qwen/Qwen3.6-27B | Upstream base, safetensors |
| unsloth/Qwen3.6-27B-GGUF | Recommended GGUF source |
| FoolDev/Janus-35B | 35B-A3B MoE sibling. More capacity, more memory pressure. |
| Crownelius/Crow-9B-HERETIC-4.6 | 9B starter model when 27B/35B is too heavy |
Credits
- Base model: Qwen/Qwen3.6-27B (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: Crownelius
License inherited from upstream: Apache-2.0.
- Downloads last month
- 407
4-bit
Model tree for FoolDev/Thanatos-27B
Base model
Qwen/Qwen3.6-27B