Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
File size: 8,080 Bytes
7197abd b564869 80f4494 59f5706 e4beea4 e1f78fa 5426482 7197abd 5426482 ac94e67 b564869 73e905b b564869 3d2e907 73e905b b564869 bc0cbc6 b564869 80f4494 b564869 6672746 bc0cbc6 b564869 17932e4 b564869 17932e4 b564869 75bbdfe 0d08cb9 5c19c97 8bddbe0 124302d 8bddbe0 124302d 16e1ddd 7063e20 8bddbe0 693cf65 7063e20 693cf65 8bddbe0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | # Thanatos-27B β Ollama wrapper around Qwen 3.6 27B (dense)
#
# Text + tool calling. Vision via Ollama is currently broken for this
# architecture (ollama/ollama#15898 β the qwen35 arch entries are in
# Ollama's Go text engine but missing from the C++ llama.cpp fallback
# Ollama uses when an mmproj is attached). Use llama.cpp directly for
# image input, or wait for the fix. See the Vision section in README.md.
#
# This repo bundles a single GGUF: Thanatos-27B.Q4_K_M.gguf (~17 GB),
# stamped `general.architecture: 'qwen35'` β the upstream-canonical
# arch entry every released llama.cpp / Ollama loads under for the
# Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
# thanatos-27b -f Modelfile && ollama run thanatos-27b` loads it
# directly. See README "Architecture" for the full stamp history
# (eight flips between qwen35 and qwen36, settled on qwen35 at
# `e03e10e` after the 4th qwen36 round trip had its friction
# re-tested in a fresh next-day session).
#
# For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
# downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
# FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
# it was removed so HF's Ollama bridge picks Q4_K_M as the default
# `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
#
# Other GGUF sources (use with `make build GGUF_PATH=...`):
# https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
# https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF
FROM ./Thanatos-27B.Q4_K_M.gguf
# Chat template β Qwen 3.6 ChatML in Ollama Go-template form, with the
# tool-calling blocks Ollama's capability detector looks for. Without a
# TEMPLATE that references .Tools and .ToolCalls, /api/chat and
# /v1/chat/completions reject any request carrying a `tools` array with
# `<model> does not support tools`. Same template as the 35B sibling β
# both share the Qwen 3.6 chat format.
TEMPLATE """{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}{{ .System }}
{{ end }}
{{- if .Tools }}# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}{{ end }}
{{- if .ToolCalls }}
{{- range .ToolCalls }}
<tool_call>
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
</tool_call>
{{- end }}
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
<think>
{{ end }}
{{- end }}"""
# Sampling tuned for reasoning + general use. See README "Recommended sampling"
# for creative/RP alternatives.
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.05
PARAMETER num_ctx 16384
# Stop tokens. Without these, Ollama only honors <|im_end|> from the GGUF
# metadata; the model occasionally emits <|endoftext|> instead and Ollama
# keeps generating past it (synthesising a fake new user turn). Listing
# both β plus <|im_start|> as a belt-and-braces guard against the same
# loop β keeps responses cleanly terminated.
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_start|>"
SYSTEM """You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning."""
# Hardware notes
# --------------
# Qwen 3.6 27B is *dense* β every parameter participates in every forward pass.
# Q4_K_M GGUF is ~17 GB. Practical footprint:
# weights mmap ~17 GB
# compute graph alloc ~12 GB (smaller than 35B-A3B because dense β MoE)
# KV cache @ 16K ctx ~1 GB (with OLLAMA_KV_CACHE_TYPE=q8_0)
# total minimum ~30 GB
#
# Working configurations:
# β RTX 3090 / 4090 24 GB β full Q4 offload, ~25-40 tok/s
# β RTX 5090 32 GB β full offload at Q5/Q6 quant
# β Mac Studio M2/M3 32 GB+ unified β ~15-25 tok/s
# β Linux box with 32 GB+ RAM (CPU-only) β ~1-3 tok/s
# β 32 GB unified-memory laptops β borderline at Q4, try
# `make build QUANT=Q3_K_S`
# (~12 GB) and trim num_ctx
#
# Measured data points (ASUS ROG Flow Z13 GZ302EA, Ryzen AI Max+ 395 +
# Radeon 8060S iGPU, 32 GB unified, gfx1151, OLLAMA_FLASH_ATTENTION=1,
# OLLAMA_KV_CACHE_TYPE=q8_0, num_ctx 16384, 3-prompt mix):
# Vulkan (OLLAMA_VULKAN=1):
# Q3_K_S β 12.31 tok/s aggregate (run 1)
# (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
# Q3_K_S β 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
# (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
# Second run measured against a `thanatos-27b:latest` (pre-rename)
# built via `make build QUANT=Q3_K_S` against the then-current
# unsloth/Qwen3.6-27B-GGUF source. Aggregate is 4.9% below
# run 1 (within the Β±20% noise band) β slightly longer
# per-prompt outputs this run (8009 vs 6182 tokens) likely
# contribute the difference, plus late-in-session thermal
# pressure on the Strix Halo iGPU.
# (Heretic v2 base is not benched here yet; rebundle pending.)
# Q4_K_M β 9.31 tok/s aggregate (run 1)
# (5356 tokens / 574.9 s; 9.48 / 9.43 / 9.28 short/medium/long)
# Q4_K_M β 9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)
# (6210 tokens / 675.6 s; 9.40 / 9.29 / 9.16 short/medium/long)
# Second run measured against the qwen36-stamped HF-bridge tag
# after `make heal-hf` rebadged it to qwen35 in store β confirms
# the in-place heal produces a model with the same performance
# profile as `make load-bundle`. Aggregate is 1.3% below run 1
# (within the Β±20% noise band the README hardware section
# warns about).
# Q4_K_M β 9.32 tok/s aggregate (run 3, 2026-05-19 evening)
# (4592 tokens / 492.7 s; 9.49 / 9.44 / 9.28 short/medium/long)
# Third run, also against a heal-hf-rebadged qwen36-stamped
# HF-bridge tag β this time the 3rd-round-trip bundle from
# commit 973d7ef. Aggregate is within 0.1% of run 1's 9.31,
# confirming the latest qwen36 -> qwen35 heal yields the same
# performance profile as the prior two runs (no regression
# from the third stamp flip).
# ROCm (older snapshot, kept for backend comparison):
# Q3_K_S β 10.14 tok/s aggregate
# (8080 tokens / 796.5 s; 10.37 / 10.31 / 10.11 short/medium/long)
|