File size: 8,080 Bytes
7197abd
b564869
80f4494
59f5706
 
 
 
e4beea4
e1f78fa
5426482
 
 
7197abd
5426482
ac94e67
 
 
b564869
73e905b
 
 
 
 
b564869
3d2e907
73e905b
 
b564869
bc0cbc6
b564869
80f4494
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b564869
 
 
 
 
 
 
 
6672746
 
 
 
 
 
 
 
 
bc0cbc6
b564869
 
 
 
 
 
 
 
 
 
 
 
 
17932e4
 
b564869
 
17932e4
b564869
 
 
 
 
 
75bbdfe
 
0d08cb9
5c19c97
8bddbe0
 
 
 
124302d
8bddbe0
124302d
 
16e1ddd
 
 
 
 
 
 
 
7063e20
8bddbe0
693cf65
7063e20
 
 
 
 
 
 
693cf65
 
 
 
 
 
 
 
8bddbe0
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# Thanatos-27B β€” Ollama wrapper around Qwen 3.6 27B (dense)
#
# Text + tool calling. Vision via Ollama is currently broken for this
# architecture (ollama/ollama#15898 β€” the qwen35 arch entries are in
# Ollama's Go text engine but missing from the C++ llama.cpp fallback
# Ollama uses when an mmproj is attached). Use llama.cpp directly for
# image input, or wait for the fix. See the Vision section in README.md.
#
# This repo bundles a single GGUF: Thanatos-27B.Q4_K_M.gguf (~17 GB),
# stamped `general.architecture: 'qwen35'` β€” the upstream-canonical
# arch entry every released llama.cpp / Ollama loads under for the
# Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
# thanatos-27b -f Modelfile && ollama run thanatos-27b` loads it
# directly. See README "Architecture" for the full stamp history
# (eight flips between qwen35 and qwen36, settled on qwen35 at
# `e03e10e` after the 4th qwen36 round trip had its friction
# re-tested in a fresh next-day session).
#
# For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
# downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
# FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
# it was removed so HF's Ollama bridge picks Q4_K_M as the default
# `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
#
# Other GGUF sources (use with `make build GGUF_PATH=...`):
#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
#     https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF

FROM ./Thanatos-27B.Q4_K_M.gguf

# Chat template β€” Qwen 3.6 ChatML in Ollama Go-template form, with the
# tool-calling blocks Ollama's capability detector looks for. Without a
# TEMPLATE that references .Tools and .ToolCalls, /api/chat and
# /v1/chat/completions reject any request carrying a `tools` array with
# `<model> does not support tools`. Same template as the 35B sibling β€”
# both share the Qwen 3.6 chat format.
TEMPLATE """{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}{{ .System }}

{{ end }}
{{- if .Tools }}# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}{{ end }}
{{- if .ToolCalls }}
{{- range .ToolCalls }}
<tool_call>
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
</tool_call>
{{- end }}
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
<think>
{{ end }}
{{- end }}"""

# Sampling tuned for reasoning + general use. See README "Recommended sampling"
# for creative/RP alternatives.
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.05
PARAMETER num_ctx 16384

# Stop tokens. Without these, Ollama only honors <|im_end|> from the GGUF
# metadata; the model occasionally emits <|endoftext|> instead and Ollama
# keeps generating past it (synthesising a fake new user turn). Listing
# both β€” plus <|im_start|> as a belt-and-braces guard against the same
# loop β€” keeps responses cleanly terminated.
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_start|>"

SYSTEM """You are Thanatos, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning."""

# Hardware notes
# --------------
# Qwen 3.6 27B is *dense* β€” every parameter participates in every forward pass.
# Q4_K_M GGUF is ~17 GB. Practical footprint:
#   weights mmap          ~17 GB
#   compute graph alloc   ~12 GB  (smaller than 35B-A3B because dense β‰  MoE)
#   KV cache @ 16K ctx     ~1 GB  (with OLLAMA_KV_CACHE_TYPE=q8_0)
#   total minimum          ~30 GB
#
# Working configurations:
#   βœ“ RTX 3090 / 4090 24 GB                     β€” full Q4 offload, ~25-40 tok/s
#   βœ“ RTX 5090 32 GB                            β€” full offload at Q5/Q6 quant
#   βœ“ Mac Studio M2/M3 32 GB+ unified           β€” ~15-25 tok/s
#   βœ“ Linux box with 32 GB+ RAM (CPU-only)      β€” ~1-3 tok/s
#   ⚠ 32 GB unified-memory laptops              β€” borderline at Q4, try
#                                                 `make build QUANT=Q3_K_S`
#                                                 (~12 GB) and trim num_ctx
#
# Measured data points (ASUS ROG Flow Z13 GZ302EA, Ryzen AI Max+ 395 +
# Radeon 8060S iGPU, 32 GB unified, gfx1151, OLLAMA_FLASH_ATTENTION=1,
# OLLAMA_KV_CACHE_TYPE=q8_0, num_ctx 16384, 3-prompt mix):
#   Vulkan (OLLAMA_VULKAN=1):
#     Q3_K_S β†’ 12.31 tok/s aggregate (run 1)
#       (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
#     Q3_K_S β†’ 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
#       (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
#       Second run measured against a `thanatos-27b:latest` (pre-rename)
#       built via `make build QUANT=Q3_K_S` against the then-current
#       unsloth/Qwen3.6-27B-GGUF source. Aggregate is 4.9% below
#       run 1 (within the Β±20% noise band) β€” slightly longer
#       per-prompt outputs this run (8009 vs 6182 tokens) likely
#       contribute the difference, plus late-in-session thermal
#       pressure on the Strix Halo iGPU.
#       (Heretic v2 base is not benched here yet; rebundle pending.)
#     Q4_K_M β†’  9.31 tok/s aggregate (run 1)
#       (5356 tokens / 574.9 s;  9.48 /  9.43 /  9.28 short/medium/long)
#     Q4_K_M β†’  9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)
#       (6210 tokens / 675.6 s;  9.40 /  9.29 /  9.16 short/medium/long)
#       Second run measured against the qwen36-stamped HF-bridge tag
#       after `make heal-hf` rebadged it to qwen35 in store β€” confirms
#       the in-place heal produces a model with the same performance
#       profile as `make load-bundle`. Aggregate is 1.3% below run 1
#       (within the Β±20% noise band the README hardware section
#       warns about).
#     Q4_K_M β†’  9.32 tok/s aggregate (run 3, 2026-05-19 evening)
#       (4592 tokens / 492.7 s;  9.49 /  9.44 /  9.28 short/medium/long)
#       Third run, also against a heal-hf-rebadged qwen36-stamped
#       HF-bridge tag β€” this time the 3rd-round-trip bundle from
#       commit 973d7ef. Aggregate is within 0.1% of run 1's 9.31,
#       confirming the latest qwen36 -> qwen35 heal yields the same
#       performance profile as the prior two runs (no regression
#       from the third stamp flip).
#   ROCm (older snapshot, kept for backend comparison):
#     Q3_K_S β†’ 10.14 tok/s aggregate
#       (8080 tokens / 796.5 s; 10.37 / 10.31 / 10.11 short/medium/long)