FoolDev commited on
Commit
e1f78fa
Β·
verified Β·
1 Parent(s): a60eff5

Rebadge bundle qwen35 -> qwen36 + doc the workaround

Browse files

Re-flip general.architecture in the bundled Q4_K_M GGUF to qwen36, the architecturally-honest label. Reverses f0d70ee (which itself reverted 2dbe526). No released llama.cpp / Ollama recognizes qwen36 yet (reconfirmed 2026-05-19 against llama.cpp 389ff61 + Ollama 0.24.0), so the bundle is unloadable on stock loaders until upstream adds the arch entry. README/Modelfile/CHANGELOG updated with the qwen36 -> qwen35 rebadge workaround (scripts/rename_arch.py is metadata-only, tensor data byte-identical). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (4) hide show
  1. CHANGELOG.md +21 -0
  2. Modelfile +15 -4
  3. README.md +67 -42
  4. Thanatos-27B.Q4_K_M.gguf +1 -1
CHANGELOG.md CHANGED
@@ -7,6 +7,27 @@ and documentation**, not the underlying base model.
7
 
8
  ## [Unreleased]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ### Added
11
  - README "Vision via llama.cpp" subsection now leads with the
12
  `llama-server --mmproj` HTTP path (always built into stock llama.cpp,
 
7
 
8
  ## [Unreleased]
9
 
10
+ ### Changed
11
+ - **BREAKING (loaders).** Bundle re-stamped from
12
+ `general.architecture: 'qwen35'` to `'qwen36'` β€” the
13
+ architecturally-honest label. No released llama.cpp / Ollama
14
+ recognizes `qwen36` yet (reconfirmed 2026-05-19 against llama.cpp
15
+ 389ff61 and Ollama 0.24.0), so `ollama run hf.co/FoolDev/Thanatos-27B`
16
+ and direct `llama-server -m Thanatos-27B.Q4_K_M.gguf` both fail with
17
+ `unknown model architecture: 'qwen36'` until upstream adds the entry.
18
+ To load *today*, rebadge to qwen35 locally with
19
+ `scripts/rename_arch.py --from-arch qwen36 --to-arch qwen35` β€” see
20
+ the README "Architecture" section for the one-liner. Tensor data
21
+ stays byte-identical; only metadata flips. Reverses prior commit
22
+ `f0d70ee` which had restored qwen35 for compatibility; deliberate
23
+ re-flip with eyes open about the breakage.
24
+ - README "Heads up" callout near the top of the model card, TL;DR,
25
+ loader matrix, Quick start Ollama block, and Modelfile preamble
26
+ all updated to flag the qwen36 stamp and point users at the
27
+ rebadge workaround. Quick-start option B (`make build` from
28
+ unsloth) called out as the loads-today path, since unsloth still
29
+ ships qwen35-stamped GGUFs.
30
+
31
  ### Added
32
  - README "Vision via llama.cpp" subsection now leads with the
33
  `llama-server --mmproj` HTTP path (always built into stock llama.cpp,
Modelfile CHANGED
@@ -6,11 +6,22 @@
6
  # Ollama uses when an mmproj is attached). Use llama.cpp directly for
7
  # image input, or wait for the fix. See the Vision section in README.md.
8
  #
9
- # This repo bundles a single GGUF: Thanatos-27B.Q4_K_M.gguf (~17 GB).
10
- # The FROM line below points at it, so a fresh clone (with LFS smudge
11
- # enabled) supports the no-script path:
 
 
 
12
  #
13
- # ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b
 
 
 
 
 
 
 
 
14
  #
15
  # For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
16
  # downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
 
6
  # Ollama uses when an mmproj is attached). Use llama.cpp directly for
7
  # image input, or wait for the fix. See the Vision section in README.md.
8
  #
9
+ # This repo bundles a single GGUF: Thanatos-27B.Q4_K_M.gguf (~17 GB),
10
+ # stamped `general.architecture: 'qwen36'`. The FROM line below points
11
+ # at it, but no released llama.cpp / Ollama recognizes `qwen36` yet, so
12
+ # `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`
13
+ # fails today with `unknown model architecture: 'qwen36'`. To load now,
14
+ # rebadge the bundle to `qwen35` first:
15
  #
16
+ # python3 scripts/rename_arch.py --from-arch qwen36 --to-arch qwen35 \
17
+ # Thanatos-27B.Q4_K_M.gguf Thanatos-27B.Q4_K_M.qwen35.gguf
18
+ # # then temporarily point FROM at the qwen35 file, or use:
19
+ # # echo "FROM $PWD/Thanatos-27B.Q4_K_M.qwen35.gguf" > /tmp/Modelfile.qwen35
20
+ # # ollama create thanatos-27b -f /tmp/Modelfile.qwen35
21
+ #
22
+ # Once upstream adds the qwen36 arch entry the workaround disappears
23
+ # and the FROM line below works as-is. See README "Architecture" for
24
+ # the full story.
25
  #
26
  # For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
27
  # downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
README.md CHANGED
@@ -61,6 +61,15 @@ pipeline_tag: image-text-to-text
61
 
62
  A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
63
 
 
 
 
 
 
 
 
 
 
64
  ## TL;DR
65
 
66
  One-liner via Hugging Face (pulls a GGUF + this repo's root-level
@@ -69,12 +78,16 @@ template β€” HF's Ollama bridge ingests those three files, not
69
  `Modelfile`):
70
 
71
  ```bash
72
- ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M (the only bundled quant)
73
  ```
74
 
 
 
 
75
  For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
76
- QUANT=Q3_K_S` downloads from `unsloth/Qwen3.6-27B-GGUF` and creates the
77
- local Ollama tag. See [Quick start](#quick-start) below.
 
78
 
79
  Or build locally (uses this repo's `Modelfile`, kept in sync with the
80
  three bridge files) for any quant:
@@ -131,16 +144,18 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
131
  | `README.md` | This file |
132
 
133
  This repo ships a single GGUF to back the HF/Ollama "Use this model"
134
- widget β€” `Thanatos-27B.Q4_K_M.gguf` (~17 GB):
135
 
136
  ```bash
137
- ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M (only bundled quant)
138
  ```
139
 
140
- For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
141
- downloads the smaller ~12 GB Q3_K_S quant from `unsloth/Qwen3.6-27B-GGUF`
142
- and creates a local `thanatos-27b` Ollama tag (does not redistribute via
143
- this repo).
 
 
144
 
145
  For other quants or local builds, pull from
146
  [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
@@ -169,39 +184,42 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
169
  current loader compatibility.
170
  - Multi-token prediction (MTP) head trained for speculative decoding
171
 
172
- The bundled GGUF declares `general.architecture: 'qwen35'`, not
173
- `'qwen36'`, on purpose. Upstream `ggml-org/llama.cpp` and `ollama/ollama`
174
- both register Qwen 3.5 and Qwen 3.6 under the same `qwen35` / `qwen35moe`
175
- arch tags because the hybrid SSM + attention stack is shared between
176
- the two generations β€” there is no separate `qwen36` arch entry in
177
- either project (checked 2026-05-19, master / main respectively).
178
- Rebadging the GGUF to `qwen36` makes it unloadable (`error loading
179
- model architecture: unknown model architecture: 'qwen36'`).
180
- `scripts/rename_arch.py` is in the repo for the day that changes
181
- upstream β€” flipping `qwen35 β†’ qwen36` (or back) is a metadata-only
182
- operation, tensor data stays byte-identical.
183
-
184
- Want to retest the qwen36 path on your machine? Smudge LFS, then:
 
 
 
185
 
186
  ```bash
 
187
  python3 scripts/rename_arch.py \
 
188
  Thanatos-27B.Q4_K_M.gguf \
189
- Thanatos-27B.Q4_K_M.qwen36.gguf
190
- # Use an absolute path β€” ollama resolves a relative FROM against the
191
- # Modelfile's directory, not your CWD, so `FROM ./...` in /tmp/ fails
192
- # with a misleading 400 / "unable to load model" before the loader runs.
193
- echo "FROM $PWD/Thanatos-27B.Q4_K_M.qwen36.gguf" > /tmp/Modelfile.qwen36
194
- ollama create test-qwen36 -f /tmp/Modelfile.qwen36
195
- ollama run test-qwen36 hi
 
 
196
  ```
197
 
198
- Today that last line errors with `500 Internal Server Error: unable
199
- to load model: <blob path>`. The CLI no longer surfaces the arch
200
- name; check `journalctl --user -u ollama` (or wherever your server
201
- logs go) for the underlying `error loading model architecture:
202
- unknown model architecture: 'qwen36'`. The day that line succeeds is
203
- the day the bundle can be flipped (reconfirmed 2026-05-19 against
204
- Ollama 0.24.0).
205
 
206
  ## Quick start
207
 
@@ -211,10 +229,13 @@ Two paths:
211
 
212
  ```bash
213
  # A. Pull straight from HF (uses the bundled Q4_K_M + root-level
214
- # template / system / params files):
215
- ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M (only bundled quant)
 
 
216
 
217
- # B. Build locally for a different quant (downloads from unsloth):
 
218
  make build # Q4_K_M -> thanatos-27b
219
  make build QUANT=Q3_K_S # 12 GB smaller quant
220
  make build QUANT=Q5_K_M # 20 GB higher quality
@@ -240,12 +261,16 @@ python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-
240
 
241
  ### Local apps
242
 
243
- The bundled `Thanatos-27B.Q4_K_M.gguf` works in any GGUF-compatible local
244
- app β€” point it at this repo and load.
 
 
 
 
245
 
246
  | App | How to load this model |
247
  |---|---|
248
- | **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
249
  | **LM Studio** | Search β†’ `FoolDev/Thanatos-27B` β†’ pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
250
  | **Jan** | Hub β†’ "Import from Hugging Face" β†’ `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
251
  | **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
 
61
 
62
  A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
63
 
64
+ > ⚠️ **Heads up β€” bundle is stamped `qwen36`.** As of 2026-05-19 the
65
+ > bundled GGUF declares `general.architecture: 'qwen36'`, which no
66
+ > released llama.cpp / Ollama recognizes yet. `ollama run
67
+ > hf.co/FoolDev/Thanatos-27B` and `llama-server -m
68
+ > Thanatos-27B.Q4_K_M.gguf` both fail today with `unknown model
69
+ > architecture: 'qwen36'`. To load now, rebadge locally to `qwen35`:
70
+ > see [Architecture](#architecture) for the one-liner. Once upstream
71
+ > ships qwen36 the workaround disappears.
72
+
73
  ## TL;DR
74
 
75
  One-liner via Hugging Face (pulls a GGUF + this repo's root-level
 
78
  `Modelfile`):
79
 
80
  ```bash
81
+ ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen36-stamped (see Heads-up above)
82
  ```
83
 
84
+ That command fails today with `unknown model architecture: 'qwen36'`
85
+ until you rebadge locally β€” see [Architecture](#architecture).
86
+
87
  For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
88
+ QUANT=Q3_K_S` downloads from `unsloth/Qwen3.6-27B-GGUF` (which still
89
+ ships `qwen35`-stamped GGUFs) and creates the local Ollama tag.
90
+ See [Quick start](#quick-start) below.
91
 
92
  Or build locally (uses this repo's `Modelfile`, kept in sync with the
93
  three bridge files) for any quant:
 
144
  | `README.md` | This file |
145
 
146
  This repo ships a single GGUF to back the HF/Ollama "Use this model"
147
+ widget β€” `Thanatos-27B.Q4_K_M.gguf` (~17 GB, qwen36-stamped):
148
 
149
  ```bash
150
+ ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen36 β€” fails today, see Heads-up above
151
  ```
152
 
153
+ For 16 GB GPUs / unified-memory laptops β€” and as a working-today
154
+ fallback while the bundle waits on upstream qwen36 support β€”
155
+ `make build QUANT=Q3_K_S` downloads the smaller ~12 GB Q3_K_S quant
156
+ from `unsloth/Qwen3.6-27B-GGUF` (qwen35-stamped, loads on every
157
+ current llama.cpp / Ollama build) and creates a local `thanatos-27b`
158
+ Ollama tag. Does not redistribute via this repo.
159
 
160
  For other quants or local builds, pull from
161
  [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
 
184
  current loader compatibility.
185
  - Multi-token prediction (MTP) head trained for speculative decoding
186
 
187
+ **The bundled GGUF declares `general.architecture: 'qwen36'`** β€” the
188
+ architecturally-honest stamp. Upstream `ggml-org/llama.cpp` and
189
+ `ollama/ollama` currently only register the hybrid SSM + attention
190
+ stack under `qwen35` / `qwen35moe`; no `qwen36` arch entry exists yet
191
+ (reconfirmed 2026-05-19 against llama.cpp 389ff61 and Ollama 0.24.0).
192
+ Consequence: **the bundle is unloadable on current stock loaders.**
193
+ `ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m ...` both
194
+ fail with `error loading model architecture: unknown model
195
+ architecture: 'qwen36'` (Ollama 0.24 surfaces it as a 500 wrapping a
196
+ generic `unable to load model: <blob>` β€” check `journalctl --user -u
197
+ ollama` for the underlying line).
198
+
199
+ This is intentional. The bundle is the *correct* metadata; the
200
+ loaders are the lagging side. The flip is reversible until upstream
201
+ catches up β€” go the other direction locally with
202
+ `scripts/rename_arch.py` (metadata-only, tensors stay byte-identical):
203
 
204
  ```bash
205
+ # Get the bundle in a loadable state on today's llama.cpp / Ollama:
206
  python3 scripts/rename_arch.py \
207
+ --from-arch qwen36 --to-arch qwen35 \
208
  Thanatos-27B.Q4_K_M.gguf \
209
+ Thanatos-27B.Q4_K_M.qwen35.gguf
210
+ # Then either build a local Ollama tag (note absolute path β€”
211
+ # `ollama create` resolves a relative FROM against the Modelfile's
212
+ # directory, not your CWD):
213
+ echo "FROM $PWD/Thanatos-27B.Q4_K_M.qwen35.gguf" > /tmp/Modelfile.qwen35
214
+ ollama create thanatos-27b -f /tmp/Modelfile.qwen35
215
+ ollama run thanatos-27b hi
216
+ # …or point llama-server at the qwen35 file directly:
217
+ llama-server -m Thanatos-27B.Q4_K_M.qwen35.gguf -ngl 99 -c 8192
218
  ```
219
 
220
+ Once upstream adds the `qwen36` arch entry β€” patch landed in
221
+ `ggml-org/llama.cpp` and propagated into Ollama β€” the bundle works
222
+ as-is and the workaround above can be deleted.
 
 
 
 
223
 
224
  ## Quick start
225
 
 
229
 
230
  ```bash
231
  # A. Pull straight from HF (uses the bundled Q4_K_M + root-level
232
+ # template / system / params files). Fails today with
233
+ # `unknown model architecture: 'qwen36'`; see Architecture for
234
+ # the qwen36 β†’ qwen35 rebadge workaround.
235
+ ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen36-stamped
236
 
237
+ # B. Build locally for a different quant (downloads qwen35-stamped
238
+ # GGUFs from unsloth β€” these load on today's llama.cpp / Ollama):
239
  make build # Q4_K_M -> thanatos-27b
240
  make build QUANT=Q3_K_S # 12 GB smaller quant
241
  make build QUANT=Q5_K_M # 20 GB higher quality
 
261
 
262
  ### Local apps
263
 
264
+ The bundled `Thanatos-27B.Q4_K_M.gguf` is `qwen36`-stamped β€” every row
265
+ below assumes you've rebadged it to `qwen35` per
266
+ [Architecture](#architecture), or that you're pulling a `qwen35`-stamped
267
+ GGUF from `unsloth/Qwen3.6-27B-GGUF` instead. The "fails today with
268
+ `unknown model architecture: 'qwen36'`" caveat applies to every row
269
+ until that's done.
270
 
271
  | App | How to load this model |
272
  |---|---|
273
+ | **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, or to bypass the qwen36 block today, `make build QUANT=Q3_K_S` downloads from unsloth (qwen35-stamped) and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
274
  | **LM Studio** | Search β†’ `FoolDev/Thanatos-27B` β†’ pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
275
  | **Jan** | Hub β†’ "Import from Hugging Face" β†’ `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
276
  | **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
Thanatos-27B.Q4_K_M.gguf CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5ed60d0af4650a854b1755bd392f9aef4872643dc25a254bc68043fa638392a0
3
  size 16817244384
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2b63b941d714b0aff5e9afc9b337e7607b84e371ba991c182d92d0f7805c0e9
3
  size 16817244384