FoolDev commited on
Commit
75bbdfe
Β·
1 Parent(s): c9ce901

Remove bundled Q3_K_S so HF picks Q4_K_M as the default Ollama tag

Browse files

Resolution to the recurring 'Q4 to widget' issue: with Q3_K_S removed,
Janus-27B.Q4_K_M.gguf is the only GGUF in the repo. HF's Ollama bridge
now defaults to it unambiguously β€” bare 'ollama run hf.co/FoolDev/janus-27b'
resolves to the 17 GB Q4_K_M blob instead of the 12 GB Q3_K_S blob.

Other approaches we tried first:
- Documenting ':Q4_K_M' explicit tag (kept working but didn't fix the widget)
- Renaming Janus-27B.Q3_K_S.gguf -> Janus-27B.q3_k_s.gguf (broke HF GGUF
detection entirely; reverted in commit 384e186)
- Looking for HF metadata to override default-quant pick (none documented)

Removing Q3_K_S keeps the smaller-footprint workflow available β€” users
on 16 GB GPUs / 32 GB unified-memory laptops run 'make build QUANT=Q3_K_S'
which downloads Qwen3.6-27B-Q3_K_S.gguf from unsloth/Qwen3.6-27B-GGUF
and creates a local Ollama tag using the bundled Modelfile (kept in
sync with the bridge files, verified by check_bridge_sync.py).

Tradeoff: 12 GB less data shipped via the repo; Q3_K_S users do an
extra ~3 minute download; default Ollama tag now matches what the
README's been recommending all along.

Updated:
- Modelfile: comment block now says 'single bundled GGUF'
- README: TL;DR / Architecture table / What's here / Quick start /
Local apps / Hardware sections all reflect Q4_K_M-only bundling
- examples/README.md + examples/ollama_chat.py: updated Ollama snippets
- CITATION.cff: abstract mentions one bundled GGUF, not two
- CHANGELOG: full forensic entry under [Unreleased] / Removed

CHANGELOG.md CHANGED
@@ -7,6 +7,25 @@ and documentation**, not the underlying base model.
7
 
8
  ## [Unreleased]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ### Investigated (and reverted)
11
  - Tried renaming `Janus-27B.Q3_K_S.gguf` to `Janus-27B.q3_k_s.gguf`
12
  (lowercase quant suffix) hoping to flip HF's default-`:latest`-tag
 
7
 
8
  ## [Unreleased]
9
 
10
+ ### Removed
11
+ - `Janus-27B.Q3_K_S.gguf` no longer redistributed in this repo.
12
+ Removing it leaves `Janus-27B.Q4_K_M.gguf` as the only GGUF, which
13
+ flips HF's Ollama bridge default-tag pick from Q3_K_S to Q4_K_M
14
+ unambiguously: `ollama run hf.co/FoolDev/janus-27b` now resolves
15
+ to the 17 GB Q4_K_M blob instead of the 12 GB Q3_K_S blob. This
16
+ is the resolution to the long-running "Q4 to widget" thread β€”
17
+ rename hacks broke HF's GGUF detection (see earlier "Investigated
18
+ (and reverted)" entry), and HF doesn't expose a documented
19
+ default-quant metadata field, so the only reliable fix was to
20
+ remove the conflicting file. Q3_K_S users build it locally with
21
+ `make build QUANT=Q3_K_S`, which downloads
22
+ `Qwen3.6-27B-Q3_K_S.gguf` from `unsloth/Qwen3.6-27B-GGUF` and
23
+ patches the Modelfile FROM line into a temp copy. Tradeoff: 12 GB
24
+ less data shipped through the repo, smaller-footprint users do an
25
+ extra ~3 minute download, default Ollama tag now matches the
26
+ README's recommendation. Updated Modelfile / README / examples /
27
+ CITATION.cff to reflect Q4_K_M as the only bundled quant.
28
+
29
  ### Investigated (and reverted)
30
  - Tried renaming `Janus-27B.Q3_K_S.gguf` to `Janus-27B.q3_k_s.gguf`
31
  (lowercase quant suffix) hoping to flip HF's default-`:latest`-tag
CITATION.cff CHANGED
@@ -10,11 +10,12 @@ url: "https://huggingface.co/FoolDev/janus-27b"
10
  abstract: >-
11
  Janus-27B is a personal repackaging of the dense Qwen 3.6 27B base model
12
  with Claude Opus 4.7 in the reasoning teacher slot. The repository ships
13
- an Ollama Modelfile, sampling defaults, usage examples, and two
14
- ready-to-run GGUFs (Q4_K_M ~17 GB and Q3_K_S ~12 GB) so the HF "Use
15
- this model" widget surfaces a one-liner Ollama snippet. Other quants
16
- and the upstream safetensors (Qwen/Qwen3.6-27B) are pulled from
17
- upstream on demand rather than redistributed.
 
18
  keywords:
19
  - qwen
20
  - qwen3.6
 
10
  abstract: >-
11
  Janus-27B is a personal repackaging of the dense Qwen 3.6 27B base model
12
  with Claude Opus 4.7 in the reasoning teacher slot. The repository ships
13
+ an Ollama Modelfile, sampling defaults, usage examples, and a single
14
+ ready-to-run GGUF (Q4_K_M ~17 GB) so the HF "Use this model" widget
15
+ surfaces a one-liner Ollama snippet. Other quants (Q3_K_S, Q5_K_M,
16
+ Q6_K, etc.) and the upstream safetensors (Qwen/Qwen3.6-27B) are
17
+ pulled from upstream (unsloth/Qwen3.6-27B-GGUF) on demand rather
18
+ than redistributed.
19
  keywords:
20
  - qwen
21
  - qwen3.6
Janus-27B.Q3_K_S.gguf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4afb4abcf0207a484b0d7e92c0421b74e8ce1c7a7250bb9d824b79288da68f20
3
- size 12358727904
 
 
 
 
Modelfile CHANGED
@@ -5,15 +5,17 @@
5
  # missing the qwen35 arch entries). Use llama.cpp directly for image
6
  # input, or wait for the fix. See the Vision section in README.md.
7
  #
8
- # This repo bundles two GGUFs: Janus-27B.Q4_K_M.gguf (~17 GB, default)
9
- # and Janus-27B.Q3_K_S.gguf (~12 GB, smaller-footprint option). The FROM
10
- # line below points at the bundled Q4_K_M, so a fresh clone (with LFS
11
- # smudge enabled) supports the no-script path:
12
  #
13
  # ollama create janus-27b -f Modelfile && ollama run janus-27b
14
  #
15
- # To use the smaller quant, edit FROM to ./Janus-27B.Q3_K_S.gguf, or run
16
- # `make build QUANT=Q3_K_S` which patches FROM in a temp Modelfile copy.
 
 
 
17
  #
18
  # Other GGUF sources (use with `make build GGUF_PATH=...`):
19
  # https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
@@ -121,7 +123,8 @@ Behavior rules:
121
  # βœ“ RTX 5090 32 GB β€” full offload at Q5/Q6 quant
122
  # βœ“ Mac Studio M2/M3 32 GB+ unified β€” ~15-25 tok/s
123
  # βœ“ Linux box with 32 GB+ RAM (CPU-only) β€” ~1-3 tok/s
124
- # ⚠ 32 GB unified-memory laptops β€” borderline at Q4, try Q3_K_S
 
125
  # (~12 GB) and trim num_ctx
126
  #
127
  # Measured data point (ASUS ROG Flow Z13 GZ302EA, Ryzen AI Max+ 395 +
 
5
  # missing the qwen35 arch entries). Use llama.cpp directly for image
6
  # input, or wait for the fix. See the Vision section in README.md.
7
  #
8
+ # This repo bundles a single GGUF: Janus-27B.Q4_K_M.gguf (~17 GB).
9
+ # The FROM line below points at it, so a fresh clone (with LFS smudge
10
+ # enabled) supports the no-script path:
 
11
  #
12
  # ollama create janus-27b -f Modelfile && ollama run janus-27b
13
  #
14
+ # For other quants (Q3_K_S, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_S`
15
+ # downloads the chosen quant from unsloth/Qwen3.6-27B-GGUF and patches
16
+ # FROM in a temp Modelfile copy. The Q3_K_S used to ship in this repo;
17
+ # it was removed so HF's Ollama bridge picks Q4_K_M as the default
18
+ # `:latest` tag instead of Q3_K_S (alphabetically-first heuristic).
19
  #
20
  # Other GGUF sources (use with `make build GGUF_PATH=...`):
21
  # https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
 
123
  # βœ“ RTX 5090 32 GB β€” full offload at Q5/Q6 quant
124
  # βœ“ Mac Studio M2/M3 32 GB+ unified β€” ~15-25 tok/s
125
  # βœ“ Linux box with 32 GB+ RAM (CPU-only) β€” ~1-3 tok/s
126
+ # ⚠ 32 GB unified-memory laptops β€” borderline at Q4, try
127
+ # `make build QUANT=Q3_K_S`
128
  # (~12 GB) and trim num_ctx
129
  #
130
  # Measured data point (ASUS ROG Flow Z13 GZ302EA, Ryzen AI Max+ 395 +
README.md CHANGED
@@ -68,15 +68,12 @@ template β€” HF's Ollama bridge ingests those three files, not
68
  `Modelfile`):
69
 
70
  ```bash
71
- ollama run hf.co/FoolDev/janus-27b:Q4_K_M # ~17 GB, recommended
72
- ollama run hf.co/FoolDev/janus-27b:Q3_K_S # ~12 GB, smaller-footprint
73
  ```
74
 
75
- > **Use the explicit `:Q4_K_M` tag.** HF's Ollama bridge currently picks
76
- > `Janus-27B.Q3_K_S.gguf` (the alphabetically-first GGUF in the repo) as
77
- > the default for the bare `ollama run hf.co/FoolDev/janus-27b` form,
78
- > *not* Q4_K_M. If you don't pin a quant tag explicitly you'll get the
79
- > 12 GB Q3_K_S blob.
80
 
81
  Or build locally (uses this repo's `Modelfile`, kept in sync with the
82
  three bridge files) for any quant:
@@ -104,8 +101,8 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
104
  | Active params per token | 27 B | ~3 B |
105
  | Layers | 64 | 40 |
106
  | Hidden size | 5120 | 2048 |
107
- | Q4_K_M GGUF size | ~17 GB | ~19 GB |
108
- | Q3_K_S GGUF size | ~12 GB | n/a |
109
  | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
110
  | Multimodal (text path) | Yes | Yes |
111
  | Multimodal (vision via Ollama) | Broken upstream β€” see below | Broken upstream |
@@ -133,19 +130,16 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
133
  | `README.md` | This file |
134
 
135
  This repo ships two GGUFs to back the HF/Ollama "Use this model"
136
- widget β€” `Janus-27B.Q4_K_M.gguf` (~17 GB, recommended) and
137
- `Janus-27B.Q3_K_S.gguf` (~12 GB, smaller-footprint option for 16 GB
138
- GPUs / unified-memory laptops):
139
 
140
  ```bash
141
- ollama run hf.co/FoolDev/janus-27b:Q4_K_M # 17 GB, recommended
142
- ollama run hf.co/FoolDev/janus-27b:Q3_K_S # 12 GB, smaller-footprint
143
  ```
144
 
145
- The bare `ollama run hf.co/FoolDev/janus-27b` form (no quant tag)
146
- currently resolves to the alphabetically-first GGUF in the repo β€”
147
- `Janus-27B.Q3_K_S.gguf` β€” *not* Q4_K_M. Pin the tag explicitly to get
148
- the recommended Q4_K_M quant.
149
 
150
  For other quants or local builds, pull from
151
  [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
@@ -177,16 +171,14 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
177
  Two paths:
178
 
179
  ```bash
180
- # A. Pull straight from HF (uses the bundled GGUF + root-level
181
- # template / system / params files; pin a quant tag explicitly):
182
- ollama run hf.co/FoolDev/janus-27b:Q4_K_M # 17 GB, recommended
183
- ollama run hf.co/FoolDev/janus-27b:Q3_K_S # 12 GB, smaller-footprint
184
- # (Bare `ollama run hf.co/FoolDev/janus-27b` resolves to Q3_K_S, not
185
- # Q4_K_M β€” HF picks the alphabetically-first GGUF when no tag pinned.)
186
-
187
- # B. Build locally (lets you pick the quant):
188
  make build # Q4_K_M -> janus-27b
189
- make build QUANT=Q3_K_S # smaller quant
 
190
  make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
191
  ollama run janus-27b
192
  ```
@@ -209,13 +201,13 @@ python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-
209
 
210
  ### Local apps
211
 
212
- Both bundled GGUFs (Q4_K_M and Q3_K_S) work in any GGUF-compatible
213
- local app β€” point it at this repo and pick a quant.
214
 
215
  | App | How to load this model |
216
  |---|---|
217
- | **Ollama** | `ollama run hf.co/FoolDev/janus-27b:Q4_K_M` (or `:Q3_K_S`). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For local builds, `make build` uses `Modelfile`, which is kept in sync. **Pin the quant tag** β€” bare `hf.co/FoolDev/janus-27b` resolves to Q3_K_S, not Q4_K_M. |
218
- | **LM Studio** | Search β†’ `FoolDev/janus-27b` β†’ pick `Janus-27B.Q4_K_M.gguf` or `Janus-27B.Q3_K_S.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
219
  | **Jan** | Hub β†’ "Import from Hugging Face" β†’ `FoolDev/janus-27b`. Same template behavior as LM Studio. |
220
  | **llama.cpp** | `hf download FoolDev/janus-27b Janus-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Janus-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
221
  | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
@@ -325,7 +317,7 @@ The dense 27B is the easier of the two Janus models to deploy.
325
  | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
326
  | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
327
  | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
328
- | 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. Drop to Q3_K_S (~12 GB) and trim `num_ctx` for headroom. |
329
 
330
  Most numbers in this table are estimates from comparable models; the
331
  gradient is right but the absolute values will move Β±20% with prompt
 
68
  `Modelfile`):
69
 
70
  ```bash
71
+ ollama run hf.co/FoolDev/janus-27b # ~17 GB Q4_K_M (the only bundled quant)
 
72
  ```
73
 
74
+ For other quants (Q3_K_S ~12 GB, Q5_K_M ~20 GB, etc.), `make build
75
+ QUANT=Q3_K_S` downloads from `unsloth/Qwen3.6-27B-GGUF` and creates the
76
+ local Ollama tag. See [Quick start](#quick-start) below.
 
 
77
 
78
  Or build locally (uses this repo's `Modelfile`, kept in sync with the
79
  three bridge files) for any quant:
 
101
  | Active params per token | 27 B | ~3 B |
102
  | Layers | 64 | 40 |
103
  | Hidden size | 5120 | 2048 |
104
+ | Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
105
+ | Q3_K_S GGUF size | ~12 GB (build locally via `make build QUANT=Q3_K_S`) | n/a |
106
  | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
107
  | Multimodal (text path) | Yes | Yes |
108
  | Multimodal (vision via Ollama) | Broken upstream β€” see below | Broken upstream |
 
130
  | `README.md` | This file |
131
 
132
  This repo ships two GGUFs to back the HF/Ollama "Use this model"
133
+ widget β€” `Janus-27B.Q4_K_M.gguf` (~17 GB):
 
 
134
 
135
  ```bash
136
+ ollama run hf.co/FoolDev/janus-27b # 17 GB Q4_K_M (only bundled quant)
 
137
  ```
138
 
139
+ For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_S`
140
+ downloads the smaller ~12 GB Q3_K_S quant from `unsloth/Qwen3.6-27B-GGUF`
141
+ and creates a local `janus-27b` Ollama tag (does not redistribute via
142
+ this repo).
143
 
144
  For other quants or local builds, pull from
145
  [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
 
171
  Two paths:
172
 
173
  ```bash
174
+ # A. Pull straight from HF (uses the bundled Q4_K_M + root-level
175
+ # template / system / params files):
176
+ ollama run hf.co/FoolDev/janus-27b # 17 GB Q4_K_M (only bundled quant)
177
+
178
+ # B. Build locally for a different quant (downloads from unsloth):
 
 
 
179
  make build # Q4_K_M -> janus-27b
180
+ make build QUANT=Q3_K_S # 12 GB smaller quant
181
+ make build QUANT=Q5_K_M # 20 GB higher quality
182
  make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
183
  ollama run janus-27b
184
  ```
 
201
 
202
  ### Local apps
203
 
204
+ The bundled `Janus-27B.Q4_K_M.gguf` works in any GGUF-compatible local
205
+ app β€” point it at this repo and load.
206
 
207
  | App | How to load this model |
208
  |---|---|
209
+ | **Ollama** | `ollama run hf.co/FoolDev/janus-27b` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_S` downloads from unsloth and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
210
+ | **LM Studio** | Search β†’ `FoolDev/janus-27b` β†’ pick `Janus-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
211
  | **Jan** | Hub β†’ "Import from Hugging Face" β†’ `FoolDev/janus-27b`. Same template behavior as LM Studio. |
212
  | **llama.cpp** | `hf download FoolDev/janus-27b Janus-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Janus-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
213
  | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 
317
  | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
318
  | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
319
  | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
320
+ | 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_S` (~12 GB) and trim `num_ctx` for headroom. |
321
 
322
  Most numbers in this table are estimates from comparable models; the
323
  gradient is right but the absolute values will move Β±20% with prompt
examples/README.md CHANGED
@@ -21,16 +21,13 @@ Easiest path β€” pull straight from HF (gets the bundled Q4_K_M GGUF +
21
  this repo's Modelfile in one step):
22
 
23
  ```bash
24
- ollama pull hf.co/FoolDev/janus-27b:Q4_K_M # 17 GB, recommended
25
- # or:
26
- ollama pull hf.co/FoolDev/janus-27b:Q3_K_S # 12 GB, smaller-footprint
27
  pip install requests
28
- MODEL=hf.co/FoolDev/janus-27b:Q4_K_M python ollama_chat.py
29
  ```
30
 
31
- > Pin the quant tag explicitly. Bare `ollama pull hf.co/FoolDev/janus-27b`
32
- > resolves to Q3_K_S (HF picks the alphabetically-first GGUF in the
33
- > repo), not Q4_K_M.
34
 
35
  Or build locally from this repo (uses the bundled `Janus-27B.Q4_K_M.gguf`,
36
  no edits required):
 
21
  this repo's Modelfile in one step):
22
 
23
  ```bash
24
+ ollama pull hf.co/FoolDev/janus-27b # 17 GB Q4_K_M (only bundled quant)
 
 
25
  pip install requests
26
+ MODEL=hf.co/FoolDev/janus-27b python ollama_chat.py
27
  ```
28
 
29
+ For the smaller-footprint Q3_K_S (~12 GB) or other quants, build
30
+ locally instead β€” see the parent repo's `make build QUANT=...` flow.
 
31
 
32
  Or build locally from this repo (uses the bundled `Janus-27B.Q4_K_M.gguf`,
33
  no edits required):
examples/ollama_chat.py CHANGED
@@ -9,10 +9,9 @@ Prerequisites (pick one):
9
  # or:
10
  $ ollama create janus-27b -f ../Modelfile
11
 
12
- B. Pull straight from HF (pin the quant tag explicitly β€” bare
13
- `hf.co/FoolDev/janus-27b` resolves to Q3_K_S, not Q4_K_M):
14
- $ ollama run hf.co/FoolDev/janus-27b:Q4_K_M
15
- # then set MODEL=hf.co/FoolDev/janus-27b:Q4_K_M below
16
 
17
  Then:
18
  $ ollama serve # usually already running
 
9
  # or:
10
  $ ollama create janus-27b -f ../Modelfile
11
 
12
+ B. Pull straight from HF (Q4_K_M is the only bundled quant):
13
+ $ ollama run hf.co/FoolDev/janus-27b
14
+ # then set MODEL=hf.co/FoolDev/janus-27b below
 
15
 
16
  Then:
17
  $ ollama serve # usually already running