FoolDev commited on
Commit
82677d0
·
1 Parent(s): 6f2884f

Fix GGUF filename pattern + correct architecture description

Browse files

- Upstream unsloth/Qwen3.6-27B-GGUF uses dashes (Qwen3.6-27B-Q4_K_M.gguf),
not dots. Without this, scripts/build.sh 404s on first run. Updated
Modelfile, Modelfile.z13, examples/README, examples/llama_cpp_quickstart,
and scripts/build.sh.
- scripts/build.sh: list known-good quant suffixes in a comment so people
don't typo Q4_KM, q4_k_m, etc.
- README architecture section: replace pure-GQA description with the
actual hybrid Gated DeltaNet + Gated Attention layout from the upstream
Qwen3.6 docs; add MTP note and YaRN ~1M ceiling.
- README comparison table: reflect actual Q4_K_M size (17 GB), add Q3_K_S
row, and qualify min-host-memory by quant + context.

Modelfile CHANGED
@@ -13,7 +13,7 @@
13
  #
14
  # Replace the path below with wherever you keep the GGUF.
15
 
16
- FROM ./Qwen3.6-27B.Q4_K_M.gguf
17
 
18
  # Sampling tuned for reasoning + general use. See README "Recommended sampling"
19
  # for creative/RP alternatives.
 
13
  #
14
  # Replace the path below with wherever you keep the GGUF.
15
 
16
+ FROM ./Qwen3.6-27B-Q4_K_M.gguf
17
 
18
  # Sampling tuned for reasoning + general use. See README "Recommended sampling"
19
  # for creative/RP alternatives.
Modelfile.z13 CHANGED
@@ -8,7 +8,7 @@
8
  # and we compensate with top_k instead.
9
  #
10
  # Recommended base GGUF for this profile:
11
- # https://huggingface.co/unsloth/Qwen3.6-27B-GGUF -> Qwen3.6-27B.Q3_K_S.gguf
12
  #
13
  # Usage:
14
  # ollama create janus-27b-z13 -f Modelfile.z13
@@ -20,7 +20,7 @@
20
  # export OLLAMA_NUM_PARALLEL=1 # don't fan out across requests
21
  # export HSA_OVERRIDE_GFX_VERSION=11.5.1 # if ROCm doesn't auto-detect gfx1150
22
 
23
- FROM ./Qwen3.6-27B.Q3_K_S.gguf
24
 
25
  PARAMETER temperature 0.6
26
  PARAMETER top_p 0.95
 
8
  # and we compensate with top_k instead.
9
  #
10
  # Recommended base GGUF for this profile:
11
+ # https://huggingface.co/unsloth/Qwen3.6-27B-GGUF -> Qwen3.6-27B-Q3_K_S.gguf
12
  #
13
  # Usage:
14
  # ollama create janus-27b-z13 -f Modelfile.z13
 
20
  # export OLLAMA_NUM_PARALLEL=1 # don't fan out across requests
21
  # export HSA_OVERRIDE_GFX_VERSION=11.5.1 # if ROCm doesn't auto-detect gfx1150
22
 
23
+ FROM ./Qwen3.6-27B-Q3_K_S.gguf
24
 
25
  PARAMETER temperature 0.6
26
  PARAMETER top_p 0.95
README.md CHANGED
@@ -71,8 +71,9 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
71
  | Active params per token | 27 B | ~3 B |
72
  | Layers | 64 | 40 |
73
  | Hidden size | 5120 | 2048 |
74
- | Q4_K_M GGUF size | ~16 GB | ~19 GB |
75
- | Min host memory | ~24 GB | ~38 GB |
 
76
  | Multimodal | Yes (vision) | Yes (vision) |
77
  | Max context | 262 144 | 262 144 |
78
 
@@ -96,11 +97,14 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
96
  ## Architecture
97
 
98
  - Qwen 3.6 dense, 27B parameters, 64 transformer layers
99
- - 24 attention heads, 4 KV heads (GQA), head_dim 256
100
- - Hidden size 5120, intermediate size 17408 (~3.4× ratio)
 
 
101
  - Vocab 248,320 (shared with 35B-A3B sibling)
102
- - 262k native context, extensible with YaRN
103
  - Vision + video support via upstream `mmproj` (not in this repo)
 
104
 
105
  ## Quick start
106
 
 
71
  | Active params per token | 27 B | ~3 B |
72
  | Layers | 64 | 40 |
73
  | Hidden size | 5120 | 2048 |
74
+ | Q4_K_M GGUF size | ~17 GB | ~19 GB |
75
+ | Q3_K_S GGUF size | ~12 GB | n/a |
76
+ | Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
77
  | Multimodal | Yes (vision) | Yes (vision) |
78
  | Max context | 262 144 | 262 144 |
79
 
 
97
  ## Architecture
98
 
99
  - Qwen 3.6 dense, 27B parameters, 64 transformer layers
100
+ - **Hybrid attention stack**: 16 repeats of `[3 × (Gated DeltaNet → FFN) 1 × (Gated Attention → FFN)]`
101
+ - Gated DeltaNet (linear attention): 48 V-heads, 16 QK-heads, head_dim 128
102
+ - Gated Attention (softmax): 24 Q-heads, 4 KV-heads (GQA), head_dim 256, partial RoPE (factor 0.25)
103
+ - Hidden size 5120, FFN intermediate 17408 (~3.4× ratio)
104
  - Vocab 248,320 (shared with 35B-A3B sibling)
105
+ - 262 144 native context, extensible to ~1 M with YaRN
106
  - Vision + video support via upstream `mmproj` (not in this repo)
107
+ - Multi-token prediction (MTP) head trained for speculative decoding
108
 
109
  ## Quick start
110
 
examples/README.md CHANGED
@@ -18,9 +18,9 @@ be consistent across backends modulo quantization noise.
18
 
19
  ```bash
20
  # 1. Pull a Qwen 3.6 27B GGUF, e.g. unsloth/Qwen3.6-27B-GGUF
21
- hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B.Q4_K_M.gguf --local-dir .
22
 
23
- # 2. Edit ../Modelfile -> FROM ./Qwen3.6-27B.Q4_K_M.gguf
24
 
25
  # 3. Build the model
26
  ollama create janus-27b -f ../Modelfile
@@ -42,7 +42,7 @@ python transformers_quickstart.py --no-4bit # bf16, ~54 GB VRAM
42
 
43
  ```bash
44
  pip install llama-cpp-python # CPU-only build
45
- python llama_cpp_quickstart.py /path/to/Qwen3.6-27B.Q4_K_M.gguf --gpu-layers 99
46
  ```
47
 
48
  For GPU offload, rebuild llama-cpp-python with the matching backend — see
 
18
 
19
  ```bash
20
  # 1. Pull a Qwen 3.6 27B GGUF, e.g. unsloth/Qwen3.6-27B-GGUF
21
+ hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir .
22
 
23
+ # 2. Edit ../Modelfile -> FROM ./Qwen3.6-27B-Q4_K_M.gguf
24
 
25
  # 3. Build the model
26
  ollama create janus-27b -f ../Modelfile
 
42
 
43
  ```bash
44
  pip install llama-cpp-python # CPU-only build
45
+ python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
46
  ```
47
 
48
  For GPU offload, rebuild llama-cpp-python with the matching backend — see
examples/llama_cpp_quickstart.py CHANGED
@@ -14,7 +14,7 @@ For GPU offload (CUDA / Metal / ROCm), install with the matching extras:
14
  CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
15
 
16
  Usage:
17
- python llama_cpp_quickstart.py /path/to/Qwen3.6-27B.Q4_K_M.gguf
18
  python llama_cpp_quickstart.py /path/to/file.gguf --gpu-layers 99
19
  python llama_cpp_quickstart.py /path/to/file.gguf --prompt "..."
20
  """
 
14
  CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
15
 
16
  Usage:
17
+ python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf
18
  python llama_cpp_quickstart.py /path/to/file.gguf --gpu-layers 99
19
  python llama_cpp_quickstart.py /path/to/file.gguf --prompt "..."
20
  """
scripts/build.sh CHANGED
@@ -14,7 +14,13 @@ QUANT="${1:-${QUANT:-Q4_K_M}}"
14
  PROFILE="${2:-${PROFILE:-default}}"
15
 
16
  REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
17
- GGUF_NAME="Qwen3.6-27B.${QUANT}.gguf"
 
 
 
 
 
 
18
  ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
19
  GGUF_PATH="${ROOT}/${GGUF_NAME}"
20
 
 
14
  PROFILE="${2:-${PROFILE:-default}}"
15
 
16
  REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
17
+ # Upstream uses dashes, e.g. Qwen3.6-27B-Q4_K_M.gguf. Quants known to exist
18
+ # at unsloth/Qwen3.6-27B-GGUF (as of 2026-04):
19
+ # Q3_K_S Q3_K_M Q4_0 Q4_1 Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0
20
+ # IQ4_XS IQ4_NL
21
+ # UD-IQ2_XXS UD-IQ2_M UD-Q2_K_XL UD-IQ3_XXS UD-Q3_K_XL UD-Q4_K_XL
22
+ # UD-Q5_K_XL UD-Q6_K_XL UD-Q8_K_XL
23
+ GGUF_NAME="Qwen3.6-27B-${QUANT}.gguf"
24
  ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
25
  GGUF_PATH="${ROOT}/${GGUF_NAME}"
26