File size: 4,159 Bytes
7197abd
6f2884f
70c2f62
6f2884f
 
 
7197abd
73e905b
e4beea4
73e905b
6f2884f
7d11d16
6f2884f
7d11d16
 
 
6f2884f
 
 
 
 
4811e8d
 
 
b20f7c9
6f2884f
7197abd
b20f7c9
7197abd
b20f7c9
6f2884f
5426482
ac94e67
5426482
 
 
 
4811e8d
73e905b
 
7197abd
6f2884f
b20f7c9
73e905b
7197abd
c336f44
 
83022eb
 
c336f44
 
83022eb
7197abd
70c2f62
 
 
73e905b
 
70c2f62
 
 
6f2884f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73e905b
6f2884f
 
 
 
e4beea4
 
 
 
73e905b
 
e4beea4
 
 
73e905b
 
e4beea4
 
 
 
613559b
 
 
 
 
 
 
73e905b
e4beea4
613559b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Thanatos-27B examples

Four minimal entry points. Pick the one that matches how you run models.

| File | Backend | When to use |
|---|---|---|
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b` model created from the project `Modelfile`. **Text + tool calling** — vision via Ollama is broken upstream for this arch. |
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-F16.gguf` and answers questions about an image. The only working vision path right now. |

All four apply the same Thanatos system prompt and sampling defaults
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
be consistent across backends modulo quantization noise. The three
non-Ollama scripts set them explicitly; `ollama_chat.py` inherits them
from the `Modelfile` / bridge files.

## Setup

### Ollama

Pull straight from HF (gets the bundled Q4_K_M GGUF + this repo's
root-level `template` / `system` / `params` files via HF's Ollama
bridge):

```bash
ollama pull hf.co/FoolDev/Thanatos-27B           # 17 GB Q4_K_M (only bundled quant)
pip install requests
MODEL=hf.co/FoolDev/Thanatos-27B python ollama_chat.py
```

If you pulled before the latest qwen35 re-stamp (HF commit
`e03e10e`) and still have a qwen36-stamped blob in your local
Ollama store, run `cd .. && make heal-hf` once to rebadge it
in place (qwen36 → qwen35, metadata-only, ~5 s) — the same
tag then loads. Fresh pulls after the re-stamp go straight
through.

For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
`make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`
and creates a local `thanatos-27b` tag:

```bash
cd ..  &&  make build QUANT=Q3_K_S  &&  cd examples
MODEL=thanatos-27b python ollama_chat.py
```

Or build a local tag from this repo's bundled GGUF without going
through the HF pull:

```bash
cd ..  &&  make load-bundle  &&  cd examples
MODEL=thanatos-27b python ollama_chat.py
```

For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
fetch it from `unsloth/Qwen3.6-27B-GGUF` and patch the `Modelfile`
`FROM` line into a temp copy automatically:

```bash
cd ..  &&  make build QUANT=Q5_K_M  &&  cd examples
python ollama_chat.py
```

### Transformers (safetensors)

```bash
pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
python transformers_quickstart.py            # 4-bit, ~16 GB VRAM
python transformers_quickstart.py --no-4bit  # bf16, ~54 GB VRAM
```

### llama-cpp-python (GGUF, no daemon)

```bash
pip install llama-cpp-python  # CPU-only build
python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
```

For GPU offload, rebuild llama-cpp-python with the matching backend — see
the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).

### Vision (image input)

```bash
# Pull the projector once (~927 MB):
hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf --local-dir .

pip install llama-cpp-python pillow
python llama_cpp_vision.py \
  --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-F16.gguf \
  --image /path/to/photo.jpg \
  --prompt "Describe this image."
```

Why not Ollama? Ollama's Go engine has the `qwen35` / `qwen35moe`
arch entries (text inference works in 0.24+), but the C++ llama.cpp
fallback that Ollama switches to when an mmproj is attached still
lacks them. `ollama create` accepts the dual-`FROM` and `ollama show`
reports `vision` capability, but the first inference call fails with
`error loading model architecture: unknown model architecture:
'qwen35'` (verified empirically against the dense 27B +
`mmproj-F16.gguf`). Tracked in
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
Until that's fixed, llama.cpp / llama-cpp-python is the working path
for vision.