Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoolDev/Thanatos-27B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto") - llama-cpp-python
How to use FoolDev/Thanatos-27B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FoolDev/Thanatos-27B", filename="Thanatos-27B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FoolDev/Thanatos-27B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M
Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use FoolDev/Thanatos-27B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoolDev/Thanatos-27B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- SGLang
How to use FoolDev/Thanatos-27B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoolDev/Thanatos-27B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoolDev/Thanatos-27B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use FoolDev/Thanatos-27B with Ollama:
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Unsloth Studio new
How to use FoolDev/Thanatos-27B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FoolDev/Thanatos-27B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FoolDev/Thanatos-27B to start chatting
- Pi new
How to use FoolDev/Thanatos-27B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FoolDev/Thanatos-27B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FoolDev/Thanatos-27B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
- Lemonade
How to use FoolDev/Thanatos-27B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FoolDev/Thanatos-27B:Q4_K_M
Run and chat with the model
lemonade run user.Thanatos-27B-Q4_K_M
List all available models
lemonade list
Rename to Thanatos-Heretic-27B and swap base to llmfan46 Heretic v2
Browse filesProject rename Thanatos-27B -> Thanatos-Heretic-27B (Ollama tag
thanatos-heretic-27b) and immediate-base swap from Qwen/Qwen3.6-27B
to llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (an uncensored Heretic
abliteration of the same Qwen 3.6 27B dense arch).
Docs + Modelfile + scripts only β bundled Thanatos-27B.Q4_K_M.gguf
LFS pointer unchanged. The blob is still the legacy pre-Heretic
Qwen quant; README "Bundled blob status" callout + Known Limitations
warn users until the rebundle ships.
- scripts/build.sh: REPO_ID -> llmfan46 Heretic GGUF, filename
pattern Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf, default
TAG thanatos-heretic-27b. Q3_K_S replaced by Q3_K_M throughout
(Heretic repo doesn't publish Q3_K_S).
- scripts/fetch_vision.sh: PRECISION=BF16, REPO_ID -> llmfan46,
FILE_NAME=Qwen3.6-27B-mmproj-BF16.gguf. Unsloth's mmproj-F16.gguf
documented as a reference fallback.
- README: tagline, base_model frontmatter, badge, Vision section,
Related models, Credits, hardware/quick-start tables all flipped
to the Heretic lineage. Architecture section unchanged β Heretic
v2 is qwen35-stamped like vanilla Qwen 3.6 27B.
- CHANGELOG: top entry documents the rename + base swap; historical
entries below intentionally left referring to Thanatos-27B as
they happened on the old repo identity.
HF repo migration (new FoolDev/Thanatos-Heretic-27B repo + remote
re-point + old-repo migration notice) and Heretic re-quantization
rebundle are separate follow-ups.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- CHANGELOG.md +79 -0
- CITATION.cff +20 -12
- Makefile +5 -5
- Modelfile +18 -17
- README.md +131 -91
- examples/README.md +21 -20
- examples/llama_cpp_quickstart.py +1 -1
- examples/llama_cpp_vision.py +8 -8
- examples/ollama_chat.py +5 -5
- examples/transformers_quickstart.py +10 -7
- scripts/bench.sh +4 -4
- scripts/build.sh +11 -12
- scripts/check.sh +6 -4
- scripts/check_bridge_sync.py +2 -2
- scripts/fetch_vision.sh +12 -8
- scripts/heal_hf_pull.sh +8 -8
- scripts/install-hooks.sh +1 -1
- scripts/load_bundle.sh +7 -7
- scripts/smoke_test.sh +6 -6
- scripts/verify_arch.py +4 -4
|
@@ -7,6 +7,85 @@ and documentation**, not the underlying base model.
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
### Changed (5th round trip β qwen36 β qwen35, retested next-day)
|
| 11 |
- **Bundle re-stamped `general.architecture: 'qwen36'` β `'qwen35'`**
|
| 12 |
in `hf upload` commit `e03e10e` (HF), 2026-05-20 midday β 8
|
|
|
|
| 7 |
|
| 8 |
## [Unreleased]
|
| 9 |
|
| 10 |
+
### Changed (project rename + base swap to Heretic v2)
|
| 11 |
+
- **Renamed project `Thanatos-27B` β `Thanatos-Heretic-27B`** and
|
| 12 |
+
**swapped immediate base from `Qwen/Qwen3.6-27B` (vanilla) β
|
| 13 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`** (an uncensored
|
| 14 |
+
Heretic-style abliteration of the dense Qwen 3.6 27B base).
|
| 15 |
+
README, Modelfile preamble, `CITATION.cff`, all scripts, and
|
| 16 |
+
all examples now refer to `Thanatos-Heretic-27B` /
|
| 17 |
+
`thanatos-heretic-27b` (lowercase Ollama tag) and pull GGUFs
|
| 18 |
+
from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`.
|
| 19 |
+
Architecture is unchanged (still Qwen 3.6 dense 27B,
|
| 20 |
+
`qwen35`-stamped, hybrid SSM+attention stack) β only the
|
| 21 |
+
weights' finetune lineage moves.
|
| 22 |
+
- **`base_model:` frontmatter** flipped to
|
| 23 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`;
|
| 24 |
+
`base_model_relation: finetune` added; `heretic` and
|
| 25 |
+
`uncensored` tags appended. `library_name: transformers` stays
|
| 26 |
+
for HF Hub placement (snippet trap accepted as before;
|
| 27 |
+
`config.json` is still intentionally absent).
|
| 28 |
+
- **`scripts/build.sh`** now points `REPO_ID` at
|
| 29 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and uses the
|
| 30 |
+
filename pattern `Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf`.
|
| 31 |
+
Default `TAG` is `thanatos-heretic-27b`. Note: no `Q3_K_S` in
|
| 32 |
+
the Heretic GGUF repo β use `Q3_K_M` for the smallest practical
|
| 33 |
+
quant (`Modelfile` preamble and README hardware/quick-start
|
| 34 |
+
tables updated accordingly).
|
| 35 |
+
- **`scripts/fetch_vision.sh`** defaults flipped to
|
| 36 |
+
`PRECISION=BF16` and
|
| 37 |
+
`REPO_ID=llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`
|
| 38 |
+
(`Qwen3.6-27B-mmproj-BF16.gguf`, ~931 MB). Unsloth's
|
| 39 |
+
`mmproj-F16.gguf` is documented as a reference fallback for
|
| 40 |
+
users who want the F16/F32 variants.
|
| 41 |
+
- **Bundled blob status:** the in-repo
|
| 42 |
+
`Thanatos-27B.Q4_K_M.gguf` LFS pointer is unchanged β still the
|
| 43 |
+
legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant
|
| 44 |
+
(`5ed60d0af4650a854b1755bd392f9aef4872643dc25a254bc68043fa638392a0`).
|
| 45 |
+
Behaves identically to vanilla Qwen 3.6 27B for now. Heretic v2
|
| 46 |
+
re-quantization + rebundle (file rename to
|
| 47 |
+
`Thanatos-Heretic-27B.Q4_K_M.gguf` + LFS swap) is a separate
|
| 48 |
+
follow-up; users wanting actual Heretic behavior today should
|
| 49 |
+
use the local-build path (`make build`).
|
| 50 |
+
- **HF repo migration:** the local git remote still points at
|
| 51 |
+
`huggingface.co/FoolDev/Thanatos-27B`. A new HF repo at
|
| 52 |
+
`FoolDev/Thanatos-Heretic-27B` needs to be created and the
|
| 53 |
+
remote re-pointed before the next push. Migration notice on the
|
| 54 |
+
old `FoolDev/Thanatos-27B` model card is pending.
|
| 55 |
+
- **CHANGELOG history left intact:** entries below this one still
|
| 56 |
+
reference `Thanatos-27B` and the bundled-blob saga as they
|
| 57 |
+
happened on the old repo identity. Historical, not retconned.
|
| 58 |
+
|
| 59 |
+
### Changed (HF tag-surface cleanup β `general.tags` strip + `config.json` drop)
|
| 60 |
+
- **Stripped `general.tags` KV from the bundled GGUF** (`9cc78e7`,
|
| 61 |
+
2026-05-20). Drops the upstream-baked `unsloth` and
|
| 62 |
+
`image-text-to-text` tags that `llama.cpp`'s converter copies
|
| 63 |
+
into GGUFs from `unsloth/Qwen3.6-27B-GGUF`; both surfaced on
|
| 64 |
+
the HF model page and obscured this card's positioning.
|
| 65 |
+
Tensors byte-identical; only the `general.tags` KV is gone.
|
| 66 |
+
- **Dropped `config.json`** (`5302d10`, 2026-05-20) to suppress
|
| 67 |
+
HF's tag auto-detector surfacing `qwen3_5` in the repo header
|
| 68 |
+
β the detector reads `architectures` from `config.json`.
|
| 69 |
+
Consequence: `AutoModelForCausalLM.from_pretrained(
|
| 70 |
+
"FoolDev/Thanatos-27B")` no longer works on its own.
|
| 71 |
+
`examples/transformers_quickstart.py` and the README
|
| 72 |
+
transformers note now point users at upstream
|
| 73 |
+
`Qwen/Qwen3.6-27B` directly (tensors byte-identical, so the
|
| 74 |
+
result is the same model). `library_name: transformers` stays
|
| 75 |
+
in the model-card metadata for Hub placement.
|
| 76 |
+
|
| 77 |
+
### Reverted (safetensors mirror experiment)
|
| 78 |
+
- **Mirrored Qwen/Qwen3.6-27B's safetensors set into this repo
|
| 79 |
+
(`b420378`, 2026-05-20), reverted within the day** (`50f6684`
|
| 80 |
+
+ `9cf363e`, 2026-05-21). 15 sharded `.safetensors` + tokenizer
|
| 81 |
+
+ processor configs (~58 GB) were briefly added so users
|
| 82 |
+
wanting GGUF + safetensors in one place could skip a second
|
| 83 |
+
`hf download`; reverted on reflection. Transformers users
|
| 84 |
+
continue to pull from upstream `Qwen/Qwen3.6-27B`. `.gitignore`
|
| 85 |
+
whitelist for the Qwen sharded naming pattern (`0c5bee4`) was
|
| 86 |
+
removed alongside the mirror; `*.safetensors` block rule is
|
| 87 |
+
back to baseline.
|
| 88 |
+
|
| 89 |
### Changed (5th round trip β qwen36 β qwen35, retested next-day)
|
| 90 |
- **Bundle re-stamped `general.architecture: 'qwen36'` β `'qwen35'`**
|
| 91 |
in `hf upload` commit `e03e10e` (HF), 2026-05-20 midday β 8
|
|
@@ -1,21 +1,22 @@
|
|
| 1 |
cff-version: 1.2.0
|
| 2 |
-
title: "Thanatos-27B: A Dense Distillation Wrapper for Qwen 3.6 27B"
|
| 3 |
message: "If you use this model card or its accompanying files, please cite as below."
|
| 4 |
type: software
|
| 5 |
authors:
|
| 6 |
- name: FoolDev
|
| 7 |
website: "https://huggingface.co/FoolDev"
|
| 8 |
-
repository-code: "https://huggingface.co/FoolDev/Thanatos-27B"
|
| 9 |
-
url: "https://huggingface.co/FoolDev/Thanatos-27B"
|
| 10 |
abstract: >-
|
| 11 |
-
Thanatos-27B is a personal repackaging of
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
| 19 |
keywords:
|
| 20 |
- qwen
|
| 21 |
- qwen3.6
|
|
@@ -23,10 +24,17 @@ keywords:
|
|
| 23 |
- distillation
|
| 24 |
- reasoning
|
| 25 |
- llm
|
|
|
|
|
|
|
| 26 |
license: Apache-2.0
|
| 27 |
references:
|
| 28 |
- type: software
|
| 29 |
-
title: "Qwen3.6-27B"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
authors:
|
| 31 |
- name: Alibaba Qwen Team
|
| 32 |
url: "https://huggingface.co/Qwen/Qwen3.6-27B"
|
|
|
|
| 1 |
cff-version: 1.2.0
|
| 2 |
+
title: "Thanatos-Heretic-27B: A Dense Distillation Wrapper for llmfan46's Qwen 3.6 27B Uncensored Heretic v2"
|
| 3 |
message: "If you use this model card or its accompanying files, please cite as below."
|
| 4 |
type: software
|
| 5 |
authors:
|
| 6 |
- name: FoolDev
|
| 7 |
website: "https://huggingface.co/FoolDev"
|
| 8 |
+
repository-code: "https://huggingface.co/FoolDev/Thanatos-Heretic-27B"
|
| 9 |
+
url: "https://huggingface.co/FoolDev/Thanatos-Heretic-27B"
|
| 10 |
abstract: >-
|
| 11 |
+
Thanatos-Heretic-27B is a personal repackaging of llmfan46's uncensored
|
| 12 |
+
Heretic v2 finetune of Qwen 3.6 27B (dense), with Claude Opus 4.7 in
|
| 13 |
+
the reasoning teacher slot. The repository ships an Ollama Modelfile,
|
| 14 |
+
sampling defaults, usage examples, and a single ready-to-run GGUF
|
| 15 |
+
(Q4_K_M ~17 GB) so the HF "Use this model" widget surfaces a one-liner
|
| 16 |
+
Ollama snippet. Other quants (Q3_K_M, Q5_K_M, Q6_K, etc.) and the
|
| 17 |
+
Heretic safetensors are pulled from upstream
|
| 18 |
+
(llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF and the matching
|
| 19 |
+
non-GGUF repo) on demand rather than redistributed.
|
| 20 |
keywords:
|
| 21 |
- qwen
|
| 22 |
- qwen3.6
|
|
|
|
| 24 |
- distillation
|
| 25 |
- reasoning
|
| 26 |
- llm
|
| 27 |
+
- heretic
|
| 28 |
+
- uncensored
|
| 29 |
license: Apache-2.0
|
| 30 |
references:
|
| 31 |
- type: software
|
| 32 |
+
title: "Qwen3.6-27B-uncensored-heretic-v2 (immediate base)"
|
| 33 |
+
authors:
|
| 34 |
+
- name: llmfan46
|
| 35 |
+
url: "https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
|
| 36 |
+
- type: software
|
| 37 |
+
title: "Qwen3.6-27B (upstream base)"
|
| 38 |
authors:
|
| 39 |
- name: Alibaba Qwen Team
|
| 40 |
url: "https://huggingface.co/Qwen/Qwen3.6-27B"
|
|
@@ -1,11 +1,11 @@
|
|
| 1 |
-
# Thanatos-27B convenience Makefile.
|
| 2 |
#
|
| 3 |
# All work is delegated to scripts/* β this file just gives common
|
| 4 |
# operations short, discoverable names.
|
| 5 |
#
|
| 6 |
# Variables you can override on the command line:
|
| 7 |
# QUANT GGUF quant suffix (default: Q4_K_M)
|
| 8 |
-
# TAG Ollama model tag (default: thanatos-27b)
|
| 9 |
# GGUF_PATH path to existing GGUF (skip the download)
|
| 10 |
# MODEL model tag for smoke (default: $(TAG))
|
| 11 |
#
|
|
@@ -19,7 +19,7 @@
|
|
| 19 |
# make clean
|
| 20 |
|
| 21 |
QUANT ?= Q4_K_M
|
| 22 |
-
TAG ?= thanatos-27b
|
| 23 |
MODEL ?= $(TAG)
|
| 24 |
|
| 25 |
.DEFAULT_GOAL := help
|
|
@@ -43,7 +43,7 @@ build: ## Download qwen35-stamped GGUF from unsloth and run 'ollama create' (lo
|
|
| 43 |
load-bundle: ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
|
| 44 |
TAG=$(TAG) ./scripts/load_bundle.sh
|
| 45 |
|
| 46 |
-
heal-hf: ## Heal an already-pulled hf.co/FoolDev/Thanatos-27B tag in-store (rebadge blob + manifest digest).
|
| 47 |
./scripts/heal_hf_pull.sh
|
| 48 |
|
| 49 |
smoke: ## Verify the model is reachable and round-trips.
|
|
@@ -69,6 +69,6 @@ hooks: ## Install scripts/check.sh as the git pre-commit hook.
|
|
| 69 |
|
| 70 |
clean: ## Remove local GGUF copies and ephemeral caches in this repo.
|
| 71 |
@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
|
| 72 |
-
@rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf ./Thanatos-27B.*.qwen[0-9]*.gguf
|
| 73 |
@rm -rf ./.cache __pycache__ examples/__pycache__
|
| 74 |
@echo "[+] clean"
|
|
|
|
| 1 |
+
# Thanatos-Heretic-27B convenience Makefile.
|
| 2 |
#
|
| 3 |
# All work is delegated to scripts/* β this file just gives common
|
| 4 |
# operations short, discoverable names.
|
| 5 |
#
|
| 6 |
# Variables you can override on the command line:
|
| 7 |
# QUANT GGUF quant suffix (default: Q4_K_M)
|
| 8 |
+
# TAG Ollama model tag (default: thanatos-heretic-27b)
|
| 9 |
# GGUF_PATH path to existing GGUF (skip the download)
|
| 10 |
# MODEL model tag for smoke (default: $(TAG))
|
| 11 |
#
|
|
|
|
| 19 |
# make clean
|
| 20 |
|
| 21 |
QUANT ?= Q4_K_M
|
| 22 |
+
TAG ?= thanatos-heretic-27b
|
| 23 |
MODEL ?= $(TAG)
|
| 24 |
|
| 25 |
.DEFAULT_GOAL := help
|
|
|
|
| 43 |
load-bundle: ## Load THIS repo's bundled GGUF into a local Ollama tag (smudge LFS + ollama create).
|
| 44 |
TAG=$(TAG) ./scripts/load_bundle.sh
|
| 45 |
|
| 46 |
+
heal-hf: ## Heal an already-pulled hf.co/FoolDev/Thanatos-Heretic-27B tag in-store (rebadge blob + manifest digest).
|
| 47 |
./scripts/heal_hf_pull.sh
|
| 48 |
|
| 49 |
smoke: ## Verify the model is reachable and round-trips.
|
|
|
|
| 69 |
|
| 70 |
clean: ## Remove local GGUF copies and ephemeral caches in this repo.
|
| 71 |
@echo "[*] removing local GGUFs and ephemeral caches in $$PWD"
|
| 72 |
+
@rm -f ./Qwen3.6-27B-*.gguf ./mmproj-*.gguf ./Thanatos-Heretic-27B.*.qwen[0-9]*.gguf
|
| 73 |
@rm -rf ./.cache __pycache__ examples/__pycache__
|
| 74 |
@echo "[+] clean"
|
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# Thanatos-27B β Ollama wrapper around Qwen 3.6 27B (dense)
|
| 2 |
#
|
| 3 |
# Text + tool calling. Vision via Ollama is currently broken for this
|
| 4 |
# architecture (ollama/ollama#15898 β the qwen35 arch entries are in
|
|
@@ -10,21 +10,22 @@
|
|
| 10 |
# stamped `general.architecture: 'qwen35'` β the upstream-canonical
|
| 11 |
# arch entry every released llama.cpp / Ollama loads under for the
|
| 12 |
# Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
|
| 13 |
-
# thanatos-27b -f Modelfile && ollama run thanatos-27b` loads it
|
| 14 |
# directly. See README "Architecture" for the full stamp history
|
| 15 |
# (eight flips between qwen35 and qwen36, settled on qwen35 at
|
| 16 |
# `e03e10e` after the 4th qwen36 round trip had its friction
|
| 17 |
# re-tested in a fresh next-day session).
|
| 18 |
#
|
| 19 |
-
# For other quants (
|
| 20 |
-
# downloads the chosen quant from
|
| 21 |
-
#
|
| 22 |
-
#
|
| 23 |
-
#
|
| 24 |
#
|
| 25 |
# Other GGUF sources (use with `make build GGUF_PATH=...`):
|
| 26 |
-
# https://huggingface.co/
|
| 27 |
-
# https://huggingface.co/
|
|
|
|
| 28 |
|
| 29 |
FROM ./Thanatos-27B.Q4_K_M.gguf
|
| 30 |
|
|
@@ -140,14 +141,14 @@ Behavior rules:
|
|
| 140 |
# (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
|
| 141 |
# Q3_K_S β 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
|
| 142 |
# (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
|
| 143 |
-
# Second run measured against `thanatos-27b:latest`
|
| 144 |
-
# `make build QUANT=Q3_K_S`
|
| 145 |
-
#
|
| 146 |
-
#
|
| 147 |
-
#
|
| 148 |
-
#
|
| 149 |
-
#
|
| 150 |
-
#
|
| 151 |
# Q4_K_M β 9.31 tok/s aggregate (run 1)
|
| 152 |
# (5356 tokens / 574.9 s; 9.48 / 9.43 / 9.28 short/medium/long)
|
| 153 |
# Q4_K_M β 9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)
|
|
|
|
| 1 |
+
# Thanatos-Heretic-27B β Ollama wrapper around Qwen 3.6 27B (dense)
|
| 2 |
#
|
| 3 |
# Text + tool calling. Vision via Ollama is currently broken for this
|
| 4 |
# architecture (ollama/ollama#15898 β the qwen35 arch entries are in
|
|
|
|
| 10 |
# stamped `general.architecture: 'qwen35'` β the upstream-canonical
|
| 11 |
# arch entry every released llama.cpp / Ollama loads under for the
|
| 12 |
# Qwen 3.5 / 3.6 hybrid SSM + attention family. `ollama create
|
| 13 |
+
# thanatos-heretic-27b -f Modelfile && ollama run thanatos-heretic-27b` loads it
|
| 14 |
# directly. See README "Architecture" for the full stamp history
|
| 15 |
# (eight flips between qwen35 and qwen36, settled on qwen35 at
|
| 16 |
# `e03e10e` after the 4th qwen36 round trip had its friction
|
| 17 |
# re-tested in a fresh next-day session).
|
| 18 |
#
|
| 19 |
+
# For other quants (Q3_K_M, Q5_K_M, Q6_K, etc.), `make build QUANT=Q3_K_M`
|
| 20 |
+
# downloads the chosen quant from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF
|
| 21 |
+
# (filename pattern Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf) and
|
| 22 |
+
# patches FROM in a temp Modelfile copy. Note: no Q3_K_S in this repo;
|
| 23 |
+
# use Q3_K_M for the smallest practical quant.
|
| 24 |
#
|
| 25 |
# Other GGUF sources (use with `make build GGUF_PATH=...`):
|
| 26 |
+
# https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF # primary (this repo's default)
|
| 27 |
+
# https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF # MTP head preserved
|
| 28 |
+
# https://huggingface.co/unsloth/Qwen3.6-27B-GGUF # vanilla Qwen 3.6 (pre-Heretic)
|
| 29 |
|
| 30 |
FROM ./Thanatos-27B.Q4_K_M.gguf
|
| 31 |
|
|
|
|
| 141 |
# (6182 tokens / 501.9 s; 12.67 / 12.55 / 12.25 short/medium/long)
|
| 142 |
# Q3_K_S β 11.70 tok/s aggregate (run 2, 2026-05-19 evening)
|
| 143 |
# (8009 tokens / 684.0 s; 12.23 / 12.12 / 11.66 short/medium/long)
|
| 144 |
+
# Second run measured against a `thanatos-27b:latest` (pre-rename)
|
| 145 |
+
# built via `make build QUANT=Q3_K_S` against the then-current
|
| 146 |
+
# unsloth/Qwen3.6-27B-GGUF source. Aggregate is 4.9% below
|
| 147 |
+
# run 1 (within the Β±20% noise band) β slightly longer
|
| 148 |
+
# per-prompt outputs this run (8009 vs 6182 tokens) likely
|
| 149 |
+
# contribute the difference, plus late-in-session thermal
|
| 150 |
+
# pressure on the Strix Halo iGPU.
|
| 151 |
+
# (Heretic v2 base is not benched here yet; rebundle pending.)
|
| 152 |
# Q4_K_M β 9.31 tok/s aggregate (run 1)
|
| 153 |
# (5356 tokens / 574.9 s; 9.48 / 9.43 / 9.28 short/medium/long)
|
| 154 |
# Q4_K_M β 9.19 tok/s aggregate (run 2, 2026-05-19 afternoon)
|
|
@@ -1,7 +1,8 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
-
|
|
|
|
| 5 |
datasets:
|
| 6 |
- crownelius/Creative_Writing_ShareGPT_Enhanced
|
| 7 |
- microsoft/rStar-Coder
|
|
@@ -40,26 +41,28 @@ tags:
|
|
| 40 |
- agent
|
| 41 |
- gguf
|
| 42 |
- ollama
|
|
|
|
|
|
|
| 43 |
library_name: transformers
|
| 44 |
pipeline_tag: image-text-to-text
|
| 45 |
---
|
| 46 |
|
| 47 |
-
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/banner.svg" alt="Thanatos-27B banner" width="100%" />
|
| 48 |
|
| 49 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 50 |
-
[](#architecture)
|
| 52 |
[](https://huggingface.co/FoolDev/Janus-35B)
|
| 53 |
[](https://buymeacoffee.com/cardoffoolm)
|
| 54 |
|
| 55 |
-
# Thanatos-27B
|
| 56 |
|
| 57 |
-
> **Dense Reasoning. Friendlier Footprint.**
|
| 58 |
-
> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
|
| 59 |
|
| 60 |
-
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
|
| 61 |
|
| 62 |
-
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on the
|
| 63 |
|
| 64 |
## TL;DR
|
| 65 |
|
|
@@ -69,18 +72,28 @@ template β HF's Ollama bridge ingests those three files, not
|
|
| 69 |
`Modelfile`):
|
| 70 |
|
| 71 |
```bash
|
| 72 |
-
ollama run hf.co/FoolDev/Thanatos-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
|
| 73 |
```
|
| 74 |
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
the
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
QUANT=...` is the simplest path. See [Quick start](#quick-start)
|
| 83 |
-
below for the full matrix.
|
|
|
|
| 84 |
|
| 85 |
For image input use llama.cpp directly β Ollama vision is broken for
|
| 86 |
this architecture upstream (see [Vision](#vision)).
|
|
@@ -89,9 +102,9 @@ this architecture upstream (see [Vision](#vision)).
|
|
| 89 |
|
| 90 |
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** β the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
|
| 91 |
|
| 92 |
-
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B β on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix) β but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
|
| 93 |
|
| 94 |
-
| | Thanatos-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|
| 95 |
|---|---|---|
|
| 96 |
| Architecture | Dense transformer | MoE 256 experts, 8 active |
|
| 97 |
| Total params | 27 B | 35 B |
|
|
@@ -99,7 +112,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 99 |
| Layers | 64 | 40 |
|
| 100 |
| Hidden size | 5120 | 2048 |
|
| 101 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
|
| 102 |
-
|
|
| 103 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 104 |
| Multimodal (text path) | Yes | Yes |
|
| 105 |
| Multimodal (vision via Ollama) | Broken upstream β see below | Broken upstream |
|
|
@@ -111,15 +124,15 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 111 |
| File | Use |
|
| 112 |
|---|---|
|
| 113 |
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
|
| 114 |
-
| `Modelfile` | Ollama wrapper around the bundled Qwen 3.6 27B
|
| 115 |
-
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-27B` directly (the bridge does **not** read `Modelfile` β see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
|
| 116 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 117 |
-
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `
|
| 118 |
-
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle β loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 β qwen35 rebadge branch for legacy
|
| 119 |
-
| `scripts/heal_hf_pull.sh` |
|
| 120 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
|
| 121 |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
|
| 122 |
-
| `scripts/fetch_vision.sh` | Pulls the vision projector (`mmproj-F16.gguf`) for llama.cpp (Ollama vision is broken upstream β see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
|
| 123 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
|
| 124 |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
|
| 125 |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
|
|
@@ -129,21 +142,22 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
|
|
| 129 |
| `CHANGELOG.md` | Versioned tooling/docs changes |
|
| 130 |
| `README.md` | This file |
|
| 131 |
|
| 132 |
-
For 16 GB GPUs / unified-memory laptops, `make build QUANT=
|
| 133 |
-
downloads the smaller ~
|
| 134 |
-
`
|
| 135 |
-
creates a local `thanatos-27b` Ollama
|
| 136 |
-
via this repo. For other quants use
|
| 137 |
-
local-build path applies this repo's
|
| 138 |
-
path applies the root-level
|
| 139 |
-
files (kept in sync with the
|
|
|
|
| 140 |
|
| 141 |
-
If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
|
| 142 |
|
| 143 |
## Architecture
|
| 144 |
|
| 145 |
<p align="left">
|
| 146 |
-
<img src="https://huggingface.co/FoolDev/Thanatos-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
|
| 147 |
</p>
|
| 148 |
|
| 149 |
- Qwen 3.6 dense, 27B parameters, 64 transformer layers
|
|
@@ -154,23 +168,30 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
|
|
| 154 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 155 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 156 |
- Vision + video supported by the **base architecture** via a separate
|
| 157 |
-
`mmproj` projector (not redistributed here; pull
|
| 158 |
-
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
| 160 |
- Multi-token prediction (MTP) head trained for speculative decoding β
|
| 161 |
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
|
| 162 |
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
|
| 163 |
**Not usable via llama.cpp / Ollama today**: the GGUF converter
|
| 164 |
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
|
| 165 |
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
|
| 166 |
-
inference yet"), so the
|
| 167 |
-
851 tensors and no MTP head.
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
**The bundled GGUF declares `general.architecture: 'qwen35'`** β not a
|
| 176 |
workaround for an unimplemented `qwen36` arch, but the canonical
|
|
@@ -186,9 +207,11 @@ stack:
|
|
| 186 |
exists in `transformers`; Qwen reuses the 3.5 class names.
|
| 187 |
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
|
| 188 |
`Qwen3_5ForCausalLM` β `MODEL_ARCH.QWEN35` and
|
| 189 |
-
`Qwen3_5MoeForCausalLM` β `MODEL_ARCH.QWEN35MOE`. The
|
| 190 |
-
GGUFs this repo pulls from
|
| 191 |
-
`
|
|
|
|
|
|
|
| 192 |
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
|
| 193 |
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
|
| 194 |
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
|
|
@@ -200,7 +223,7 @@ There is no PR or tracking issue for a `qwen36` arch entry in
|
|
| 200 |
`qwen35` already loads the model the upstream code path was
|
| 201 |
designed to load.
|
| 202 |
|
| 203 |
-
`ollama run hf.co/FoolDev/Thanatos-27B` and `llama-server -m
|
| 204 |
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
|
| 205 |
loaders.
|
| 206 |
|
|
@@ -257,7 +280,8 @@ the legacy qwen36 β qwen35 in-store rebadge (used by `make
|
|
| 257 |
heal-hf` and `make load-bundle`) and any future arch flip:
|
| 258 |
|
| 259 |
```bash
|
| 260 |
-
# qwen36 -> qwen35 (the legacy recovery direction
|
|
|
|
| 261 |
python3 scripts/rename_arch.py \
|
| 262 |
--from-arch qwen36 --to-arch qwen35 \
|
| 263 |
Thanatos-27B.Q4_K_M.qwen36.gguf \
|
|
@@ -273,21 +297,23 @@ Three paths:
|
|
| 273 |
```bash
|
| 274 |
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
|
| 275 |
# root-level template / system / params files in one step):
|
| 276 |
-
ollama run hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M, qwen35-stamped
|
| 277 |
|
| 278 |
-
# B. Build a local `thanatos-27b` tag from THIS repo's bundle
|
| 279 |
# (LFS smudge if needed, then `ollama create`). Useful if you
|
| 280 |
# want a bare local tag rather than the `hf.co/...` path:
|
| 281 |
-
make load-bundle # creates local tag thanatos-27b
|
| 282 |
-
ollama run thanatos-27b
|
| 283 |
-
|
| 284 |
-
# C. Bypass the bundle: download a qwen35-stamped
|
| 285 |
-
# and build locally. Loads on every current
|
| 286 |
-
|
| 287 |
-
|
|
|
|
|
|
|
| 288 |
make build QUANT=Q5_K_M # 20 GB higher quality
|
| 289 |
-
make build GGUF_PATH=~/models/Qwen3.6-27B-Q4_K_M.gguf # skip download
|
| 290 |
-
ollama run thanatos-27b
|
| 291 |
```
|
| 292 |
|
| 293 |
Under the hood, `make build` calls `scripts/build.sh`, which downloads the
|
|
@@ -295,7 +321,7 @@ GGUF if missing (set `GGUF_PATH` to point at one you already have) and
|
|
| 295 |
runs `ollama create` with the matching `Modelfile`.
|
| 296 |
|
| 297 |
If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
|
| 298 |
-
run `ollama create thanatos-27b -f Modelfile && ollama run thanatos-27b`.
|
| 299 |
|
| 300 |
Confirm everything works:
|
| 301 |
|
|
@@ -310,10 +336,10 @@ python examples/ollama_chat.py # full demo: chat, streaming, tools, OpenAI-
|
|
| 310 |
|
| 311 |
| App | How to load this model |
|
| 312 |
|---|---|
|
| 313 |
-
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=
|
| 314 |
-
| **LM Studio** | Search β `FoolDev/Thanatos-27B` β pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
|
| 315 |
-
| **Jan** | Hub β "Import from Hugging Face" β `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
|
| 316 |
-
| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via
|
| 317 |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
|
| 318 |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path β point at the GGUF, use the embedded chat template. |
|
| 319 |
|
|
@@ -331,7 +357,7 @@ external schema.
|
|
| 331 |
curl -s http://localhost:11434/v1/chat/completions \
|
| 332 |
-H 'Content-Type: application/json' \
|
| 333 |
-d '{
|
| 334 |
-
"model": "thanatos-27b",
|
| 335 |
"messages": [
|
| 336 |
{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
|
| 337 |
{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
|
|
@@ -369,17 +395,21 @@ Behavior rules:
|
|
| 369 |
|
| 370 |
## Vision
|
| 371 |
|
| 372 |
-
The Qwen 3.6 base
|
| 373 |
-
`mmproj` projector. The full
|
|
|
|
| 374 |
|
| 375 |
```
|
| 376 |
-
Qwen3.6-27B-Q4_K_M.gguf (~17 GB, the text decoder)
|
| 377 |
-
mmproj-
|
| 378 |
```
|
| 379 |
|
| 380 |
Both files are at
|
| 381 |
-
[`
|
| 382 |
-
|
|
|
|
|
|
|
|
|
|
| 383 |
|
| 384 |
### Loader compatibility β the honest table
|
| 385 |
|
|
@@ -397,10 +427,11 @@ Three flavors, in order of build-time effort:
|
|
| 397 |
```bash
|
| 398 |
# A. HTTP via llama-server (always built β the easiest path).
|
| 399 |
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
|
| 400 |
-
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU.
|
|
|
|
| 401 |
llama-server \
|
| 402 |
-
-m Qwen3.6-27B-Q4_K_M.gguf \
|
| 403 |
-
--mmproj mmproj-
|
| 404 |
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
|
| 405 |
# then POST OpenAI-style chat completions with an image_url content
|
| 406 |
# block β e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
|
|
@@ -413,15 +444,15 @@ llama-server \
|
|
| 413 |
# produce it β a plain `cmake --build build` will. If yours didn't,
|
| 414 |
# run `cmake --build build --target llama-mtmd-cli`.
|
| 415 |
llama-mtmd-cli \
|
| 416 |
-
-m Qwen3.6-27B-Q4_K_M.gguf \
|
| 417 |
-
--mmproj mmproj-
|
| 418 |
--image photo.jpg \
|
| 419 |
-p "Describe this image."
|
| 420 |
|
| 421 |
# C. Python via llama-cpp-python:
|
| 422 |
python examples/llama_cpp_vision.py \
|
| 423 |
-
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 424 |
-
--mmproj /path/to/mmproj-
|
| 425 |
--image /path/to/photo.jpg \
|
| 426 |
--prompt "What is in this image?"
|
| 427 |
```
|
|
@@ -439,19 +470,22 @@ The dense 27B is the lighter sibling to Janus-35B and the easier of the two to d
|
|
| 439 |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
|
| 440 |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
|
| 441 |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
|
| 442 |
-
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=
|
| 443 |
|
| 444 |
Most numbers in this table are estimates from comparable models; the
|
| 445 |
gradient is right but the absolute values will move Β±20% with prompt
|
| 446 |
shape, KV cache type, and parallel-request count. Measure your own
|
| 447 |
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
|
| 448 |
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
|
| 449 |
-
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan
|
|
|
|
|
|
|
| 450 |
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
|
| 451 |
steady across short / medium / long prompts), sitting between CPU-only
|
| 452 |
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
|
| 453 |
same Q3_K_S bench gave ~10.1 tok/s β Vulkan was the clear winner on
|
| 454 |
-
this hardware.
|
|
|
|
| 455 |
|
| 456 |
## Chat template
|
| 457 |
|
|
@@ -465,10 +499,10 @@ Ollama is the exception: its conversion of the embedded jinja loses the
|
|
| 465 |
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
|
| 466 |
Two paths fix this, depending on how you pull the model:
|
| 467 |
|
| 468 |
-
- **`ollama run hf.co/FoolDev/Thanatos-27B`** β HF's Ollama bridge applies
|
| 469 |
the root-level `template` / `system` / `params` files in this repo
|
| 470 |
(the bridge does **not** read `Modelfile`).
|
| 471 |
-
- **`make build` / `ollama create thanatos-27b -f Modelfile`** β uses the
|
| 472 |
`Modelfile`'s `TEMPLATE` block.
|
| 473 |
|
| 474 |
Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
|
|
@@ -511,7 +545,7 @@ the model adapts to whichever shape the system prompt prescribes.
|
|
| 511 |
**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
|
| 512 |
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
|
| 513 |
extractor parses into a structured `tool_calls` array. After
|
| 514 |
-
`make build`, `ollama show thanatos-27b` lists `tools` and `thinking`
|
| 515 |
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
|
| 516 |
accept a `tools` array.
|
| 517 |
|
|
@@ -552,19 +586,25 @@ python examples/ollama_chat.py # section 3 runs a real round-trip
|
|
| 552 |
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached β see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
|
| 553 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 554 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
|
|
|
|
|
|
| 555 |
|
| 556 |
## Related models
|
| 557 |
|
| 558 |
| Model | Notes |
|
| 559 |
|---|---|
|
| 560 |
-
| [
|
| 561 |
-
| [
|
|
|
|
|
|
|
|
|
|
| 562 |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
|
| 563 |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
|
| 564 |
|
| 565 |
## Credits
|
| 566 |
|
| 567 |
-
-
|
|
|
|
| 568 |
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
|
| 569 |
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
|
| 570 |
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
+
- llmfan46/Qwen3.6-27B-uncensored-heretic-v2
|
| 5 |
+
base_model_relation: finetune
|
| 6 |
datasets:
|
| 7 |
- crownelius/Creative_Writing_ShareGPT_Enhanced
|
| 8 |
- microsoft/rStar-Coder
|
|
|
|
| 41 |
- agent
|
| 42 |
- gguf
|
| 43 |
- ollama
|
| 44 |
+
- heretic
|
| 45 |
+
- uncensored
|
| 46 |
library_name: transformers
|
| 47 |
pipeline_tag: image-text-to-text
|
| 48 |
---
|
| 49 |
|
| 50 |
+
<img src="https://huggingface.co/FoolDev/Thanatos-Heretic-27B/resolve/main/banner.svg" alt="Thanatos-Heretic-27B banner" width="100%" />
|
| 51 |
|
| 52 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 53 |
+
[](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2)
|
| 54 |
[](#architecture)
|
| 55 |
[](https://huggingface.co/FoolDev/Janus-35B)
|
| 56 |
[](https://buymeacoffee.com/cardoffoolm)
|
| 57 |
|
| 58 |
+
# Thanatos-Heretic-27B
|
| 59 |
|
| 60 |
+
> **Dense Reasoning. Friendlier Footprint. Uncensored.**
|
| 61 |
+
> *llmfan46's Heretic v2 abliteration of Qwen 3.6 27B (dense), repackaged with Claude Opus 4.7 in the teacher slot.*
|
| 62 |
|
| 63 |
+
**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Base:`** `Heretic v2 (llmfan46)` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled + Abliterated LLM`
|
| 64 |
|
| 65 |
+
A personal sibling to [`FoolDev/Janus-35B`](https://huggingface.co/FoolDev/Janus-35B). Same teacher (Claude Opus 4.7), same dataset family, but built on [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) β an uncensored Heretic-style abliteration of the dense [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base β instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises, and refusal-trained behavior is dialed back at the base layer.
|
| 66 |
|
| 67 |
## TL;DR
|
| 68 |
|
|
|
|
| 72 |
`Modelfile`):
|
| 73 |
|
| 74 |
```bash
|
| 75 |
+
ollama run hf.co/FoolDev/Thanatos-Heretic-27B # ~17 GB Q4_K_M, qwen35-stamped, loads on stock Ollama
|
| 76 |
```
|
| 77 |
|
| 78 |
+
> **Bundled blob status:** the GGUF currently bundled in this repo
|
| 79 |
+
> is the legacy pre-Heretic Qwen 3.6 27B Q4_K_M quant from before
|
| 80 |
+
> the rename. Behaves identically to vanilla Qwen 3.6 27B for now;
|
| 81 |
+
> the Heretic v2 rebundle (from
|
| 82 |
+
> `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) is pending β
|
| 83 |
+
> see the top entry of [CHANGELOG](CHANGELOG.md). If you want the
|
| 84 |
+
> Heretic behavior today, use the local-build path below
|
| 85 |
+
> (`make build`), which pulls the Heretic GGUF directly.
|
| 86 |
+
|
| 87 |
+
If you pulled the bundle during any of the qwen36 windows on the
|
| 88 |
+
pre-rename `FoolDev/Thanatos-27B` repo (2026-05-19/20) and still
|
| 89 |
+
have a qwen36-stamped blob in your local Ollama store, `make
|
| 90 |
+
heal-hf` rebadges it in place. Fresh pulls of the new
|
| 91 |
+
`Thanatos-Heretic-27B` repo go straight through.
|
| 92 |
+
|
| 93 |
+
For other quants (Q3_K_M ~12 GB, Q5_K_M ~20 GB, etc.), `make build
|
| 94 |
QUANT=...` is the simplest path. See [Quick start](#quick-start)
|
| 95 |
+
below for the full matrix. Note: no Q3_K_S in the Heretic GGUF
|
| 96 |
+
repo β use Q3_K_M for the smallest practical quant.
|
| 97 |
|
| 98 |
For image input use llama.cpp directly β Ollama vision is broken for
|
| 99 |
this architecture upstream (see [Vision](#vision)).
|
|
|
|
| 102 |
|
| 103 |
The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** β the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
|
| 104 |
|
| 105 |
+
The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B β on a Ryzen AI Max+ 395 / Radeon 8060S iGPU the dense 27B at Q3_K_S clocks ~10 tok/s, versus ~27 tok/s for the MoE 35B at ~Q4 (`make bench`, 3-prompt mix, measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits the same architecture so per-step cost should match) β but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
|
| 106 |
|
| 107 |
+
| | Thanatos-Heretic-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/Janus-35B) |
|
| 108 |
|---|---|---|
|
| 109 |
| Architecture | Dense transformer | MoE 256 experts, 8 active |
|
| 110 |
| Total params | 27 B | 35 B |
|
|
|
|
| 112 |
| Layers | 64 | 40 |
|
| 113 |
| Hidden size | 5120 | 2048 |
|
| 114 |
| Q4_K_M GGUF size | ~17 GB (bundled) | ~19 GB (bundled) |
|
| 115 |
+
| Q3_K_M GGUF size | ~13 GB (build locally via `make build QUANT=Q3_K_M`) | n/a |
|
| 116 |
| Min host memory @ Q4 / 8K ctx | ~22 GB | ~38 GB |
|
| 117 |
| Multimodal (text path) | Yes | Yes |
|
| 118 |
| Multimodal (vision via Ollama) | Broken upstream β see below | Broken upstream |
|
|
|
|
| 124 |
| File | Use |
|
| 125 |
|---|---|
|
| 126 |
| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
|
| 127 |
+
| `Modelfile` | Ollama wrapper around the bundled GGUF (currently the legacy pre-Heretic Qwen 3.6 27B Q4_K_M; Heretic v2 rebundle pending) β used by `make build` / `ollama create` for **local** builds |
|
| 128 |
+
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` directly (the bridge does **not** read `Modelfile` β see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)). Mirrors the `Modelfile`'s template / system prompt / sampling params. |
|
| 129 |
| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
|
| 130 |
+
| `scripts/build.sh` | Pulls a qwen35-stamped GGUF from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and runs `ollama create` (loads on today's llama.cpp / Ollama; see `make build`). This is the path that gets you actual Heretic behavior until the bundled blob is rebundled. |
|
| 131 |
+
| `scripts/load_bundle.sh` | One-shot path from *this repo's* bundle β loadable local Ollama tag (smudges LFS pointer via `hf download` if needed, runs `ollama create`; see `make load-bundle`). Carries a qwen36 β qwen35 rebadge branch for legacy pre-rename checkouts β no-op on the current qwen35-stamped bundle. |
|
| 132 |
+
| `scripts/heal_hf_pull.sh` | Legacy recovery for users migrating from the pre-rename `FoolDev/Thanatos-27B` repo who still have a qwen36-stamped blob in their local Ollama store: rebadges the blob qwen36 β qwen35 and rewrites the manifest's model-layer digest so the same tag becomes loadable in place. See `make heal-hf`. Idempotent and a no-op on tags already on qwen35 οΏ½οΏ½οΏ½ fresh pulls of `Thanatos-Heretic-27B` don't need it. |
|
| 133 |
| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, asserts no chat-template tokens leak into the response. With `TOOLS_TEST=1`, also exercises an end-to-end tool-call round-trip and checks the response shape |
|
| 134 |
| `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
|
| 135 |
+
| `scripts/fetch_vision.sh` | Pulls the vision projector (`Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo, or `mmproj-F16.gguf` from the unsloth reference projector) for llama.cpp (Ollama vision is broken upstream β see [Vision](#vision)). Renamed from `fetch_mmproj.sh` because HF's Ollama bridge auto-indexed the script as a vision projector layer (filename pattern match). |
|
| 136 |
| `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep, plus `Modelfile`-vs-bridge-files sync check |
|
| 137 |
| `scripts/check_bridge_sync.py` | Verifies the `Modelfile` `TEMPLATE` / `SYSTEM` / `PARAMETER` directives stay in sync with the root-level `template` / `system` / `params` files. Run as part of `make check`; called from the pre-commit hook. |
|
| 138 |
| `scripts/verify_arch.py` | Cross-checks the README "Architecture" forward-pass bullets (layer count, head counts, hidden / FFN dims, RoPE factor, SSM dims, vocab, context) against the actual GGUF metadata keys. Run as `make verify-arch`. Handles both `qwen35`- and `qwen36`-stamped bundles; exit non-zero if any value mismatches. Not part of `make check` because it loads the 17 GB GGUF (LFS smudge required); run on demand. |
|
|
|
|
| 142 |
| `CHANGELOG.md` | Versioned tooling/docs changes |
|
| 143 |
| `README.md` | This file |
|
| 144 |
|
| 145 |
+
For 16 GB GPUs / unified-memory laptops, `make build QUANT=Q3_K_M`
|
| 146 |
+
downloads the smaller ~13 GB Q3_K_M quant from
|
| 147 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` (qwen35-stamped,
|
| 148 |
+
loads directly) and creates a local `thanatos-heretic-27b` Ollama
|
| 149 |
+
tag. Does not redistribute via this repo. For other quants use
|
| 150 |
+
`make build QUANT=...`. The local-build path applies this repo's
|
| 151 |
+
`Modelfile`; the `hf.co/...` path applies the root-level
|
| 152 |
+
`template`, `system`, and `params` files (kept in sync with the
|
| 153 |
+
`Modelfile`).
|
| 154 |
|
| 155 |
+
If you want the Heretic safetensors for `transformers`, fetch them from [`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2). For the vanilla pre-Heretic Qwen 3.6 27B base, use [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
|
| 156 |
|
| 157 |
## Architecture
|
| 158 |
|
| 159 |
<p align="left">
|
| 160 |
+
<img src="https://huggingface.co/FoolDev/Thanatos-Heretic-27B/resolve/main/dense-flow.svg" alt="animated dense forward-pass visualization: 64-layer hybrid attention stack with a pulse traversing left-to-right, illuminating Gated DeltaNet (purple) and Gated Attention (cyan) layers in turn" width="800" />
|
| 161 |
</p>
|
| 162 |
|
| 163 |
- Qwen 3.6 dense, 27B parameters, 64 transformer layers
|
|
|
|
| 168 |
- Vocab 248,320 (shared with 35B-A3B sibling)
|
| 169 |
- 262 144 native context, extensible to ~1 M with YaRN
|
| 170 |
- Vision + video supported by the **base architecture** via a separate
|
| 171 |
+
`mmproj` projector (not redistributed here; pull
|
| 172 |
+
`Qwen3.6-27B-mmproj-BF16.gguf` from
|
| 173 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`, or
|
| 174 |
+
`mmproj-F16.gguf` from `unsloth/Qwen3.6-27B-GGUF` as a reference
|
| 175 |
+
alternative). See [Vision](#vision) below for current loader
|
| 176 |
+
compatibility.
|
| 177 |
- Multi-token prediction (MTP) head trained for speculative decoding β
|
| 178 |
present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
|
| 179 |
vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
|
| 180 |
**Not usable via llama.cpp / Ollama today**: the GGUF converter
|
| 181 |
(`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
|
| 182 |
`qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
|
| 183 |
+
inference yet"), so the standard GGUFs (this bundle, unsloth's,
|
| 184 |
+
llmfan46's Heretic v2) ship with 851 tensors and no MTP head.
|
| 185 |
+
llmfan46 also publishes a separate
|
| 186 |
+
`Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF` repo
|
| 187 |
+
that keeps the MTP tensors for vLLM/SGLang users who want both
|
| 188 |
+
Heretic v2 + MTP. llama.cpp's MTP support (PR #22673, merged
|
| 189 |
+
2026-05-16) currently covers other architectures only; tracking
|
| 190 |
+
that PR's follow-up work for when qwen35 / qwen35moe consumer
|
| 191 |
+
support lands. (Earlier README versions claimed MTP was available
|
| 192 |
+
via llama.cpp without this caveat β confirmed empirically via
|
| 193 |
+
`gguf.GGUFReader` on both this bundle and
|
| 194 |
+
`unsloth/Qwen3.6-27B-GGUF`, 2026-05-19.)
|
| 195 |
|
| 196 |
**The bundled GGUF declares `general.architecture: 'qwen35'`** β not a
|
| 197 |
workaround for an unimplemented `qwen36` arch, but the canonical
|
|
|
|
| 207 |
exists in `transformers`; Qwen reuses the 3.5 class names.
|
| 208 |
- **llama.cpp's converter.** `convert_hf_to_gguf.py` registers
|
| 209 |
`Qwen3_5ForCausalLM` β `MODEL_ARCH.QWEN35` and
|
| 210 |
+
`Qwen3_5MoeForCausalLM` β `MODEL_ARCH.QWEN35MOE`. The Heretic
|
| 211 |
+
GGUFs this repo pulls from
|
| 212 |
+
(`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`) inherit those
|
| 213 |
+
stamps, as do the upstream unsloth GGUFs (`unsloth/Qwen3.6-27B-GGUF`,
|
| 214 |
+
`unsloth/Qwen3.6-35B-A3B-GGUF`).
|
| 215 |
- **llama.cpp's model code.** `src/models/qwen35.cpp` has an
|
| 216 |
explicit `case 64: type = LLM_TYPE_27B` branch for this model;
|
| 217 |
`qwen35moe.cpp` has `case 40: type = LLM_TYPE_35B_A3B` for the
|
|
|
|
| 223 |
`qwen35` already loads the model the upstream code path was
|
| 224 |
designed to load.
|
| 225 |
|
| 226 |
+
`ollama run hf.co/FoolDev/Thanatos-Heretic-27B` and `llama-server -m
|
| 227 |
Thanatos-27B.Q4_K_M.gguf` both load directly on current stock
|
| 228 |
loaders.
|
| 229 |
|
|
|
|
| 280 |
heal-hf` and `make load-bundle`) and any future arch flip:
|
| 281 |
|
| 282 |
```bash
|
| 283 |
+
# qwen36 -> qwen35 (the legacy recovery direction, for blobs
|
| 284 |
+
# pulled from the pre-rename FoolDev/Thanatos-27B repo)
|
| 285 |
python3 scripts/rename_arch.py \
|
| 286 |
--from-arch qwen36 --to-arch qwen35 \
|
| 287 |
Thanatos-27B.Q4_K_M.qwen36.gguf \
|
|
|
|
| 297 |
```bash
|
| 298 |
# A. Pull straight from HF (gets the bundled Q4_K_M GGUF + the
|
| 299 |
# root-level template / system / params files in one step):
|
| 300 |
+
ollama run hf.co/FoolDev/Thanatos-Heretic-27B # 17 GB Q4_K_M, qwen35-stamped
|
| 301 |
|
| 302 |
+
# B. Build a local `thanatos-heretic-27b` tag from THIS repo's bundle
|
| 303 |
# (LFS smudge if needed, then `ollama create`). Useful if you
|
| 304 |
# want a bare local tag rather than the `hf.co/...` path:
|
| 305 |
+
make load-bundle # creates local tag thanatos-heretic-27b
|
| 306 |
+
ollama run thanatos-heretic-27b
|
| 307 |
+
|
| 308 |
+
# C. Bypass the bundle: download a qwen35-stamped Heretic v2 GGUF
|
| 309 |
+
# from llmfan46 and build locally. Loads on every current
|
| 310 |
+
# llama.cpp / Ollama. This is the path that gets you actual
|
| 311 |
+
# Heretic behavior until the bundled blob is rebundled.
|
| 312 |
+
make build # Q4_K_M -> thanatos-heretic-27b
|
| 313 |
+
make build QUANT=Q3_K_M # 13 GB smaller quant
|
| 314 |
make build QUANT=Q5_K_M # 20 GB higher quality
|
| 315 |
+
make build GGUF_PATH=~/models/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf # skip download
|
| 316 |
+
ollama run thanatos-heretic-27b
|
| 317 |
```
|
| 318 |
|
| 319 |
Under the hood, `make build` calls `scripts/build.sh`, which downloads the
|
|
|
|
| 321 |
runs `ollama create` with the matching `Modelfile`.
|
| 322 |
|
| 323 |
If you'd rather do it by hand: edit the `FROM` line in `Modelfile` and
|
| 324 |
+
run `ollama create thanatos-heretic-27b -f Modelfile && ollama run thanatos-heretic-27b`.
|
| 325 |
|
| 326 |
Confirm everything works:
|
| 327 |
|
|
|
|
| 336 |
|
| 337 |
| App | How to load this model |
|
| 338 |
|---|---|
|
| 339 |
+
| **Ollama** | `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`). For other quants, `make build QUANT=Q3_K_M` downloads from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
|
| 340 |
+
| **LM Studio** | Search β `FoolDev/Thanatos-Heretic-27B` β pick `Thanatos-27B.Q4_K_M.gguf` (current bundled filename; will become `Thanatos-Heretic-27B.Q4_K_M.gguf` after the rebundle). Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
|
| 341 |
+
| **Jan** | Hub β "Import from Hugging Face" β `FoolDev/Thanatos-Heretic-27B`. Same template behavior as LM Studio. |
|
| 342 |
+
| **llama.cpp** | `hf download FoolDev/Thanatos-Heretic-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via `Qwen3.6-27B-mmproj-BF16.gguf` from the Heretic GGUF repo). |
|
| 343 |
| **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
|
| 344 |
| **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path β point at the GGUF, use the embedded chat template. |
|
| 345 |
|
|
|
|
| 357 |
curl -s http://localhost:11434/v1/chat/completions \
|
| 358 |
-H 'Content-Type: application/json' \
|
| 359 |
-d '{
|
| 360 |
+
"model": "thanatos-heretic-27b",
|
| 361 |
"messages": [
|
| 362 |
{"role": "system", "content": "You are Thanatos, a precise reasoning assistant."},
|
| 363 |
{"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
|
|
|
|
| 395 |
|
| 396 |
## Vision
|
| 397 |
|
| 398 |
+
The Qwen 3.6 base (and llmfan46's Heretic v2 finetune of it) supports
|
| 399 |
+
image (and video) input via a separate `mmproj` projector. The full
|
| 400 |
+
multimodal stack is:
|
| 401 |
|
| 402 |
```
|
| 403 |
+
Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf (~17 GB, the text decoder)
|
| 404 |
+
Qwen3.6-27B-mmproj-BF16.gguf (~931 MB, the vision projector)
|
| 405 |
```
|
| 406 |
|
| 407 |
Both files are at
|
| 408 |
+
[`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF`](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF).
|
| 409 |
+
For the vanilla pre-Heretic projector, see
|
| 410 |
+
[`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
|
| 411 |
+
(`mmproj-F16.gguf`, ~927 MB). This repo intentionally does not
|
| 412 |
+
redistribute either.
|
| 413 |
|
| 414 |
### Loader compatibility β the honest table
|
| 415 |
|
|
|
|
| 427 |
```bash
|
| 428 |
# A. HTTP via llama-server (always built β the easiest path).
|
| 429 |
# Reconfirmed working 2026-05-19 against llama.cpp 389ff61 + Vulkan
|
| 430 |
+
# on a Ryzen AI Max+ 395 / Radeon 8060S iGPU (pre-Heretic Qwen 3.6
|
| 431 |
+
# bundle; Heretic v2 shares the architecture so the recipe carries).
|
| 432 |
llama-server \
|
| 433 |
+
-m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 434 |
+
--mmproj Qwen3.6-27B-mmproj-BF16.gguf \
|
| 435 |
--host 127.0.0.1 --port 8765 -c 8192 -ngl 99
|
| 436 |
# then POST OpenAI-style chat completions with an image_url content
|
| 437 |
# block β e.g. {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,..."}}
|
|
|
|
| 444 |
# produce it β a plain `cmake --build build` will. If yours didn't,
|
| 445 |
# run `cmake --build build --target llama-mtmd-cli`.
|
| 446 |
llama-mtmd-cli \
|
| 447 |
+
-m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 448 |
+
--mmproj Qwen3.6-27B-mmproj-BF16.gguf \
|
| 449 |
--image photo.jpg \
|
| 450 |
-p "Describe this image."
|
| 451 |
|
| 452 |
# C. Python via llama-cpp-python:
|
| 453 |
python examples/llama_cpp_vision.py \
|
| 454 |
+
--gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 455 |
+
--mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
|
| 456 |
--image /path/to/photo.jpg \
|
| 457 |
--prompt "What is in this image?"
|
| 458 |
```
|
|
|
|
| 470 |
| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
|
| 471 |
| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
|
| 472 |
| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
|
| 473 |
+
| 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. `make build QUANT=Q3_K_M` (~13 GB) and trim `num_ctx` for headroom. |
|
| 474 |
|
| 475 |
Most numbers in this table are estimates from comparable models; the
|
| 476 |
gradient is right but the absolute values will move Β±20% with prompt
|
| 477 |
shape, KV cache type, and parallel-request count. Measure your own
|
| 478 |
machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
|
| 479 |
`eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
|
| 480 |
+
data points on a Ryzen AI Max+ 395 / Radeon 8060S iGPU under Vulkan
|
| 481 |
+
(measured against the pre-rename Qwen 3.6 bundle; Heretic v2 inherits
|
| 482 |
+
the architecture so per-step cost should match within bench noise):
|
| 483 |
**~12.3 tok/s at Q3_K_S** and **~9.3 tok/s at Q4_K_M** (3-prompt mix,
|
| 484 |
steady across short / medium / long prompts), sitting between CPU-only
|
| 485 |
and a 24 GB discrete card as expected. An earlier ROCm snapshot of the
|
| 486 |
same Q3_K_S bench gave ~10.1 tok/s β Vulkan was the clear winner on
|
| 487 |
+
this hardware. (Heretic v2 publishes Q3_K_M rather than Q3_K_S; the
|
| 488 |
+
~13 GB Q3_K_M should sit within 5% of the ~12 GB Q3_K_S numbers.)
|
| 489 |
|
| 490 |
## Chat template
|
| 491 |
|
|
|
|
| 499 |
`.Tools` / `.ToolCalls` blocks Ollama's capability detector requires.
|
| 500 |
Two paths fix this, depending on how you pull the model:
|
| 501 |
|
| 502 |
+
- **`ollama run hf.co/FoolDev/Thanatos-Heretic-27B`** β HF's Ollama bridge applies
|
| 503 |
the root-level `template` / `system` / `params` files in this repo
|
| 504 |
(the bridge does **not** read `Modelfile`).
|
| 505 |
+
- **`make build` / `ollama create thanatos-heretic-27b -f Modelfile`** β uses the
|
| 506 |
`Modelfile`'s `TEMPLATE` block.
|
| 507 |
|
| 508 |
Both routes wire `.Tools` / `.ToolCalls` and tools work end-to-end on
|
|
|
|
| 545 |
**Ollama path** (this repo's `Modelfile`). The `TEMPLATE` directive
|
| 546 |
prompts the model to emit JSON-in-XML, the form Ollama's tool-call
|
| 547 |
extractor parses into a structured `tool_calls` array. After
|
| 548 |
+
`make build`, `ollama show thanatos-heretic-27b` lists `tools` and `thinking`
|
| 549 |
under **Capabilities**, and both `/api/chat` and `/v1/chat/completions`
|
| 550 |
accept a `tools` array.
|
| 551 |
|
|
|
|
| 586 |
- **No mmproj in this release**, and **vision via Ollama is broken upstream** (the qwen35/qwen35moe arch entries are present in Ollama's Go engine but missing from the C++ llama.cpp fallback Ollama uses when mmproj is attached β see the [Vision](#vision) section). For image input use llama.cpp directly until that's fixed.
|
| 587 |
- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
|
| 588 |
- **No formal evaluation in this card.** Numbers above are estimates.
|
| 589 |
+
- **Bundled blob is pre-Heretic.** The currently-bundled `Thanatos-27B.Q4_K_M.gguf` blob is the legacy Qwen 3.6 27B Q4_K_M quant from before the rename β it behaves like vanilla Qwen 3.6, not Heretic v2. Use `make build` (which pulls the Heretic GGUF from llmfan46) until the rebundle ships.
|
| 590 |
+
- **Uncensored base.** The Heretic v2 abliteration dials back the refusal-training of upstream Qwen 3.6. Outputs may be more compliant with sensitive requests than the vanilla base; the Thanatos system prompt still steers behavior, but the safety floor is lower. Apply your own filtering for user-facing deployments.
|
| 591 |
|
| 592 |
## Related models
|
| 593 |
|
| 594 |
| Model | Notes |
|
| 595 |
|---|---|
|
| 596 |
+
| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) | **Immediate base**, safetensors |
|
| 597 |
+
| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF) | Recommended GGUF source (what `make build` pulls from) |
|
| 598 |
+
| [llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved) | Same Heretic v2 but keeps the MTP head for vLLM / SGLang speculative decoding |
|
| 599 |
+
| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream pre-Heretic base, safetensors |
|
| 600 |
+
| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Pre-Heretic GGUF mirror + reference `mmproj-F16.gguf` projector |
|
| 601 |
| [FoolDev/Janus-35B](https://huggingface.co/FoolDev/Janus-35B) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
|
| 602 |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
|
| 603 |
|
| 604 |
## Credits
|
| 605 |
|
| 606 |
+
- Immediate base: [llmfan46/Qwen3.6-27B-uncensored-heretic-v2](https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2) β Heretic-style abliteration of Qwen 3.6 27B
|
| 607 |
+
- Upstream base: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
|
| 608 |
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
|
| 609 |
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
|
| 610 |
|
|
@@ -1,13 +1,13 @@
|
|
| 1 |
-
# Thanatos-27B examples
|
| 2 |
|
| 3 |
Four minimal entry points. Pick the one that matches how you run models.
|
| 4 |
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
-
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-27b` model created from the project `Modelfile`. **Text + tool calling** β vision via Ollama is broken upstream for this arch. |
|
| 8 |
-
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the
|
| 9 |
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
|
| 10 |
-
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `mmproj-
|
| 11 |
|
| 12 |
All four apply the same Thanatos system prompt and sampling defaults
|
| 13 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
@@ -24,9 +24,9 @@ root-level `template` / `system` / `params` files via HF's Ollama
|
|
| 24 |
bridge):
|
| 25 |
|
| 26 |
```bash
|
| 27 |
-
ollama pull hf.co/FoolDev/Thanatos-27B # 17 GB Q4_K_M (only bundled quant)
|
| 28 |
pip install requests
|
| 29 |
-
MODEL=hf.co/FoolDev/Thanatos-27B python ollama_chat.py
|
| 30 |
```
|
| 31 |
|
| 32 |
If you pulled before the latest qwen35 re-stamp (HF commit
|
|
@@ -36,13 +36,14 @@ in place (qwen36 β qwen35, metadata-only, ~5 s) β the same
|
|
| 36 |
tag then loads. Fresh pulls after the re-stamp go straight
|
| 37 |
through.
|
| 38 |
|
| 39 |
-
For a non-bundled quant (e.g.
|
| 40 |
-
`make build QUANT=...` downloads from
|
| 41 |
-
and creates a
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
-
cd .. && make build QUANT=
|
| 45 |
-
MODEL=thanatos-27b python ollama_chat.py
|
| 46 |
```
|
| 47 |
|
| 48 |
Or build a local tag from this repo's bundled GGUF without going
|
|
@@ -50,12 +51,12 @@ through the HF pull:
|
|
| 50 |
|
| 51 |
```bash
|
| 52 |
cd .. && make load-bundle && cd examples
|
| 53 |
-
MODEL=thanatos-27b python ollama_chat.py
|
| 54 |
```
|
| 55 |
|
| 56 |
For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
|
| 57 |
-
fetch it from `
|
| 58 |
-
`FROM` line into a temp copy automatically:
|
| 59 |
|
| 60 |
```bash
|
| 61 |
cd .. && make build QUANT=Q5_K_M && cd examples
|
|
@@ -74,7 +75,7 @@ python transformers_quickstart.py --no-4bit # bf16, ~54 GB VRAM
|
|
| 74 |
|
| 75 |
```bash
|
| 76 |
pip install llama-cpp-python # CPU-only build
|
| 77 |
-
python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-Q4_K_M.gguf --gpu-layers 99
|
| 78 |
```
|
| 79 |
|
| 80 |
For GPU offload, rebuild llama-cpp-python with the matching backend β see
|
|
@@ -83,13 +84,13 @@ the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).
|
|
| 83 |
### Vision (image input)
|
| 84 |
|
| 85 |
```bash
|
| 86 |
-
# Pull the projector once (~
|
| 87 |
-
hf download
|
| 88 |
|
| 89 |
pip install llama-cpp-python pillow
|
| 90 |
python llama_cpp_vision.py \
|
| 91 |
-
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 92 |
-
--mmproj /path/to/mmproj-
|
| 93 |
--image /path/to/photo.jpg \
|
| 94 |
--prompt "Describe this image."
|
| 95 |
```
|
|
@@ -101,7 +102,7 @@ lacks them. `ollama create` accepts the dual-`FROM` and `ollama show`
|
|
| 101 |
reports `vision` capability, but the first inference call fails with
|
| 102 |
`error loading model architecture: unknown model architecture:
|
| 103 |
'qwen35'` (verified empirically against the dense 27B +
|
| 104 |
-
|
| 105 |
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 106 |
Until that's fixed, llama.cpp / llama-cpp-python is the working path
|
| 107 |
for vision.
|
|
|
|
| 1 |
+
# Thanatos-Heretic-27B examples
|
| 2 |
|
| 3 |
Four minimal entry points. Pick the one that matches how you run models.
|
| 4 |
|
| 5 |
| File | Backend | When to use |
|
| 6 |
|---|---|---|
|
| 7 |
+
| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `thanatos-heretic-27b` model created from the project `Modelfile`. **Text + tool calling** β vision via Ollama is broken upstream for this arch. |
|
| 8 |
+
| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the Heretic safetensors (`llmfan46/Qwen3.6-27B-uncensored-heretic-v2`) on GPU, optionally in 4-bit via bitsandbytes. |
|
| 9 |
| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). Text only. |
|
| 10 |
+
| `llama_cpp_vision.py` | llama-cpp-python + mmproj | **Image input.** Loads a text GGUF + `Qwen3.6-27B-mmproj-BF16.gguf` and answers questions about an image. The only working vision path right now. |
|
| 11 |
|
| 12 |
All four apply the same Thanatos system prompt and sampling defaults
|
| 13 |
(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
|
|
|
|
| 24 |
bridge):
|
| 25 |
|
| 26 |
```bash
|
| 27 |
+
ollama pull hf.co/FoolDev/Thanatos-Heretic-27B # 17 GB Q4_K_M (only bundled quant)
|
| 28 |
pip install requests
|
| 29 |
+
MODEL=hf.co/FoolDev/Thanatos-Heretic-27B python ollama_chat.py
|
| 30 |
```
|
| 31 |
|
| 32 |
If you pulled before the latest qwen35 re-stamp (HF commit
|
|
|
|
| 36 |
tag then loads. Fresh pulls after the re-stamp go straight
|
| 37 |
through.
|
| 38 |
|
| 39 |
+
For a non-bundled quant (e.g. Q3_K_M ~12 GB, Q5_K_M ~20 GB),
|
| 40 |
+
`make build QUANT=...` downloads from
|
| 41 |
+
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and creates a
|
| 42 |
+
local `thanatos-heretic-27b` tag:
|
| 43 |
|
| 44 |
```bash
|
| 45 |
+
cd .. && make build QUANT=Q3_K_M && cd examples
|
| 46 |
+
MODEL=thanatos-heretic-27b python ollama_chat.py
|
| 47 |
```
|
| 48 |
|
| 49 |
Or build a local tag from this repo's bundled GGUF without going
|
|
|
|
| 51 |
|
| 52 |
```bash
|
| 53 |
cd .. && make load-bundle && cd examples
|
| 54 |
+
MODEL=thanatos-heretic-27b python ollama_chat.py
|
| 55 |
```
|
| 56 |
|
| 57 |
For a quant the repo doesn't bundle (e.g. Q5_K_M), `make build` will
|
| 58 |
+
fetch it from `llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF` and
|
| 59 |
+
patch the `Modelfile` `FROM` line into a temp copy automatically:
|
| 60 |
|
| 61 |
```bash
|
| 62 |
cd .. && make build QUANT=Q5_K_M && cd examples
|
|
|
|
| 75 |
|
| 76 |
```bash
|
| 77 |
pip install llama-cpp-python # CPU-only build
|
| 78 |
+
python llama_cpp_quickstart.py /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf --gpu-layers 99
|
| 79 |
```
|
| 80 |
|
| 81 |
For GPU offload, rebuild llama-cpp-python with the matching backend β see
|
|
|
|
| 84 |
### Vision (image input)
|
| 85 |
|
| 86 |
```bash
|
| 87 |
+
# Pull the projector once (~931 MB):
|
| 88 |
+
hf download llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF Qwen3.6-27B-mmproj-BF16.gguf --local-dir .
|
| 89 |
|
| 90 |
pip install llama-cpp-python pillow
|
| 91 |
python llama_cpp_vision.py \
|
| 92 |
+
--gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 93 |
+
--mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
|
| 94 |
--image /path/to/photo.jpg \
|
| 95 |
--prompt "Describe this image."
|
| 96 |
```
|
|
|
|
| 102 |
reports `vision` capability, but the first inference call fails with
|
| 103 |
`error loading model architecture: unknown model architecture:
|
| 104 |
'qwen35'` (verified empirically against the dense 27B +
|
| 105 |
+
the F16 reference projector). Tracked in
|
| 106 |
[ollama/ollama#15898](https://github.com/ollama/ollama/issues/15898).
|
| 107 |
Until that's fixed, llama.cpp / llama-cpp-python is the working path
|
| 108 |
for vision.
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β llama-cpp-python quickstart.
|
| 4 |
|
| 5 |
Skip Ollama entirely and call the GGUF directly through llama-cpp-python.
|
| 6 |
Useful for batch jobs, CI, or environments where you don't want a daemon.
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β llama-cpp-python quickstart.
|
| 4 |
|
| 5 |
Skip Ollama entirely and call the GGUF directly through llama-cpp-python.
|
| 6 |
Useful for batch jobs, CI, or environments where you don't want a daemon.
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β vision (image-text-to-text) via llama-cpp-python.
|
| 4 |
|
| 5 |
Why this script exists:
|
| 6 |
Ollama's Go engine has the qwen35 / qwen35moe arch entries (text
|
|
@@ -23,21 +23,21 @@ Install:
|
|
| 23 |
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
|
| 24 |
# CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
|
| 25 |
|
| 26 |
-
Files you need (both from
|
| 27 |
-
1. A text GGUF (any quant): e.g. Qwen3.6-27B-Q4_K_M.gguf (~17 GB)
|
| 28 |
-
2. A vision projector: mmproj-
|
| 29 |
|
| 30 |
Usage:
|
| 31 |
python llama_cpp_vision.py \
|
| 32 |
-
--gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \
|
| 33 |
-
--mmproj /path/to/mmproj-
|
| 34 |
--image /path/to/photo.jpg \
|
| 35 |
--prompt "What is in this image? Be specific."
|
| 36 |
|
| 37 |
# CLI alternative without python binding (ships with llama.cpp):
|
| 38 |
# llama-mtmd-cli \
|
| 39 |
-
# -m Qwen3.6-27B-Q4_K_M.gguf \
|
| 40 |
-
# --mmproj mmproj-
|
| 41 |
# --image photo.jpg \
|
| 42 |
# -p "Describe this image."
|
| 43 |
"""
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β vision (image-text-to-text) via llama-cpp-python.
|
| 4 |
|
| 5 |
Why this script exists:
|
| 6 |
Ollama's Go engine has the qwen35 / qwen35moe arch entries (text
|
|
|
|
| 23 |
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
|
| 24 |
# CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
|
| 25 |
|
| 26 |
+
Files you need (both from llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF):
|
| 27 |
+
1. A text GGUF (any quant): e.g. Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf (~17 GB)
|
| 28 |
+
2. A vision projector: Qwen3.6-27B-mmproj-BF16.gguf (~931 MB)
|
| 29 |
|
| 30 |
Usage:
|
| 31 |
python llama_cpp_vision.py \
|
| 32 |
+
--gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 33 |
+
--mmproj /path/to/Qwen3.6-27B-mmproj-BF16.gguf \
|
| 34 |
--image /path/to/photo.jpg \
|
| 35 |
--prompt "What is in this image? Be specific."
|
| 36 |
|
| 37 |
# CLI alternative without python binding (ships with llama.cpp):
|
| 38 |
# llama-mtmd-cli \
|
| 39 |
+
# -m Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \
|
| 40 |
+
# --mmproj Qwen3.6-27B-mmproj-BF16.gguf \
|
| 41 |
# --image photo.jpg \
|
| 42 |
# -p "Describe this image."
|
| 43 |
"""
|
|
@@ -1,17 +1,17 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β Ollama chat examples.
|
| 4 |
|
| 5 |
Prerequisites (pick one):
|
| 6 |
|
| 7 |
A. From the bundled GGUFs (default flow):
|
| 8 |
$ make build # uses Thanatos-27B.Q4_K_M.gguf
|
| 9 |
# or:
|
| 10 |
-
$ ollama create thanatos-27b -f ../Modelfile
|
| 11 |
|
| 12 |
B. Pull straight from HF (Q4_K_M is the only bundled quant):
|
| 13 |
-
$ ollama run hf.co/FoolDev/Thanatos-27B
|
| 14 |
-
# then set MODEL=hf.co/FoolDev/Thanatos-27B below
|
| 15 |
|
| 16 |
Then:
|
| 17 |
$ ollama serve # usually already running
|
|
@@ -39,7 +39,7 @@ from typing import Any, Iterator
|
|
| 39 |
|
| 40 |
import requests
|
| 41 |
|
| 42 |
-
MODEL = os.environ.get("MODEL", "thanatos-27b")
|
| 43 |
HOST = os.environ.get("HOST", "http://localhost:11434")
|
| 44 |
|
| 45 |
_THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β Ollama chat examples.
|
| 4 |
|
| 5 |
Prerequisites (pick one):
|
| 6 |
|
| 7 |
A. From the bundled GGUFs (default flow):
|
| 8 |
$ make build # uses Thanatos-27B.Q4_K_M.gguf
|
| 9 |
# or:
|
| 10 |
+
$ ollama create thanatos-heretic-27b -f ../Modelfile
|
| 11 |
|
| 12 |
B. Pull straight from HF (Q4_K_M is the only bundled quant):
|
| 13 |
+
$ ollama run hf.co/FoolDev/Thanatos-Heretic-27B
|
| 14 |
+
# then set MODEL=hf.co/FoolDev/Thanatos-Heretic-27B below
|
| 15 |
|
| 16 |
Then:
|
| 17 |
$ ollama serve # usually already running
|
|
|
|
| 39 |
|
| 40 |
import requests
|
| 41 |
|
| 42 |
+
MODEL = os.environ.get("MODEL", "thanatos-heretic-27b")
|
| 43 |
HOST = os.environ.get("HOST", "http://localhost:11434")
|
| 44 |
|
| 45 |
_THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
|
|
@@ -1,12 +1,15 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β Hugging Face Transformers quickstart.
|
| 4 |
|
| 5 |
-
Loads the
|
| 6 |
-
chat turn using its embedded chat template. Thanatos-27B is a
|
| 7 |
-
around that base, so for the transformers route there is nothing
|
| 8 |
-
download from this repo β point at
|
| 9 |
-
system prompt the Modelfile uses.
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
Requirements:
|
| 12 |
pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
|
|
@@ -36,7 +39,7 @@ except ImportError as e: # pragma: no cover
|
|
| 36 |
)
|
| 37 |
|
| 38 |
|
| 39 |
-
MODEL_ID = "
|
| 40 |
|
| 41 |
THANATOS_SYSTEM = (
|
| 42 |
"You are Thanatos, a precise and capable assistant for reasoning, writing, "
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β Hugging Face Transformers quickstart.
|
| 4 |
|
| 5 |
+
Loads the Heretic v2 Qwen 3.6 27B safetensors directly and runs a single
|
| 6 |
+
chat turn using its embedded chat template. Thanatos-Heretic-27B is a
|
| 7 |
+
*wrapper* around that base, so for the transformers route there is nothing
|
| 8 |
+
to download from this repo β point at llmfan46/Qwen3.6-27B-uncensored-heretic-v2
|
| 9 |
+
and apply the same system prompt the Modelfile uses.
|
| 10 |
+
|
| 11 |
+
Set MODEL_ID = "Qwen/Qwen3.6-27B" to bypass the Heretic abliteration and
|
| 12 |
+
load the vanilla upstream base instead.
|
| 13 |
|
| 14 |
Requirements:
|
| 15 |
pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
|
|
|
|
| 39 |
)
|
| 40 |
|
| 41 |
|
| 42 |
+
MODEL_ID = "llmfan46/Qwen3.6-27B-uncensored-heretic-v2"
|
| 43 |
|
| 44 |
THANATOS_SYSTEM = (
|
| 45 |
"You are Thanatos, a precise and capable assistant for reasoning, writing, "
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β tok/s benchmark via Ollama.
|
| 3 |
#
|
| 4 |
# Reads timing from Ollama's /api/chat response metadata (eval_count and
|
| 5 |
# eval_duration are authoritative β no client-side stopwatch noise) and
|
|
@@ -7,14 +7,14 @@
|
|
| 7 |
# number generalises a bit beyond a single shape.
|
| 8 |
#
|
| 9 |
# Usage:
|
| 10 |
-
# ./scripts/bench.sh # uses MODEL=thanatos-27b
|
| 11 |
-
# MODEL=thanatos-27b ./scripts/bench.sh
|
| 12 |
# HOST=http://localhost:11434 ./scripts/bench.sh
|
| 13 |
#
|
| 14 |
# Requires: curl, jq, a running Ollama daemon with the model created.
|
| 15 |
set -euo pipefail
|
| 16 |
|
| 17 |
-
MODEL="${MODEL:-thanatos-27b}"
|
| 18 |
HOST="${HOST:-http://localhost:11434}"
|
| 19 |
|
| 20 |
red() { printf "\033[31m%s\033[0m\n" "$*" >&2; }
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β tok/s benchmark via Ollama.
|
| 3 |
#
|
| 4 |
# Reads timing from Ollama's /api/chat response metadata (eval_count and
|
| 5 |
# eval_duration are authoritative β no client-side stopwatch noise) and
|
|
|
|
| 7 |
# number generalises a bit beyond a single shape.
|
| 8 |
#
|
| 9 |
# Usage:
|
| 10 |
+
# ./scripts/bench.sh # uses MODEL=thanatos-heretic-27b
|
| 11 |
+
# MODEL=thanatos-heretic-27b ./scripts/bench.sh
|
| 12 |
# HOST=http://localhost:11434 ./scripts/bench.sh
|
| 13 |
#
|
| 14 |
# Requires: curl, jq, a running Ollama daemon with the model created.
|
| 15 |
set -euo pipefail
|
| 16 |
|
| 17 |
+
MODEL="${MODEL:-thanatos-heretic-27b}"
|
| 18 |
HOST="${HOST:-http://localhost:11434}"
|
| 19 |
|
| 20 |
red() { printf "\033[31m%s\033[0m\n" "$*" >&2; }
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β fetch a Qwen 3.6 27B GGUF and build the Ollama model.
|
| 3 |
#
|
| 4 |
# Usage:
|
| 5 |
# ./scripts/build.sh # default: Q4_K_M
|
|
@@ -7,28 +7,27 @@
|
|
| 7 |
# QUANT=Q6_K ./scripts/build.sh
|
| 8 |
#
|
| 9 |
# Skip the download by pointing at a GGUF you already have:
|
| 10 |
-
# GGUF_PATH=/path/to/Qwen3.6-27B-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
|
| 11 |
#
|
| 12 |
# Requires: huggingface-cli (or hf), ollama, awk.
|
| 13 |
set -euo pipefail
|
| 14 |
|
| 15 |
QUANT="${1:-${QUANT:-Q4_K_M}}"
|
| 16 |
|
| 17 |
-
REPO_ID="${REPO_ID:-
|
| 18 |
-
#
|
| 19 |
-
#
|
| 20 |
-
#
|
| 21 |
-
#
|
| 22 |
-
#
|
| 23 |
-
|
| 24 |
-
GGUF_NAME="Qwen3.6-27B-${QUANT}.gguf"
|
| 25 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 26 |
# GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
|
| 27 |
# with cached weights elsewhere don't have to copy or symlink anything.
|
| 28 |
GGUF_PATH="${GGUF_PATH:-${ROOT}/${GGUF_NAME}}"
|
| 29 |
|
| 30 |
MODELFILE="${ROOT}/Modelfile"
|
| 31 |
-
TAG="${TAG:-thanatos-27b}"
|
| 32 |
|
| 33 |
echo "[*] repo: ${REPO_ID}"
|
| 34 |
echo "[*] quant: ${QUANT}"
|
|
@@ -96,4 +95,4 @@ ollama create "${TAG}" -f "${TMP_MODELFILE}"
|
|
| 96 |
echo
|
| 97 |
echo "[+] Done. Try it:"
|
| 98 |
echo " ollama run ${TAG}"
|
| 99 |
-
echo " python ${ROOT}/examples/ollama_chat.py # update MODEL constant if not 'thanatos-27b'"
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β fetch a Qwen 3.6 27B GGUF and build the Ollama model.
|
| 3 |
#
|
| 4 |
# Usage:
|
| 5 |
# ./scripts/build.sh # default: Q4_K_M
|
|
|
|
| 7 |
# QUANT=Q6_K ./scripts/build.sh
|
| 8 |
#
|
| 9 |
# Skip the download by pointing at a GGUF you already have:
|
| 10 |
+
# GGUF_PATH=/path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf ./scripts/build.sh Q4_K_M
|
| 11 |
#
|
| 12 |
# Requires: huggingface-cli (or hf), ollama, awk.
|
| 13 |
set -euo pipefail
|
| 14 |
|
| 15 |
QUANT="${1:-${QUANT:-Q4_K_M}}"
|
| 16 |
|
| 17 |
+
REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
|
| 18 |
+
# Filenames at llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF follow
|
| 19 |
+
# Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf
|
| 20 |
+
# Quants known to exist (as of 2026-05):
|
| 21 |
+
# Q3_K_M Q3_K_L Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K Q8_0 BF16
|
| 22 |
+
# Note: no Q3_K_S in this repo β use Q3_K_M for the smallest practical quant.
|
| 23 |
+
GGUF_NAME="Qwen3.6-27B-uncensored-heretic-v2-${QUANT}.gguf"
|
|
|
|
| 24 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 25 |
# GGUF_PATH defaults to ${ROOT}/${GGUF_NAME}, but can be overridden so users
|
| 26 |
# with cached weights elsewhere don't have to copy or symlink anything.
|
| 27 |
GGUF_PATH="${GGUF_PATH:-${ROOT}/${GGUF_NAME}}"
|
| 28 |
|
| 29 |
MODELFILE="${ROOT}/Modelfile"
|
| 30 |
+
TAG="${TAG:-thanatos-heretic-27b}"
|
| 31 |
|
| 32 |
echo "[*] repo: ${REPO_ID}"
|
| 33 |
echo "[*] quant: ${QUANT}"
|
|
|
|
| 95 |
echo
|
| 96 |
echo "[+] Done. Try it:"
|
| 97 |
echo " ollama run ${TAG}"
|
| 98 |
+
echo " python ${ROOT}/examples/ollama_chat.py # update MODEL constant if not 'thanatos-heretic-27b'"
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β repo-local sanity checks.
|
| 3 |
#
|
| 4 |
# Runs everything that's cheap and catches a real-world bug we've already hit:
|
| 5 |
#
|
|
@@ -104,9 +104,11 @@ fi
|
|
| 104 |
|
| 105 |
# ---- 5. footgun: dot-vs-dash filename -------------------------------------
|
| 106 |
#
|
| 107 |
-
# Upstream
|
| 108 |
-
#
|
| 109 |
-
#
|
|
|
|
|
|
|
| 110 |
|
| 111 |
blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
|
| 112 |
if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β repo-local sanity checks.
|
| 3 |
#
|
| 4 |
# Runs everything that's cheap and catches a real-world bug we've already hit:
|
| 5 |
#
|
|
|
|
| 104 |
|
| 105 |
# ---- 5. footgun: dot-vs-dash filename -------------------------------------
|
| 106 |
#
|
| 107 |
+
# Upstream llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF (and the
|
| 108 |
+
# legacy unsloth/Qwen3.6-27B-GGUF) use dashes
|
| 109 |
+
# (Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf,
|
| 110 |
+
# Qwen3.6-27B-Q4_K_M.gguf). Earlier commits used the wrong
|
| 111 |
+
# dot-separated pattern, which 404s. Block re-introduction.
|
| 112 |
|
| 113 |
blue "[*] grep: forbidden Qwen3.6-27B.Q* filename pattern"
|
| 114 |
if grep -RnE 'Qwen3\.6-27B\.Q[0-9A-Z_]+\.gguf' \
|
|
@@ -1,13 +1,13 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β verify Modelfile and HF Ollama bridge files stay in sync.
|
| 4 |
|
| 5 |
The repo ships two parallel Ollama configurations:
|
| 6 |
|
| 7 |
- ``Modelfile`` is consumed by the local-build path (``ollama create -f Modelfile``).
|
| 8 |
It contains ``TEMPLATE`` / ``SYSTEM`` / ``PARAMETER`` directives.
|
| 9 |
- ``template`` / ``system`` / ``params`` at the repo root are consumed by HF's
|
| 10 |
-
Ollama bridge when users ``ollama run hf.co/FoolDev/Thanatos-27B`` directly. HF
|
| 11 |
does NOT read the Modelfile (per https://huggingface.co/docs/hub/en/ollama).
|
| 12 |
|
| 13 |
If the two configurations drift apart, ``hf.co/...`` users and ``make build``
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β verify Modelfile and HF Ollama bridge files stay in sync.
|
| 4 |
|
| 5 |
The repo ships two parallel Ollama configurations:
|
| 6 |
|
| 7 |
- ``Modelfile`` is consumed by the local-build path (``ollama create -f Modelfile``).
|
| 8 |
It contains ``TEMPLATE`` / ``SYSTEM`` / ``PARAMETER`` directives.
|
| 9 |
- ``template`` / ``system`` / ``params`` at the repo root are consumed by HF's
|
| 10 |
+
Ollama bridge when users ``ollama run hf.co/FoolDev/Thanatos-Heretic-27B`` directly. HF
|
| 11 |
does NOT read the Modelfile (per https://huggingface.co/docs/hub/en/ollama).
|
| 12 |
|
| 13 |
If the two configurations drift apart, ``hf.co/...`` users and ``make build``
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β fetch the vision projector (mmproj) for image input.
|
| 3 |
#
|
| 4 |
# Why this is separate from build.sh:
|
| 5 |
# build.sh is for the Ollama text path. The mmproj is only useful for
|
|
@@ -8,16 +8,20 @@
|
|
| 8 |
# it (see README Vision section, ollama/ollama#15898).
|
| 9 |
#
|
| 10 |
# Usage:
|
| 11 |
-
# ./scripts/fetch_vision.sh # default:
|
| 12 |
-
#
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
#
|
| 15 |
# Requires: huggingface-cli (or hf).
|
| 16 |
set -euo pipefail
|
| 17 |
|
| 18 |
-
PRECISION="${1:-${PRECISION:-
|
| 19 |
-
REPO_ID="${REPO_ID:-
|
| 20 |
-
FILE_NAME="mmproj-${PRECISION}.gguf"
|
| 21 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 22 |
DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
|
| 23 |
|
|
@@ -58,7 +62,7 @@ fi
|
|
| 58 |
echo
|
| 59 |
echo "[+] Done. Use it via:"
|
| 60 |
echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
|
| 61 |
-
echo " --gguf /path/to/Qwen3.6-27B-Q4_K_M.gguf \\"
|
| 62 |
echo " --mmproj ${DEST} \\"
|
| 63 |
echo " --image /path/to/photo.jpg \\"
|
| 64 |
echo " --prompt 'Describe this image.'"
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β fetch the vision projector (mmproj) for image input.
|
| 3 |
#
|
| 4 |
# Why this is separate from build.sh:
|
| 5 |
# build.sh is for the Ollama text path. The mmproj is only useful for
|
|
|
|
| 8 |
# it (see README Vision section, ollama/ollama#15898).
|
| 9 |
#
|
| 10 |
# Usage:
|
| 11 |
+
# ./scripts/fetch_vision.sh # default: BF16 (~931 MB)
|
| 12 |
+
#
|
| 13 |
+
# llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF publishes BF16 only;
|
| 14 |
+
# for F16/F32 variants fall back to unsloth's reference projector:
|
| 15 |
+
# REPO_ID=unsloth/Qwen3.6-27B-GGUF FILE_NAME=mmproj-F16.gguf ./scripts/fetch_vision.sh
|
| 16 |
+
# (vision tokens are projected the same way across Qwen 3.6 27B
|
| 17 |
+
# finetunes, so the unsloth projector is functionally interchangeable.)
|
| 18 |
#
|
| 19 |
# Requires: huggingface-cli (or hf).
|
| 20 |
set -euo pipefail
|
| 21 |
|
| 22 |
+
PRECISION="${1:-${PRECISION:-BF16}}"
|
| 23 |
+
REPO_ID="${REPO_ID:-llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GGUF}"
|
| 24 |
+
FILE_NAME="${FILE_NAME:-Qwen3.6-27B-mmproj-${PRECISION}.gguf}"
|
| 25 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 26 |
DEST="${MMPROJ_PATH:-${ROOT}/${FILE_NAME}}"
|
| 27 |
|
|
|
|
| 62 |
echo
|
| 63 |
echo "[+] Done. Use it via:"
|
| 64 |
echo " python ${ROOT}/examples/llama_cpp_vision.py \\"
|
| 65 |
+
echo " --gguf /path/to/Qwen3.6-27B-uncensored-heretic-v2-Q4_K_M.gguf \\"
|
| 66 |
echo " --mmproj ${DEST} \\"
|
| 67 |
echo " --image /path/to/photo.jpg \\"
|
| 68 |
echo " --prompt 'Describe this image.'"
|
|
@@ -1,10 +1,10 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β heal a previously pulled HF-bridge tag whose bundled
|
| 3 |
# GGUF is `qwen36`-stamped (legacy v0.6.0-era pulls before `964e418`,
|
| 4 |
# 3rd-round-trip-era pulls between `973d7ef` and `978798f`, or
|
| 5 |
# 5th-round-trip-era pulls between `ae67ed1` and `e03e10e`).
|
| 6 |
#
|
| 7 |
-
# Fresh pulls of `ollama run hf.co/FoolDev/Thanatos-27B` now get the
|
| 8 |
# qwen35-stamped bundle and load directly β this script is the
|
| 9 |
# recovery path for users who pulled a qwen36-stamped blob into
|
| 10 |
# their local Ollama store during one of the qwen36 windows
|
|
@@ -13,7 +13,7 @@
|
|
| 13 |
# It rebadges the HF-bridge tag's model blob in-place (qwen36 ->
|
| 14 |
# qwen35, metadata-only, byte-identical tensors) and rewrites the
|
| 15 |
# manifest's model-layer digest to point at the new blob. After
|
| 16 |
-
# running, the cached `hf.co/FoolDev/Thanatos-27B` tag loads.
|
| 17 |
#
|
| 18 |
# Idempotent: a tag already on qwen35 / qwen35moe is left untouched.
|
| 19 |
# The current bundle is qwen35-stamped so this script is a no-op for
|
|
@@ -22,13 +22,13 @@
|
|
| 22 |
#
|
| 23 |
# Usage:
|
| 24 |
# ./scripts/heal_hf_pull.sh # default tag
|
| 25 |
-
# TAG=hf.co/FoolDev/Thanatos-27B:Q4_K_M ./scripts/heal_hf_pull.sh
|
| 26 |
#
|
| 27 |
# Requires: ollama, jq, python3 with the `gguf` package, sha256sum.
|
| 28 |
set -euo pipefail
|
| 29 |
|
| 30 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 31 |
-
TAG="${TAG:-hf.co/FoolDev/Thanatos-27B:Q4_K_M}"
|
| 32 |
OLLAMA_MODELS="${OLLAMA_MODELS:-${HOME}/.ollama/models}"
|
| 33 |
|
| 34 |
red() { printf "\033[31m%s\033[0m\n" "$*"; }
|
|
@@ -50,7 +50,7 @@ done
|
|
| 50 |
|
| 51 |
# `ollama show --modelfile` writes a FROM line with the absolute blob path.
|
| 52 |
# Reliable regardless of which case variant the user pulled with
|
| 53 |
-
# (hf.co's 307 lets `Thanatos-27B` and `thanatos-27b` both resolve to the
|
| 54 |
# canonical repo, and ollama stores the manifest under whichever case
|
| 55 |
# was first registered).
|
| 56 |
#
|
|
@@ -79,8 +79,8 @@ blue "[*] blob: ${MODEL_BLOB}"
|
|
| 79 |
# referenced from exactly one tag in the heal scenario β fresh HF pull
|
| 80 |
# of a single :Q4_K_M tag β but if someone has multiple tags pointing
|
| 81 |
# at the same blob, we filter down to the one matching ${TAG}.
|
| 82 |
-
TAG_PATH="${TAG#hf.co/}" # FoolDev/Thanatos-27B:Q4_K_M
|
| 83 |
-
NAMESPACE_PATH="${TAG_PATH%:*}" # FoolDev/Thanatos-27B
|
| 84 |
TAG_FILE="${TAG_PATH##*:}" # Q4_K_M
|
| 85 |
|
| 86 |
MANIFEST="$(find "${OLLAMA_MODELS}/manifests/hf.co" \
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β heal a previously pulled HF-bridge tag whose bundled
|
| 3 |
# GGUF is `qwen36`-stamped (legacy v0.6.0-era pulls before `964e418`,
|
| 4 |
# 3rd-round-trip-era pulls between `973d7ef` and `978798f`, or
|
| 5 |
# 5th-round-trip-era pulls between `ae67ed1` and `e03e10e`).
|
| 6 |
#
|
| 7 |
+
# Fresh pulls of `ollama run hf.co/FoolDev/Thanatos-Heretic-27B` now get the
|
| 8 |
# qwen35-stamped bundle and load directly β this script is the
|
| 9 |
# recovery path for users who pulled a qwen36-stamped blob into
|
| 10 |
# their local Ollama store during one of the qwen36 windows
|
|
|
|
| 13 |
# It rebadges the HF-bridge tag's model blob in-place (qwen36 ->
|
| 14 |
# qwen35, metadata-only, byte-identical tensors) and rewrites the
|
| 15 |
# manifest's model-layer digest to point at the new blob. After
|
| 16 |
+
# running, the cached `hf.co/FoolDev/Thanatos-Heretic-27B` tag loads.
|
| 17 |
#
|
| 18 |
# Idempotent: a tag already on qwen35 / qwen35moe is left untouched.
|
| 19 |
# The current bundle is qwen35-stamped so this script is a no-op for
|
|
|
|
| 22 |
#
|
| 23 |
# Usage:
|
| 24 |
# ./scripts/heal_hf_pull.sh # default tag
|
| 25 |
+
# TAG=hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M ./scripts/heal_hf_pull.sh
|
| 26 |
#
|
| 27 |
# Requires: ollama, jq, python3 with the `gguf` package, sha256sum.
|
| 28 |
set -euo pipefail
|
| 29 |
|
| 30 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 31 |
+
TAG="${TAG:-hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M}"
|
| 32 |
OLLAMA_MODELS="${OLLAMA_MODELS:-${HOME}/.ollama/models}"
|
| 33 |
|
| 34 |
red() { printf "\033[31m%s\033[0m\n" "$*"; }
|
|
|
|
| 50 |
|
| 51 |
# `ollama show --modelfile` writes a FROM line with the absolute blob path.
|
| 52 |
# Reliable regardless of which case variant the user pulled with
|
| 53 |
+
# (hf.co's 307 lets `Thanatos-Heretic-27B` and `thanatos-heretic-27b` both resolve to the
|
| 54 |
# canonical repo, and ollama stores the manifest under whichever case
|
| 55 |
# was first registered).
|
| 56 |
#
|
|
|
|
| 79 |
# referenced from exactly one tag in the heal scenario β fresh HF pull
|
| 80 |
# of a single :Q4_K_M tag β but if someone has multiple tags pointing
|
| 81 |
# at the same blob, we filter down to the one matching ${TAG}.
|
| 82 |
+
TAG_PATH="${TAG#hf.co/}" # FoolDev/Thanatos-Heretic-27B:Q4_K_M
|
| 83 |
+
NAMESPACE_PATH="${TAG_PATH%:*}" # FoolDev/Thanatos-Heretic-27B
|
| 84 |
TAG_FILE="${TAG_PATH##*:}" # Q4_K_M
|
| 85 |
|
| 86 |
MANIFEST="$(find "${OLLAMA_MODELS}/manifests/hf.co" \
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β install scripts/check.sh as a git pre-commit hook.
|
| 3 |
#
|
| 4 |
# Idempotent. Re-runs are safe.
|
| 5 |
set -euo pipefail
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β install scripts/check.sh as a git pre-commit hook.
|
| 3 |
#
|
| 4 |
# Idempotent. Re-runs are safe.
|
| 5 |
set -euo pipefail
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β load this repo's bundle into Ollama as a local tag.
|
| 3 |
#
|
| 4 |
# The bundled GGUF (Thanatos-27B.Q4_K_M.gguf) is qwen35-stamped and
|
| 5 |
# loads directly on stock llama.cpp / Ollama. This script is the
|
|
@@ -15,13 +15,13 @@
|
|
| 15 |
# 3. Run `ollama create <tag> -f <temp Modelfile pointing at the
|
| 16 |
# resolved bundle>`.
|
| 17 |
#
|
| 18 |
-
# Useful if you want a bare local tag (`thanatos-27b`) rather than
|
| 19 |
-
# the `hf.co/FoolDev/Thanatos-27B` path. The legacy qwen36 rebadge
|
| 20 |
# branch is kept for anyone working from a pre-e03e10e checkout.
|
| 21 |
#
|
| 22 |
# Usage:
|
| 23 |
-
# ./scripts/load_bundle.sh # default tag: thanatos-27b
|
| 24 |
-
# TAG=thanatos-27b-bundle ./scripts/load_bundle.sh
|
| 25 |
# BUNDLE=/path/to/Thanatos-27B.Q4_K_M.gguf ./scripts/load_bundle.sh
|
| 26 |
#
|
| 27 |
# Requires: ollama, python3 with the `gguf` package, hf (if the bundle
|
|
@@ -30,8 +30,8 @@ set -euo pipefail
|
|
| 30 |
|
| 31 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 32 |
BUNDLE="${BUNDLE:-${ROOT}/Thanatos-27B.Q4_K_M.gguf}"
|
| 33 |
-
TAG="${TAG:-thanatos-27b}"
|
| 34 |
-
REPO_ID="${REPO_ID:-FoolDev/Thanatos-27B}"
|
| 35 |
MODELFILE="${ROOT}/Modelfile"
|
| 36 |
|
| 37 |
red() { printf "\033[31m%s\033[0m\n" "$*"; }
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β load this repo's bundle into Ollama as a local tag.
|
| 3 |
#
|
| 4 |
# The bundled GGUF (Thanatos-27B.Q4_K_M.gguf) is qwen35-stamped and
|
| 5 |
# loads directly on stock llama.cpp / Ollama. This script is the
|
|
|
|
| 15 |
# 3. Run `ollama create <tag> -f <temp Modelfile pointing at the
|
| 16 |
# resolved bundle>`.
|
| 17 |
#
|
| 18 |
+
# Useful if you want a bare local tag (`thanatos-heretic-27b`) rather than
|
| 19 |
+
# the `hf.co/FoolDev/Thanatos-Heretic-27B` path. The legacy qwen36 rebadge
|
| 20 |
# branch is kept for anyone working from a pre-e03e10e checkout.
|
| 21 |
#
|
| 22 |
# Usage:
|
| 23 |
+
# ./scripts/load_bundle.sh # default tag: thanatos-heretic-27b
|
| 24 |
+
# TAG=thanatos-heretic-27b-bundle ./scripts/load_bundle.sh
|
| 25 |
# BUNDLE=/path/to/Thanatos-27B.Q4_K_M.gguf ./scripts/load_bundle.sh
|
| 26 |
#
|
| 27 |
# Requires: ollama, python3 with the `gguf` package, hf (if the bundle
|
|
|
|
| 30 |
|
| 31 |
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 32 |
BUNDLE="${BUNDLE:-${ROOT}/Thanatos-27B.Q4_K_M.gguf}"
|
| 33 |
+
TAG="${TAG:-thanatos-heretic-27b}"
|
| 34 |
+
REPO_ID="${REPO_ID:-FoolDev/Thanatos-Heretic-27B}"
|
| 35 |
MODELFILE="${ROOT}/Modelfile"
|
| 36 |
|
| 37 |
red() { printf "\033[31m%s\033[0m\n" "$*"; }
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
-
# Thanatos-27B β smoke test against a running Ollama daemon.
|
| 3 |
#
|
| 4 |
# Verifies:
|
| 5 |
# 1. The Ollama server is reachable.
|
|
@@ -14,11 +14,11 @@
|
|
| 14 |
# Usage:
|
| 15 |
# ./scripts/smoke_test.sh # fast checks only
|
| 16 |
# TOOLS_TEST=1 ./scripts/smoke_test.sh # add tool-call round-trip
|
| 17 |
-
# MODEL=hf.co/FoolDev/Thanatos-27B:Q4_K_M ./scripts/smoke_test.sh
|
| 18 |
# HOST=http://localhost:11434 ./scripts/smoke_test.sh
|
| 19 |
set -euo pipefail
|
| 20 |
|
| 21 |
-
MODEL="${MODEL:-thanatos-27b}"
|
| 22 |
HOST="${HOST:-http://localhost:11434}"
|
| 23 |
PROMPT="${PROMPT:-Reply with the single word: OK}"
|
| 24 |
|
|
@@ -46,9 +46,9 @@ green "[+] server reachable"
|
|
| 46 |
|
| 47 |
# 2. Model present? Match case-insensitively: Ollama 0.24 normalizes
|
| 48 |
# model names at lookup but preserves whatever case was first registered
|
| 49 |
-
# on disk (e.g. `make load-bundle` may produce `Thanatos-27B:latest`
|
| 50 |
-
# even when invoked with TAG=thanatos-27b, if an earlier session left a
|
| 51 |
-
# Thanatos-27B manifest dir behind). The exact tag the user typed is
|
| 52 |
# still valid for `ollama run` β the comparison just needs to be
|
| 53 |
# case-folded to match.
|
| 54 |
if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select((.name | ascii_downcase) | startswith($m | ascii_downcase))' >/dev/null; then
|
|
|
|
| 1 |
#!/usr/bin/env bash
|
| 2 |
+
# Thanatos-Heretic-27B β smoke test against a running Ollama daemon.
|
| 3 |
#
|
| 4 |
# Verifies:
|
| 5 |
# 1. The Ollama server is reachable.
|
|
|
|
| 14 |
# Usage:
|
| 15 |
# ./scripts/smoke_test.sh # fast checks only
|
| 16 |
# TOOLS_TEST=1 ./scripts/smoke_test.sh # add tool-call round-trip
|
| 17 |
+
# MODEL=hf.co/FoolDev/Thanatos-Heretic-27B:Q4_K_M ./scripts/smoke_test.sh
|
| 18 |
# HOST=http://localhost:11434 ./scripts/smoke_test.sh
|
| 19 |
set -euo pipefail
|
| 20 |
|
| 21 |
+
MODEL="${MODEL:-thanatos-heretic-27b}"
|
| 22 |
HOST="${HOST:-http://localhost:11434}"
|
| 23 |
PROMPT="${PROMPT:-Reply with the single word: OK}"
|
| 24 |
|
|
|
|
| 46 |
|
| 47 |
# 2. Model present? Match case-insensitively: Ollama 0.24 normalizes
|
| 48 |
# model names at lookup but preserves whatever case was first registered
|
| 49 |
+
# on disk (e.g. `make load-bundle` may produce `Thanatos-Heretic-27B:latest`
|
| 50 |
+
# even when invoked with TAG=thanatos-heretic-27b, if an earlier session left a
|
| 51 |
+
# Thanatos-Heretic-27B manifest dir behind). The exact tag the user typed is
|
| 52 |
# still valid for `ollama run` β the comparison just needs to be
|
| 53 |
# case-folded to match.
|
| 54 |
if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select((.name | ascii_downcase) | startswith($m | ascii_downcase))' >/dev/null; then
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Thanatos-27B β verify the README "Architecture" forward-pass bullets
|
| 4 |
against the actual GGUF metadata.
|
| 5 |
|
| 6 |
Reads either the qwen35- or qwen36-stamped bundle (or any GGUF that
|
|
@@ -69,8 +69,8 @@ def main() -> int:
|
|
| 69 |
return 2
|
| 70 |
root = Path(__file__).resolve().parent.parent
|
| 71 |
default_paths = [
|
| 72 |
-
root / "Thanatos-27B.Q4_K_M.qwen35.gguf",
|
| 73 |
-
root / "Thanatos-27B.Q4_K_M.qwen36.gguf",
|
| 74 |
root / "Thanatos-27B.Q4_K_M.gguf",
|
| 75 |
]
|
| 76 |
if len(sys.argv) == 2:
|
|
@@ -78,7 +78,7 @@ def main() -> int:
|
|
| 78 |
else:
|
| 79 |
path = next((p for p in default_paths if p.exists() and p.stat().st_size > 1024), None)
|
| 80 |
if path is None:
|
| 81 |
-
print("[!] no Thanatos-27B GGUF found in repo root; pass a path explicitly", file=sys.stderr)
|
| 82 |
return 2
|
| 83 |
|
| 84 |
print(f"[*] reading: {path}")
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Thanatos-Heretic-27B β verify the README "Architecture" forward-pass bullets
|
| 4 |
against the actual GGUF metadata.
|
| 5 |
|
| 6 |
Reads either the qwen35- or qwen36-stamped bundle (or any GGUF that
|
|
|
|
| 69 |
return 2
|
| 70 |
root = Path(__file__).resolve().parent.parent
|
| 71 |
default_paths = [
|
| 72 |
+
root / "Thanatos-Heretic-27B.Q4_K_M.qwen35.gguf",
|
| 73 |
+
root / "Thanatos-Heretic-27B.Q4_K_M.qwen36.gguf",
|
| 74 |
root / "Thanatos-27B.Q4_K_M.gguf",
|
| 75 |
]
|
| 76 |
if len(sys.argv) == 2:
|
|
|
|
| 78 |
else:
|
| 79 |
path = next((p for p in default_paths if p.exists() and p.stat().st_size > 1024), None)
|
| 80 |
if path is None:
|
| 81 |
+
print("[!] no Thanatos-Heretic-27B GGUF found in repo root; pass a path explicitly", file=sys.stderr)
|
| 82 |
return 2
|
| 83 |
|
| 84 |
print(f"[*] reading: {path}")
|