Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio new

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Pi new

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

FoolDev Claude Opus 4.7 commited on 10 days ago

Commit

2b2ba03

1 Parent(s): 25d5454

docs: fix four README defects surfaced by fresh-eyes audit

Browse files

1. **"Local apps" llama.cpp row recipe was broken.** The cell told
users to `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf`
then `llama-server -m Thanatos-27B.Q4_K_M.gguf` — but that bundle
is qwen36-stamped post-973d7ef, so llama-server immediately
errors with `unknown model architecture: 'qwen36'`. The intro
paragraph above the table noted the qwen36 caveat but the recipe
itself didn't honor it. Rewrote to either rebadge in place via
scripts/rename_arch.py OR (cleaner) pull the qwen35-stamped
`Qwen3.6-27B-Q4_K_M.gguf` from unsloth directly.

2. **History date wrong.** Said `v0.6.0 (e1f78fa, 2026-05-18)` but
`git log e1f78fa` shows 2026-05-19 14:38 UTC. Corrected.

3. **History flip count wrong.** Said "flipped between the two
stamps three times" — actual count is five stamp changes (three
landings on qwen36 at e1f78fa / 07fa120 / 973d7ef, two on
qwen35 at 964e418 / 72259c1). Split the round-trip bullet into
its two constituent flips and corrected the lede.

4. **examples/README.md heal-hf framing stale.** Said "If you
pulled before commit `964e418`" — that text predated the 3rd
round trip (973d7ef). Current state: every fresh `ollama pull`
of this repo's bundle needs `make heal-hf`. Rewrote to say so
and point at the main README's Stamp choice section.

Bug class: stale text the round-trip residue cleanup (c1c4dfd) and
the qwen36 re-alignment (cee14f4) didn't fully sweep. None blocked
working users, but the llama.cpp recipe in particular would have
sent anyone trying that path straight into the qwen36 error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (2) hide show

README.md +21 -20
examples/README.md +8 -4

README.md CHANGED Viewed

@@ -251,26 +251,27 @@ the tensor data is byte-identical across both stamps.
 ### History
-The bundle has now flipped between the two stamps three times,
-each time after weighing the friction-vs-honesty tradeoff anew:
-- **v0.6.0 (e1f78fa, 2026-05-18):** initial qwen35 → qwen36
-  stamp, on the theory that qwen35 was a loader stand-in
-  awaiting proper Qwen 3.6 support. Upstream audit later
-  showed that theory was mistaken (see above).
-- **2026-05-19 morning (964e418):** flipped back to qwen35
   after daily friction outweighed version-specificity for that
-  iteration; doc workaround narrative collapsed
-  (`83022eb`).
-- **2026-05-19 evening (07fa120, reverted `72259c1`):** brief
-  ~1-hour re-flip to qwen36 during a fresh-pull integration
-  test; reverted because the live friction was worse than the
-  doc prose suggested.
-- **2026-05-19 evening, again (`973d7ef`):** flipped to qwen36
-  one more time, after the upstream-evidence audit had been
-  shipped and the friction was a known quantity. Project owner
-  prefers the version-specific stamp despite the audit
-  conclusion. **This is the current state.**
 Tensor data was byte-identical across all stamps; only the
 `general.architecture` KV (and namespaced KV keys) flipped.
@@ -350,7 +351,7 @@ caveat applies to every row until that's done.
 | **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`); fails on first inference, then `make heal-hf` rebadges the cached blob. For other quants, or to bypass the qwen36 block entirely, `make build QUANT=Q3_K_S` downloads from unsloth (qwen35-stamped) and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
 | **LM Studio** | Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
 | **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
-| **llama.cpp** | `hf download FoolDev/Thanatos-27B Thanatos-27B.Q4_K_M.gguf --local-dir .` then `llama-server -m Thanatos-27B.Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |

 ### History
+The bundle has now changed stamps five times (three landings on
+qwen36, two on qwen35), each time after weighing the
+friction-vs-honesty tradeoff anew:
+- **v0.6.0-era (`e1f78fa`, 2026-05-19 14:38 UTC):** initial qwen35
+  → qwen36 stamp, on the theory that qwen35 was a loader stand-in
+  awaiting proper Qwen 3.6 support. Upstream audit later showed
+  that theory was mistaken (see above).
+- **2026-05-19 afternoon (`964e418`):** flipped back to qwen35
   after daily friction outweighed version-specificity for that
+  iteration; doc workaround narrative collapsed (`83022eb`).
+- **2026-05-19 evening (`07fa120`):** brief re-flip to qwen36
+  during a fresh-pull integration test on Strix Halo.
+- **2026-05-19 evening (`72259c1`, ~1 hour later):** reverted to
+  qwen35 again because the live friction was worse than the doc
+  prose suggested.
+- **2026-05-19 evening (`973d7ef`):** flipped to qwen36 one more
+  time, after the upstream-evidence audit had been shipped and
+  the friction was a known quantity. Project owner prefers the
+  version-specific stamp despite the audit conclusion. **This
+  is the current state.**
 Tensor data was byte-identical across all stamps; only the
 `general.architecture` KV (and namespaced KV keys) flipped.
 | **Ollama** | `ollama run hf.co/FoolDev/Thanatos-27B` (default Q4_K_M). Pulls the GGUF + the root-level `template` / `system` / `params` files in one step (HF's Ollama bridge ingests these three files; it does **not** read `Modelfile`); fails on first inference, then `make heal-hf` rebadges the cached blob. For other quants, or to bypass the qwen36 block entirely, `make build QUANT=Q3_K_S` downloads from unsloth (qwen35-stamped) and creates a local Ollama tag using the `Modelfile`, which is kept in sync with the bridge files. |
 | **LM Studio** | Search → `FoolDev/Thanatos-27B` → pick `Thanatos-27B.Q4_K_M.gguf`. Uses the GGUF's embedded jinja chat template (Qwen 3.6 ChatML); set the system prompt manually from the `SYSTEM` block in this repo's `Modelfile`. |
 | **Jan** | Hub → "Import from Hugging Face" → `FoolDev/Thanatos-27B`. Same template behavior as LM Studio. |
+| **llama.cpp** | The bundled GGUF is qwen36-stamped, so `llama-server -m Thanatos-27B.Q4_K_M.gguf` errors with `unknown model architecture: 'qwen36'`. Either rebadge first (`python3 scripts/rename_arch.py --from-arch qwen36 --to-arch qwen35 Thanatos-27B.Q4_K_M.gguf Thanatos-27B.Q4_K_M.qwen35.gguf`), or — cleaner — `hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir .` to get the qwen35-stamped GGUF directly, then `llama-server -m Qwen3.6-27B-Q4_K_M.gguf` (or `llama-cli`, `llama-mtmd-cli` for vision via the upstream `mmproj-F16.gguf`). |
 | **llama-cpp-python** | See `examples/llama_cpp_quickstart.py` (text) and `examples/llama_cpp_vision.py` (image input). |
 | **Open WebUI / KoboldCpp / text-generation-webui** | Standard llama.cpp loader path — point at the GGUF, use the embedded chat template. |

examples/README.md CHANGED Viewed

@@ -29,10 +29,14 @@ pip install requests
 MODEL=hf.co/FoolDev/Thanatos-27B python ollama_chat.py
 ```
-If you pulled before commit `964e418` (the qwen35 re-stamp) and
-still have the broken qwen36 blob in your Ollama store, run
-`cd .. && make heal-hf` once to rebadge it in place. Fresh pulls
-after the re-stamp go straight through.
 For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
 `make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`

 MODEL=hf.co/FoolDev/Thanatos-27B python ollama_chat.py
 ```
+The bundled GGUF is currently `qwen36`-stamped (HF commit
+`973d7ef`), so the `ollama pull` above fails on first inference
+with `unable to load model`. Run `cd .. && make heal-hf` once to
+rebadge the cached blob in place (qwen36 → qwen35, metadata-only,
+~5 s) — the same tag then loads. Every fresh `ollama pull` of
+this repo's bundle needs the heal step until the project flips
+back to qwen35 (see the main README's "Stamp choice" section for
+why this repo stamps qwen36 deliberately).
 For a non-bundled quant (e.g. Q3_K_S ~12 GB, Q5_K_M ~20 GB),
 `make build QUANT=...` downloads from `unsloth/Qwen3.6-27B-GGUF`