Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

Thanatos-27B / examples /ollama_chat.py

FoolDev

Rename back: Thanatos-27B-Heretic → Thanatos-27B (HF repo also renamed)

7197abd 12 days ago

raw

history blame contribute delete

6.32 kB

	#!/usr/bin/env python3
	"""
	Thanatos-27B — Ollama chat examples.

	Prerequisites (pick one):

	A. From the bundled GGUFs (default flow):
	$ make build # uses Thanatos-27B.Q4_K_M.gguf
	# or:
	$ ollama create thanatos-27b -f ../Modelfile

	B. Pull straight from HF (Q4_K_M is the only bundled quant):
	$ ollama run hf.co/FoolDev/Thanatos-27B
	# then set MODEL=hf.co/FoolDev/Thanatos-27B below

	Then:
	$ ollama serve # usually already running
	$ python ollama_chat.py

	The model emits <think>...</think> reasoning blocks before its answer.
	Current Ollama (0.24, especially with `OLLAMA_NEW_ENGINE=1`) returns the
	reasoning in a separate `message.thinking` field and keeps `content`
	clean. Older builds put the whole `<think>...</think>` block inside
	`content`. The demo below reads `message.thinking` first and falls
	back to parsing `<think>` tags out of `content` so it works against
	either path.

	Endpoints used:
	- Native Ollama: http://localhost:11434/api/chat
	- OpenAI-compat: http://localhost:11434/v1/chat/completions
	"""
	from __future__ import annotations

	import json
	import os
	import re
	import sys
	from typing import Any, Iterator

	import requests

	MODEL = os.environ.get("MODEL", "thanatos-27b")
	HOST = os.environ.get("HOST", "http://localhost:11434")

	_THINK_RE = re.compile(r"<think>.?</think>\s", re.DOTALL)


	def split_thinking(content: str) -> tuple[str, str]:
	"""Return (thinking, final_answer) from a content string."""
	parts = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
	thinking = "\n".join(p.strip() for p in parts).strip()
	answer = _THINK_RE.sub("", content).strip()
	return thinking, answer


	# ---------- 1. Simple chat ----------

	def chat(prompt: str, system: str \| None = None) -> dict[str, Any]:
	msgs: list[dict[str, Any]] = []
	if system:
	msgs.append({"role": "system", "content": system})
	msgs.append({"role": "user", "content": prompt})
	r = requests.post(
	f"{HOST}/api/chat",
	json={"model": MODEL, "messages": msgs, "stream": False},
	timeout=600,
	)
	r.raise_for_status()
	return r.json()


	# ---------- 2. Streaming ----------

	def chat_stream(prompt: str) -> Iterator[str]:
	"""Yield content tokens as they arrive."""
	with requests.post(
	f"{HOST}/api/chat",
	json={
	"model": MODEL,
	"messages": [{"role": "user", "content": prompt}],
	"stream": True,
	},
	stream=True,
	timeout=600,
	) as r:
	r.raise_for_status()
	for line in r.iter_lines():
	if not line:
	continue
	chunk = json.loads(line)
	if "message" in chunk and "content" in chunk["message"]:
	yield chunk["message"]["content"]
	if chunk.get("done"):
	break


	# ---------- 3. Tool calling ----------

	WEATHER_TOOL = {
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"description": "Get the current weather in a given city",
	"parameters": {
	"type": "object",
	"properties": {
	"city": {"type": "string", "description": "City name"},
	"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
	},
	"required": ["city", "unit"],
	},
	},
	}


	def fake_weather(city: str, unit: str) -> str:
	"""Stand-in tool implementation."""
	return json.dumps(
	{"city": city, "temperature": 14, "unit": unit, "conditions": "light rain"}
	)


	def tool_round_trip(prompt: str) -> str:
	"""Single-shot tool call: model -> tool -> model -> final answer."""
	history: list[dict[str, Any]] = [{"role": "user", "content": prompt}]
	r = requests.post(
	f"{HOST}/api/chat",
	json={
	"model": MODEL,
	"messages": history,
	"tools": [WEATHER_TOOL],
	"stream": False,
	},
	timeout=600,
	)
	r.raise_for_status()
	msg = r.json()["message"]

	if not msg.get("tool_calls"):
	return msg["content"]

	history.append({"role": "assistant", "tool_calls": msg["tool_calls"]})
	for tc in msg["tool_calls"]:
	fn = tc["function"]
	if fn["name"] == "get_current_weather":
	result = fake_weather(**fn["arguments"])
	else:
	result = json.dumps({"error": f"unknown tool {fn['name']}"})
	history.append({"role": "tool", "tool_name": fn["name"], "content": result})

	r = requests.post(
	f"{HOST}/api/chat",
	json={
	"model": MODEL,
	"messages": history,
	"tools": [WEATHER_TOOL],
	"stream": False,
	},
	timeout=600,
	)
	r.raise_for_status()
	return r.json()["message"]["content"]


	# ---------- 4. OpenAI-compatible endpoint ----------

	def openai_chat(prompt: str) -> str:
	r = requests.post(
	f"{HOST}/v1/chat/completions",
	json={
	"model": MODEL,
	"messages": [{"role": "user", "content": prompt}],
	"temperature": 0.6,
	},
	timeout=600,
	)
	r.raise_for_status()
	return r.json()["choices"][0]["message"]["content"]


	# ---------- demo ----------

	def _demo() -> None:
	print("=== 1. simple chat ===")
	resp = chat("What is 84 * 3 / 2?")
	msg = resp["message"]
	# Prefer the dedicated `thinking` field (Ollama 0.24+ / new engine);
	# fall back to extracting <think>...</think> from `content` for
	# older builds that inline the reasoning.
	thinking = (msg.get("thinking") or "").strip()
	answer = msg.get("content", "")
	if not thinking:
	thinking, answer = split_thinking(answer)
	if thinking:
	print(f"[thinking] {thinking[:200]}...")
	print(f"[answer] {answer}")

	print("\n=== 2. streaming ===")
	for tok in chat_stream("Count from 1 to 5 in one line."):
	sys.stdout.write(tok)
	sys.stdout.flush()
	print()

	print("\n=== 3. tool round-trip ===")
	print(tool_round_trip("What is the weather in Paris in celsius?"))

	print("\n=== 4. OpenAI-compat ===")
	print(openai_chat("Say 'OpenAI endpoint OK' and nothing else."))


	if __name__ == "__main__":
	_demo()