Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

FoolDev commited on about 1 month ago

Commit

b564869

1 Parent(s): bfe34c3

Initial release: Janus-27B repo

Browse files

Sibling distribution package to FoolDev/janus, targeting the dense
Qwen 3.6 27B base instead of the 35B-A3B MoE.

Includes:
- README with arch/hardware/sampling/limitations sections matching the
35B sibling card
- Modelfile that wraps a user-provided Qwen 3.6 27B GGUF
- Tokyo-Night-themed banner (PNG + SVG source) using purple as the
sibling-distinct accent vs the 35B's cyan
- Standard HF .gitattributes for LFS-tracked binary types

This repo does not redistribute weights; users pull from
unsloth/Qwen3.6-27B-GGUF or another community quant.

Files changed (5) hide show

.gitattributes +1 -0
Modelfile +52 -0
README.md +192 -0
banner.png +0 -0
banner.svg +60 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text

Modelfile ADDED Viewed

	@@ -0,0 +1,52 @@

+# Janus-27B — Ollama wrapper around Qwen 3.6 27B (dense)
+#
+# This repo does not redistribute weights. Edit the FROM line below to
+# point at a local Qwen 3.6 27B GGUF, then:
+#
+#     ollama create janus-27b -f Modelfile && ollama run janus-27b
+#
+# Recommended GGUF source:
+#     https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
+#
+# Or a community Opus-distilled variant:
+#     https://huggingface.co/rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-GGUF
+#
+# Replace the path below with wherever you keep the GGUF.
+FROM ./Qwen3.6-27B.Q4_K_M.gguf
+# Sampling tuned for reasoning + general use. See README "Recommended sampling"
+# for creative/RP alternatives.
+PARAMETER temperature 0.6
+PARAMETER top_p 0.95
+PARAMETER top_k 20
+PARAMETER repeat_penalty 1.05
+PARAMETER num_ctx 16384
+SYSTEM """You are Janus, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
+Behavior rules:
+- Answer the user's actual request directly.
+- Be accurate, complete, and structured.
+- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
+- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
+- If the user wants creative writing, preserve tone, continuity, and character consistency.
+- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
+- Finish with a usable answer, not just planning."""
+# Hardware notes
+# --------------
+# Qwen 3.6 27B is *dense* — every parameter participates in every forward pass.
+# Q4_K_M GGUF is ~16 GB. Practical footprint:
+#   weights mmap          ~16 GB
+#   compute graph alloc   ~12 GB  (smaller than 35B-A3B because dense ≠ MoE)
+#   KV cache @ 16K ctx     ~1 GB  (with OLLAMA_KV_CACHE_TYPE=q8_0)
+#   total minimum          ~29 GB
+#
+# Working configurations:
+#   ✓ RTX 3090 / 4090 24 GB                     — full Q4 offload, ~25-40 tok/s
+#   ✓ RTX 5090 32 GB                            — full offload at Q5/Q6 quant
+#   ✓ Mac Studio M2/M3 32 GB+ unified           — ~15-25 tok/s
+#   ✓ Linux box with 32 GB+ RAM (CPU-only)      — ~1-3 tok/s
+#   ⚠ ASUS ROG Flow Z13 (32 GB unified)         — borderline, try Q3_K_S quant
+#                                                 (~12 GB) for headroom

README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+---
+license: apache-2.0
+base_model:
+  - Qwen/Qwen3.6-27B
+datasets:
+  - crownelius/Creative_Writing_ShareGPT_Enhanced
+  - microsoft/rStar-Coder
+  - peteromallet/dataclaw-peteromallet
+  - crownelius/Opus-4.7-Reasoning
+  - openbmb/UltraData-Math
+  - Crownelius/Crow-Heretic-TeichAI-Unified
+language:
+  - en
+  - zh
+  - ru
+  - es
+  - fr
+  - it
+  - ja
+  - ko
+  - de
+  - ar
+  - tr
+  - pl
+  - sv
+  - nl
+  - he
+  - id
+  - uk
+  - fa
+  - pt
+  - ms
+  - fi
+  - el
+tags:
+  - qwen3_6
+  - dense
+  - conversational
+  - multimodal
+  - agent
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+<img src="https://huggingface.co/FoolDev/janus-27b/resolve/main/banner.png" alt="Janus-27B banner" width="100%" />
+[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
+[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--27B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-27B)
+[![Architecture](https://img.shields.io/badge/Arch-Dense_27B-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
+[![Sibling](https://img.shields.io/badge/Sibling-Janus--35B-7dcfff?style=flat&labelColor=1a1b26)](https://huggingface.co/FoolDev/janus)
+# Janus-27B
+> **Dense Reasoning. Friendlier Footprint.**
+> *Qwen 3.6 27B (dense) repackaged with Claude Opus 4.7 in the teacher slot.*
+**`Architecture:`** `Qwen 3.6 27B (Dense)` | **`Parameters:`** `27B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled LLM`
+A personal sibling to [`FoolDev/janus`](https://huggingface.co/FoolDev/janus). Same teacher (Claude Opus 4.7), same dataset family, but built on the **dense** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) base instead of the 35B-A3B MoE. Smaller, easier to deploy, no expert-routing surprises.
+## Why a 27B variant?
+The 35B-A3B is a sparse mixture-of-experts model: 35B parameters total but only ~3B active per token. That makes it fast at inference but **memory-hungry at load time** — the full 35B has to live in VRAM/RAM even though only 3B is doing useful work each step.
+The 27B is **dense**: every parameter participates in every forward pass. It's slower per token than 35B-A3B (no sparse advantage), but the working set fits comfortably on commodity GPUs and avoids the MoE-specific load-balance failure modes.
+| | Janus-27B (this) | [Janus-35B](https://huggingface.co/FoolDev/janus) |
+|---|---|---|
+| Architecture | Dense transformer | MoE 256 experts, 8 active |
+| Total params | 27 B | 35 B |
+| Active params per token | 27 B | ~3 B |
+| Layers | 64 | 40 |
+| Hidden size | 5120 | 2048 |
+| Q4_K_M GGUF size | ~16 GB | ~19 GB |
+| Min host memory | ~24 GB | ~38 GB |
+| Multimodal | Yes (vision) | Yes (vision) |
+| Max context | 262 144 | 262 144 |
+## What's here
+| File | Use |
+|---|---|
+| `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
+| `Modelfile` | Ollama wrapper around the upstream Qwen3.6-27B GGUF |
+| `README.md` | This file |
+This repo does **not** redistribute weights. Pull the upstream GGUF from [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or any other community quant, point the Modelfile at it, and `ollama create janus-27b -f Modelfile`.
+If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B).
+## Architecture
+- Qwen 3.6 dense, 27B parameters, 64 transformer layers
+- 24 attention heads, 4 KV heads (GQA), head_dim 256
+- Hidden size 5120, intermediate size 17408 (~3.4× ratio)
+- Vocab 248,320 (shared with 35B-A3B sibling)
+- 262k native context, extensible with YaRN
+- Vision + video support via upstream `mmproj` (not in this repo)
+## Quick start
+### Ollama
+A ready-to-use `Modelfile` is included. Edit the `FROM` line to point at your local GGUF copy:
+```bash
+# After pulling unsloth/Qwen3.6-27B-GGUF or another quant locally:
+ollama create janus-27b -f Modelfile && ollama run janus-27b
+```
+### Inference (OpenAI-compatible)
+```bash
+curl -s http://localhost:11434/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "janus-27b",
+    "messages": [
+      {"role": "system", "content": "You are Janus, a precise reasoning assistant."},
+      {"role": "user", "content": "Explain the Burrows-Wheeler transform in 200 words."}
+    ],
+    "temperature": 0.6
+  }' | jq -r '.choices[0].message.content'
+```
+### Recommended sampling
+| Use | temp | top_p | top_k | repeat_penalty |
+|---|---:|---:|---:|---:|
+| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
+| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |
+Lower temperature (0.4-0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.
+### System prompt
+Same as the 35B sibling:
+```text
+You are Janus, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
+Behavior rules:
+- Answer the user's actual request directly.
+- Be accurate, complete, and structured.
+- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
+- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
+- If the user wants creative writing, preserve tone, continuity, and character consistency.
+- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
+- Finish with a usable answer, not just planning.
+```
+## Hardware requirements
+The dense 27B is the easier of the two Janus models to deploy.
+| Hardware | Status |
+|---|---|
+| ≥32 GB RAM (CPU-only) | Works, ~1-3 tok/s |
+| RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
+| RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
+| Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
+| ASUS ROG Flow Z13 (Ryzen AI Max+, 32 GB unified) | Borderline — 16 GB Q4 GGUF + ~16 GB compute graph crowds the 20 GB iGPU pool. Try Q3_K_S (~12 GB) for headroom. |
+## Chat template
+Identical to the 35B sibling — Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` markers, `<think>...</think>` for reasoning traces, XML-style `<tool_call>` for function calling. The template is embedded in the GGUF metadata.
+See the [Janus-35B Chat template section](https://huggingface.co/FoolDev/janus#chat-template) for examples — they apply unchanged here.
+## Known limitations
+- **Slower per token than the 35B-A3B sibling.** Dense 27B beats sparse 35B/3B-active on steps-per-second benchmarks because every parameter contributes; if you optimize for tokens-per-second, the MoE wins.
+- **No mmproj in this release.** Same as 35B — fetch upstream for vision input.
+- **Q4_K_M quality loss** is real. Use Q5_K_M or Q6_K if you have the VRAM (~20-22 GB).
+- **No formal evaluation in this card.** Numbers above are estimates.
+## Related models
+| Model | Notes |
+|---|---|
+| [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | Upstream base, safetensors |
+| [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Recommended GGUF source |
+| [FoolDev/janus](https://huggingface.co/FoolDev/janus) | 35B-A3B MoE sibling. More capacity, more memory pressure. |
+| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B starter model when 27B/35B is too heavy |
+## Credits
+- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba)
+- Reasoning teacher: Claude Opus 4.7 (Anthropic)
+- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)
+License inherited from upstream: Apache-2.0.

banner.png ADDED Viewed

banner.svg ADDED Viewed