Instructions to use FoolDev/Thanatos-27B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FoolDev/Thanatos-27B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FoolDev/Thanatos-27B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FoolDev/Thanatos-27B", dtype="auto")

llama-cpp-python

How to use FoolDev/Thanatos-27B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FoolDev/Thanatos-27B",
	filename="Thanatos-27B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FoolDev/Thanatos-27B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FoolDev/Thanatos-27B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FoolDev/Thanatos-27B:Q4_K_M

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

LM Studio
Jan

vLLM

How to use FoolDev/Thanatos-27B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FoolDev/Thanatos-27B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M

SGLang

How to use FoolDev/Thanatos-27B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FoolDev/Thanatos-27B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FoolDev/Thanatos-27B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FoolDev/Thanatos-27B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use FoolDev/Thanatos-27B with Ollama:
```
ollama run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Unsloth Studio

How to use FoolDev/Thanatos-27B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FoolDev/Thanatos-27B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FoolDev/Thanatos-27B to start chatting

How to use FoolDev/Thanatos-27B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FoolDev/Thanatos-27B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FoolDev/Thanatos-27B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FoolDev/Thanatos-27B:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use FoolDev/Thanatos-27B with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf FoolDev/Thanatos-27B:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "FoolDev/Thanatos-27B:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use FoolDev/Thanatos-27B with Docker Model Runner:
```
docker model run hf.co/FoolDev/Thanatos-27B:Q4_K_M
```

Lemonade

How to use FoolDev/Thanatos-27B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FoolDev/Thanatos-27B:Q4_K_M

Run and chat with the model

lemonade run user.Thanatos-27B-Q4_K_M

List all available models

lemonade list

FoolDev commited on May 2

Commit

6f2884f

1 Parent(s): b564869

Add usage scaffolding: examples, build/smoke scripts, Z13 profile, LICENSE

Browse files

- examples/: ollama_chat.py, transformers_quickstart.py (4-bit bnb),
llama_cpp_quickstart.py, plus a README explaining when to use each
- scripts/build.sh: one-shot GGUF pull + 'ollama create' for default and
z13 profiles; scripts/smoke_test.sh: server/model/round-trip check
- Modelfile.z13: Q3_K_S, 8K ctx, FA + q8_0 KV cache profile that fits in
the Ryzen AI Max+ unified pool
- LICENSE (Apache-2.0), CITATION.cff, .gitignore
- README updated to reference the new files and replace the 'borderline'
Z13 caveat with the working profile

Files changed (11) hide show

.gitignore +17 -0
CITATION.cff +34 -0
LICENSE +201 -0
Modelfile.z13 +49 -0
README.md +25 -5
examples/README.md +49 -0
examples/llama_cpp_quickstart.py +84 -0
examples/ollama_chat.py +193 -0
examples/transformers_quickstart.py +119 -0
scripts/build.sh +89 -0
scripts/smoke_test.sh +68 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+.venv/
+venv/
+# Local model weights (we don't redistribute these)
+*.gguf
+*.safetensors
+*.bin
+# Editor / OS
+.DS_Store
+.idea/
+.vscode/
+*.swp

CITATION.cff ADDED Viewed

	@@ -0,0 +1,34 @@

+cff-version: 1.2.0
+title: "Janus-27B: A Dense Distillation Wrapper for Qwen 3.6 27B"
+message: "If you use this model card or its accompanying files, please cite as below."
+type: software
+authors:
+  - name: FoolDev
+    website: "https://huggingface.co/FoolDev"
+repository-code: "https://huggingface.co/FoolDev/janus-27b"
+url: "https://huggingface.co/FoolDev/janus-27b"
+abstract: >-
+  Janus-27B is a personal repackaging of the dense Qwen 3.6 27B base model
+  with Claude Opus 4.7 in the reasoning teacher slot. The repository ships
+  an Ollama Modelfile, sampling defaults, and usage examples; weights are
+  pulled from upstream (Qwen/Qwen3.6-27B safetensors or
+  unsloth/Qwen3.6-27B-GGUF quants) rather than redistributed.
+keywords:
+  - qwen
+  - qwen3.6
+  - dense
+  - distillation
+  - reasoning
+  - llm
+license: Apache-2.0
+references:
+  - type: software
+    title: "Qwen3.6-27B"
+    authors:
+      - name: Alibaba Qwen Team
+    url: "https://huggingface.co/Qwen/Qwen3.6-27B"
+  - type: software
+    title: "Janus-35B-A3B (MoE sibling)"
+    authors:
+      - name: FoolDev
+    url: "https://huggingface.co/FoolDev/janus"

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for describing the origin of the Work and
+      reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may accept and charge a
+      fee for acceptance of support, warranty, indemnity, or other liability
+      obligations and/or rights consistent with this License. However, in
+      accepting such obligations, You may act only on Your own behalf and
+      on Your sole responsibility, not on behalf of any other Contributor,
+      and only if You agree to indemnify, defend, and hold each Contributor
+      harmless for any liability incurred by, or claims asserted against,
+      such Contributor by reason of your accepting any such warranty or
+      additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2025 FoolDev
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

Modelfile.z13 ADDED Viewed

	@@ -0,0 +1,49 @@

+# Janus-27B — Z13 variant for ASUS ROG Flow Z13 (Ryzen AI Max+ 395, 128 GB)
+#
+# This Modelfile is tuned for an iGPU with a shared/unified memory pool.
+# Defaults differ from the main Modelfile in three ways:
+#   1. Smaller context (8K instead of 16K) to keep KV cache slim.
+#   2. Q3_K_S GGUF assumed (~12 GB) so weights + compute graph fit under 20 GB.
+#   3. Slightly lower repeat_penalty since smaller quants are more loop-prone
+#      and we compensate with top_k instead.
+#
+# Recommended base GGUF for this profile:
+#   https://huggingface.co/unsloth/Qwen3.6-27B-GGUF -> Qwen3.6-27B.Q3_K_S.gguf
+#
+# Usage:
+#   ollama create janus-27b-z13 -f Modelfile.z13
+#   ollama run janus-27b-z13
+#
+# Environment variables that help on the Z13 (set before `ollama serve`):
+#   export OLLAMA_KV_CACHE_TYPE=q8_0      # halve KV cache memory
+#   export OLLAMA_FLASH_ATTENTION=1       # tighter attention working set
+#   export OLLAMA_NUM_PARALLEL=1          # don't fan out across requests
+#   export HSA_OVERRIDE_GFX_VERSION=11.5.1 # if ROCm doesn't auto-detect gfx1150
+FROM ./Qwen3.6-27B.Q3_K_S.gguf
+PARAMETER temperature 0.6
+PARAMETER top_p 0.95
+PARAMETER top_k 30
+PARAMETER repeat_penalty 1.03
+PARAMETER num_ctx 8192
+SYSTEM """You are Janus, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.
+Behavior rules:
+- Answer the user's actual request directly.
+- Be accurate, complete, and structured.
+- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
+- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
+- If the user wants creative writing, preserve tone, continuity, and character consistency.
+- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
+- Finish with a usable answer, not just planning."""
+# Footprint estimate on Z13 (Ryzen AI Max+ 395, 32 GB unified pool, gfx1150):
+#   weights mmap (Q3_K_S)    ~12 GB
+#   compute graph alloc       ~4 GB   (with FA + 8K ctx)
+#   KV cache @ 8K, q8_0       ~0.7 GB
+#   total                    ~17 GB   -> fits under 20 GB GTT cap
+#
+# If you have headroom and want better quality, swap to Q4_K_S (~14 GB).
+# Q4_K_M (~16 GB) will work but leaves almost no slack.

README.md CHANGED Viewed

@@ -81,7 +81,12 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
 | File | Use |
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
-| `Modelfile` | Ollama wrapper around the upstream Qwen3.6-27B GGUF |
 | `README.md` | This file |
 This repo does **not** redistribute weights. Pull the upstream GGUF from [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or any other community quant, point the Modelfile at it, and `ollama create janus-27b -f Modelfile`.
@@ -99,15 +104,30 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
 ## Quick start
-### Ollama
-A ready-to-use `Modelfile` is included. Edit the `FROM` line to point at your local GGUF copy:
 ```bash
-# After pulling unsloth/Qwen3.6-27B-GGUF or another quant locally:
 ollama create janus-27b -f Modelfile && ollama run janus-27b
 ```
 ### Inference (OpenAI-compatible)
 ```bash
@@ -159,7 +179,7 @@ The dense 27B is the easier of the two Janus models to deploy.
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
-| ASUS ROG Flow Z13 (Ryzen AI Max+, 32 GB unified) | Borderline — 16 GB Q4 GGUF + ~16 GB compute graph crowds the 20 GB iGPU pool. Try Q3_K_S (~12 GB) for headroom. |
 ## Chat template

 | File | Use |
 |---|---|
 | `banner.svg` / `banner.png` | Repo header, Tokyo Night themed |
+| `Modelfile` | Ollama wrapper around the upstream Qwen 3.6 27B GGUF (default profile, Q4_K_M) |
+| `Modelfile.z13` | Tighter profile for ASUS ROG Flow Z13 (Q3_K_S, 8K ctx, FA + q8_0 KV cache) |
+| `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
+| `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
+| `scripts/smoke_test.sh` | Verifies an Ollama daemon + model and runs a round-trip |
+| `LICENSE`, `CITATION.cff` | Apache-2.0 license and citation metadata |
 | `README.md` | This file |
 This repo does **not** redistribute weights. Pull the upstream GGUF from [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or any other community quant, point the Modelfile at it, and `ollama create janus-27b -f Modelfile`.
 ## Quick start
+### Ollama (one-liner)
+`scripts/build.sh` will download the GGUF and create the Ollama model in one shot:
+```bash
+./scripts/build.sh                  # Q4_K_M, default profile      -> janus-27b
+./scripts/build.sh Q3_K_S z13       # Z13 profile (Modelfile.z13)  -> janus-27b-z13
+./scripts/build.sh Q5_K_M           # higher-quality quant         -> janus-27b
+ollama run janus-27b
+```
+Or do it manually if you already have a GGUF on disk — edit the `FROM` line in `Modelfile` and run:
 ```bash
 ollama create janus-27b -f Modelfile && ollama run janus-27b
 ```
+Confirm everything works:
+```bash
+./scripts/smoke_test.sh             # checks server, model, round-trip
+python examples/ollama_chat.py      # full demo: chat, streaming, tools, OpenAI-compat
+```
 ### Inference (OpenAI-compatible)
 ```bash
 | RTX 3090 / 4090 24 GB | Works, full Q4 offload, ~25-40 tok/s |
 | RTX 5090 32 GB | Works, full offload at higher quant (Q5/Q6), ~30-50 tok/s |
 | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
+| ASUS ROG Flow Z13 (Ryzen AI Max+, 32 GB unified) | Borderline at Q4. Use the included `Modelfile.z13` (Q3_K_S, 8K ctx, FA + q8_0 KV cache) — fits in ~17 GB. |
 ## Chat template

examples/README.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Janus-27B examples
+Three minimal entry points. Pick the one that matches how you run models.
+| File | Backend | When to use |
+|---|---|---|
+| `ollama_chat.py` | Ollama HTTP API | You already have `ollama serve` running and the `janus-27b` model created from the project `Modelfile`. |
+| `transformers_quickstart.py` | Hugging Face Transformers | You want to run the upstream safetensors (`Qwen/Qwen3.6-27B`) on GPU, optionally in 4-bit via bitsandbytes. |
+| `llama_cpp_quickstart.py` | llama-cpp-python | You want to invoke a local GGUF directly without a daemon (CI, batch jobs, scripts). |
+All three apply the same Janus system prompt and sampling defaults
+(`temp=0.6, top_p=0.95, top_k=20, repeat_penalty=1.05`) so behavior should
+be consistent across backends modulo quantization noise.
+## Setup
+### Ollama
+```bash
+# 1. Pull a Qwen 3.6 27B GGUF, e.g. unsloth/Qwen3.6-27B-GGUF
+hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B.Q4_K_M.gguf --local-dir .
+# 2. Edit ../Modelfile -> FROM ./Qwen3.6-27B.Q4_K_M.gguf
+# 3. Build the model
+ollama create janus-27b -f ../Modelfile
+# 4. Run the demo
+pip install requests
+python ollama_chat.py
+```
+### Transformers (safetensors)
+```bash
+pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
+python transformers_quickstart.py            # 4-bit, ~16 GB VRAM
+python transformers_quickstart.py --no-4bit  # bf16, ~54 GB VRAM
+```
+### llama-cpp-python (GGUF, no daemon)
+```bash
+pip install llama-cpp-python  # CPU-only build
+python llama_cpp_quickstart.py /path/to/Qwen3.6-27B.Q4_K_M.gguf --gpu-layers 99
+```
+For GPU offload, rebuild llama-cpp-python with the matching backend — see
+the script header for `CMAKE_ARGS` recipes (CUDA, Metal, ROCm/HIP).

examples/llama_cpp_quickstart.py ADDED Viewed

	@@ -0,0 +1,84 @@

+#!/usr/bin/env python3
+"""
+Janus-27B — llama-cpp-python quickstart.
+Skip Ollama entirely and call the GGUF directly through llama-cpp-python.
+Useful for batch jobs, CI, or environments where you don't want a daemon.
+Install:
+    pip install llama-cpp-python
+For GPU offload (CUDA / Metal / ROCm), install with the matching extras:
+    CMAKE_ARGS="-DGGML_CUDA=on"  pip install llama-cpp-python --no-binary :all:
+    CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-binary :all:
+    CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --no-binary :all:
+Usage:
+    python llama_cpp_quickstart.py /path/to/Qwen3.6-27B.Q4_K_M.gguf
+    python llama_cpp_quickstart.py /path/to/file.gguf --gpu-layers 99
+    python llama_cpp_quickstart.py /path/to/file.gguf --prompt "..."
+"""
+from __future__ import annotations
+import argparse
+import sys
+try:
+    from llama_cpp import Llama
+except ImportError:  # pragma: no cover
+    sys.exit("Missing llama-cpp-python. Install with: pip install llama-cpp-python")
+JANUS_SYSTEM = (
+    "You are Janus, a precise and capable assistant for reasoning, writing, "
+    "coding, and long-form dialogue.\n\n"
+    "Behavior rules:\n"
+    "- Answer the user's actual request directly.\n"
+    "- Be accurate, complete, and structured.\n"
+    "- Think before answering, but do not get stuck in repetitive loops.\n"
+    "- If the request is ambiguous, state what is missing and make the smallest "
+    "reasonable assumption needed to continue.\n"
+    "- Finish with a usable answer, not just planning."
+)
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("gguf", help="Path to Qwen3.6-27B GGUF (e.g. Q4_K_M).")
+    ap.add_argument(
+        "--prompt",
+        default="Explain the Burrows-Wheeler transform in 200 words.",
+    )
+    ap.add_argument("--ctx", type=int, default=16384, help="Context window.")
+    ap.add_argument(
+        "--gpu-layers",
+        type=int,
+        default=0,
+        help="Layers to offload to GPU (-1 or 99 = all).",
+    )
+    ap.add_argument("--max-tokens", type=int, default=512)
+    args = ap.parse_args()
+    llm = Llama(
+        model_path=args.gguf,
+        n_ctx=args.ctx,
+        n_gpu_layers=args.gpu_layers,
+        verbose=False,
+    )
+    out = llm.create_chat_completion(
+        messages=[
+            {"role": "system", "content": JANUS_SYSTEM},
+            {"role": "user", "content": args.prompt},
+        ],
+        temperature=0.6,
+        top_p=0.95,
+        top_k=20,
+        repeat_penalty=1.05,
+        max_tokens=args.max_tokens,
+    )
+    print(out["choices"][0]["message"]["content"])
+if __name__ == "__main__":
+    main()

examples/ollama_chat.py ADDED Viewed

	@@ -0,0 +1,193 @@

+#!/usr/bin/env python3
+"""
+Janus-27B — Ollama chat examples.
+Prerequisites:
+    1. Pull a Qwen 3.6 27B GGUF (e.g. unsloth/Qwen3.6-27B-GGUF).
+    2. Edit ../Modelfile so the FROM line points at the GGUF path.
+    3. ollama create janus-27b -f ../Modelfile
+    4. ollama serve   (usually already running)
+    5. python ollama_chat.py
+The model emits <think>...</think> reasoning blocks before its answer.
+Ollama (as of 0.22) does not always split these into a separate field for
+qwen3_6, so the reasoning lands inside `content`. Helpers below strip it
+when you only want the final answer.
+Endpoints used:
+    - Native Ollama:  http://localhost:11434/api/chat
+    - OpenAI-compat:  http://localhost:11434/v1/chat/completions
+"""
+from __future__ import annotations
+import json
+import re
+import sys
+from typing import Any, Iterator
+import requests
+MODEL = "janus-27b"
+HOST = "http://localhost:11434"
+_THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
+def split_thinking(content: str) -> tuple[str, str]:
+    """Return (thinking, final_answer) from a content string."""
+    parts = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
+    thinking = "\n".join(p.strip() for p in parts).strip()
+    answer = _THINK_RE.sub("", content).strip()
+    return thinking, answer
+# ---------- 1. Simple chat ----------
+def chat(prompt: str, system: str | None = None) -> dict[str, Any]:
+    msgs: list[dict[str, Any]] = []
+    if system:
+        msgs.append({"role": "system", "content": system})
+    msgs.append({"role": "user", "content": prompt})
+    r = requests.post(
+        f"{HOST}/api/chat",
+        json={"model": MODEL, "messages": msgs, "stream": False},
+        timeout=600,
+    )
+    r.raise_for_status()
+    return r.json()
+# ---------- 2. Streaming ----------
+def chat_stream(prompt: str) -> Iterator[str]:
+    """Yield content tokens as they arrive."""
+    with requests.post(
+        f"{HOST}/api/chat",
+        json={
+            "model": MODEL,
+            "messages": [{"role": "user", "content": prompt}],
+            "stream": True,
+        },
+        stream=True,
+        timeout=600,
+    ) as r:
+        r.raise_for_status()
+        for line in r.iter_lines():
+            if not line:
+                continue
+            chunk = json.loads(line)
+            if "message" in chunk and "content" in chunk["message"]:
+                yield chunk["message"]["content"]
+            if chunk.get("done"):
+                break
+# ---------- 3. Tool calling ----------
+WEATHER_TOOL = {
+    "type": "function",
+    "function": {
+        "name": "get_current_weather",
+        "description": "Get the current weather in a given city",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "city": {"type": "string", "description": "City name"},
+                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+            },
+            "required": ["city", "unit"],
+        },
+    },
+}
+def fake_weather(city: str, unit: str) -> str:
+    """Stand-in tool implementation."""
+    return json.dumps(
+        {"city": city, "temperature": 14, "unit": unit, "conditions": "light rain"}
+    )
+def tool_round_trip(prompt: str) -> str:
+    """Single-shot tool call: model -> tool -> model -> final answer."""
+    history: list[dict[str, Any]] = [{"role": "user", "content": prompt}]
+    r = requests.post(
+        f"{HOST}/api/chat",
+        json={
+            "model": MODEL,
+            "messages": history,
+            "tools": [WEATHER_TOOL],
+            "stream": False,
+        },
+        timeout=600,
+    )
+    r.raise_for_status()
+    msg = r.json()["message"]
+    if not msg.get("tool_calls"):
+        return msg["content"]
+    history.append({"role": "assistant", "tool_calls": msg["tool_calls"]})
+    for tc in msg["tool_calls"]:
+        fn = tc["function"]
+        if fn["name"] == "get_current_weather":
+            result = fake_weather(**fn["arguments"])
+        else:
+            result = json.dumps({"error": f"unknown tool {fn['name']}"})
+        history.append({"role": "tool", "tool_name": fn["name"], "content": result})
+    r = requests.post(
+        f"{HOST}/api/chat",
+        json={
+            "model": MODEL,
+            "messages": history,
+            "tools": [WEATHER_TOOL],
+            "stream": False,
+        },
+        timeout=600,
+    )
+    r.raise_for_status()
+    return r.json()["message"]["content"]
+# ---------- 4. OpenAI-compatible endpoint ----------
+def openai_chat(prompt: str) -> str:
+    r = requests.post(
+        f"{HOST}/v1/chat/completions",
+        json={
+            "model": MODEL,
+            "messages": [{"role": "user", "content": prompt}],
+            "temperature": 0.6,
+        },
+        timeout=600,
+    )
+    r.raise_for_status()
+    return r.json()["choices"][0]["message"]["content"]
+# ---------- demo ----------
+def _demo() -> None:
+    print("=== 1. simple chat ===")
+    resp = chat("What is 84 * 3 / 2?")
+    thinking, answer = split_thinking(resp["message"]["content"])
+    if thinking:
+        print(f"[thinking] {thinking[:200]}...")
+    print(f"[answer]   {answer}")
+    print("\n=== 2. streaming ===")
+    for tok in chat_stream("Count from 1 to 5 in one line."):
+        sys.stdout.write(tok)
+        sys.stdout.flush()
+    print()
+    print("\n=== 3. tool round-trip ===")
+    print(tool_round_trip("What is the weather in Paris in celsius?"))
+    print("\n=== 4. OpenAI-compat ===")
+    print(openai_chat("Say 'OpenAI endpoint OK' and nothing else."))
+if __name__ == "__main__":
+    _demo()

examples/transformers_quickstart.py ADDED Viewed

	@@ -0,0 +1,119 @@

+#!/usr/bin/env python3
+"""
+Janus-27B — Hugging Face Transformers quickstart.
+Loads the upstream Qwen 3.6 27B safetensors directly and runs a single
+chat turn using its embedded chat template. Janus-27B is a *wrapper*
+around that base, so for the transformers route there is nothing to
+download from this repo — point at Qwen/Qwen3.6-27B and apply the same
+system prompt the Modelfile uses.
+Requirements:
+    pip install --upgrade "transformers>=4.45" accelerate sentencepiece bitsandbytes
+Memory:
+    - bf16 full precision: ~54 GB VRAM (won't fit on a single 24 GB card).
+    - 4-bit (bitsandbytes nf4): ~16 GB VRAM, runs on a 3090/4090 24 GB.
+    - Fall back to device_map="auto" + bnb_4bit on consumer GPUs.
+Usage:
+    python transformers_quickstart.py
+    python transformers_quickstart.py --no-4bit         # bf16, needs >= 48 GB VRAM
+    python transformers_quickstart.py --prompt "..."    # custom prompt
+"""
+from __future__ import annotations
+import argparse
+import sys
+try:
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+except ImportError as e:  # pragma: no cover
+    sys.exit(
+        f"Missing dependency: {e.name}. Install with:\n"
+        "  pip install --upgrade 'transformers>=4.45' accelerate sentencepiece bitsandbytes"
+    )
+MODEL_ID = "Qwen/Qwen3.6-27B"
+JANUS_SYSTEM = (
+    "You are Janus, a precise and capable assistant for reasoning, writing, "
+    "coding, and long-form dialogue.\n\n"
+    "Behavior rules:\n"
+    "- Answer the user's actual request directly.\n"
+    "- Be accurate, complete, and structured.\n"
+    "- Think before answering, but do not get stuck in repetitive loops or "
+    "meta-commentary.\n"
+    "- If the request is ambiguous or incomplete, state what is missing and "
+    "make the smallest reasonable assumption needed to continue.\n"
+    "- If the user wants creative writing, preserve tone, continuity, and "
+    "character consistency.\n"
+    "- If the user wants analysis or technical help, prefer concrete steps, "
+    "examples, and decisions over fluff.\n"
+    "- Finish with a usable answer, not just planning."
+)
+def load(use_4bit: bool):
+    kwargs: dict = {"device_map": "auto", "torch_dtype": torch.bfloat16}
+    if use_4bit:
+        from transformers import BitsAndBytesConfig
+        kwargs["quantization_config"] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+        )
+        kwargs.pop("torch_dtype", None)
+    tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True, **kwargs)
+    return tok, model
+def generate(tok, model, prompt: str, max_new_tokens: int = 512) -> str:
+    messages = [
+        {"role": "system", "content": JANUS_SYSTEM},
+        {"role": "user", "content": prompt},
+    ]
+    inputs = tok.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt",
+    ).to(model.device)
+    out = model.generate(
+        inputs,
+        max_new_tokens=max_new_tokens,
+        do_sample=True,
+        temperature=0.6,
+        top_p=0.95,
+        top_k=20,
+        repetition_penalty=1.05,
+    )
+    return tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True)
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--prompt", default="Explain the Burrows-Wheeler transform in 200 words.")
+    ap.add_argument(
+        "--no-4bit",
+        action="store_true",
+        help="Disable 4-bit quantization (requires ~54 GB VRAM in bf16).",
+    )
+    ap.add_argument("--max-new-tokens", type=int, default=512)
+    args = ap.parse_args()
+    print(f"[load] {MODEL_ID} (4bit={'no' if args.no_4bit else 'yes'})")
+    tok, model = load(use_4bit=not args.no_4bit)
+    print(f"[gen]  prompt: {args.prompt!r}")
+    print()
+    print(generate(tok, model, args.prompt, args.max_new_tokens))
+if __name__ == "__main__":
+    main()

scripts/build.sh ADDED Viewed

	@@ -0,0 +1,89 @@

+#!/usr/bin/env bash
+# Janus-27B — fetch a Qwen 3.6 27B GGUF and build the Ollama model.
+#
+# Usage:
+#   ./scripts/build.sh                       # default: Q4_K_M, profile=default
+#   ./scripts/build.sh Q5_K_M                # different quant
+#   ./scripts/build.sh Q3_K_S z13            # quant + Z13 profile (uses Modelfile.z13)
+#   QUANT=Q6_K PROFILE=default ./scripts/build.sh
+#
+# Requires: huggingface-cli (or hf), ollama, awk, sed.
+set -euo pipefail
+QUANT="${1:-${QUANT:-Q4_K_M}}"
+PROFILE="${2:-${PROFILE:-default}}"
+REPO_ID="${REPO_ID:-unsloth/Qwen3.6-27B-GGUF}"
+GGUF_NAME="Qwen3.6-27B.${QUANT}.gguf"
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+GGUF_PATH="${ROOT}/${GGUF_NAME}"
+case "${PROFILE}" in
+    default) MODELFILE="${ROOT}/Modelfile";        TAG="janus-27b" ;;
+    z13)     MODELFILE="${ROOT}/Modelfile.z13";    TAG="janus-27b-z13" ;;
+    *)       echo "[!] Unknown profile: ${PROFILE} (expected: default | z13)" >&2; exit 1 ;;
+esac
+echo "[*] repo:     ${REPO_ID}"
+echo "[*] quant:    ${QUANT}"
+echo "[*] profile:  ${PROFILE}"
+echo "[*] tag:      ${TAG}"
+echo "[*] modelfile:${MODELFILE}"
+echo "[*] gguf:     ${GGUF_PATH}"
+# ---- 1. Sanity ---------------------------------------------------------------
+if ! command -v ollama >/dev/null 2>&1; then
+    echo "[!] ollama not found in PATH" >&2; exit 1
+fi
+if [[ ! -f "${MODELFILE}" ]]; then
+    echo "[!] Missing ${MODELFILE}" >&2; exit 1
+fi
+# ---- 2. Pick a HuggingFace CLI ----------------------------------------------
+HF=""
+if command -v hf >/dev/null 2>&1; then
+    HF="hf"
+elif command -v huggingface-cli >/dev/null 2>&1; then
+    HF="huggingface-cli"
+else
+    echo "[!] Neither 'hf' nor 'huggingface-cli' found." >&2
+    echo "    pip install -U huggingface_hub" >&2
+    exit 1
+fi
+# ---- 3. Download GGUF if missing --------------------------------------------
+if [[ -f "${GGUF_PATH}" ]]; then
+    echo "[=] GGUF already present, skipping download."
+else
+    echo "[*] Downloading ${GGUF_NAME} from ${REPO_ID} ..."
+    case "${HF}" in
+        hf)                 hf download "${REPO_ID}" "${GGUF_NAME}" --local-dir "${ROOT}" ;;
+        huggingface-cli)    huggingface-cli download "${REPO_ID}" "${GGUF_NAME}" --local-dir "${ROOT}" ;;
+    esac
+fi
+if [[ ! -f "${GGUF_PATH}" ]]; then
+    echo "[!] Download failed: ${GGUF_PATH} not present." >&2; exit 1
+fi
+# ---- 4. Patch the Modelfile FROM line in a temp copy -------------------------
+TMP_MODELFILE="$(mktemp -t janus27b-modelfile.XXXXXX)"
+trap 'rm -f "${TMP_MODELFILE}"' EXIT
+awk -v p="${GGUF_PATH}" '
+    /^FROM[[:space:]]/ && !done { print "FROM " p; done=1; next }
+    { print }
+' "${MODELFILE}" > "${TMP_MODELFILE}"
+# ---- 5. Create the Ollama model ---------------------------------------------
+echo "[*] ollama create ${TAG} -f <patched modelfile>"
+ollama create "${TAG}" -f "${TMP_MODELFILE}"
+echo
+echo "[+] Done. Try it:"
+echo "    ollama run ${TAG}"
+echo "    python ${ROOT}/examples/ollama_chat.py   # update MODEL constant if not 'janus-27b'"

scripts/smoke_test.sh ADDED Viewed

	@@ -0,0 +1,68 @@

+#!/usr/bin/env bash
+# Janus-27B — smoke test against a running Ollama daemon.
+#
+# Verifies:
+#   1. The Ollama server is reachable.
+#   2. The target model is loaded / loadable.
+#   3. A single chat round-trip succeeds and produces non-empty output.
+#
+# Usage:
+#   ./scripts/smoke_test.sh                  # uses MODEL=janus-27b
+#   MODEL=janus-27b-z13 ./scripts/smoke_test.sh
+#   HOST=http://localhost:11434 ./scripts/smoke_test.sh
+set -euo pipefail
+MODEL="${MODEL:-janus-27b}"
+HOST="${HOST:-http://localhost:11434}"
+PROMPT="${PROMPT:-Reply with the single word: OK}"
+red()   { printf "\033[31m%s\033[0m\n" "$*"; }
+green() { printf "\033[32m%s\033[0m\n" "$*"; }
+blue()  { printf "\033[34m%s\033[0m\n" "$*"; }
+require() {
+    if ! command -v "$1" >/dev/null 2>&1; then
+        red "[!] missing dependency: $1"; exit 1
+    fi
+}
+require curl
+require jq
+blue "[*] host:   ${HOST}"
+blue "[*] model:  ${MODEL}"
+# 1. Server up?
+if ! curl -fsS "${HOST}/api/tags" >/dev/null; then
+    red "[!] Ollama not reachable at ${HOST}. Is 'ollama serve' running?"
+    exit 1
+fi
+green "[+] server reachable"
+# 2. Model present?
+if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select(.name | startswith($m))' >/dev/null; then
+    red "[!] Model '${MODEL}' not found. Build it first:"
+    red "    ./scripts/build.sh                # default profile"
+    red "    ./scripts/build.sh Q3_K_S z13     # Z13 profile"
+    exit 1
+fi
+green "[+] model present"
+# 3. Round-trip
+blue "[*] sending test prompt..."
+RESP="$(curl -fsS "${HOST}/api/chat" \
+    -H 'Content-Type: application/json' \
+    -d "$(jq -n --arg m "${MODEL}" --arg p "${PROMPT}" '{
+        model: $m,
+        messages: [{role:"user", content:$p}],
+        stream: false
+    }')" | jq -r '.message.content // empty')"
+if [[ -z "${RESP}" ]]; then
+    red "[!] empty response from model"
+    exit 1
+fi
+green "[+] round-trip OK"
+echo "----- model said -----"
+echo "${RESP}"
+echo "----------------------"