---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-35B-A3B
datasets:
  - crownelius/Creative_Writing_ShareGPT_Enhanced
  - microsoft/rStar-Coder
  - peteromallet/dataclaw-peteromallet
  - crownelius/Opus-4.7-Reasoning
  - openbmb/UltraData-Math
  - Crownelius/Crow-Heretic-TeichAI-Unified
language:
  - en
  - zh
  - ru
  - es
  - fr
  - it
  - ja
  - ko
  - de
  - ar
  - tr
  - pl
  - sv
  - nl
  - he
  - id
  - uk
  - fa
  - pt
  - ms
  - fi
  - el
tags:
  - qwen3_6
  - moe
  - conversational
  - multimodal
  - agent
  - gguf
library_name: transformers
pipeline_tag: image-text-to-text
---

<img src="https://huggingface.co/FoolDev/Janus-35B/resolve/main/banner.svg" alt="Janus-35B banner" width="100%" />

[![License](https://img.shields.io/badge/License-Apache_2.0-7aa2f7?style=flat&labelColor=1a1b26)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3.6--35B--A3B-bb9af7?style=flat&labelColor=1a1b26)](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
[![Architecture](https://img.shields.io/badge/Arch-MoE_35B/3B_active-ff9e64?style=flat&labelColor=1a1b26)](#architecture)
[![Quant](https://img.shields.io/badge/GGUF-Q4__K__M-9ece6a?style=flat&labelColor=1a1b26)](#whats-here)
[![Buy me a coffee](https://img.shields.io/badge/%E2%98%95%20Buy_me_a_coffee-e0af68?style=flat&logo=buymeacoffee&logoColor=1a1b26&labelColor=1a1b26)](https://buymeacoffee.com/cardoffoolm)

# Janus-35B

> **Flagship Reasoning. Sparse Footprint.**
> *Qwen 3.6 35B-A3B repackaged with Claude Opus 4.7 in the teacher slot.*

**`Architecture:`** `Qwen 3.6 35B-A3B (MoE)` | **`Total Params:`** `35B` | **`Active Params:`** `3B` | **`Teacher:`** `Claude Opus 4.7` | **`Type:`** `Distilled MoE LLM`

A personal fork of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) — a 35B-total / 3B-active mixture-of-experts multimodal model — repackaged as Janus-35B with Claude Opus 4.7 reasoning data in the teacher slot.

## TL;DR

One-liner via Hugging Face (pulls a GGUF + this repo's root-level
`template` / `system` / `params` files, including the tool-calling
template — HF's Ollama bridge ingests those three files, not
`Modelfile`):

```bash
ollama run hf.co/FoolDev/Janus-35B               # default ~19 GB Q4_K_M
ollama run hf.co/FoolDev/Janus-35B:Q4_K_M        # same blob, explicit tag
```

Or build locally (uses this repo's `Modelfile`, kept in sync with the
three bridge files):

```bash
git clone https://huggingface.co/FoolDev/Janus-35B && cd Janus-35B
ollama create janus -f Modelfile && ollama run janus
```

After either path, `ollama show janus` lists `completion`, `tools`,
and `thinking` under Capabilities. Hardware: ~38 GB RAM at default
`num_ctx 16384`, or trim ctx + batch to fit 32 GB hosts (see
[Hardware requirements](#hardware-requirements)).

## What's here

| File | Use |
|---|---|
| `Janus-35B-A3B.Q4_K_M.gguf` | Recommended default, ~19 GB |
| `Modelfile` | Ollama wrapper for **local** builds (`ollama create janus -f Modelfile`) — overrides the GGUF's embedded template with one that exposes `.Tools` / `.ToolCalls` to Ollama's capability detector. |
| `template`, `system`, `params` | Used by HF's Ollama bridge when users `ollama run hf.co/FoolDev/Janus-35B` directly. The bridge does **not** read `Modelfile` (see [HF Ollama docs](https://huggingface.co/docs/hub/en/ollama)); it ingests these three root-level files instead. Kept in sync with the `Modelfile`'s `TEMPLATE` / `SYSTEM` / `PARAMETER` directives. |
| `scripts/check_bridge_sync.py` | Run before pushing a `Modelfile` / `template` / `system` / `params` edit to verify the four configurations remain in sync. Exits 0 if in sync, 1 with a per-key diff if not. |

GGUF-only release. Pull the upstream safetensors from `Qwen/Qwen3.6-35B-A3B` if you need the `transformers` tree.

## Architecture

<p align="left">
  <img src="https://huggingface.co/FoolDev/Janus-35B/resolve/main/moe-routing.svg" alt="animated MoE routing visualization: 16x16 grid of 256 expert dots with 8 lit at any time, cycling through 8 routing patterns" width="640" />
</p>

- Qwen 3.6, 35B total / 3B active, MoE (256 experts, 8 activated per token)
- 40 layers, 10 × (3 × DeltaNet → MoE / 1 × Gated Attention → MoE)
- 262k native context, extensible to ~1M with YaRN
- Vision + video supported by upstream (mmproj not included in this release)
- Vocab 248,320

## Quick start

### llama.cpp / LM Studio

Drop the GGUF into your loader of choice. The chat template is embedded in the GGUF metadata, so llama.cpp's `--chat-template auto` and LM Studio's GGUF auto-detection handle plain conversation correctly.

### Ollama

The chat template baked into the GGUF is **not sufficient on Ollama** — it lacks the `.Tools` / `.ToolCalls` blocks Ollama's capability detector requires, so a naive `ollama pull` reports `does not support tools` and rejects any request carrying a `tools` array. Two paths fix this:

```bash
# A. Pull straight from HF (uses the root-level template/system/params files):
ollama run hf.co/FoolDev/Janus-35B               # default tag, ~19 GB Q4_K_M
ollama run hf.co/FoolDev/Janus-35B:Q4_K_M        # same blob, explicit tag
# Note: HF's Ollama bridge does NOT read Modelfile; it reads template/system/params.

# B. Build locally (uses Modelfile, which is kept in sync with the three above):
ollama create janus -f Modelfile && ollama run janus
```

After either path, `ollama show janus` should list `completion`, `tools`, and `thinking` under Capabilities.

### Inference examples

Once the model is loaded (via `ollama run janus`, `lms server`, or `llama-server`), all the standard OpenAI-compatible clients work. Examples assume the loader is listening on `http://localhost:11434` (Ollama default) — adjust the port for LM Studio (`:1234`) or llama.cpp (`:8080`).

#### curl

```bash
curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "janus",
    "messages": [
      {"role": "system", "content": "You are Janus, a precise reasoning assistant."},
      {"role": "user", "content": "Sketch an algorithm to detect cycles in a directed graph."}
    ],
    "temperature": 0.6,
    "max_tokens": 800
  }' | jq -r '.choices[0].message.content'
```

#### Python (openai-compat)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ignored")

resp = client.chat.completions.create(
    model="janus",
    messages=[
        {"role": "user", "content": "Write a haiku about a stack overflow."}
    ],
    temperature=0.8,
    top_p=0.95,
)
print(resp.choices[0].message.content)
```

#### Streaming

```python
stream = client.chat.completions.create(
    model="janus",
    messages=[{"role": "user", "content": "Explain RoPE briefly."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
```

### Recommended sampling

| Use | temp | top_p | top_k | repeat_penalty |
|---|---:|---:|---:|---:|
| Reasoning / general | 0.6 | 0.95 | 20 | 1.05 |
| Creative / RP | 0.8 | 0.95 | 40 | 1.02 |

Lower temperature (0.4–0.6) and bump `repeat_penalty` to 1.08 if it loops inside `<think>` tags.

### System prompt

```text
You are Janus, a precise and capable assistant for reasoning, writing, coding, and long-form dialogue.

Behavior rules:
- Answer the user's actual request directly.
- Be accurate, complete, and structured.
- Think before answering, but do not get stuck in repetitive loops or meta-commentary.
- If the request is ambiguous or incomplete, state what is missing and make the smallest reasonable assumption needed to continue.
- If the user wants creative writing, preserve tone, continuity, and character consistency.
- If the user wants analysis or technical help, prefer concrete steps, examples, and decisions over fluff.
- Finish with a usable answer, not just planning.
```

## Hardware requirements

This is an 18.9 GB Q4_K_M GGUF. Ollama's runtime footprint at default settings is **roughly 2× the model file** (weights mmap + compute graph allocation), plus KV cache — so ~38 GB total memory at `num_ctx 16384`. The compute-graph allocation scales with context and batch size, so 32 GB hosts can fit the model by trimming both (see Z13 row in the table).

| Hardware | Status |
|---|---|
| ≥48 GB RAM (CPU-only) | Works, ~3-6 tok/s |
| Single H100/A100 80 GB | Works, full offload, ~30+ tok/s |
| RTX 4090 24 GB / 5090 32 GB + 32 GB RAM | Works, partial offload, ~15-25 tok/s |
| Mac Studio M2/M3 Ultra 64 GB+ unified | Works, ~20+ tok/s |
| 32 GB unified-memory laptops (Ryzen AI Max+, Apple M-series) | Works with `num_ctx ≤ 4096` and `num_batch ≤ 256` to fit the compute graph; default 16K ctx OOMs. Measured 28.71 tok/s on ASUS ROG Flow Z13 GZ302EA at Q4_K_M (Radeon 8060S iGPU via ROCm gfx1151). |

## Chat template

The model uses the standard Qwen 3.x ChatML format with `<|im_start|>` / `<|im_end|>` role markers. The template is embedded in the GGUF metadata for plain conversation use, but Ollama users should rely on the `TEMPLATE` block in the included `Modelfile` — that version exposes the tool-calling scaffolding Ollama's capability detector requires (the embedded template alone is insufficient; see [Ollama](#ollama) above).

### Plain conversation

```text
<|im_start|>system
You are Janus, a precise and capable assistant…<|im_end|>
<|im_start|>user
What is the time complexity of mergesort?<|im_end|>
<|im_start|>assistant
```

### With reasoning trace

When the model decides to think, the assistant turn contains a `<think>…</think>` block followed by the visible answer:

```text
<|im_start|>assistant
<think>
The user is asking about mergesort. Mergesort divides the array, recursively sorts each half, then merges. The recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n).
</think>

Mergesort runs in **O(n log n)** time in the worst, average, and best cases. The recurrence is T(n) = 2T(n/2) + O(n), which solves to Θ(n log n) by the master theorem.<|im_end|>
```

Most clients (Open WebUI, LibreChat, etc.) hide the `<think>` block by default and show only the final answer. If your client doesn't, set its "show reasoning" toggle off.

### Tool / function calling

The wire format depends on which path you take. **Both are valid** — the model adapts to whichever format the system prompt specifies.

**Ollama path** (this repo's `Modelfile`). The TEMPLATE advertises tools inside `<tools>…</tools>` and asks the model to reply in JSON-in-XML — the form Ollama's tool-call extractor parses into a structured `tool_calls` array on `/api/chat` and `/v1/chat/completions`:

```text
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>
```

**Embedded-jinja path** (llama.cpp, llama-cpp-python, LM Studio). The Qwen 3.6 native chat template baked into the GGUF instructs the model to emit a more verbose XML form. This is the shape you'll see if you talk to `llama-server` or LM Studio directly:

```text
<tool_call>
<function=get_weather>
<parameter=city>
Tokyo
</parameter>
</function>
</tool_call>
```

Pick the parser shape that matches your loader. Don't mix.

#### Example (Ollama, OpenAI-compatible API)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ignored")

resp = client.chat.completions.create(
    model="janus",
    messages=[
        {"role": "user", "content": "Call get_weather for Tokyo. Respond ONLY with the tool call."}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    temperature=0.3,
)
print(resp.choices[0].message.tool_calls)
# [ToolCall(id='call_xxx', type='function',
#           function=Function(name='get_weather', arguments='{"city":"Tokyo"}'))]
```

#### Tips

- Use direct prompts ("Call X for Y") rather than soft hints ("Use the tool"). The model thinks before committing to a call, and weak prompts can exhaust `num_predict` inside the `<think>` block before the call is emitted.
- Allow at least `num_predict: 1024` (or `max_tokens: 1024`) for tool-calling turns, more if the schemas are large.
- The Modelfile's JSON-in-XML format is what Ollama's tool-call extractor understands; if you swap loaders, swap the parser to match (see "Embedded-jinja path" above).

## Known limitations

- **No mmproj in this release.** The base Qwen3.6 supports image and video input via a separate `mmproj` file, which is not included here. Text-only inference works out of the box; multimodal inference requires fetching `Qwen2.5-VL-*-mmproj-*.gguf` (or equivalent) from upstream.
- **Quantization-induced quality loss.** Q4_K_M is a strong general-purpose quant but does measurably degrade math and code accuracy compared to BF16. If you need maximum quality, run the upstream safetensors on a GPU that fits BF16 (~70 GB).
- **MoE expert utilization is uneven.** Stock Qwen3.6-35B-A3B routes 8 of 256 experts per token. On narrow domains (e.g. only one programming language) a small subset of experts dominates; load-balance loss was a training-time concern, not a runtime guarantee.
- **Thinking traces can loop.** Like most reasoning-distilled models, Janus-35B occasionally gets stuck repeating itself inside `<think>` tags. Mitigations: lower temperature to 0.4-0.6, raise `repeat_penalty` to 1.08, or set a `<think>`-token budget cap if your loader supports it.
- **Not aligned with any specific safety policy.** This is a personal repackage of an open-weight base model with reasoning-focused distillation. There is no RLHF refusal layer beyond what Qwen 3.6 ships with; downstream safety is the operator's responsibility.
- **No formal evaluation in this card.** Numbers in the hardware table are estimates, not measured. If you produce real benchmarks (MMLU, HumanEval, etc.) and want them included, file a PR.

## Related models

| Model | Size | Notes |
|---|---|---|
| [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) | 35B / 3B active | Upstream base model. `transformers`-native multimodal weights. |
| [FoolDev/Thanatos-27B-Heretic](https://huggingface.co/FoolDev/Thanatos-27B-Heretic) | 27B dense | Dense sibling, now on `llmfan46/Qwen3.6-27B-uncensored-heretic-v2` (Heretic-style abliteration of the Qwen 3.6 27B base). Same teacher (Opus 4.7), same dataset family, smaller memory footprint, no MoE quirks, uncensored. (Renamed from `FoolDev/Thanatos-27B` — HF serves a 307 from the old path.) |
| [Crownelius/Crow-9B-HERETIC-4.6](https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6) | 9B dense | Heretic-flavored fine-tune of the same Qwen 3.5 9B base used as a smaller starting point. Useful as a fast first-pass model when 35B is too heavy for the host. |

## Credits

- Base model: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) (Alibaba)
- Reasoning teacher: Claude Opus 4.7 (Anthropic)
- Distillation lineage and dataset curation: [Crownelius](https://huggingface.co/Crownelius)

License inherited from upstream: Apache-2.0.