Instructions to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="JThomas-CoE/coe-qwen3.5-coding-18b-a3b",
	filename="CoE-qwen3.5-coding-18b-a3b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Use Docker

docker model run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

LM Studio
Jan
Ollama
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Ollama:
```
ollama run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
```

Unsloth Studio

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Docker Model Runner:
```
docker model run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
```

Lemonade

How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M

Run and chat with the model

lemonade run user.coe-qwen3.5-coding-18b-a3b-Q4_K_M

List all available models

lemonade list

CoE-Qwen3.5-Algorithmic-Systems-Coding-18B-A3B — College of Experts Specialist · Beta

⚠ Beta Release. This model has undergone GGUF-level expert surgery but has received no post-surgery supervised fine-tuning. Task performance within the target domain is validated (see §Validation below), but edge-case behaviour may differ from the full base model. Use with the recommended prompt harness and temperature settings.

Base model: Qwen/Qwen3.5-35B-A3B
Parameter count: ~18B total · 3B active/token
Quantization: Q4_K_M
Format: GGUF (Ollama / llama.cpp compatible)
Modality: Text only (vision not supported in this release)
Domain: Algorithmic / Systems Coding — algorithms, systems programming, OS, low-level languages

What This Is

Qwen3.5-35B-A3B is a Mixture-of-Experts model with 256 experts per layer, activating 8 per token. This release is a surgical specialist — 128 of those 256 experts have been removed per layer, retaining only those whose activation frequency is highest for Algorithmic / Systems Coding content.

The result is a model with ~~half the total parameters (~~18B vs 35B), identical active parameters per token (3B), and meaningfully concentrated domain routing.

This is one of 8 domain specialists in the College of Experts beta release.

Surgery Methodology

Expert selection uses a coverage mask derived from 3D activation histograms collected over a domain-specific text corpus.

Utilisation score per expert:

$\text{util}[l, e] = \sum_{k=0}^{7} \frac{8-k}{36} \cdot H[l, e, k]$

where $H[l, e, k]$ is the count of times expert $e$ in layer $l$ was selected at rank $k$ across the domain corpus. Rank 0 (top-selected) is weighted 8/36; rank 7 contributes 1/36.

Mask selection: Top-128 experts by utilisation score per layer, computed from the textual-only histogram for this domain (textual and visual routing activations were found to select largely different experts; see §Validation for the T-vs-V separability result).

GGUF surgery: scripts/prune_gguf_from_mask.py in the GitHub repo operates directly on the Ollama GGUF blob via struct-level I/O — no safetensors or HuggingFace loading required. It slices all blk.N.*ffn_expert* tensors to the retained 128 indices, permutes the router gate weight rows to match, and updates llm.expert_count metadata to 128.

The retained expert indices for this model are in masks/coverage_SYSTEMS_K128.pt (this repo) — a list[torch.LongTensor(128,)] of length 40, one per layer.

Validation

Functional threshold

Expert budget K=128 was validated as the deployment target against the full 256-expert baseline on Python coding tasks (pass@5 at T=0.4):

Expert budget K	% of pool	pass@5 (T=0.4)
256 (baseline)	100%	100%
128	50%	100%
64	25%	86%
32	12.5%	74%

K=128 is the minimum budget that fully saturates pass@5 on the validation suite. K=64 and K=32 show measurable degradation ("cliff" behaviour begins below K=128).

Domain separability

To confirm that retaining domain-specific experts preserves meaningful structure rather than degrading uniformly, the bidirectional separability experiment compared the SYSTEMS/coding and HUMANITIES K=128 pruned models on each other's tasks:

Pruned model	Coding tasks (pass@5, T=0.4)	Humanities tasks
SYSTEMS K=128 (this release family)	100%	~4%
HUMANITIES K=128	~4%	high

The 96 percentage-point gap confirms that the retained expert pools are principled domain partitions, not random subsets. A humanities-specialist model fails almost completely on coding tasks where the coding specialist achieves perfect pass@5 — and vice versa. This is expected from the routing structure of the base model and validates that surgery preserves domain-specific computation.

Full methodology, per-layer expert indices, and all experimental results are in the RESEARCH_LOG.md in the GitHub repo (§21–§23 for the separability experiments; §29 for the full 14-domain Jaccard overlap analysis).

Quickstart (Ollama)

# Pull and run
ollama pull JThomas-CoE/coe-qwen3.5-coding-18b-a3b

ollama run JThomas-CoE/coe-qwen3.5-coding-18b-a3b

Or register directly from this repo's GGUF:

# Download the GGUF and Modelfile, then:
ollama create coe-qwen3.5-coding-18b-a3b -f Modelfile
ollama run coe-qwen3.5-coding-18b-a3b

Recommended temperature: T=0.4
Expert pruning sharpens the router's logit distribution. Higher temperatures (T≥0.7) may cause routing instability. T=0.4 was the validated operating point across all K=128 experiments.

Recommended Prompt Harness

The full model's default of T=0.6 will generally work for the pruned models but expert pruning sharpens the routing distribution so T=0.4 may work better as an operating point for these pruned variants on some tasks but temperatures up to 0.9 have been tested and generally work if greater variability/creativity is desired.

This model performs best with an explicit domain framing in the system prompt. Examples for each specialist are given below — substitute the example matching this model's domain.

Coding specialist (coe-qwen3.5-coding-18b-a3b, T=0.4)

System: "You are a Python coding expert. Complete the following task as such.
         Return a single, complete block of functional Python code. Keep comments
         and explanations concise and minimal. Do not second guess your answer."

User:   "write python code to implement a thread-safe LRU cache with O(1) get and put."

Web specialist (coe-qwen3.5-web-18b-a3b, T=0.4)

System: "You are a web development expert. Answer the following with working code.
         Prefer modern standards and best practices. Add inline comments only where
         the logic is non-obvious. Stop after your answer."

User:   "Create a standalone HTML file for a snake game web app. All CSS and JS must
         be inline. Give the app a retro, dark neon look."

Math specialist (coe-qwen3.5-math-18b-a3b, T=0.4)

System: "You are a mathematics expert. Solve the following problem. Show non-trivial
         intermediate steps. State any assumptions. Use standard notation. Stop after
         the solution."

User:   "Find the eigenvalues and eigenvectors of the matrix [[3, 1], [1, 3]]."

Physics specialist (coe-qwen3.5-physics-18b-a3b, T=0.4)

System: "You are a physics expert. Answer with precision. Show derivations where
         relevant. Use SI units throughout. Stop after your answer."

User:   "Derive the expression for the period of a simple pendulum in the
         small-angle approximation."

Biology specialist (coe-qwen3.5-biology-18b-a3b, T=0.4)

System: "You are a biology expert. Answer with scientific precision. Reference
         specific mechanisms, structures, and established terminology. Do not
         add unsolicited commentary — stop after your answer."

User:   "Explain the role of the sodium-potassium pump in maintaining the
         resting membrane potential of a neuron."

Engineering specialist (coe-qwen3.5-engineering-18b-a3b, T=0.4)

System: "You are an engineering expert. Answer with technical precision. Include
         relevant standards, tolerances, or safety considerations where they
         apply. When you have given your answer stop without further elaboration."

User:   "Compare the fatigue life of a notched versus unnotched steel specimen
         under cyclic loading, and explain the mechanism responsible for the
         difference."

Vocational specialist (coe-qwen3.5-vocational-18b-a3b, T=0.4)

System: "You are an expert on welding. Answer as such. If appropriate include
         best practices guidelines including safety protocols. When you have
         given your answer stop without further elaboration."

User:   "What type of filler rod should I use for TIG welding 304 stainless
         steel, and what shielding gas is appropriate?"

Humanities specialist (coe-qwen3.5-humanities-18b-a3b, T=0.4)

System: "You are a humanities scholar. Answer the question with precision and
         appropriate depth. Cite specific works, authors, or dates when relevant.
         Do not add unsolicited commentary — stop after your answer."

User:   "What is the dramatic function of the Chorus in Greek tragedy?
         Use Sophocles as your primary reference."

Limitations / Beta Caveats

No post-surgery training. This model is the output of structural surgery on the base model weights. No supervised fine-tuning, RLHF, or DPO has been applied after expert removal. Behaviour on unusual prompts may be less robust than the base model.
Text only. Vision inputs are not supported. The _T (textual) mask was used, which selects experts optimised for text routing only. Visual expert pools differ significantly (mean T-vs-V Jaccard ~0.43 for same domain).
Q4_K_M quantization only. Full-precision (BF16) weights will be released following post-surgery validation across all 8 domains.
Recommended context window: 32k tokens (Modelfile default). Longer contexts have not been validated post-surgery.
Domain framing recommended. See prompt harness section above.

Expert Mask

The file masks/coverage_SYSTEMS_K128.pt in this repo contains the retained expert indices used to produce this GGUF. Format:

import torch
masks = torch.load("masks/coverage_SYSTEMS_K128.pt", weights_only=False)
# masks: list of 40 torch.LongTensor, each shape (128,)
# masks[layer_idx] = 1D tensor of 128 retained expert indices for that layer
print(masks[0])   # expert indices retained in layer 0

To reproduce the surgery from the base model GGUF:

python scripts/prune_gguf_from_mask.py \
    --mask   "masks/coverage_SYSTEMS_K128.pt" \
    --input  "<path-to-Qwen3.5-35B-A3B-base.gguf>" \
    --output-dir "./output"

Full surgery script and build pipeline: https://github.com/JThomas-CoE/College-of-Experts-AI

License

This model is derived from Qwen3.5-35B-A3B. The surgical modifications, masks, and College of Experts tooling are released under PolyForm Noncommercial 1.0.0. Commercial licensing available upon request.

The base model weights remain subject to the Qwen3.5 model license.

Downloads last month: 113

GGUF

Model size

20B params

Architecture

qwen35moe

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JThomas-CoE/coe-qwen3.5-coding-18b-a3b

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(273)

this model