Instructions to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="JThomas-CoE/coe-qwen3.5-coding-18b-a3b", filename="CoE-qwen3.5-coding-18b-a3b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Use Docker
docker model run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Ollama:
ollama run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
- Unsloth Studio new
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for JThomas-CoE/coe-qwen3.5-coding-18b-a3b to start chatting
- Pi new
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Docker Model Runner:
docker model run hf.co/JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
- Lemonade
How to use JThomas-CoE/coe-qwen3.5-coding-18b-a3b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull JThomas-CoE/coe-qwen3.5-coding-18b-a3b:Q4_K_M
Run and chat with the model
lemonade run user.coe-qwen3.5-coding-18b-a3b-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)CoE-Qwen3.5-Algorithmic-Systems-Coding-18B-A3B โ College of Experts Specialist ยท Beta
โ Beta Release. This model has undergone GGUF-level expert surgery but has received no post-surgery supervised fine-tuning. Task performance within the target domain is validated (see ยงValidation below), but edge-case behaviour may differ from the full base model. Use with the recommended prompt harness and temperature settings.
Base model: Qwen/Qwen3.5-35B-A3B
Parameter count: ~18B total ยท 3B active/token
Quantization: Q4_K_M
Format: GGUF (Ollama / llama.cpp compatible)
Modality: Text only (vision not supported in this release)
Domain: Algorithmic / Systems Coding โ algorithms, systems programming, OS, low-level languages
What This Is
Qwen3.5-35B-A3B is a Mixture-of-Experts model with 256 experts per layer, activating 8 per token. This release is a surgical specialist โ 128 of those 256 experts have been removed per layer, retaining only those whose activation frequency is highest for Algorithmic / Systems Coding content.
The result is a model with half the total parameters (18B vs 35B), identical active
parameters per token (3B), and meaningfully concentrated domain routing.
This is one of 8 domain specialists in the College of Experts beta release.
Surgery Methodology
Expert selection uses a coverage mask derived from 3D activation histograms collected over a domain-specific text corpus.
Utilisation score per expert:
where $H[l, e, k]$ is the count of times expert $e$ in layer $l$ was selected at rank $k$ across the domain corpus. Rank 0 (top-selected) is weighted 8/36; rank 7 contributes 1/36.
Mask selection: Top-128 experts by utilisation score per layer, computed from the textual-only histogram for this domain (textual and visual routing activations were found to select largely different experts; see ยงValidation for the T-vs-V separability result).
GGUF surgery: scripts/prune_gguf_from_mask.py in the GitHub repo operates directly
on the Ollama GGUF blob via struct-level I/O โ no safetensors or HuggingFace loading
required. It slices all blk.N.*ffn_expert* tensors to the retained 128 indices, permutes
the router gate weight rows to match, and updates llm.expert_count metadata to 128.
The retained expert indices for this model are in masks/coverage_SYSTEMS_K128.pt
(this repo) โ a list[torch.LongTensor(128,)] of length 40, one per layer.
Validation
Functional threshold
Expert budget K=128 was validated as the deployment target against the full 256-expert baseline on Python coding tasks (pass@5 at T=0.4):
| Expert budget K | % of pool | pass@5 (T=0.4) |
|---|---|---|
| 256 (baseline) | 100% | 100% |
| 128 | 50% | 100% |
| 64 | 25% | 86% |
| 32 | 12.5% | 74% |
K=128 is the minimum budget that fully saturates pass@5 on the validation suite. K=64 and K=32 show measurable degradation ("cliff" behaviour begins below K=128).
Domain separability
To confirm that retaining domain-specific experts preserves meaningful structure rather than degrading uniformly, the bidirectional separability experiment compared the SYSTEMS/coding and HUMANITIES K=128 pruned models on each other's tasks:
| Pruned model | Coding tasks (pass@5, T=0.4) | Humanities tasks |
|---|---|---|
| SYSTEMS K=128 (this release family) | 100% | ~4% |
| HUMANITIES K=128 | ~4% | high |
The 96 percentage-point gap confirms that the retained expert pools are principled domain partitions, not random subsets. A humanities-specialist model fails almost completely on coding tasks where the coding specialist achieves perfect pass@5 โ and vice versa. This is expected from the routing structure of the base model and validates that surgery preserves domain-specific computation.
Full methodology, per-layer expert indices, and all experimental results are in the RESEARCH_LOG.md in the GitHub repo (ยง21โยง23 for the separability experiments; ยง29 for the full 14-domain Jaccard overlap analysis).
Quickstart (Ollama)
# Pull and run
ollama pull JThomas-CoE/coe-qwen3.5-coding-18b-a3b
ollama run JThomas-CoE/coe-qwen3.5-coding-18b-a3b
Or register directly from this repo's GGUF:
# Download the GGUF and Modelfile, then:
ollama create coe-qwen3.5-coding-18b-a3b -f Modelfile
ollama run coe-qwen3.5-coding-18b-a3b
Recommended temperature: T=0.4
Expert pruning sharpens the router's logit distribution. Higher temperatures (Tโฅ0.7)
may cause routing instability. T=0.4 was the validated operating point across all
K=128 experiments.
Recommended Prompt Harness
The full model's default of T=0.6 will generally work for the pruned models but expert pruning sharpens the routing distribution so T=0.4 may work better as an operating point for these pruned variants on some tasks but temperatures up to 0.9 have been tested and generally work if greater variability/creativity is desired.
This model performs best with an explicit domain framing in the system prompt. Examples for each specialist are given below โ substitute the example matching this model's domain.
Coding specialist (coe-qwen3.5-coding-18b-a3b, T=0.4)
System: "You are a Python coding expert. Complete the following task as such.
Return a single, complete block of functional Python code. Keep comments
and explanations concise and minimal. Do not second guess your answer."
User: "write python code to implement a thread-safe LRU cache with O(1) get and put."
Web specialist (coe-qwen3.5-web-18b-a3b, T=0.4)
System: "You are a web development expert. Answer the following with working code.
Prefer modern standards and best practices. Add inline comments only where
the logic is non-obvious. Stop after your answer."
User: "Create a standalone HTML file for a snake game web app. All CSS and JS must
be inline. Give the app a retro, dark neon look."
Math specialist (coe-qwen3.5-math-18b-a3b, T=0.4)
System: "You are a mathematics expert. Solve the following problem. Show non-trivial
intermediate steps. State any assumptions. Use standard notation. Stop after
the solution."
User: "Find the eigenvalues and eigenvectors of the matrix [[3, 1], [1, 3]]."
Physics specialist (coe-qwen3.5-physics-18b-a3b, T=0.4)
System: "You are a physics expert. Answer with precision. Show derivations where
relevant. Use SI units throughout. Stop after your answer."
User: "Derive the expression for the period of a simple pendulum in the
small-angle approximation."
Biology specialist (coe-qwen3.5-biology-18b-a3b, T=0.4)
System: "You are a biology expert. Answer with scientific precision. Reference
specific mechanisms, structures, and established terminology. Do not
add unsolicited commentary โ stop after your answer."
User: "Explain the role of the sodium-potassium pump in maintaining the
resting membrane potential of a neuron."
Engineering specialist (coe-qwen3.5-engineering-18b-a3b, T=0.4)
System: "You are an engineering expert. Answer with technical precision. Include
relevant standards, tolerances, or safety considerations where they
apply. When you have given your answer stop without further elaboration."
User: "Compare the fatigue life of a notched versus unnotched steel specimen
under cyclic loading, and explain the mechanism responsible for the
difference."
Vocational specialist (coe-qwen3.5-vocational-18b-a3b, T=0.4)
System: "You are an expert on welding. Answer as such. If appropriate include
best practices guidelines including safety protocols. When you have
given your answer stop without further elaboration."
User: "What type of filler rod should I use for TIG welding 304 stainless
steel, and what shielding gas is appropriate?"
Humanities specialist (coe-qwen3.5-humanities-18b-a3b, T=0.4)
System: "You are a humanities scholar. Answer the question with precision and
appropriate depth. Cite specific works, authors, or dates when relevant.
Do not add unsolicited commentary โ stop after your answer."
User: "What is the dramatic function of the Chorus in Greek tragedy?
Use Sophocles as your primary reference."
Limitations / Beta Caveats
- No post-surgery training. This model is the output of structural surgery on the base model weights. No supervised fine-tuning, RLHF, or DPO has been applied after expert removal. Behaviour on unusual prompts may be less robust than the base model.
- Text only. Vision inputs are not supported. The
_T(textual) mask was used, which selects experts optimised for text routing only. Visual expert pools differ significantly (mean T-vs-V Jaccard ~0.43 for same domain). - Q4_K_M quantization only. Full-precision (BF16) weights will be released following post-surgery validation across all 8 domains.
- Recommended context window: 32k tokens (Modelfile default). Longer contexts have not been validated post-surgery.
- Domain framing recommended. See prompt harness section above.
Expert Mask
The file masks/coverage_SYSTEMS_K128.pt in this repo contains the retained expert
indices used to produce this GGUF. Format:
import torch
masks = torch.load("masks/coverage_SYSTEMS_K128.pt", weights_only=False)
# masks: list of 40 torch.LongTensor, each shape (128,)
# masks[layer_idx] = 1D tensor of 128 retained expert indices for that layer
print(masks[0]) # expert indices retained in layer 0
To reproduce the surgery from the base model GGUF:
python scripts/prune_gguf_from_mask.py \
--mask "masks/coverage_SYSTEMS_K128.pt" \
--input "<path-to-Qwen3.5-35B-A3B-base.gguf>" \
--output-dir "./output"
Full surgery script and build pipeline: https://github.com/JThomas-CoE/College-of-Experts-AI
License
This model is derived from Qwen3.5-35B-A3B. The surgical modifications, masks, and College of Experts tooling are released under PolyForm Noncommercial 1.0.0. Commercial licensing available upon request.
The base model weights remain subject to the Qwen3.5 model license.
- Downloads last month
- 162
4-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="JThomas-CoE/coe-qwen3.5-coding-18b-a3b", filename="CoE-qwen3.5-coding-18b-a3b-Q4_K_M.gguf", )