gpt-oss-claude-mlx

Fine-tuned gpt-oss-20b for Claude Code integration and Harmony tool calls, quantized to 4-bit for Apple Silicon via mlx-lm.

Model Details

Property Value
Base model mlx-community/gpt-oss-20b-MXFP4-Q4
Architecture GptOss MoE, 24 layers, 32 experts
Fine-tuning method LoRA
LoRA rank 8
LoRA scale 20.0
LoRA target layers attention (q/k/v/o_proj) + MoE (mlp.router, mlp.experts.*)
Quantization 4-bit affine, group size 64 (routers at 8-bit)
Disk size ~10 GB
Context length 131072 tokens
Format MLX safetensors

Fine-Tuning Process

Step 1: Dataset Preparation

Training examples are JSONL with a messages array. The assistant turn has two fields:

  • thinking — internal reasoning, mapped to Harmony's analysis channel
  • content — user-visible answer, mapped to Harmony's final channel

The key constraint: pre-render once with tokenizer.apply_chat_template() and save as {"text": "..."}. If you pass raw messages to mlx_lm, it applies the chat template again on top of existing Harmony tokens and corrupts them.

Training data combines two sources:

  • Hand-crafted examples covering response style and Harmony tool call syntax (<|start|>assistant to=TOOL blocks for read/write/bash/glob)
  • Real Claude Code conversations mined from ~/.claude/projects/ — threads are reconstructed by following parentUuid links, filtered for at least one tool call (Bash, Read, Write, Edit, Glob, Grep), and sliced into overlapping windows of up to 12 turns. Tool results truncated at 800 chars. Up to 500 examples, deduplicated and shuffled.

Step 2: LoRA on Q4

LoRA adapters were trained directly on the Q4 base model — no Q8 intermediate needed. Adapters stay in full precision and compensate for quantization noise. Gradient checkpointing kept peak RAM manageable on a 32 GB machine.

Training config:

Parameter Value
Iterations 200
Batch size 1
Learning rate 1e-5
Weight decay 0.01
Max seq length 2048
Gradient checkpointing true
Checkpoint / eval every 50 steps
Best checkpoint iter 150

Step 3: Fuse → Quantize

Fusing directly into a quantized model stacks quantization errors. Instead:

  1. Fuse with --dequantize — merges adapters into clean bf16 weights (~40 GB intermediate)
  2. Re-quantize fresh to Q4 — group size 64, routers kept at 8-bit
Q4 base → LoRA (iter 150) → fuse --dequantize (bf16) → re-quantize Q4 → ~10 GB

Usage

The goal is to use this model as a local drop-in for Claude Code. The setup has two layers:

  1. vllm-mlx — serves the model on port 8080 with an OpenAI-compatible /v1/chat/completions API
  2. A thin stdlib bridge (~50 lines) on port 8082 — translates Claude Code's Anthropic /v1/messages requests to OpenAI format, forwards to vllm-mlx, strips Harmony tokens from the response, and returns clean JSON
Claude Code → bridge :8082 → vllm-mlx :8080 → model

Harmony token stripping

gpt-oss outputs structured Harmony tokens. The bridge extracts only the final channel:

<|channel|>final<|message|>...response...<|end|>

The analysis (chain-of-thought) channel is discarded — Claude Code only sees the clean assistant string.

Point Claude Code at your local model

mkdir -p /tmp/claude-local
HOME=/tmp/claude-local \
ANTHROPIC_BASE_URL=http://localhost:8082 \
ANTHROPIC_API_KEY=local \
claude --model gpt-oss-20b

The isolated HOME keeps your normal Claude config untouched. The dummy API key satisfies the client; the bridge ignores it.

Harmony-native agent (optional)

If you want to use the model's native tool-call format directly — without the bridge or Claude Code — a localcode agent parses <|start|>assistant to=TOOL blocks (read, write, bash, glob, etc.) and executes them in a loop. This exposes the full Harmony tool-call training that the Claude Code path masks.

Hardware

Trained and tested on a Mac Mini M4 with 32 GB unified memory. The Q4 model uses ~12 GB RAM — fits comfortably on 16 GB machines.

Downloads last month
889
Safetensors
Model size
21B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deburky/gpt-oss-claude-mlx

Adapter
(3)
this model