gpt-oss-claude-mlx
Fine-tuned gpt-oss-20b for Claude Code integration and Harmony tool calls, quantized to 4-bit for Apple Silicon via mlx-lm.
Model Details
| Property | Value |
|---|---|
| Base model | mlx-community/gpt-oss-20b-MXFP4-Q4 |
| Architecture | GptOss MoE, 24 layers, 32 experts |
| Fine-tuning method | LoRA |
| LoRA rank | 8 |
| LoRA scale | 20.0 |
| LoRA target layers | attention (q/k/v/o_proj) + MoE (mlp.router, mlp.experts.*) |
| Quantization | 4-bit affine, group size 64 (routers at 8-bit) |
| Disk size | ~10 GB |
| Context length | 131072 tokens |
| Format | MLX safetensors |
Fine-Tuning Process
Step 1: Dataset Preparation
Training examples are JSONL with a messages array. The assistant turn has two fields:
thinking— internal reasoning, mapped to Harmony'sanalysischannelcontent— user-visible answer, mapped to Harmony'sfinalchannel
The key constraint: pre-render once with tokenizer.apply_chat_template() and save as {"text": "..."}. If you pass raw messages to mlx_lm, it applies the chat template again on top of existing Harmony tokens and corrupts them.
Training data combines two sources:
- Hand-crafted examples covering response style and Harmony tool call syntax (
<|start|>assistant to=TOOLblocks for read/write/bash/glob) - Real Claude Code conversations mined from
~/.claude/projects/— threads are reconstructed by followingparentUuidlinks, filtered for at least one tool call (Bash,Read,Write,Edit,Glob,Grep), and sliced into overlapping windows of up to 12 turns. Tool results truncated at 800 chars. Up to 500 examples, deduplicated and shuffled.
Step 2: LoRA on Q4
LoRA adapters were trained directly on the Q4 base model — no Q8 intermediate needed. Adapters stay in full precision and compensate for quantization noise. Gradient checkpointing kept peak RAM manageable on a 32 GB machine.
Training config:
| Parameter | Value |
|---|---|
| Iterations | 200 |
| Batch size | 1 |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Max seq length | 2048 |
| Gradient checkpointing | true |
| Checkpoint / eval every | 50 steps |
| Best checkpoint | iter 150 |
Step 3: Fuse → Quantize
Fusing directly into a quantized model stacks quantization errors. Instead:
- Fuse with
--dequantize— merges adapters into clean bf16 weights (~40 GB intermediate) - Re-quantize fresh to Q4 — group size 64, routers kept at 8-bit
Q4 base → LoRA (iter 150) → fuse --dequantize (bf16) → re-quantize Q4 → ~10 GB
Usage
The goal is to use this model as a local drop-in for Claude Code. The setup has two layers:
- vllm-mlx — serves the model on port 8080 with an OpenAI-compatible
/v1/chat/completionsAPI - A thin stdlib bridge (~50 lines) on port 8082 — translates Claude Code's Anthropic
/v1/messagesrequests to OpenAI format, forwards to vllm-mlx, strips Harmony tokens from the response, and returns clean JSON
Claude Code → bridge :8082 → vllm-mlx :8080 → model
Harmony token stripping
gpt-oss outputs structured Harmony tokens. The bridge extracts only the final channel:
<|channel|>final<|message|>...response...<|end|>
The analysis (chain-of-thought) channel is discarded — Claude Code only sees the clean assistant string.
Point Claude Code at your local model
mkdir -p /tmp/claude-local
HOME=/tmp/claude-local \
ANTHROPIC_BASE_URL=http://localhost:8082 \
ANTHROPIC_API_KEY=local \
claude --model gpt-oss-20b
The isolated HOME keeps your normal Claude config untouched. The dummy API key satisfies the client; the bridge ignores it.
Harmony-native agent (optional)
If you want to use the model's native tool-call format directly — without the bridge or Claude Code — a localcode agent parses <|start|>assistant to=TOOL blocks (read, write, bash, glob, etc.) and executes them in a loop. This exposes the full Harmony tool-call training that the Claude Code path masks.
Hardware
Trained and tested on a Mac Mini M4 with 32 GB unified memory. The Q4 model uses ~12 GB RAM — fits comfortably on 16 GB machines.
- Downloads last month
- 889
4-bit