Instructions to use deburky/gpt-oss-claude-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deburky/gpt-oss-claude-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("deburky/gpt-oss-claude-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use deburky/gpt-oss-claude-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "deburky/gpt-oss-claude-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "deburky/gpt-oss-claude-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use deburky/gpt-oss-claude-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "deburky/gpt-oss-claude-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default deburky/gpt-oss-claude-mlx

Run Hermes

hermes

OpenClaw new

How to use deburky/gpt-oss-claude-mlx with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "deburky/gpt-oss-claude-mlx"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "deburky/gpt-oss-claude-mlx" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use deburky/gpt-oss-claude-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "deburky/gpt-oss-claude-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "deburky/gpt-oss-claude-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "deburky/gpt-oss-claude-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

gpt-oss-claude-mlx

Fine-tuned gpt-oss-20b for Claude Code integration and Harmony tool calls, quantized to 4-bit for Apple Silicon via mlx-lm.

Model Details

Property	Value
Base model	`mlx-community/gpt-oss-20b-MXFP4-Q4`
Architecture	GptOss MoE, 24 layers, 32 experts
Fine-tuning method	LoRA
LoRA rank	8
LoRA scale	20.0
LoRA target layers	attention (`q/k/v/o_proj`) + MoE (`mlp.router`, `mlp.experts.*`)
Quantization	4-bit affine, group size 64 (routers at 8-bit)
Disk size	~10 GB
Context length	131072 tokens
Format	MLX safetensors

Fine-Tuning Process

Step 1: Dataset Preparation

Training examples are JSONL with a messages array. The assistant turn has two fields:

thinking — internal reasoning, mapped to Harmony's analysis channel
content — user-visible answer, mapped to Harmony's final channel

The key constraint: pre-render once with tokenizer.apply_chat_template() and save as {"text": "..."}. If you pass raw messages to mlx_lm, it applies the chat template again on top of existing Harmony tokens and corrupts them.

Training data combines two sources:

Hand-crafted examples covering response style and Harmony tool call syntax (<|start|>assistant to=TOOL blocks for read/write/bash/glob)
Real Claude Code conversations mined from ~/.claude/projects/ — threads are reconstructed by following parentUuid links, filtered for at least one tool call (Bash, Read, Write, Edit, Glob, Grep), and sliced into overlapping windows of up to 12 turns. Tool results truncated at 800 chars. Up to 500 examples, deduplicated and shuffled.

Step 2: LoRA on Q4

LoRA adapters were trained directly on the Q4 base model — no Q8 intermediate needed. Adapters stay in full precision and compensate for quantization noise. Gradient checkpointing kept peak RAM manageable on a 32 GB machine.

Training config:

Parameter	Value
Iterations	200
Batch size	1
Learning rate	1e-5
Weight decay	0.01
Max seq length	2048
Gradient checkpointing	true
Checkpoint / eval every	50 steps
Best checkpoint	iter 150

Step 3: Fuse → Quantize

Fusing directly into a quantized model stacks quantization errors. Instead:

Fuse with --dequantize — merges adapters into clean bf16 weights (~40 GB intermediate)
Re-quantize fresh to Q4 — group size 64, routers kept at 8-bit

Q4 base → LoRA (iter 150) → fuse --dequantize (bf16) → re-quantize Q4 → ~10 GB

Usage

The goal is to use this model as a local drop-in for Claude Code. The setup has two layers:

vllm-mlx — serves the model on port 8080 with an OpenAI-compatible /v1/chat/completions API
A thin stdlib bridge (~50 lines) on port 8082 — translates Claude Code's Anthropic /v1/messages requests to OpenAI format, forwards to vllm-mlx, strips Harmony tokens from the response, and returns clean JSON

Claude Code → bridge :8082 → vllm-mlx :8080 → model

Harmony token stripping

gpt-oss outputs structured Harmony tokens. The bridge extracts only the final channel:

<|channel|>final<|message|>...response...<|end|>

The analysis (chain-of-thought) channel is discarded — Claude Code only sees the clean assistant string.

Point Claude Code at your local model

mkdir -p /tmp/claude-local
HOME=/tmp/claude-local \
ANTHROPIC_BASE_URL=http://localhost:8082 \
ANTHROPIC_API_KEY=local \
claude --model gpt-oss-20b

The isolated HOME keeps your normal Claude config untouched. The dummy API key satisfies the client; the bridge ignores it.

Harmony-native agent (optional)

If you want to use the model's native tool-call format directly — without the bridge or Claude Code — a localcode agent parses <|start|>assistant to=TOOL blocks (read, write, bash, glob, etc.) and executes them in a loop. This exposes the full Harmony tool-call training that the Claude Code path masks.

Hardware

Trained and tested on a Mac Mini M4 with 32 GB unified memory. The Q4 model uses ~12 GB RAM — fits comfortably on 16 GB machines.

Downloads last month: 39

Safetensors

Model size

21B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for deburky/gpt-oss-claude-mlx

Base model

openai/gpt-oss-20b

Quantized

mlx-community/gpt-oss-20b-MXFP4-Q4

Adapter

(3)

this model