Carnice-9b-MLX

4-bit MLX quantization of kai-os/Carnice-9b.

This conversion was produced using mlx_lm.convert for use with MLX on Apple Silicon.

Original Model

Carnice-9b by kai-os is a standalone merged model built on Qwen/Qwen3.5-9B, tuned specifically for the Hermes Agent harness. It was trained in two stages:

  • Stage A — reasoning repair pass on high-signal reasoning data
  • Stage B — Hermes-specific refresh pass built around harness-native traces and action structure

The model is optimized for terminal-heavy task execution, file editing, structured tool use, browser-assisted agent behavior, and multi-turn tool calling inside the Hermes runtime.

See the original model card for full training details and data sources.

Quantization Details

Property Value
Quantization 4-bit
Group size 64
Mode Affine
Original dtype bfloat16
File size ~4.7 GB

Model Architecture

Property Value
Architecture Qwen3_5ForCausalLM
Parameters ~8.95B
Hidden layers 32
Hidden size 4096
Attention heads 16 (4 KV heads)
Head dim 256
Intermediate size 12288
Max context length 262,144 tokens (256K)
Vocab size 248,320
Attention Hybrid (linear + full, every 4th layer is full attention)

Features

  • Reasoning — supports <think> / </think> tags for chain-of-thought reasoning
  • Tool calling — Hermes-style tool use with <tool_call> formatting
  • Long context — 256K token context window
  • Multimodal tokens — tokenizer includes vision and audio special tokens (inherited from Qwen3.5 base)

Benchmarks

Sampled evaluation (30 items per benchmark) on the 4-bit MLX quantized model:

Benchmark Accuracy Correct Total Sampled From
MMLU 66.7% 20 30 14,042
HellaSwag 90.0% 27 30 10,042
TruthfulQA 86.7% 26 30 817
HumanEval 83.3% 25 30 164
LiveCodeBench 30.0% 9 30 1,055

Note: These are sampled results (30 items each), not full benchmark runs. They give a rough signal of quantized model quality but should not be compared directly to full-run scores.

Usage

With mlx-lm

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("jason-schulz/Carnice-9b-MLX")

prompt = "Explain the difference between linear and full attention."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(
    model,
    tokenizer,
    prompt=text,
    max_tokens=512,
)
print(response)

With MLX chat

mlx_lm.chat --model jason-schulz/Carnice-9b-MLX

Conversion

Produced with:

mlx_lm.convert \
    --hf-path kai-os/Carnice-9b \
    --mlx-path Carnice-9b-MLX \
    -q \
    --q-bits 4 \
    --q-group-size 64 \
    --q-mode affine

Conversion Note

The original model's config.json uses model_type: "qwen_3_5_text", which mlx_lm.convert does not recognize. The conversion will fail with an unsupported model type error. To work around this, manually change model_type to "qwen3_5" in the source model's config.json before running the conversion. The resulting model loads and runs correctly under MLX.

Credits

Downloads last month
223
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jason-schulz/Carnice-9b-MLX

Finetuned
Qwen/Qwen3.5-9B
Quantized
(158)
this model