Carnice-9b-MLX
4-bit MLX quantization of kai-os/Carnice-9b.
This conversion was produced using mlx_lm.convert for use with MLX on Apple Silicon.
Original Model
Carnice-9b by kai-os is a standalone merged model built on Qwen/Qwen3.5-9B, tuned specifically for the Hermes Agent harness. It was trained in two stages:
- Stage A — reasoning repair pass on high-signal reasoning data
- Stage B — Hermes-specific refresh pass built around harness-native traces and action structure
The model is optimized for terminal-heavy task execution, file editing, structured tool use, browser-assisted agent behavior, and multi-turn tool calling inside the Hermes runtime.
See the original model card for full training details and data sources.
Quantization Details
| Property | Value |
|---|---|
| Quantization | 4-bit |
| Group size | 64 |
| Mode | Affine |
| Original dtype | bfloat16 |
| File size | ~4.7 GB |
Model Architecture
| Property | Value |
|---|---|
| Architecture | Qwen3_5ForCausalLM |
| Parameters | ~8.95B |
| Hidden layers | 32 |
| Hidden size | 4096 |
| Attention heads | 16 (4 KV heads) |
| Head dim | 256 |
| Intermediate size | 12288 |
| Max context length | 262,144 tokens (256K) |
| Vocab size | 248,320 |
| Attention | Hybrid (linear + full, every 4th layer is full attention) |
Features
- Reasoning — supports
<think>/</think>tags for chain-of-thought reasoning - Tool calling — Hermes-style tool use with
<tool_call>formatting - Long context — 256K token context window
- Multimodal tokens — tokenizer includes vision and audio special tokens (inherited from Qwen3.5 base)
Benchmarks
Sampled evaluation (30 items per benchmark) on the 4-bit MLX quantized model:
| Benchmark | Accuracy | Correct | Total | Sampled From |
|---|---|---|---|---|
| MMLU | 66.7% | 20 | 30 | 14,042 |
| HellaSwag | 90.0% | 27 | 30 | 10,042 |
| TruthfulQA | 86.7% | 26 | 30 | 817 |
| HumanEval | 83.3% | 25 | 30 | 164 |
| LiveCodeBench | 30.0% | 9 | 30 | 1,055 |
Note: These are sampled results (30 items each), not full benchmark runs. They give a rough signal of quantized model quality but should not be compared directly to full-run scores.
Usage
With mlx-lm
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("jason-schulz/Carnice-9b-MLX")
prompt = "Explain the difference between linear and full attention."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(
model,
tokenizer,
prompt=text,
max_tokens=512,
)
print(response)
With MLX chat
mlx_lm.chat --model jason-schulz/Carnice-9b-MLX
Conversion
Produced with:
mlx_lm.convert \
--hf-path kai-os/Carnice-9b \
--mlx-path Carnice-9b-MLX \
-q \
--q-bits 4 \
--q-group-size 64 \
--q-mode affine
Conversion Note
The original model's config.json uses model_type: "qwen_3_5_text", which mlx_lm.convert does not recognize. The conversion will fail with an unsupported model type error. To work around this, manually change model_type to "qwen3_5" in the source model's config.json before running the conversion. The resulting model loads and runs correctly under MLX.
Credits
- Original model: kai-os/Carnice-9b by kai-os
- Base model: Qwen/Qwen3.5-9B by Qwen
- MLX conversion: This repository
- Downloads last month
- 223
4-bit