Instructions to use andrevp/Nanbeige4.1-3B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use andrevp/Nanbeige4.1-3B-MLX-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use andrevp/Nanbeige4.1-3B-MLX-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "andrevp/Nanbeige4.1-3B-MLX-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "andrevp/Nanbeige4.1-3B-MLX-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use andrevp/Nanbeige4.1-3B-MLX-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "andrevp/Nanbeige4.1-3B-MLX-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default andrevp/Nanbeige4.1-3B-MLX-4bit

Run Hermes

hermes

MLX LM

How to use andrevp/Nanbeige4.1-3B-MLX-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "andrevp/Nanbeige4.1-3B-MLX-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "andrevp/Nanbeige4.1-3B-MLX-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "andrevp/Nanbeige4.1-3B-MLX-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Nanbeige4.1-3B-MLX-4bit (4-bit Quantized)

This is the Nanbeige4.1-3B model converted to MLX format with 4-bit quantization (affine, group_size=64) for efficient inference on Apple Silicon. This is the smallest and fastest variant — ideal for speed-sensitive and memory-constrained use cases.

Other variants:

andrevp/Nanbeige4.1-3B-MLX-8bit — Best balance of quality & speed (~3.91 GB)

andrevp/Nanbeige4.1-3B-MLX-BF16 — Full precision (~7.3 GB)

All Variants Compared

Performance

Variant	Size	Memory	Prompt Speed	Gen Speed
4-bit (this)	2.06 GB	~2.3 GB	~279 tok/s	~103 tok/s
8-bit	3.91 GB	~4.3 GB	~342 tok/s	~59 tok/s
BF16	7.35 GB	~8.0 GB	~276 tok/s	~33 tok/s

Quality Comparison (Head-to-Head, Identical Prompts, temp=0)

All three variants were tested with the same prompts under deterministic settings (temperature=0) to evaluate quality differences:

Test	4-bit	8-bit	BF16
Math: 47 * 83	3901	3901	3901
Logic: "All but 9 die" trick	9	9	9
Code: Binary search	Correct	Correct	Correct
Math: f(x)=2x^2-3x+1, f(5)	36	36	36
Nuanced reasoning: Paper folding	Correct	Correct	Correct
Tool call: BookFlight JSON	Identical	Identical	Identical
AIME-style: 2^100 mod 7	2	2	2

Key findings:

8-bit vs BF16: Produced word-for-word identical reasoning and answers in the majority of tests. Essentially zero quality loss.
4-bit vs BF16: Sometimes takes slightly different reasoning paths, but arrives at the same correct answers. Tool calling output is 100% identical across all variants.
Recommendation: 4-bit is best for speed and memory. 8-bit is the sweet spot for quality. BF16 is for research and benchmarking where exact reproduction matters.

Original Model

Source: Nanbeige/Nanbeige4.1-3B
Developer: Nanbeige (BOSS Zhipin)
Architecture: LlamaForCausalLM (4B parameters)
License: Apache 2.0 (same as the original model)

Conversion Details

Property	Value
Quantization	4-bit affine
Group size	64
Bits per weight	~4.5
Original size (BF16)	~7.87 GB
Quantized size	~2.06 GB
Compression ratio	3.8x
Conversion tool	mlx-lm v0.30.7

Performance on Apple Silicon

Tested on Apple Silicon:

Metric	Value
Prompt processing	~279 tokens/sec
Generation speed	~103 tokens/sec
Peak memory usage	~2.3 GB

Capabilities

This model retains all capabilities of the original Nanbeige4.1-3B:

Reasoning/Thinking: Uses <think>...</think> tags for chain-of-thought reasoning
Tool/Function Calling: Generates structured <tool_call>...</tool_call> JSON output
Multi-turn Conversation: Supports multi-turn chat with context tracking
Multilingual: Strong performance in both English and Chinese
Code Generation: Capable of writing and explaining code
Deep-Search Agent: Supports deep-search tasks with 500+ rounds of tool invocations (using tokenizer_config_search.json)

Quickstart

Installation

pip install mlx-lm

CLI Usage

mlx_lm generate \
  --model andrevp/Nanbeige4.1-3B-MLX-4bit \
  --prompt "Explain quantum computing in simple terms." \
  --max-tokens 512 \
  --temp 0.6 \
  --top-p 0.95

Python - Chat

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.8?"}
]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler
)
print(response)

Python - Tool Calling

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

messages = [
    {"role": "user", "content": "What is the weather in Tokyo?"}
]
tools = [
    {
        "type": "function",
        "function": {
            "name": "SearchWeather",
            "description": "Find the current weather in a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]
prompt = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=256, sampler=sampler
)
print(response)

Python - Multi-turn with Tool Responses

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

tools = [
    {
        "type": "function",
        "function": {
            "name": "SearchWeather",
            "description": "Find the current weather in a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [
    {"role": "user", "content": "What's the weather like in Paris?"},
    {"role": "assistant", "content": "", "tool_calls": [
        {"function": {"name": "SearchWeather", "arguments": '{"location": "Paris"}'}}
    ]},
    {"role": "tool", "content": '{"temperature": "18°C", "condition": "Partly cloudy"}'}
]

prompt = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=300, sampler=sampler
)
print(response)

Recommended Inference Hyperparameters

Parameter	Value
Temperature	0.6
Top-p	0.95
Repeat penalty	1.0
Max new tokens	131,072

Benchmarks (Original Model)

General Reasoning Tasks

Benchmark	Qwen3-4B	Qwen3-8B	Qwen3-14B	Qwen3-32B	Qwen3-30B-A3B	Nanbeige4.1-3B
Code
Live-Code-Bench-V6	57.4	49.4	55.9	55.7	66.0	76.9
LCB-Pro-Easy	40.2	41.2	33.0	42.3	60.8	81.4
LCB-Pro-Medium	5.3	3.5	1.8	3.5	3.5	28.1
Math
AIME 2026 I	81.46	70.42	76.46	75.83	87.30	87.40
HMMT Nov	68.33	48.33	56.67	57.08	71.25	77.92
IMO-Answer-Bench	48.00	36.56	41.81	43.94	54.34	53.38
Science
GPQA	65.8	62.0	63.38	68.4	73.4	83.8
HLE (Text-only)	6.72	5.28	7.00	9.31	11.77	12.60
Alignment
Arena-Hard-v2	34.9	26.3	36.9	56.0	60.2	73.2
Multi-Challenge	41.14	36.30	36.97	38.72	49.40	52.21
Tool Use
BFCL-V4	44.87	42.20	45.14	47.90	48.6	56.50
Tau2-Bench	45.9	42.06	44.96	45.26	47.70	48.57

Deep Search Benchmarks

Model	xBench-DS-2505	xBench-DS-2510	Browse-Comp	GAIA	HLE	SEAL-0
MiroThinker-v1.0-8B	61	-	31.1	66.4	21.5	40.4
AgentCPM-Explore-4B	70	-	25.0	63.9	19.1	40.0
Qwen3-32B	39	8	3.15	30.17	9.26	8.15
Nanbeige4.1-3B	75	39	19.12	69.90	22.29	41.44

Files Included

File	Description
`model.safetensors`	Quantized 4-bit model weights
`model.safetensors.index.json`	Weight index mapping
`config.json`	Model architecture config (with quantization params)
`tokenizer.json`	Fast tokenizer
`tokenizer.model`	SentencePiece model
`tokenizer_config.json`	Tokenizer config with all special tokens
`tokenizer_config_search.json`	Tokenizer config for deep-search mode
`chat_template.jinja`	Chat template (chat + tool calling + reasoning)
`special_tokens_map.json`	Special tokens mapping
`added_tokens.json`	Additional token definitions
`generation_config.json`	Default generation parameters

Special Tokens

Token	ID	Purpose
`<\|im_start\|>`	166100	BOS / message start
`<\|im_end\|>`	166101	EOS / message end
`<\|endoftext\|>`	166102	End of text
`<think>`	166103	Start of reasoning
`</think>`	166104	End of reasoning
`<tool_call>`	166105	Start of tool call
`</tool_call>`	166106	End of tool call

Deep-Search Mode

For deep-search agent capabilities, switch to tokenizer_config_search.json and use the miroflow-framework for inference. See the original model card for detailed deep-search setup instructions.

Important: Extended Thinking Behavior

Nanbeige4.1-3B is a reasoning model that generates chain-of-thought inside <think>...</think> tags before producing a final answer. This is by design — and it affects all variants equally (4-bit, 8-bit, BF16).

What to expect

Task Type	Typical Thinking Length	Recommended `max_tokens`
Math, logic, tool calling	200–500 tokens	512–2,048
Code generation	500–1,500 tokens	2,048–4,096
Translation, creative writing, commonsense	3,000–5,000+ tokens	8,192+

For complex tasks (translation, creative writing, open-ended questions), the model may spend 3,000–5,000+ tokens reasoning before delivering an answer. If max_tokens is too low, the output will be truncated mid-thinking and no final answer will appear — this is not a failure, the model simply needs more tokens to finish its thought process.

Workarounds

1. Increase max_tokens (recommended)

# For complex tasks, use high max_tokens
response = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=8192,  # or higher for very complex tasks
    sampler=sampler
)

2. Skip thinking (experimental)

You can pre-fill an empty thinking block to force the model to answer directly. This does not always work — the model may re-enter thinking mode — but it can help for simpler open-ended tasks:

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
prompt += '<think>\n\n</think>\n\n'  # Force empty thinking block
response = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=512, sampler=sampler
)

3. Post-process to extract the answer

if '</think>' in response:
    answer = response.split('</think>')[-1].strip()
else:
    answer = response  # Still in thinking — increase max_tokens

Limitations

While safety was emphasized during training, the model may still generate unexpected outputs due to its size and probabilistic nature. Users should not propagate harmful content. The developers assume no responsibility for consequences from dissemination of inappropriate content.

License

This model is released under the Apache 2.0 License, the same license as the original Nanbeige4.1-3B model.

Credits

Original model: Nanbeige/Nanbeige4.1-3B by Nanbeige (BOSS Zhipin)
MLX framework: Apple MLX
Conversion: mlx-lm
Contact (original model): nanbeige@kanzhun.com

Downloads last month: 51

Safetensors

Model size

0.6B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for andrevp/Nanbeige4.1-3B-MLX-4bit

Base model

Nanbeige/Nanbeige4-3B-Base

Finetuned

Nanbeige/Nanbeige4.1-3B

Quantized

(56)

this model