Nanbeige4.1-3B-MLX-4bit (4-bit Quantized)
This is the Nanbeige4.1-3B model converted to MLX format with 4-bit quantization (affine, group_size=64) for efficient inference on Apple Silicon. This is the smallest and fastest variant — ideal for speed-sensitive and memory-constrained use cases.
Other variants:
- andrevp/Nanbeige4.1-3B-MLX-8bit — Best balance of quality & speed (~3.91 GB)
- andrevp/Nanbeige4.1-3B-MLX-BF16 — Full precision (~7.3 GB)
All Variants Compared
Performance
| Variant | Size | Memory | Prompt Speed | Gen Speed |
|---|---|---|---|---|
| 4-bit (this) | 2.06 GB | ~2.3 GB | ~279 tok/s | ~103 tok/s |
| 8-bit | 3.91 GB | ~4.3 GB | ~342 tok/s | ~59 tok/s |
| BF16 | 7.35 GB | ~8.0 GB | ~276 tok/s | ~33 tok/s |
Quality Comparison (Head-to-Head, Identical Prompts, temp=0)
All three variants were tested with the same prompts under deterministic settings (temperature=0) to evaluate quality differences:
| Test | 4-bit | 8-bit | BF16 |
|---|---|---|---|
| Math: 47 * 83 | 3901 | 3901 | 3901 |
| Logic: "All but 9 die" trick | 9 | 9 | 9 |
| Code: Binary search | Correct | Correct | Correct |
| Math: f(x)=2x^2-3x+1, f(5) | 36 | 36 | 36 |
| Nuanced reasoning: Paper folding | Correct | Correct | Correct |
| Tool call: BookFlight JSON | Identical | Identical | Identical |
| AIME-style: 2^100 mod 7 | 2 | 2 | 2 |
Key findings:
- 8-bit vs BF16: Produced word-for-word identical reasoning and answers in the majority of tests. Essentially zero quality loss.
- 4-bit vs BF16: Sometimes takes slightly different reasoning paths, but arrives at the same correct answers. Tool calling output is 100% identical across all variants.
- Recommendation: 4-bit is best for speed and memory. 8-bit is the sweet spot for quality. BF16 is for research and benchmarking where exact reproduction matters.
Original Model
- Source: Nanbeige/Nanbeige4.1-3B
- Developer: Nanbeige (BOSS Zhipin)
- Architecture: LlamaForCausalLM (4B parameters)
- License: Apache 2.0 (same as the original model)
Conversion Details
| Property | Value |
|---|---|
| Quantization | 4-bit affine |
| Group size | 64 |
| Bits per weight | ~4.5 |
| Original size (BF16) | ~7.87 GB |
| Quantized size | ~2.06 GB |
| Compression ratio | 3.8x |
| Conversion tool | mlx-lm v0.30.7 |
Performance on Apple Silicon
Tested on Apple Silicon:
| Metric | Value |
|---|---|
| Prompt processing | ~279 tokens/sec |
| Generation speed | ~103 tokens/sec |
| Peak memory usage | ~2.3 GB |
Capabilities
This model retains all capabilities of the original Nanbeige4.1-3B:
- Reasoning/Thinking: Uses
<think>...</think>tags for chain-of-thought reasoning - Tool/Function Calling: Generates structured
<tool_call>...</tool_call>JSON output - Multi-turn Conversation: Supports multi-turn chat with context tracking
- Multilingual: Strong performance in both English and Chinese
- Code Generation: Capable of writing and explaining code
- Deep-Search Agent: Supports deep-search tasks with 500+ rounds of tool invocations (using
tokenizer_config_search.json)
Quickstart
Installation
pip install mlx-lm
CLI Usage
mlx_lm generate \
--model andrevp/Nanbeige4.1-3B-MLX-4bit \
--prompt "Explain quantum computing in simple terms." \
--max-tokens 512 \
--temp 0.6 \
--top-p 0.95
Python - Chat
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)
messages = [
{"role": "user", "content": "Which number is bigger, 9.11 or 9.8?"}
]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
response = generate(
model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler
)
print(response)
Python - Tool Calling
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)
messages = [
{"role": "user", "content": "What is the weather in Tokyo?"}
]
tools = [
{
"type": "function",
"function": {
"name": "SearchWeather",
"description": "Find the current weather in a city.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
}
}
}
]
prompt = tokenizer.apply_chat_template(
messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
model, tokenizer, prompt=prompt, max_tokens=256, sampler=sampler
)
print(response)
Python - Multi-turn with Tool Responses
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)
tools = [
{
"type": "function",
"function": {
"name": "SearchWeather",
"description": "Find the current weather in a city.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
]
messages = [
{"role": "user", "content": "What's the weather like in Paris?"},
{"role": "assistant", "content": "", "tool_calls": [
{"function": {"name": "SearchWeather", "arguments": '{"location": "Paris"}'}}
]},
{"role": "tool", "content": '{"temperature": "18°C", "condition": "Partly cloudy"}'}
]
prompt = tokenizer.apply_chat_template(
messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
model, tokenizer, prompt=prompt, max_tokens=300, sampler=sampler
)
print(response)
Recommended Inference Hyperparameters
| Parameter | Value |
|---|---|
| Temperature | 0.6 |
| Top-p | 0.95 |
| Repeat penalty | 1.0 |
| Max new tokens | 131,072 |
Benchmarks (Original Model)
General Reasoning Tasks
| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3-14B | Qwen3-32B | Qwen3-30B-A3B | Nanbeige4.1-3B |
|---|---|---|---|---|---|---|
| Code | ||||||
| Live-Code-Bench-V6 | 57.4 | 49.4 | 55.9 | 55.7 | 66.0 | 76.9 |
| LCB-Pro-Easy | 40.2 | 41.2 | 33.0 | 42.3 | 60.8 | 81.4 |
| LCB-Pro-Medium | 5.3 | 3.5 | 1.8 | 3.5 | 3.5 | 28.1 |
| Math | ||||||
| AIME 2026 I | 81.46 | 70.42 | 76.46 | 75.83 | 87.30 | 87.40 |
| HMMT Nov | 68.33 | 48.33 | 56.67 | 57.08 | 71.25 | 77.92 |
| IMO-Answer-Bench | 48.00 | 36.56 | 41.81 | 43.94 | 54.34 | 53.38 |
| Science | ||||||
| GPQA | 65.8 | 62.0 | 63.38 | 68.4 | 73.4 | 83.8 |
| HLE (Text-only) | 6.72 | 5.28 | 7.00 | 9.31 | 11.77 | 12.60 |
| Alignment | ||||||
| Arena-Hard-v2 | 34.9 | 26.3 | 36.9 | 56.0 | 60.2 | 73.2 |
| Multi-Challenge | 41.14 | 36.30 | 36.97 | 38.72 | 49.40 | 52.21 |
| Tool Use | ||||||
| BFCL-V4 | 44.87 | 42.20 | 45.14 | 47.90 | 48.6 | 56.50 |
| Tau2-Bench | 45.9 | 42.06 | 44.96 | 45.26 | 47.70 | 48.57 |
Deep Search Benchmarks
| Model | xBench-DS-2505 | xBench-DS-2510 | Browse-Comp | GAIA | HLE | SEAL-0 |
|---|---|---|---|---|---|---|
| MiroThinker-v1.0-8B | 61 | - | 31.1 | 66.4 | 21.5 | 40.4 |
| AgentCPM-Explore-4B | 70 | - | 25.0 | 63.9 | 19.1 | 40.0 |
| Qwen3-32B | 39 | 8 | 3.15 | 30.17 | 9.26 | 8.15 |
| Nanbeige4.1-3B | 75 | 39 | 19.12 | 69.90 | 22.29 | 41.44 |
Files Included
| File | Description |
|---|---|
model.safetensors |
Quantized 4-bit model weights |
model.safetensors.index.json |
Weight index mapping |
config.json |
Model architecture config (with quantization params) |
tokenizer.json |
Fast tokenizer |
tokenizer.model |
SentencePiece model |
tokenizer_config.json |
Tokenizer config with all special tokens |
tokenizer_config_search.json |
Tokenizer config for deep-search mode |
chat_template.jinja |
Chat template (chat + tool calling + reasoning) |
special_tokens_map.json |
Special tokens mapping |
added_tokens.json |
Additional token definitions |
generation_config.json |
Default generation parameters |
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|im_start|> |
166100 | BOS / message start |
<|im_end|> |
166101 | EOS / message end |
<|endoftext|> |
166102 | End of text |
<think> |
166103 | Start of reasoning |
</think> |
166104 | End of reasoning |
<tool_call> |
166105 | Start of tool call |
</tool_call> |
166106 | End of tool call |
Deep-Search Mode
For deep-search agent capabilities, switch to tokenizer_config_search.json and use the miroflow-framework for inference. See the original model card for detailed deep-search setup instructions.
Important: Extended Thinking Behavior
Nanbeige4.1-3B is a reasoning model that generates chain-of-thought inside <think>...</think> tags before producing a final answer. This is by design — and it affects all variants equally (4-bit, 8-bit, BF16).
What to expect
| Task Type | Typical Thinking Length | Recommended max_tokens |
|---|---|---|
| Math, logic, tool calling | 200–500 tokens | 512–2,048 |
| Code generation | 500–1,500 tokens | 2,048–4,096 |
| Translation, creative writing, commonsense | 3,000–5,000+ tokens | 8,192+ |
For complex tasks (translation, creative writing, open-ended questions), the model may spend 3,000–5,000+ tokens reasoning before delivering an answer. If max_tokens is too low, the output will be truncated mid-thinking and no final answer will appear — this is not a failure, the model simply needs more tokens to finish its thought process.
Workarounds
1. Increase max_tokens (recommended)
# For complex tasks, use high max_tokens
response = generate(
model, tokenizer, prompt=prompt,
max_tokens=8192, # or higher for very complex tasks
sampler=sampler
)
2. Skip thinking (experimental)
You can pre-fill an empty thinking block to force the model to answer directly. This does not always work — the model may re-enter thinking mode — but it can help for simpler open-ended tasks:
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
prompt += '<think>\n\n</think>\n\n' # Force empty thinking block
response = generate(
model, tokenizer, prompt=prompt,
max_tokens=512, sampler=sampler
)
3. Post-process to extract the answer
if '</think>' in response:
answer = response.split('</think>')[-1].strip()
else:
answer = response # Still in thinking — increase max_tokens
Limitations
While safety was emphasized during training, the model may still generate unexpected outputs due to its size and probabilistic nature. Users should not propagate harmful content. The developers assume no responsibility for consequences from dissemination of inappropriate content.
License
This model is released under the Apache 2.0 License, the same license as the original Nanbeige4.1-3B model.
Credits
- Original model: Nanbeige/Nanbeige4.1-3B by Nanbeige (BOSS Zhipin)
- MLX framework: Apple MLX
- Conversion: mlx-lm
- Contact (original model): nanbeige@kanzhun.com
- Downloads last month
- 14
4-bit