|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- deepseek |
|
|
- kimi_k2 |
|
|
- text-generation |
|
|
- reasoning |
|
|
- agentic |
|
|
- tool-calling |
|
|
- compressed-tensors |
|
|
pipeline_tag: text-generation |
|
|
base_model: moonshotai/Kimi-K2-Thinking |
|
|
--- |
|
|
|
|
|
# Zen Max - Kimi K2 Thinking Architecture |
|
|
|
|
|
**Organization**: [Zen LM](https://zenlm.org) (Hanzo AI Γ Zoo Labs Foundation) |
|
|
**Base Model**: Moonshot AI Kimi K2 Thinking (DeepseekV3ForCausalLM) |
|
|
**Parameters**: 671B total (384 experts Γ ~1.75B each, 8 active per token = ~14B) |
|
|
**License**: Apache 2.0 |
|
|
**Context Window**: 256K tokens |
|
|
**Thinking Capacity**: 96K-128K thinking tokens per step |
|
|
**Architecture**: DeepseekV3 MoE (Mixture of Experts) |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
Zen Max is a reasoning-first language model built on Moonshot AI's Kimi K2 Thinking architecture, designed for **test-time scaling** through extended thinking and tool-calling capabilities. |
|
|
|
|
|
Built as a **thinking agent**, Zen Max reasons step-by-step while using tools, executing **200-300 sequential tool calls** without human interference, reasoning coherently across hundreds of steps to solve complex problems. |
|
|
|
|
|
> **Note**: This repository contains configuration files and documentation for Zen Max. The full model weights (~1TB) are available from the base model: [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking). Zen-specific fine-tuning instructions and adapters will be provided in future releases. |
|
|
|
|
|
### Key Capabilities |
|
|
|
|
|
#### 1. Agentic Reasoning (HLE: 44.9%) |
|
|
- Extended chain-of-thought reasoning with `<think>` tags |
|
|
- Multi-step planning and execution |
|
|
- Adaptive reasoning with hypothesis generation and refinement |
|
|
- Think β search β code β verify β think cycles |
|
|
|
|
|
#### 2. Agentic Search & Browsing (BrowseComp: 60.2%) |
|
|
- Goal-directed web-based reasoning |
|
|
- 200-300 sequential tool calls for information gathering |
|
|
- Real-world information collection and synthesis |
|
|
- Dynamic search β browser β reasoning loops |
|
|
|
|
|
#### 3. Agentic Coding (SWE-Bench Verified: 71.3%) |
|
|
- Multi-language support (100+ languages) |
|
|
- Agentic coding workflows with tool integration |
|
|
- Component-heavy web development (React, HTML) |
|
|
- Terminal automation (Terminal-Bench: 47.1%) |
|
|
|
|
|
#### 4. Mathematical Reasoning |
|
|
- AIME 2025: 99.1% (with Python) |
|
|
- HMMT 2025: 95.1% (with Python) |
|
|
- IMO-AnswerBench: 78.6% |
|
|
- GPQA-Diamond: 84.5% |
|
|
|
|
|
### Architecture Features |
|
|
|
|
|
#### Test-Time Scaling |
|
|
- **Thinking Tokens**: 96K-128K per reasoning step |
|
|
- **Extended Context**: 256K tokens |
|
|
- **Sequential Tool Calls**: 200-300 without human intervention |
|
|
- **Parallel Rollouts**: Heavy mode with 8 simultaneous trajectories |
|
|
|
|
|
#### INT4 Quantization-Aware Training |
|
|
- Native INT4 inference support |
|
|
- 2x generation speed improvement |
|
|
- State-of-the-art performance at INT4 precision |
|
|
- Optimized for low-bit quantization during post-training |
|
|
|
|
|
#### Inference Efficiency |
|
|
- Quantization-aware training (QAT) for MoE components |
|
|
- INT4 weight-only quantization |
|
|
- ~50% latency reduction |
|
|
- Minimal performance degradation |
|
|
|
|
|
## Benchmark Performance |
|
|
|
|
|
### Reasoning Tasks |
|
|
| Benchmark | Score | Notes | |
|
|
|-----------|-------|-------| |
|
|
| HLE (with tools) | 44.9% | vs Human baseline 29.2% | |
|
|
| AIME 2025 (with Python) | 99.1% | 75.2% without tools | |
|
|
| HMMT 2025 (with Python) | 95.1% | 70.4% without tools | |
|
|
| IMO-AnswerBench | 78.6% | Mathematical olympiad | |
|
|
| GPQA-Diamond | 84.5% | Expert-level questions | |
|
|
|
|
|
### Agentic Search |
|
|
| Benchmark | Score | Notes | |
|
|
|-----------|-------|-------| |
|
|
| BrowseComp | 60.2% | vs Human 29.2% | |
|
|
| BrowseComp-ZH | 62.3% | Chinese browsing | |
|
|
| Seal-0 | 56.3% | Real-world info | |
|
|
| FinSearchComp-T3 | 47.4% | Financial search | |
|
|
| Frames | 87.0% | Multi-step search | |
|
|
|
|
|
### Coding |
|
|
| Benchmark | Score | Notes | |
|
|
|-----------|-------|-------| |
|
|
| SWE-Bench Verified | 71.3% | Software engineering | |
|
|
| SWE-Multilingual | 61.1% | Multi-language coding | |
|
|
| Multi-SWE-Bench | 41.9% | Multiple repositories | |
|
|
| LiveCodeBench v6 | 83.1% | Competitive programming | |
|
|
| Terminal-Bench | 47.1% | Shell automation | |
|
|
|
|
|
### General Capabilities |
|
|
| Benchmark | Score | Notes | |
|
|
|-----------|-------|-------| |
|
|
| MMLU-Pro | 84.6% | Professional knowledge | |
|
|
| MMLU-Redux | 94.4% | General knowledge | |
|
|
| Longform Writing | 73.8% | Creative writing | |
|
|
| HealthBench | 58.0% | Medical knowledge | |
|
|
|
|
|
## Training Approach |
|
|
|
|
|
### Base Architecture |
|
|
- Kimi K2 Thinking foundation |
|
|
- Mixture of Experts (MoE) components |
|
|
- Extended thinking token support |
|
|
- Multi-modal reasoning capabilities |
|
|
|
|
|
### Zen Identity Fine-Tuning |
|
|
1. **Constitutional AI Training**: Hanzo AI principles and values |
|
|
2. **Tool-Calling Specialization**: 200-300 step sequences |
|
|
3. **Thinking Mode Optimization**: Extended reasoning patterns |
|
|
4. **Multi-Agent Workflows**: Coordinated task execution |
|
|
|
|
|
### Optimization |
|
|
- INT4 quantization-aware training |
|
|
- MoE component optimization |
|
|
- Context management strategies |
|
|
- Parallel trajectory aggregation (Heavy Mode) |
|
|
|
|
|
## Usage Examples |
|
|
|
|
|
### 1. Extended Reasoning with Tools |
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-max") |
|
|
tokenizer = AutoTokenizer.from_pretrained("zenlm/zen-max") |
|
|
|
|
|
# Enable thinking mode with tool access |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "Research and analyze the latest developments in quantum computing, then write a comprehensive report." |
|
|
} |
|
|
] |
|
|
|
|
|
# Model will: |
|
|
# 1. Think about search strategy |
|
|
# 2. Execute 50+ web searches |
|
|
# 3. Browse relevant pages |
|
|
# 4. Synthesize information |
|
|
# 5. Generate structured report |
|
|
response = model.chat(tokenizer, messages, thinking_budget=128000, max_tool_calls=300) |
|
|
``` |
|
|
|
|
|
### 2. Agentic Coding Workflow |
|
|
```python |
|
|
# Component-heavy web development |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "Build a fully functional Word clone with React, including document editing, formatting, and export features." |
|
|
} |
|
|
] |
|
|
|
|
|
# Model will: |
|
|
# 1. Plan component architecture |
|
|
# 2. Generate HTML/React code |
|
|
# 3. Implement styling and interactions |
|
|
# 4. Test and debug iteratively |
|
|
# 5. Deliver production-ready application |
|
|
response = model.chat(tokenizer, messages, thinking_budget=96000, enable_tools=True) |
|
|
``` |
|
|
|
|
|
### 3. Mathematical Problem Solving |
|
|
```python |
|
|
# PhD-level mathematics with Python |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "Solve the hyperbolic space sampling problem involving Lorentz model and Brownian bridge covariance." |
|
|
} |
|
|
] |
|
|
|
|
|
# Model will: |
|
|
# 1. Analyze mathematical structure |
|
|
# 2. Execute Python computations |
|
|
# 3. Derive closed-form solutions |
|
|
# 4. Verify results numerically |
|
|
response = model.chat(tokenizer, messages, thinking_budget=128000, python_enabled=True) |
|
|
``` |
|
|
|
|
|
### 4. Heavy Mode (Parallel Reasoning) |
|
|
```python |
|
|
# 8 parallel trajectories with reflective aggregation |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "Comprehensive analysis of climate change solutions across economics, technology, and policy." |
|
|
} |
|
|
] |
|
|
|
|
|
response = model.chat( |
|
|
tokenizer, |
|
|
messages, |
|
|
mode="heavy", # 8 parallel rollouts |
|
|
thinking_budget=128000, |
|
|
enable_reflection=True |
|
|
) |
|
|
``` |
|
|
|
|
|
## Configuration |
|
|
|
|
|
### Thinking Budget |
|
|
- **Low**: 32K thinking tokens (fast responses) |
|
|
- **Medium**: 96K thinking tokens (balanced) |
|
|
- **High**: 128K thinking tokens (complex reasoning) |
|
|
- **Heavy Mode**: 8 Γ 128K parallel trajectories |
|
|
|
|
|
### Tool Configuration |
|
|
```python |
|
|
tools = { |
|
|
"search": True, # Web search |
|
|
"browser": True, # Page browsing |
|
|
"python": True, # Code execution |
|
|
"bash": True, # Shell commands |
|
|
"file_operations": True, # File I/O |
|
|
} |
|
|
``` |
|
|
|
|
|
### Context Management |
|
|
- **Context Window**: 256K tokens |
|
|
- **Auto-hiding**: Tool outputs hidden when exceeding context |
|
|
- **Smart truncation**: Preserves reasoning chain and key results |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Inference (INT4 from HuggingFace) |
|
|
- **Model Size**: ~370GB (62 safetensors shards, INT4 quantized) |
|
|
- **Minimum**: 247GB combined RAM+VRAM+Disk |
|
|
- **Optimal**: 370GB+ RAM+VRAM for 5+ tokens/s |
|
|
- **Budget Setup**: 1x 24GB GPU + 256GB RAM (~1-2 tokens/s) |
|
|
- **High Performance**: 4x A100 80GB or 8x A100 40GB |
|
|
|
|
|
### Alternative: GGUF Quantizations (Unsloth) |
|
|
- **1.66-bit (UD-TQ1_0)**: 245GB - fits on 247GB combined RAM+VRAM |
|
|
- **2.71-bit (UD-Q2_K_XL)**: 381GB - recommended for accuracy |
|
|
- **4.5-bit (UD-Q4_K_XL)**: 588GB - near full precision |
|
|
|
|
|
### QLoRA Training |
|
|
- **VRAM**: ~500GB total (370GB model + 130GB activations) |
|
|
- **GPUs**: 4x A100 80GB or 8x A100 40GB |
|
|
- **Training Time**: 4-8 hours for 1000 steps |
|
|
- **Output**: LoRA adapters (~100MB) |
|
|
|
|
|
## Format Availability |
|
|
|
|
|
### Current |
|
|
- β
SafeTensors (BF16, full precision) |
|
|
- β
INT4 Quantized (native QAT) |
|
|
|
|
|
### Coming Soon |
|
|
- π GGUF quantizations (Q4_K_M, Q5_K_M, Q8_0) |
|
|
- π MLX optimized formats (4-bit, 8-bit for Apple Silicon) |
|
|
- π ONNX export for edge deployment |
|
|
|
|
|
## Special Features |
|
|
|
|
|
### 1. Thinking Mode |
|
|
- Chain-of-thought reasoning with `<think>` tags |
|
|
- Explicit reasoning traces |
|
|
- Up to 128K thinking tokens per step |
|
|
- Adaptive depth based on problem complexity |
|
|
|
|
|
### 2. Tool-Calling Agent |
|
|
- 200-300 sequential tool invocations |
|
|
- No human intervention required |
|
|
- Dynamic tool selection |
|
|
- Error recovery and retry logic |
|
|
|
|
|
### 3. Parallel Reasoning (Heavy Mode) |
|
|
- 8 simultaneous reasoning trajectories |
|
|
- Reflective aggregation of outputs |
|
|
- Consensus-based answer selection |
|
|
- 2-3x accuracy improvement on hard problems |
|
|
|
|
|
### 4. Multi-Modal Extensions |
|
|
- Vision-language understanding (future) |
|
|
- Audio processing (future) |
|
|
- Code β execution β analysis loops |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Thinking Token Overhead**: Extended reasoning increases latency |
|
|
2. **Tool Call Limits**: 300 steps may not suffice for extremely complex tasks |
|
|
3. **Context Management**: Auto-hiding may lose important intermediate results |
|
|
4. **Quantization**: INT4 optimized, but BF16 still preferred for maximum accuracy |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Base Training**: Kimi K2 Thinking pre-training corpus |
|
|
- **Zen Fine-Tuning**: |
|
|
- Zoo-Gym framework with RAIS technology |
|
|
- Constitutional AI alignment data |
|
|
- Multi-turn tool-calling trajectories |
|
|
- Agentic workflow demonstrations |
|
|
- **Verification**: Human expert validation on HLE, AIME, coding tasks |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{zenmax2025, |
|
|
title={Zen Max: Reasoning-First Language Model with Test-Time Scaling}, |
|
|
author={Hanzo AI and Zoo Labs Foundation}, |
|
|
year={2025}, |
|
|
url={https://zenlm.org}, |
|
|
note={Based on Moonshot AI Kimi K2 Thinking architecture} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Moonshot AI**: K2 Thinking architecture and training methodology |
|
|
- **Hanzo AI**: Constitutional AI training and Zen identity |
|
|
- **Zoo Labs Foundation**: Open AI research and community governance |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Website**: https://zenlm.org |
|
|
- **HuggingFace**: https://huggingface.co/zenlm/zen-max |
|
|
- **GitHub**: https://github.com/zenlm/zen |
|
|
- **Moonshot AI**: https://www.moonshot.cn/ |
|
|
- **K2 Thinking**: https://platform.moonshot.cn/docs/intro#kimi-k2-thinking |
|
|
|
|
|
--- |
|
|
|
|
|
**Zen AI**: Clarity Through Intelligence |
|
|
*Now with reasoning at test-time* |
|
|
|