zen-max / README.md
Hanzo Dev
Add accurate hardware specs from Unsloth/Moonshot
3989ff3
---
library_name: transformers
license: apache-2.0
tags:
- deepseek
- kimi_k2
- text-generation
- reasoning
- agentic
- tool-calling
- compressed-tensors
pipeline_tag: text-generation
base_model: moonshotai/Kimi-K2-Thinking
---
# Zen Max - Kimi K2 Thinking Architecture
**Organization**: [Zen LM](https://zenlm.org) (Hanzo AI Γ— Zoo Labs Foundation)
**Base Model**: Moonshot AI Kimi K2 Thinking (DeepseekV3ForCausalLM)
**Parameters**: 671B total (384 experts Γ— ~1.75B each, 8 active per token = ~14B)
**License**: Apache 2.0
**Context Window**: 256K tokens
**Thinking Capacity**: 96K-128K thinking tokens per step
**Architecture**: DeepseekV3 MoE (Mixture of Experts)
## Model Overview
Zen Max is a reasoning-first language model built on Moonshot AI's Kimi K2 Thinking architecture, designed for **test-time scaling** through extended thinking and tool-calling capabilities.
Built as a **thinking agent**, Zen Max reasons step-by-step while using tools, executing **200-300 sequential tool calls** without human interference, reasoning coherently across hundreds of steps to solve complex problems.
> **Note**: This repository contains configuration files and documentation for Zen Max. The full model weights (~1TB) are available from the base model: [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking). Zen-specific fine-tuning instructions and adapters will be provided in future releases.
### Key Capabilities
#### 1. Agentic Reasoning (HLE: 44.9%)
- Extended chain-of-thought reasoning with `<think>` tags
- Multi-step planning and execution
- Adaptive reasoning with hypothesis generation and refinement
- Think β†’ search β†’ code β†’ verify β†’ think cycles
#### 2. Agentic Search & Browsing (BrowseComp: 60.2%)
- Goal-directed web-based reasoning
- 200-300 sequential tool calls for information gathering
- Real-world information collection and synthesis
- Dynamic search β†’ browser β†’ reasoning loops
#### 3. Agentic Coding (SWE-Bench Verified: 71.3%)
- Multi-language support (100+ languages)
- Agentic coding workflows with tool integration
- Component-heavy web development (React, HTML)
- Terminal automation (Terminal-Bench: 47.1%)
#### 4. Mathematical Reasoning
- AIME 2025: 99.1% (with Python)
- HMMT 2025: 95.1% (with Python)
- IMO-AnswerBench: 78.6%
- GPQA-Diamond: 84.5%
### Architecture Features
#### Test-Time Scaling
- **Thinking Tokens**: 96K-128K per reasoning step
- **Extended Context**: 256K tokens
- **Sequential Tool Calls**: 200-300 without human intervention
- **Parallel Rollouts**: Heavy mode with 8 simultaneous trajectories
#### INT4 Quantization-Aware Training
- Native INT4 inference support
- 2x generation speed improvement
- State-of-the-art performance at INT4 precision
- Optimized for low-bit quantization during post-training
#### Inference Efficiency
- Quantization-aware training (QAT) for MoE components
- INT4 weight-only quantization
- ~50% latency reduction
- Minimal performance degradation
## Benchmark Performance
### Reasoning Tasks
| Benchmark | Score | Notes |
|-----------|-------|-------|
| HLE (with tools) | 44.9% | vs Human baseline 29.2% |
| AIME 2025 (with Python) | 99.1% | 75.2% without tools |
| HMMT 2025 (with Python) | 95.1% | 70.4% without tools |
| IMO-AnswerBench | 78.6% | Mathematical olympiad |
| GPQA-Diamond | 84.5% | Expert-level questions |
### Agentic Search
| Benchmark | Score | Notes |
|-----------|-------|-------|
| BrowseComp | 60.2% | vs Human 29.2% |
| BrowseComp-ZH | 62.3% | Chinese browsing |
| Seal-0 | 56.3% | Real-world info |
| FinSearchComp-T3 | 47.4% | Financial search |
| Frames | 87.0% | Multi-step search |
### Coding
| Benchmark | Score | Notes |
|-----------|-------|-------|
| SWE-Bench Verified | 71.3% | Software engineering |
| SWE-Multilingual | 61.1% | Multi-language coding |
| Multi-SWE-Bench | 41.9% | Multiple repositories |
| LiveCodeBench v6 | 83.1% | Competitive programming |
| Terminal-Bench | 47.1% | Shell automation |
### General Capabilities
| Benchmark | Score | Notes |
|-----------|-------|-------|
| MMLU-Pro | 84.6% | Professional knowledge |
| MMLU-Redux | 94.4% | General knowledge |
| Longform Writing | 73.8% | Creative writing |
| HealthBench | 58.0% | Medical knowledge |
## Training Approach
### Base Architecture
- Kimi K2 Thinking foundation
- Mixture of Experts (MoE) components
- Extended thinking token support
- Multi-modal reasoning capabilities
### Zen Identity Fine-Tuning
1. **Constitutional AI Training**: Hanzo AI principles and values
2. **Tool-Calling Specialization**: 200-300 step sequences
3. **Thinking Mode Optimization**: Extended reasoning patterns
4. **Multi-Agent Workflows**: Coordinated task execution
### Optimization
- INT4 quantization-aware training
- MoE component optimization
- Context management strategies
- Parallel trajectory aggregation (Heavy Mode)
## Usage Examples
### 1. Extended Reasoning with Tools
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-max")
tokenizer = AutoTokenizer.from_pretrained("zenlm/zen-max")
# Enable thinking mode with tool access
messages = [
{
"role": "user",
"content": "Research and analyze the latest developments in quantum computing, then write a comprehensive report."
}
]
# Model will:
# 1. Think about search strategy
# 2. Execute 50+ web searches
# 3. Browse relevant pages
# 4. Synthesize information
# 5. Generate structured report
response = model.chat(tokenizer, messages, thinking_budget=128000, max_tool_calls=300)
```
### 2. Agentic Coding Workflow
```python
# Component-heavy web development
messages = [
{
"role": "user",
"content": "Build a fully functional Word clone with React, including document editing, formatting, and export features."
}
]
# Model will:
# 1. Plan component architecture
# 2. Generate HTML/React code
# 3. Implement styling and interactions
# 4. Test and debug iteratively
# 5. Deliver production-ready application
response = model.chat(tokenizer, messages, thinking_budget=96000, enable_tools=True)
```
### 3. Mathematical Problem Solving
```python
# PhD-level mathematics with Python
messages = [
{
"role": "user",
"content": "Solve the hyperbolic space sampling problem involving Lorentz model and Brownian bridge covariance."
}
]
# Model will:
# 1. Analyze mathematical structure
# 2. Execute Python computations
# 3. Derive closed-form solutions
# 4. Verify results numerically
response = model.chat(tokenizer, messages, thinking_budget=128000, python_enabled=True)
```
### 4. Heavy Mode (Parallel Reasoning)
```python
# 8 parallel trajectories with reflective aggregation
messages = [
{
"role": "user",
"content": "Comprehensive analysis of climate change solutions across economics, technology, and policy."
}
]
response = model.chat(
tokenizer,
messages,
mode="heavy", # 8 parallel rollouts
thinking_budget=128000,
enable_reflection=True
)
```
## Configuration
### Thinking Budget
- **Low**: 32K thinking tokens (fast responses)
- **Medium**: 96K thinking tokens (balanced)
- **High**: 128K thinking tokens (complex reasoning)
- **Heavy Mode**: 8 Γ— 128K parallel trajectories
### Tool Configuration
```python
tools = {
"search": True, # Web search
"browser": True, # Page browsing
"python": True, # Code execution
"bash": True, # Shell commands
"file_operations": True, # File I/O
}
```
### Context Management
- **Context Window**: 256K tokens
- **Auto-hiding**: Tool outputs hidden when exceeding context
- **Smart truncation**: Preserves reasoning chain and key results
## Hardware Requirements
### Inference (INT4 from HuggingFace)
- **Model Size**: ~370GB (62 safetensors shards, INT4 quantized)
- **Minimum**: 247GB combined RAM+VRAM+Disk
- **Optimal**: 370GB+ RAM+VRAM for 5+ tokens/s
- **Budget Setup**: 1x 24GB GPU + 256GB RAM (~1-2 tokens/s)
- **High Performance**: 4x A100 80GB or 8x A100 40GB
### Alternative: GGUF Quantizations (Unsloth)
- **1.66-bit (UD-TQ1_0)**: 245GB - fits on 247GB combined RAM+VRAM
- **2.71-bit (UD-Q2_K_XL)**: 381GB - recommended for accuracy
- **4.5-bit (UD-Q4_K_XL)**: 588GB - near full precision
### QLoRA Training
- **VRAM**: ~500GB total (370GB model + 130GB activations)
- **GPUs**: 4x A100 80GB or 8x A100 40GB
- **Training Time**: 4-8 hours for 1000 steps
- **Output**: LoRA adapters (~100MB)
## Format Availability
### Current
- βœ… SafeTensors (BF16, full precision)
- βœ… INT4 Quantized (native QAT)
### Coming Soon
- πŸ”„ GGUF quantizations (Q4_K_M, Q5_K_M, Q8_0)
- πŸ”„ MLX optimized formats (4-bit, 8-bit for Apple Silicon)
- πŸ”„ ONNX export for edge deployment
## Special Features
### 1. Thinking Mode
- Chain-of-thought reasoning with `<think>` tags
- Explicit reasoning traces
- Up to 128K thinking tokens per step
- Adaptive depth based on problem complexity
### 2. Tool-Calling Agent
- 200-300 sequential tool invocations
- No human intervention required
- Dynamic tool selection
- Error recovery and retry logic
### 3. Parallel Reasoning (Heavy Mode)
- 8 simultaneous reasoning trajectories
- Reflective aggregation of outputs
- Consensus-based answer selection
- 2-3x accuracy improvement on hard problems
### 4. Multi-Modal Extensions
- Vision-language understanding (future)
- Audio processing (future)
- Code β†’ execution β†’ analysis loops
## Limitations
1. **Thinking Token Overhead**: Extended reasoning increases latency
2. **Tool Call Limits**: 300 steps may not suffice for extremely complex tasks
3. **Context Management**: Auto-hiding may lose important intermediate results
4. **Quantization**: INT4 optimized, but BF16 still preferred for maximum accuracy
## Training Data
- **Base Training**: Kimi K2 Thinking pre-training corpus
- **Zen Fine-Tuning**:
- Zoo-Gym framework with RAIS technology
- Constitutional AI alignment data
- Multi-turn tool-calling trajectories
- Agentic workflow demonstrations
- **Verification**: Human expert validation on HLE, AIME, coding tasks
## Citation
```bibtex
@misc{zenmax2025,
title={Zen Max: Reasoning-First Language Model with Test-Time Scaling},
author={Hanzo AI and Zoo Labs Foundation},
year={2025},
url={https://zenlm.org},
note={Based on Moonshot AI Kimi K2 Thinking architecture}
}
```
## Acknowledgments
- **Moonshot AI**: K2 Thinking architecture and training methodology
- **Hanzo AI**: Constitutional AI training and Zen identity
- **Zoo Labs Foundation**: Open AI research and community governance
## Links
- **Website**: https://zenlm.org
- **HuggingFace**: https://huggingface.co/zenlm/zen-max
- **GitHub**: https://github.com/zenlm/zen
- **Moonshot AI**: https://www.moonshot.cn/
- **K2 Thinking**: https://platform.moonshot.cn/docs/intro#kimi-k2-thinking
---
**Zen AI**: Clarity Through Intelligence
*Now with reasoning at test-time*