metadata
language:
- en
license: mit
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
tags:
- zenith
- tenstorrent
- reasoning
- math
- moe
- ring-attention
- eq-adapter
- deepseek-r1
- matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: zenith
hardware:
- tenstorrent-blackhole-p300a
Zenith-32B-p300 V1-Tenstorrent-Blackhole-p300
Tenstorrent p300a-optimized 32B parameter model based on DeepSeek-R1-Distill-Qwen-32B.
Features
- 32B Parameters: Based on DeepSeek-R1-Distill-Qwen-32B
- p300a Optimized: Specifically tuned for Tenstorrent p300a hardware
- Ring Attention: 32K context window with efficient chunked attention
- MoE Support: Mixture of Experts for sparse activation
- EQ Adapter: Emotional intelligence capabilities
- Reasoning & Code: Strong performance on reasoning and coding tasks
- Tensor/Pipeline Parallelism: Optimized for distributed training
- NoC Optimization: Efficient chip-to-chip communication
- Ollama Compatible: Ready for deployment
Hardware Requirements
Training
- Tenstorrent p300a: 2 chips (64 RISC-V cores)
- Memory: 64GB GDDR6
- Storage: 2TB+ NVMe SSD
Inference
- p300a: Full 32K context supported
- Standard GPU: 64GB+ VRAM (e.g., A100 80GB, H100 80GB)
- Consumer GPUs: Use QLoRA or reduce context length
Quick Start
Installation
cd Zenith/V1-Tenstorrent-Blackhole-p300/32B
pip install -r requirements.txt
Training
# LoRA fine-tuning (recommended)
python train.py \
--base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--train_data ./data/train.json \
--use_lora \
--lora_r 16 \
--lora_alpha 32 \
--epochs 3 \
--batch_size 4 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--use_ring_attention \
--max_seq_length 32768 \
--tensor_parallel_size 8 \
--pipeline_parallel_size 4 \
--use_noc_optimization \
--mixed_precision bf16
Inference
# Interactive mode
python inference.py --checkpoint ./outputs/checkpoint-final
# Single prompt
python inference.py \
--checkpoint ./outputs/checkpoint-final \
--prompt "Write a Python function to implement quicksort" \
--max_new_tokens 1024
Ollama
ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Explain the difference between supervised and unsupervised learning"
Architecture
Model Configuration
from configs.zenith_config import get_32b_config
config = get_32b_config()
Key Parameters:
hidden_size: 4096num_layers: 40num_heads: 32num_experts: 8 (configurable)moe_top_k: 2max_seq_len: 32768use_ring_attention: Truering_attention_chunk_size: 8192ring_attention_overlap: 2048
p300 Optimizations
- Tensor Parallelism (TP=8): Across 8 cores per chip
- Pipeline Parallelism (PP=4): 4 stages per chip
- NoC Optimization: Efficient inter-core communication
- Ring Attention: 32K context without OOM
- Mixed Precision: BF16 native support
Data Processing
OpenThoughts Integration
from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig
ot_config = OpenThoughtsConfig(
dataset_name="open-thoughts/OpenThoughts3-1.2M",
streaming=True,
max_seq_length=32768,
quality_filtering=True,
curriculum_learning=True,
tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)
Curriculum Stages
- Foundation: High-quality samples (score > 0.8)
- Reasoning: Chain-of-thought examples
- Code: Programming tasks
- Full: Complete dataset
Quality Filtering
Multi-dimensional scoring:
- Length: 512-32000 tokens
- Language: English
- Repetition: < 15%
- Coherence: > 0.7
- Structure: Valid formatting
- Thought quality: CoT depth > 3 steps
Advanced Features
MoE
--use_moe --num_experts 8 --moe_top_k 2
- Top-2 routing
- Load balancing loss
- 60% middle layers use MoE
EQ Adapter
--use_eq_adapter --eq_loss_weight 0.05
- Frustration detection
- 8-emotion classification
- Fused with attention
Ring Attention
--use_ring_attention --ring_chunk_size 8192 --ring_overlap 2048
- Enables 32K context
- Memory: O(seq_len × chunk_size)
- Chunked processing
Testing
python test_model.py
Tests cover:
- Model creation
- Forward pass
- p300 optimizations
- MoE configuration
- Ring attention
- EQ adapter
- Generation
- Gradient flow
Evaluation
python -m evaluation.benchmark \
--model_path ./outputs/checkpoint-final \
--benchmarks humaneval mbpp gsm8k math truthfulqa
Deployment
Ollama
ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Your prompt here"
vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./outputs/checkpoint-final \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--port 8000
Troubleshooting
Memory Issues
- Reduce batch size
- Use gradient accumulation
- Enable LoRA/QLoRA
- Reduce sequence length
- Enable gradient checkpointing
Slow Training
- Increase batch size
- Reduce gradient accumulation
- Use mixed precision
- Optimize data loading
- Enable NoC optimization
Poor Quality
- Use curriculum learning
- Apply quality filtering
- Train more epochs
- Adjust learning rate
- Use more high-quality data
Performance
| Configuration | Memory | Speed | Quality |
|---|---|---|---|
| Full FT, 2K | ~58GB | 50-80 | Baseline |
| LoRA r=16, 2K | ~18GB | 80-120 | 98% |
| QLoRA r=8, 2K | ~10GB | 100-150 | 95% |
| Ring 32K | +20% | 30-50 | Enables long context |
Citation
@misc{zenith-32b-p300-2025,
title={Zenith-32B-p300: A Tenstorrent-Optimized Reasoning Model},
author={Zenith Project},
year={2025}
}
License
[Specify]
Support
- Documentation:
README.md - Fine-tuning:
FINETUNE_GUIDE.md - Config:
configs/zenith_config.py - Issues: Open with detailed logs