Fine-Tuning Guide for Zenith-32B-p300
Complete guide for fine-tuning the 32B model on Tenstorrent p300a hardware.
Table of Contents
- Hardware Setup
- Prerequisites
- Data Preparation
- Training Strategies
- p300 Configuration
- Advanced Features
- Evaluation
- Deployment
- Troubleshooting
Hardware Setup
Tenstorrent p300a
- Chips: 2x p300a
- Cores: 32 RISC-V cores per chip (64 total)
- Memory: 64GB GDDR6 shared
- Interconnect: NoC
- Storage: NVMe SSD recommended
Core Allocation
Default:
- TP=8: 8 cores/chip for tensor parallelism
- PP=4: 4 cores/chip for pipeline parallelism
- Remaining: Data parallelism or idle
Uses 32 cores/chip, 64 total.
Prerequisites
cd Zenith/V1-Tenstorrent-Blackhole-p300/32B
pip install -r requirements.txt
Note: May need Tenstorrent's custom PyTorch build.
Data Preparation
Format
[
{
"instruction": "Write a function to find the longest common subsequence",
"input": "strings: 'ABCDGH', 'AEDFHR'",
"output": "def lcs(s1, s2):\n m, n = len(s1), len(s2)\n dp = [[0] * (n+1) for _ in range(m+1)]\n for i in range(1, m+1):\n for j in range(1, n+1):\n if s1[i-1] == s2[j-1]:\n dp[i][j] = dp[i-1][j-1] + 1\n else:\n dp[i][j] = max(dp[i-1][j], dp[i][j-1])\n return dp[m][n]",
"thoughts": "This is a classic DP problem. Need to build a 2D table where dp[i][j] represents LCS length of s1[:i] and s2[:j].",
"emotion": "neutral",
"frustration_level": 0.0,
"domain": "algorithms"
}
]
Preprocessing
from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig
from configs.zenith_config import get_32b_config
config = get_32b_config()
tokenizer = AdvancedTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
ot_config = OpenThoughtsConfig(
dataset_name="your-dataset",
streaming=True,
max_seq_length=32768,
quality_filtering=True,
curriculum_learning=True,
tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)
dataset = processor.load_dataset()
Training Strategies
1. LoRA (Recommended)
python train.py \
--base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--train_data ./data/train.json \
--use_lora \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--epochs 3 \
--batch_size 4 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--use_ring_attention \
--max_seq_length 32768 \
--tensor_parallel_size 8 \
--pipeline_parallel_size 4 \
--use_noc_optimization \
--mixed_precision bf16
Memory: ~18GB
2. QLoRA
python train.py \
--use_qlora \
--use_lora \
--lora_r 8 \
--batch_size 4 \
--learning_rate 2e-4 \
...
Memory: ~10GB
3. Full Fine-Tuning
Not recommended unless you have specific needs and sufficient memory.
python train.py \
--batch_size 2 \
--gradient_accumulation_steps 16 \
--learning_rate 5e-6 \
...
Memory: ~58GB (tight)
p300 Configuration
Distributed Training
export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2
torchrun --nproc_per_node=2 --nnodes=1 train.py ...
Ring Attention
--use_ring_attention \
--ring_chunk_size 8192 \
--ring_overlap 2048
Essential for 32K context.
Mixed Precision
--mixed_precision bf16
p300 native support.
Advanced Features
MoE
--use_moe --num_experts 8 --moe_top_k 2
Increases capacity, use with LoRA.
EQ Adapter
--use_eq_adapter --eq_loss_weight 0.05
For emotional intelligence.
Curriculum
--use_curriculum
Progressive learning.
Quality Filter
--use_quality_filter --min_quality_score 0.6
Automatic filtering.
Evaluation
python -m evaluation.benchmark \
--model_path ./outputs/checkpoint-final \
--benchmarks humaneval mbpp gsm8k math
Deployment
Ollama
ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Your prompt"
vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./outputs/checkpoint-final \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--port 8000
Troubleshooting
Out of Memory
- Reduce batch size
- Increase gradient accumulation
- Use QLoRA
- Reduce max_seq_length
- Enable gradient checkpointing
Slow Training
- Increase batch size
- Reduce gradient accumulation
- Use mixed precision
- Optimize data loading
- Enable NoC
Poor Quality
- Use curriculum
- Apply quality filter
- More epochs
- Lower learning rate
- More CoT data
Performance
| Config | Memory | Speed | Quality |
|---|---|---|---|
| LoRA r=16 | 18GB | 80-120 | 98% |
| QLoRA r=8 | 10GB | 100-150 | 95% |
| Ring 32K | +20% | 30-50 | Enabled |
Citation
@misc{zenith-32b-p300-2025,
title={Zenith-32B-p300: A Tenstorrent-Optimized Reasoning Model},
year={2025}
}
License
[Specify]
Support
README.mdfor quick referenceFINETUNE_GUIDE.mdfor detailed instructionsconfigs/zenith_config.pyfor configuration options- Open issues with logs