Zenith-32b-p300-V1 / FINETUNE_GUIDE.md
Zandy-Wandy's picture
Upload Zenith-32b-V1-Tenstorrent-Blackhole-p300 model
c1a6ebe verified

Fine-Tuning Guide for Zenith-32B-p300

Complete guide for fine-tuning the 32B model on Tenstorrent p300a hardware.

Table of Contents

  1. Hardware Setup
  2. Prerequisites
  3. Data Preparation
  4. Training Strategies
  5. p300 Configuration
  6. Advanced Features
  7. Evaluation
  8. Deployment
  9. Troubleshooting

Hardware Setup

Tenstorrent p300a

  • Chips: 2x p300a
  • Cores: 32 RISC-V cores per chip (64 total)
  • Memory: 64GB GDDR6 shared
  • Interconnect: NoC
  • Storage: NVMe SSD recommended

Core Allocation

Default:

  • TP=8: 8 cores/chip for tensor parallelism
  • PP=4: 4 cores/chip for pipeline parallelism
  • Remaining: Data parallelism or idle

Uses 32 cores/chip, 64 total.

Prerequisites

cd Zenith/V1-Tenstorrent-Blackhole-p300/32B
pip install -r requirements.txt

Note: May need Tenstorrent's custom PyTorch build.

Data Preparation

Format

[
  {
    "instruction": "Write a function to find the longest common subsequence",
    "input": "strings: 'ABCDGH', 'AEDFHR'",
    "output": "def lcs(s1, s2):\n    m, n = len(s1), len(s2)\n    dp = [[0] * (n+1) for _ in range(m+1)]\n    for i in range(1, m+1):\n        for j in range(1, n+1):\n            if s1[i-1] == s2[j-1]:\n                dp[i][j] = dp[i-1][j-1] + 1\n            else:\n                dp[i][j] = max(dp[i-1][j], dp[i][j-1])\n    return dp[m][n]",
    "thoughts": "This is a classic DP problem. Need to build a 2D table where dp[i][j] represents LCS length of s1[:i] and s2[:j].",
    "emotion": "neutral",
    "frustration_level": 0.0,
    "domain": "algorithms"
  }
]

Preprocessing

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig
from configs.zenith_config import get_32b_config

config = get_32b_config()
tokenizer = AdvancedTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")

ot_config = OpenThoughtsConfig(
    dataset_name="your-dataset",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    tokenizer=tokenizer
)

processor = OpenThoughtsProcessor(ot_config)
dataset = processor.load_dataset()

Training Strategies

1. LoRA (Recommended)

python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.1 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --use_ring_attention \
  --max_seq_length 32768 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --mixed_precision bf16

Memory: ~18GB

2. QLoRA

python train.py \
  --use_qlora \
  --use_lora \
  --lora_r 8 \
  --batch_size 4 \
  --learning_rate 2e-4 \
  ...

Memory: ~10GB

3. Full Fine-Tuning

Not recommended unless you have specific needs and sufficient memory.

python train.py \
  --batch_size 2 \
  --gradient_accumulation_steps 16 \
  --learning_rate 5e-6 \
  ...

Memory: ~58GB (tight)

p300 Configuration

Distributed Training

export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2

torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Ring Attention

--use_ring_attention \
--ring_chunk_size 8192 \
--ring_overlap 2048

Essential for 32K context.

Mixed Precision

--mixed_precision bf16

p300 native support.

Advanced Features

MoE

--use_moe --num_experts 8 --moe_top_k 2

Increases capacity, use with LoRA.

EQ Adapter

--use_eq_adapter --eq_loss_weight 0.05

For emotional intelligence.

Curriculum

--use_curriculum

Progressive learning.

Quality Filter

--use_quality_filter --min_quality_score 0.6

Automatic filtering.

Evaluation

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math

Deployment

Ollama

ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Your prompt"

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/checkpoint-final \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

Troubleshooting

Out of Memory

  1. Reduce batch size
  2. Increase gradient accumulation
  3. Use QLoRA
  4. Reduce max_seq_length
  5. Enable gradient checkpointing

Slow Training

  1. Increase batch size
  2. Reduce gradient accumulation
  3. Use mixed precision
  4. Optimize data loading
  5. Enable NoC

Poor Quality

  1. Use curriculum
  2. Apply quality filter
  3. More epochs
  4. Lower learning rate
  5. More CoT data

Performance

Config Memory Speed Quality
LoRA r=16 18GB 80-120 98%
QLoRA r=8 10GB 100-150 95%
Ring 32K +20% 30-50 Enabled

Citation

@misc{zenith-32b-p300-2025,
  title={Zenith-32B-p300: A Tenstorrent-Optimized Reasoning Model},
  year={2025}
}

License

[Specify]

Support

  • README.md for quick reference
  • FINETUNE_GUIDE.md for detailed instructions
  • configs/zenith_config.py for configuration options
  • Open issues with logs