Zenith-32b-p300-V1 / FINETUNE_GUIDE.md

Zandy-Wandy

Upload Zenith-32b-V1-Tenstorrent-Blackhole-p300 model

c1a6ebe verified about 1 month ago

preview code

raw

history blame contribute delete

5.76 kB

Fine-Tuning Guide for Zenith-32B-p300

Complete guide for fine-tuning the 32B model on Tenstorrent p300a hardware.

Hardware Setup
Prerequisites
Data Preparation
Training Strategies
p300 Configuration
Advanced Features
Evaluation
Deployment
Troubleshooting

Hardware Setup

Tenstorrent p300a

Chips: 2x p300a
Cores: 32 RISC-V cores per chip (64 total)
Memory: 64GB GDDR6 shared
Interconnect: NoC
Storage: NVMe SSD recommended

Core Allocation

Default:

TP=8: 8 cores/chip for tensor parallelism
PP=4: 4 cores/chip for pipeline parallelism
Remaining: Data parallelism or idle

Uses 32 cores/chip, 64 total.

Prerequisites

cd Zenith/V1-Tenstorrent-Blackhole-p300/32B
pip install -r requirements.txt

Note: May need Tenstorrent's custom PyTorch build.

Data Preparation

Format

[
  {
    "instruction": "Write a function to find the longest common subsequence",
    "input": "strings: 'ABCDGH', 'AEDFHR'",
    "output": "def lcs(s1, s2):\n    m, n = len(s1), len(s2)\n    dp = [[0] * (n+1) for _ in range(m+1)]\n    for i in range(1, m+1):\n        for j in range(1, n+1):\n            if s1[i-1] == s2[j-1]:\n                dp[i][j] = dp[i-1][j-1] + 1\n            else:\n                dp[i][j] = max(dp[i-1][j], dp[i][j-1])\n    return dp[m][n]",
    "thoughts": "This is a classic DP problem. Need to build a 2D table where dp[i][j] represents LCS length of s1[:i] and s2[:j].",
    "emotion": "neutral",
    "frustration_level": 0.0,
    "domain": "algorithms"
  }
]

Preprocessing

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig
from configs.zenith_config import get_32b_config

config = get_32b_config()
tokenizer = AdvancedTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")

ot_config = OpenThoughtsConfig(
    dataset_name="your-dataset",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    tokenizer=tokenizer
)

processor = OpenThoughtsProcessor(ot_config)
dataset = processor.load_dataset()

Training Strategies

1. LoRA (Recommended)

python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.1 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --use_ring_attention \
  --max_seq_length 32768 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --mixed_precision bf16

Memory: ~18GB

2. QLoRA

python train.py \
  --use_qlora \
  --use_lora \
  --lora_r 8 \
  --batch_size 4 \
  --learning_rate 2e-4 \
  ...

Memory: ~10GB

3. Full Fine-Tuning

Not recommended unless you have specific needs and sufficient memory.

python train.py \
  --batch_size 2 \
  --gradient_accumulation_steps 16 \
  --learning_rate 5e-6 \
  ...

Memory: ~58GB (tight)

p300 Configuration

Distributed Training

export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2

torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Ring Attention

--use_ring_attention \
--ring_chunk_size 8192 \
--ring_overlap 2048

Essential for 32K context.

Mixed Precision

--mixed_precision bf16

p300 native support.

Advanced Features

MoE

--use_moe --num_experts 8 --moe_top_k 2

Increases capacity, use with LoRA.

EQ Adapter

--use_eq_adapter --eq_loss_weight 0.05

For emotional intelligence.

Curriculum

--use_curriculum

Progressive learning.

Quality Filter

--use_quality_filter --min_quality_score 0.6

Automatic filtering.

Evaluation

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math

Deployment

Ollama

ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Your prompt"

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/checkpoint-final \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

Troubleshooting

Out of Memory

Reduce batch size
Increase gradient accumulation
Use QLoRA
Reduce max_seq_length
Enable gradient checkpointing

Slow Training

Increase batch size
Reduce gradient accumulation
Use mixed precision
Optimize data loading
Enable NoC

Poor Quality

Use curriculum
Apply quality filter
More epochs
Lower learning rate
More CoT data

Performance

Config	Memory	Speed	Quality
LoRA r=16	18GB	80-120	98%
QLoRA r=8	10GB	100-150	95%
Ring 32K	+20%	30-50	Enabled

Citation

@misc{zenith-32b-p300-2025,
  title={Zenith-32B-p300: A Tenstorrent-Optimized Reasoning Model},
  year={2025}
}

License

[Specify]

Support

README.md for quick reference
FINETUNE_GUIDE.md for detailed instructions
configs/zenith_config.py for configuration options
Open issues with logs

Matrix-Corp
/

Zenith-32b-p300-V1