Zenith-28B-p300 V1-Tenstorrent-Blackhole-p300

Tenstorrent p300a-optimized 28B parameter model with advanced reasoning capabilities.

Features

  • 28B Parameters: Based on Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
  • p300a Optimized: Specifically tuned for Tenstorrent p300a hardware (dual-chip, 32 RISC-V cores, 64GB GDDR6)
  • Ring Attention: 32K context window with efficient chunked attention
  • MoE Architecture: Mixture of Experts for sparse activation (configurable)
  • EQ Adapter: Emotional intelligence and frustration detection
  • Reasoning Focus: Enhanced chain-of-thought and problem-solving abilities
  • Tensor/Pipeline Parallelism: Optimized for distributed training on p300
  • NoC Optimization: Efficient chip-to-chip communication
  • Ollama Compatible: Ready for deployment

Hardware Requirements

Training

  • Tenstorrent p300a: 2 chips (32 cores total)
  • Memory: 64GB GDDR6 (shared across chips)
  • Storage: 2TB+ NVMe SSD for dataset
  • Recommended: Use LoRA/QLoRA for efficient fine-tuning

Inference

  • p300a: Full 32K context supported
  • Standard GPU: 48GB+ VRAM (e.g., A100, H100) for full model
  • Consumer GPUs: Use QLoRA or smaller context windows

Quick Start

Installation

cd Zenith/V1-Tenstorrent-Blackhole-p300/28B
pip install -r requirements.txt

Training on p300

# Full fine-tuning (requires p300 with all 64 cores)
python train.py \
  --base_model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --train_data ./data/train.json \
  --epochs 2 \
  --batch_size 2 \
  --gradient_accumulation_steps 16 \
  --learning_rate 1e-5 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --use_ring_attention \
  --max_seq_length 32768 \
  --mixed_precision bf16

# LoRA fine-tuning (recommended)
python train.py \
  --base_model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --use_ring_attention \
  --max_seq_length 32768

Inference

# Interactive mode with reasoning focus
python inference.py --checkpoint ./outputs/checkpoint-final

# Single prompt with long context
python inference.py \
  --checkpoint ./outputs/checkpoint-final \
  --prompt "Analyze the following 10K text and extract key insights..." \
  --max_new_tokens 2048 \
  --temperature 0.55

Ollama Deployment

# Build model
ollama create zenith-28b-p300 -f Modelfile

# Run with reasoning focus
ollama run zenith-28b-p300 "Solve: A train travels at 60 mph for 2.5 hours. How far does it go? Show your reasoning."

# Long context example
ollama run zenith-28b-p300 "Summarize the key points from this document: [paste 30K text]"

Architecture

Model Configuration

from configs.zenith_config import get_28b_p300_config

config = get_28b_p300_config()
print(config)

Key Parameters:

  • hidden_size: 3072
  • num_layers: 36
  • num_heads: 24
  • num_experts: 8 (configurable, set 0 for dense)
  • moe_top_k: 2
  • max_seq_len: 32768
  • use_ring_attention: True
  • ring_attention_chunk_size: 8192
  • ring_attention_overlap: 2048

p300 Optimizations

  1. Tensor Parallelism (TP=8): Distributes model across 8 cores per chip
  2. Pipeline Parallelism (PP=4): Splits layers across 4 cores per chip
  3. NoC Optimization: Efficient inter-core communication
  4. Ring Attention: Chunked attention for 32K context without OOM
  5. Mixed Precision: BF16 for compute, FP16 for storage

MoE Configuration

config.num_experts = 8
config.moe_top_k = 2
config.moe_load_balancing_weight = 0.01
config.moe_capacity_factor = 1.0
config.moe_router_learning_rate = 1e-3
  • Top-2 routing (2 experts active per token)
  • Load balancing loss to ensure expert utilization
  • 60% of middle layers use MoE (22 out of 36 layers)

EQ Adapter

config.use_eq_adapter = True
config.eq_adapter_hidden_size = 64
config.eq_loss_weight = 0.05
config.emotion_loss_weight = 0.05
config.frustration_loss_weight = 0.05
  • Frustration detection (regression, 0-1)
  • 8-emotion classification (joy, sadness, anger, fear, surprise, disgust, trust, neutral)
  • Fused after attention in each transformer layer

Data Processing

OpenThoughts Integration

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig

ot_config = OpenThoughtsConfig(
    dataset_name="open-thoughts/OpenThoughts3-1.2M",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    augmentation=False,  # Disabled for reasoning tasks
    tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)

Curriculum Stages

  1. Foundation: High-quality reasoning examples (CoT, well-structured)
  2. Reasoning: Complex problem-solving and analytical tasks
  3. Code: Programming and algorithm implementation
  4. Full: Complete dataset with all quality levels

Quality Filtering

Multi-metric scoring:

  • Length: 512-32000 tokens (optimal for 32K context)
  • Language: English only (for reasoning tasks)
  • Repetition: < 15% character-level repetition
  • Coherence: Sentence-level semantic coherence > 0.7
  • Structure: Proper formatting and organization
  • Thought quality: For CoT data, depth of reasoning > 3 steps

Advanced Features

Ring Attention

Enables 32K context without quadratic memory:

config.use_ring_attention = True
config.ring_attention_chunk_size = 8192  # 8K chunks
config.ring_attention_overlap = 2048     # 2K overlap for continuity
  • Splits sequence into chunks processed sequentially
  • Overlap ensures smooth attention across chunk boundaries
  • Memory complexity: O(seq_len * chunk_size) instead of O(seq_len²)

Distributed Training

For multi-chip p300 setups:

# Set environment variables for distributed training
export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2  # 2 chips

# Run training with torchrun
torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Mixed Precision

--mixed_precision bf16  # Best for Ampere+ (p300)
# or
--mixed_precision fp16  # For older GPUs

Gradient Checkpointing

--gradient_checkpointing  # Reduces memory by ~60%

Testing

# Run test suite
python test_model.py

# Includes:
# - Model creation
# - Forward pass
# - p300 optimizations
# - MoE configuration
# - Ring attention
# - EQ adapter
# - Generation
# - Gradient flow

Evaluation

Standard Benchmarks

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math truthfulqa \
  --output_dir ./eval_results

Custom Evaluation

Create custom evaluation script:

from evaluation.eval_datasets import load_benchmark
from evaluation.metrics import compute_metrics

# Load reasoning benchmark
test_data = load_benchmark("gsm8k")

# Generate predictions
predictions = []
for sample in test_data:
    prompt = f"Question: {sample['question']}\nLet's think step by step.\nAnswer:"
    response = generate(model, tokenizer, prompt, max_new_tokens=512)
    predictions.append(response)

# Evaluate
metrics = compute_metrics(predictions, test_data, task="gsm8k")
print(f"Accuracy: {metrics['accuracy']:.2%}")

Performance Optimization

Memory Efficiency

  1. Use LoRA/QLoRA: Reduces trainable parameters by 1000x
  2. Gradient Checkpointing: Trade compute for memory
  3. Reduce Batch Size: Use gradient accumulation to maintain effective batch
  4. Shorter Sequences: Use --max_seq_length appropriate for your task
  5. Mixed Precision: BF16 reduces memory by 50%

Speed Optimization

  1. Increase Batch Size: Maximize GPU utilization
  2. Reduce Gradient Accumulation: Fewer steps, larger batches
  3. Disable Unused Features: Turn off EQ adapter, MoE if not needed
  4. Use Flash Attention: If available on your hardware
  5. Preprocess Data: Cache tokenized dataset

p300-Specific Tuning

  1. Core Allocation: TP=8, PP=4 uses all 32 RISC-V cores per chip
  2. NoC Optimization: Enable --use_noc_optimization for chip-to-chip comms
  3. Ring Attention: Essential for 32K context on limited memory
  4. BF16 Precision: Native support on p300 for faster compute

Troubleshooting

Out of Memory on p300

  • Reduce --batch_size to 1
  • Increase --gradient_accumulation_steps
  • Use --use_qlora for 4-bit quantization
  • Reduce --max_seq_length (try 16384 instead of 32768)
  • Enable --gradient_checkpointing

Slow Training

  • Check that tensor/pipeline parallelism is configured correctly
  • Verify NoC optimization is enabled for multi-chip
  • Ensure data loading is not bottleneck (use SSD, pre-tokenize)
  • Increase batch size if memory allows
  • Use mixed precision

Poor Reasoning Quality

  • Use curriculum learning (--use_curriculum)
  • Apply quality filtering (--use_quality_filter)
  • Train for more epochs (3-5 for reasoning tasks)
  • Use lower learning rate (5e-6 to 1e-5)
  • Ensure base model is reasoning-optimized (Qwen3.5-27B-Claude-Reasoning-Distilled)
  • Add more reasoning-focused data (CoT, math, code)

Ring Attention Issues

  • Ensure --max_seq_length is divisible by --ring_chunk_size
  • Reduce --ring_overlap if memory is tight
  • Check that chunk size is not too small (< 2048)
  • Verify attention mask is correctly generated

Citation

@misc{zenith-28b-p300-2025,
  title={Zenith-28B-p300: A Tenstorrent-Optimized Reasoning Model with Ring Attention},
  author={Zenith Project},
  year={2025},
  publisher={Zenith Project}
}

License

[Specify license]

Support

  • Documentation: README.md
  • Fine-tuning guide: FINETUNE_GUIDE.md
  • Configuration: configs/zenith_config.py
  • Issues: Please open an issue with detailed logs
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Matrix-Corp/Zenith-28b-p300-V1

Base model

Qwen/Qwen3.5-27B
Finetuned
(83)
this model

Space using Matrix-Corp/Zenith-28b-p300-V1 1

Collection including Matrix-Corp/Zenith-28b-p300-V1