Zenith-28B-p300 V1-Tenstorrent-Blackhole-p300

Tenstorrent p300a-optimized 28B parameter model with advanced reasoning capabilities.

Features

28B Parameters: Based on Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
p300a Optimized: Specifically tuned for Tenstorrent p300a hardware (dual-chip, 32 RISC-V cores, 64GB GDDR6)
Ring Attention: 32K context window with efficient chunked attention
MoE Architecture: Mixture of Experts for sparse activation (configurable)
EQ Adapter: Emotional intelligence and frustration detection
Reasoning Focus: Enhanced chain-of-thought and problem-solving abilities
Tensor/Pipeline Parallelism: Optimized for distributed training on p300
NoC Optimization: Efficient chip-to-chip communication
Ollama Compatible: Ready for deployment

Hardware Requirements

Training

Tenstorrent p300a: 2 chips (32 cores total)
Memory: 64GB GDDR6 (shared across chips)
Storage: 2TB+ NVMe SSD for dataset
Recommended: Use LoRA/QLoRA for efficient fine-tuning

Inference

p300a: Full 32K context supported
Standard GPU: 48GB+ VRAM (e.g., A100, H100) for full model
Consumer GPUs: Use QLoRA or smaller context windows

Quick Start

Installation

cd Zenith/V1-Tenstorrent-Blackhole-p300/28B
pip install -r requirements.txt

Training on p300

# Full fine-tuning (requires p300 with all 64 cores)
python train.py \
  --base_model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --train_data ./data/train.json \
  --epochs 2 \
  --batch_size 2 \
  --gradient_accumulation_steps 16 \
  --learning_rate 1e-5 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --use_ring_attention \
  --max_seq_length 32768 \
  --mixed_precision bf16

# LoRA fine-tuning (recommended)
python train.py \
  --base_model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --use_ring_attention \
  --max_seq_length 32768

Inference

# Interactive mode with reasoning focus
python inference.py --checkpoint ./outputs/checkpoint-final

# Single prompt with long context
python inference.py \
  --checkpoint ./outputs/checkpoint-final \
  --prompt "Analyze the following 10K text and extract key insights..." \
  --max_new_tokens 2048 \
  --temperature 0.55

Ollama Deployment

# Build model
ollama create zenith-28b-p300 -f Modelfile

# Run with reasoning focus
ollama run zenith-28b-p300 "Solve: A train travels at 60 mph for 2.5 hours. How far does it go? Show your reasoning."

# Long context example
ollama run zenith-28b-p300 "Summarize the key points from this document: [paste 30K text]"

Architecture

Model Configuration

from configs.zenith_config import get_28b_p300_config

config = get_28b_p300_config()
print(config)

Key Parameters:

hidden_size: 3072
num_layers: 36
num_heads: 24
num_experts: 8 (configurable, set 0 for dense)
moe_top_k: 2
max_seq_len: 32768
use_ring_attention: True
ring_attention_chunk_size: 8192
ring_attention_overlap: 2048

p300 Optimizations

Tensor Parallelism (TP=8): Distributes model across 8 cores per chip
Pipeline Parallelism (PP=4): Splits layers across 4 cores per chip
NoC Optimization: Efficient inter-core communication
Ring Attention: Chunked attention for 32K context without OOM
Mixed Precision: BF16 for compute, FP16 for storage

MoE Configuration

config.num_experts = 8
config.moe_top_k = 2
config.moe_load_balancing_weight = 0.01
config.moe_capacity_factor = 1.0
config.moe_router_learning_rate = 1e-3

Top-2 routing (2 experts active per token)
Load balancing loss to ensure expert utilization
60% of middle layers use MoE (22 out of 36 layers)

EQ Adapter

config.use_eq_adapter = True
config.eq_adapter_hidden_size = 64
config.eq_loss_weight = 0.05
config.emotion_loss_weight = 0.05
config.frustration_loss_weight = 0.05

Frustration detection (regression, 0-1)
8-emotion classification (joy, sadness, anger, fear, surprise, disgust, trust, neutral)
Fused after attention in each transformer layer

Data Processing

OpenThoughts Integration

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig

ot_config = OpenThoughtsConfig(
    dataset_name="open-thoughts/OpenThoughts3-1.2M",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    augmentation=False,  # Disabled for reasoning tasks
    tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)

Curriculum Stages

Foundation: High-quality reasoning examples (CoT, well-structured)
Reasoning: Complex problem-solving and analytical tasks
Code: Programming and algorithm implementation
Full: Complete dataset with all quality levels

Quality Filtering

Multi-metric scoring:

Length: 512-32000 tokens (optimal for 32K context)
Language: English only (for reasoning tasks)
Repetition: < 15% character-level repetition
Coherence: Sentence-level semantic coherence > 0.7
Structure: Proper formatting and organization
Thought quality: For CoT data, depth of reasoning > 3 steps

Advanced Features

Ring Attention

Enables 32K context without quadratic memory:

config.use_ring_attention = True
config.ring_attention_chunk_size = 8192  # 8K chunks
config.ring_attention_overlap = 2048     # 2K overlap for continuity

Splits sequence into chunks processed sequentially
Overlap ensures smooth attention across chunk boundaries
Memory complexity: O(seq_len * chunk_size) instead of O(seq_len²)

Distributed Training

For multi-chip p300 setups:

# Set environment variables for distributed training
export MASTER_ADDR=localhost
export MASTER_PORT=29500
export WORLD_SIZE=2  # 2 chips

# Run training with torchrun
torchrun --nproc_per_node=2 --nnodes=1 train.py ...

Mixed Precision

--mixed_precision bf16  # Best for Ampere+ (p300)
# or
--mixed_precision fp16  # For older GPUs

Gradient Checkpointing

--gradient_checkpointing  # Reduces memory by ~60%

Testing

# Run test suite
python test_model.py

# Includes:
# - Model creation
# - Forward pass
# - p300 optimizations
# - MoE configuration
# - Ring attention
# - EQ adapter
# - Generation
# - Gradient flow

Evaluation

Standard Benchmarks

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math truthfulqa \
  --output_dir ./eval_results

Custom Evaluation

Create custom evaluation script:

from evaluation.eval_datasets import load_benchmark
from evaluation.metrics import compute_metrics

# Load reasoning benchmark
test_data = load_benchmark("gsm8k")

# Generate predictions
predictions = []
for sample in test_data:
    prompt = f"Question: {sample['question']}\nLet's think step by step.\nAnswer:"
    response = generate(model, tokenizer, prompt, max_new_tokens=512)
    predictions.append(response)

# Evaluate
metrics = compute_metrics(predictions, test_data, task="gsm8k")
print(f"Accuracy: {metrics['accuracy']:.2%}")

Performance Optimization

Memory Efficiency

Use LoRA/QLoRA: Reduces trainable parameters by 1000x
Gradient Checkpointing: Trade compute for memory
Reduce Batch Size: Use gradient accumulation to maintain effective batch
Shorter Sequences: Use --max_seq_length appropriate for your task
Mixed Precision: BF16 reduces memory by 50%

Speed Optimization

Increase Batch Size: Maximize GPU utilization
Reduce Gradient Accumulation: Fewer steps, larger batches
Disable Unused Features: Turn off EQ adapter, MoE if not needed
Use Flash Attention: If available on your hardware
Preprocess Data: Cache tokenized dataset

p300-Specific Tuning

Core Allocation: TP=8, PP=4 uses all 32 RISC-V cores per chip
NoC Optimization: Enable --use_noc_optimization for chip-to-chip comms
Ring Attention: Essential for 32K context on limited memory
BF16 Precision: Native support on p300 for faster compute

Troubleshooting

Out of Memory on p300

Reduce --batch_size to 1
Increase --gradient_accumulation_steps
Use --use_qlora for 4-bit quantization
Reduce --max_seq_length (try 16384 instead of 32768)
Enable --gradient_checkpointing

Slow Training

Check that tensor/pipeline parallelism is configured correctly
Verify NoC optimization is enabled for multi-chip
Ensure data loading is not bottleneck (use SSD, pre-tokenize)
Increase batch size if memory allows
Use mixed precision

Poor Reasoning Quality

Use curriculum learning (--use_curriculum)
Apply quality filtering (--use_quality_filter)
Train for more epochs (3-5 for reasoning tasks)
Use lower learning rate (5e-6 to 1e-5)
Ensure base model is reasoning-optimized (Qwen3.5-27B-Claude-Reasoning-Distilled)
Add more reasoning-focused data (CoT, math, code)

Ring Attention Issues

Ensure --max_seq_length is divisible by --ring_chunk_size
Reduce --ring_overlap if memory is tight
Check that chunk size is not too small (< 2048)
Verify attention mask is correctly generated

Citation

@misc{zenith-28b-p300-2025,
  title={Zenith-28B-p300: A Tenstorrent-Optimized Reasoning Model with Ring Attention},
  author={Zenith Project},
  year={2025},
  publisher={Zenith Project}
}

License

[Specify license]

Support

Documentation: README.md
Fine-tuning guide: FINETUNE_GUIDE.md
Configuration: configs/zenith_config.py
Issues: Please open an issue with detailed logs

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Matrix-Corp/Zenith-28b-p300-V1

Base model

Qwen/Qwen3.5-27B

Finetuned

(279)

this model

Space using Matrix-Corp/Zenith-28b-p300-V1 1

Collection including Matrix-Corp/Zenith-28b-p300-V1

Zenith V1

Collection

All V1 models of Zenith series • 4 items • Updated Mar 10 • 1