Update README.md

29baeca verified about 1 month ago

6.09 kB

language:
  - en
license: mit
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
tags:
  - zenith
  - tenstorrent
  - reasoning
  - math
  - moe
  - ring-attention
  - eq-adapter
  - deepseek-r1
  - matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: zenith
hardware:
  - tenstorrent-blackhole-p300a

Zenith-32B-p300 V1-Tenstorrent-Blackhole-p300

Tenstorrent p300a-optimized 32B parameter model based on DeepSeek-R1-Distill-Qwen-32B.

Features

32B Parameters: Based on DeepSeek-R1-Distill-Qwen-32B
p300a Optimized: Specifically tuned for Tenstorrent p300a hardware
Ring Attention: 32K context window with efficient chunked attention
MoE Support: Mixture of Experts for sparse activation
EQ Adapter: Emotional intelligence capabilities
Reasoning & Code: Strong performance on reasoning and coding tasks
Tensor/Pipeline Parallelism: Optimized for distributed training
NoC Optimization: Efficient chip-to-chip communication
Ollama Compatible: Ready for deployment

Hardware Requirements

Training

Tenstorrent p300a: 2 chips (64 RISC-V cores)
Memory: 64GB GDDR6
Storage: 2TB+ NVMe SSD

Inference

p300a: Full 32K context supported
Standard GPU: 64GB+ VRAM (e.g., A100 80GB, H100 80GB)
Consumer GPUs: Use QLoRA or reduce context length

Quick Start

Installation

cd Zenith/V1-Tenstorrent-Blackhole-p300/32B
pip install -r requirements.txt

Training

# LoRA fine-tuning (recommended)
python train.py \
  --base_model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --train_data ./data/train.json \
  --use_lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --use_ring_attention \
  --max_seq_length 32768 \
  --tensor_parallel_size 8 \
  --pipeline_parallel_size 4 \
  --use_noc_optimization \
  --mixed_precision bf16

Inference

# Interactive mode
python inference.py --checkpoint ./outputs/checkpoint-final

# Single prompt
python inference.py \
  --checkpoint ./outputs/checkpoint-final \
  --prompt "Write a Python function to implement quicksort" \
  --max_new_tokens 1024

Ollama

ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Explain the difference between supervised and unsupervised learning"

Architecture

Model Configuration

from configs.zenith_config import get_32b_config

config = get_32b_config()

Key Parameters:

hidden_size: 4096
num_layers: 40
num_heads: 32
num_experts: 8 (configurable)
moe_top_k: 2
max_seq_len: 32768
use_ring_attention: True
ring_attention_chunk_size: 8192
ring_attention_overlap: 2048

p300 Optimizations

Tensor Parallelism (TP=8): Across 8 cores per chip
Pipeline Parallelism (PP=4): 4 stages per chip
NoC Optimization: Efficient inter-core communication
Ring Attention: 32K context without OOM
Mixed Precision: BF16 native support

Data Processing

OpenThoughts Integration

from data.openthoughts_processor import OpenThoughtsProcessor, OpenThoughtsConfig

ot_config = OpenThoughtsConfig(
    dataset_name="open-thoughts/OpenThoughts3-1.2M",
    streaming=True,
    max_seq_length=32768,
    quality_filtering=True,
    curriculum_learning=True,
    tokenizer=tokenizer
)
processor = OpenThoughtsProcessor(ot_config)

Curriculum Stages

Foundation: High-quality samples (score > 0.8)
Reasoning: Chain-of-thought examples
Code: Programming tasks
Full: Complete dataset

Quality Filtering

Multi-dimensional scoring:

Length: 512-32000 tokens
Language: English
Repetition: < 15%
Coherence: > 0.7
Structure: Valid formatting
Thought quality: CoT depth > 3 steps

Advanced Features

MoE

--use_moe --num_experts 8 --moe_top_k 2

Top-2 routing
Load balancing loss
60% middle layers use MoE

EQ Adapter

--use_eq_adapter --eq_loss_weight 0.05

Frustration detection
8-emotion classification
Fused with attention

Ring Attention

--use_ring_attention --ring_chunk_size 8192 --ring_overlap 2048

Enables 32K context
Memory: O(seq_len × chunk_size)
Chunked processing

Testing

python test_model.py

Tests cover:

Model creation
Forward pass
p300 optimizations
MoE configuration
Ring attention
EQ adapter
Generation
Gradient flow

Evaluation

python -m evaluation.benchmark \
  --model_path ./outputs/checkpoint-final \
  --benchmarks humaneval mbpp gsm8k math truthfulqa

Deployment

Ollama

ollama create zenith-32b-p300 -f Modelfile
ollama run zenith-32b-p300 "Your prompt here"

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model ./outputs/checkpoint-final \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

Troubleshooting

Memory Issues

Reduce batch size
Use gradient accumulation
Enable LoRA/QLoRA
Reduce sequence length
Enable gradient checkpointing

Slow Training

Increase batch size
Reduce gradient accumulation
Use mixed precision
Optimize data loading
Enable NoC optimization

Poor Quality

Use curriculum learning
Apply quality filtering
Train more epochs
Adjust learning rate
Use more high-quality data

Performance

Configuration	Memory	Speed	Quality
Full FT, 2K	~58GB	50-80	Baseline
LoRA r=16, 2K	~18GB	80-120	98%
QLoRA r=8, 2K	~10GB	100-150	95%
Ring 32K	+20%	30-50	Enables long context

Citation

@misc{zenith-32b-p300-2025,
  title={Zenith-32B-p300: A Tenstorrent-Optimized Reasoning Model},
  author={Zenith Project},
  year={2025}
}

License

[Specify]

Support

Documentation: README.md
Fine-tuning: FINETUNE_GUIDE.md
Config: configs/zenith_config.py
Issues: Open with detailed logs