Pointer-Mini / README.md

OzTianlu

Update README.md

e528314 verified 6 months ago

preview code

raw

history blame

6.8 kB

metadata

license: apache-2.0
language:
  - en
  - zh
library_name: pytorch
tags:
  - transformer
  - decoder-only
  - pointer-networks
  - knowledge-distillation
  - sparse-attention
  - pytorch
pipeline_tag: text-generation

Pointer: Decoder-only Transformer with Relational Routing

Pointer is a novel Decoder-only transformer architecture that implements relational routing through sparse pointer mechanisms. The core innovation lies in writing "edges" into weights while dereferencing node vectors at runtime, combined with FFN blocks for non-linear transformations.

Model Architecture

Core Innovation: Pointer Block

The PointerBlock is the heart of this architecture, implementing:

Sparse Address Generation: Creates sparse address distributions through top-k selection
Multi-head Attention: Uses multiple attention heads for pointer computation
Dynamic Vector Aggregation: Aggregates neighbor vectors based on pointer probabilities
Pointer-of-Pointer Chaining: Enables hierarchical knowledge addressing across layers

Architecture Components

TokenEmbedding → [PointerLayer × N] → LayerNorm → LM Head

PointerLayer:
├── LayerNorm
├── PointerBlock (sparse addressing + aggregation)
├── Gate + Residual Connection
├── LayerNorm  
└── FFN (d → d_ff → d)

Key Features

Relational Routing: Only "edges" are written into weights, node vectors are dereferenced at runtime
Sparse Attention: Top-k selection mechanism for efficient computation
Knowledge Address Chains: Higher layers reference increasingly abstract relationship patterns
KV Caching: Efficient inference with dynamic cache expansion

Model Specifications

Parameter	Value
Architecture	Decoder-only Transformer
Model Size	Pointer-300M
Vocabulary Size	Dynamic (based on tokenizer)
Hidden Dimension (d)	1,024
Number of Layers	24
Attention Heads	16
Top-k Selection	2
FFN Expansion Ratio	2.7
Maximum Sequence Length	4,096
Parameters	~300M
Dropout	0.1
FP16 Training	Yes
Tied Embeddings	Yes

Training Details

Mix-Distillation Strategy

The model was trained using Mix-Distillation following the "Small Models Struggle to Learn from Strong Reasoners" approach:

Teacher Model: DeepSeek-R1
Training Data: Mix-Long strategy with Long-CoT : Short-CoT in 0.2 : 0.8 ratio
Training Steps: 10,000 steps with gradient accumulation
Precision: FP16 with numerical stability protections

Training Hyperparameters

num_epochs: 2
per_device_batch_size: 4
gradient_accumulation_steps: 4
effective_batch_size: 16  # 4 * 4
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
weight_decay: 0.01
save_steps: 1000
eval_steps: 500
logging_steps: 50
fp16: true

Distillation Configuration

temperature: 2.0
alpha: 0.5  # KD loss weight
beta: 1.0   # CE loss weight  
gamma: 0.5  # Additional loss weight
use_kd_loss: true
use_ce_loss: true
use_hidden_mse: false
use_pointer_kl: false

Training Data

Dataset Size: 110,000 samples from Chinese-DeepSeek-R1-Distill
CoT Distribution:
- Long-CoT: 22,000 samples (20%)
- Short-CoT: 88,000 samples (80%)
Sequence Length: 21-2,048 tokens (mean: 885, median: 721)
Quality Scores: 7-10 (mean: 9.09)

Loss Components

Cross-Entropy Loss: Standard language modeling objective
Hidden State MSE: Knowledge distillation from teacher hidden states
Pointer KL Divergence: Alignment of pointer attention distributions
Pointer Cross-Entropy: Hard distillation for pointer indices

Key Innovations

1. Pointer-of-Pointer Mechanism

Each layer produces pointer indices to previous positions, and the next layer uses these indices to create "pointer-of-pointer" chains, enabling hierarchical knowledge addressing patterns.

2. Sparse Relational Routing

Instead of dense attention, the model uses sparse top-k selection to identify the most relevant connections, making computation more efficient while maintaining expressiveness.

3. Runtime Vector Dereferencing

Unlike traditional transformers that compute attention over all positions, Pointer writes relationship patterns into weights and dereferences specific node vectors only when needed.

4. Numerical Stability for FP16

Extensive NaN detection and handling throughout the forward pass, including:

Input validation in embeddings
Attention score clamping
Emergency NaN repairs

Usage

import torch
from src.model.pointer_model import PointerDecoder

# Initialize Pointer-300M model with your config
model = PointerDecoder(
    vocab_size=tokenizer.vocab_size,  # Dynamic based on tokenizer
    d=1024,                          # Hidden dimension
    n_layers=24,                     # Number of layers
    n_heads=16,                      # Attention heads
    top_k=2,                         # Pointer selection
    r=2.7,                          # FFN expansion ratio
    max_seq_len=4096,               # Max sequence length
    dropout=0.1,                    # Dropout rate
    tie_embeddings=True,            # Tie input/output embeddings
    fp16=True                       # FP16 training
)

# Forward pass
input_ids = torch.randint(0, tokenizer.vocab_size, (1, 100))
logits = model(input_ids)

# Inference with caching
cache = model.init_cache(batch_size=1)
for token in input_sequence:
    logits, cache = model.step(token, cache)

File Structure

src/
├── layers/
│   ├── embedding.py       # TokenEmbedding with vocab reduction support
│   ├── rotary.py         # Rotary positional encoding
│   ├── pointer_block.py  # Core PointerBlock implementation
│   ├── ffn.py           # Feed-forward network
│   └── pointer_layer.py  # PointerBlock + FFN + Residual connections
└── model/
    └── pointer_model.py  # Complete PointerDecoder implementation

Supported Languages

English
Chinese (Simplified)

Limitations

Currently supports only left-to-right generation (no bidirectional)
Requires careful FP16 training due to numerical stability considerations
Top-k selection parameter needs tuning for different tasks
Model size is 300M parameters (smaller than larger language models)
Trained primarily on Chinese data with DeepSeek-R1 distillation

Citation

If you use this model in your research, please cite:

@misc{pointer300m2025,
  title={Pointer-300M: Decoder-only Transformer with Relational Routing},
  author={[Noesis Lab]},
  year={2025},
  howpublished={\url{https://huggingface.co/NoesisLab/Pointer-300M}}
}

License

This model is released under the Apache 2.0 License.