NoesisLab
/

Pointer-Mini

@@ -51,14 +51,18 @@ PointerLayer:
 | Parameter | Value |
 |-----------|-------|
 | Architecture | Decoder-only Transformer |
-| Vocabulary Size | 50,032 |
-| Hidden Dimension (d) | 4,096 |
-| Number of Layers | 48 |
-| Attention Heads | 32 |
 | Top-k Selection | 2 |
 | FFN Expansion Ratio | 2.7 |
-| Sequence Length | 4,096 |
-| Parameters | ~6B |
 ## Training Details
@@ -72,13 +76,40 @@ The model was trained using Mix-Distillation following the "Small Models Struggl
 ### Training Hyperparameters
 ```yaml
-batch_size: 1024
-learning_rate: 3e-4
 warmup_ratio: 0.05
-sequence_length: 4096
-optimizer: AdamW
 ```
 ### Loss Components
 - **Cross-Entropy Loss**: Standard language modeling objective
 - **Hidden State MSE**: Knowledge distillation from teacher hidden states
@@ -108,18 +139,22 @@ Extensive NaN detection and handling throughout the forward pass, including:
 import torch
 from src.model.pointer_model import PointerDecoder
-# Initialize model
 model = PointerDecoder(
-    vocab_size=50032,
-    d=4096,
-    n_layers=48,
-    n_heads=32,
-    top_k=2,
-    r=2.7
 )
 # Forward pass
-input_ids = torch.randint(0, 50032, (1, 100))
 logits = model(input_ids)
 # Inference with caching
@@ -152,17 +187,19 @@ src/
 - Currently supports only left-to-right generation (no bidirectional)
 - Requires careful FP16 training due to numerical stability considerations
 - Top-k selection parameter needs tuning for different tasks
 ## Citation
 If you use this model in your research, please cite:
 ```bibtex
-@misc{pointer2024,
-  title={Pointer: Decoder-only Transformer with Relational Routing},
-  author={[Your Name]},
-  year={2024},
-  howpublished={\url{https://huggingface.co/[your-username]/pointer}}
 }
 ```

 | Parameter | Value |
 |-----------|-------|
 | Architecture | Decoder-only Transformer |
+| Model Size | Pointer-300M |
+| Vocabulary Size | Dynamic (based on tokenizer) |
+| Hidden Dimension (d) | 1,024 |
+| Number of Layers | 24 |
+| Attention Heads | 16 |
 | Top-k Selection | 2 |
 | FFN Expansion Ratio | 2.7 |
+| Maximum Sequence Length | 4,096 |
+| Parameters | ~300M |
+| Dropout | 0.1 |
+| FP16 Training | Yes |
+| Tied Embeddings | Yes |
 ## Training Details
 ### Training Hyperparameters
 ```yaml
+num_epochs: 2
+per_device_batch_size: 4
+gradient_accumulation_steps: 4
+effective_batch_size: 16  # 4 * 4
+learning_rate: 2e-4
+lr_scheduler: cosine
 warmup_ratio: 0.05
+weight_decay: 0.01
+save_steps: 1000
+eval_steps: 500
+logging_steps: 50
+fp16: true
 ```
+### Distillation Configuration
+```yaml
+temperature: 2.0
+alpha: 0.5  # KD loss weight
+beta: 1.0   # CE loss weight
+gamma: 0.5  # Additional loss weight
+use_kd_loss: true
+use_ce_loss: true
+use_hidden_mse: false
+use_pointer_kl: false
+```
+### Training Data
+- **Dataset Size**: 110,000 samples from Chinese-DeepSeek-R1-Distill
+- **CoT Distribution**:
+  - Long-CoT: 22,000 samples (20%)
+  - Short-CoT: 88,000 samples (80%)
+- **Sequence Length**: 21-2,048 tokens (mean: 885, median: 721)
+- **Quality Scores**: 7-10 (mean: 9.09)
 ### Loss Components
 - **Cross-Entropy Loss**: Standard language modeling objective
 - **Hidden State MSE**: Knowledge distillation from teacher hidden states
 import torch
 from src.model.pointer_model import PointerDecoder
+# Initialize Pointer-300M model with your config
 model = PointerDecoder(
+    vocab_size=tokenizer.vocab_size,  # Dynamic based on tokenizer
+    d=1024,                          # Hidden dimension
+    n_layers=24,                     # Number of layers
+    n_heads=16,                      # Attention heads
+    top_k=2,                         # Pointer selection
+    r=2.7,                          # FFN expansion ratio
+    max_seq_len=4096,               # Max sequence length
+    dropout=0.1,                    # Dropout rate
+    tie_embeddings=True,            # Tie input/output embeddings
+    fp16=True                       # FP16 training
 )
 # Forward pass
+input_ids = torch.randint(0, tokenizer.vocab_size, (1, 100))
 logits = model(input_ids)
 # Inference with caching
 - Currently supports only left-to-right generation (no bidirectional)
 - Requires careful FP16 training due to numerical stability considerations
 - Top-k selection parameter needs tuning for different tasks
+- Model size is 300M parameters (smaller than larger language models)
+- Trained primarily on Chinese data with DeepSeek-R1 distillation
 ## Citation
 If you use this model in your research, please cite:
 ```bibtex
+@misc{pointer300m2025,
+  title={Pointer-300M: Decoder-only Transformer with Relational Routing},
+  author={[Noesis Lab]},
+  year={2025},
+  howpublished={\url{https://huggingface.co/NoesisLab/Pointer-300M}}
 }
 ```