KitsuVp
/

NeoLLM

@@ -1,55 +1,182 @@
 ---
 library_name: transformers
 tags:
-- generated_from_trainer
 model-index:
 - name: NeoLLM
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # NeoLLM
-This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 3.5958
-- Num Input Tokens Seen: 0
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0006
-- train_batch_size: 64
-- eval_batch_size: 64
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 1
-### Training results
-### Framework versions
-- Transformers 4.57.0.dev0
-- Pytorch 2.8.0+cu129
-- Datasets 3.6.0
-- Tokenizers 0.22.0

 ---
 library_name: transformers
 tags:
+- pytorch
+- neollm
+- hybrid-attention
+- fanformer
+- gated-delta-networks
+- polynomial-activations
+- fineweb-edu
+- ademamix
+- custom-scheduler
+- flash-attention
+- torch-compile
+pipeline_tag: text-generation
 model-index:
 - name: NeoLLM
+  results:
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      type: multiple-choice
+      name: ARC-Easy
+    metrics:
+    - type: accuracy
+      value: 39.14
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      type: multiple-choice
+      name: HellaSwag
+    metrics:
+    - type: accuracy
+      value: 26.55
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      type: multiple-choice
+      name: MMLU
+    metrics:
+    - type: accuracy
+      value: 24.25
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      type: multiple-choice
+      name: ARC-Challenge
+    metrics:
+    - type: accuracy
+      value: 17.24
+license: apache-2.0
+datasets:
+- HuggingFaceFW/fineweb-edu
+language:
+- en
 ---
 # NeoLLM
+NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.
+## Model Description
+NeoLLM incorporates several cutting-edge components:
+- **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
+- **Hybrid Attention Architecture**: Alternates between full attention and linear attention (Gated Delta Net) layers inspired by Qwen3-Next
+- **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics
+- **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
+- **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency
+### Architecture Details
+- **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings)
+- **Hidden Size**: 512
+- **Layers**: 12 layers with hybrid attention pattern
+- **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention)
+- **Intermediate Size**: 1024
+- **Sequence Length**: 512 tokens
+- **Vocabulary**: 151,665 tokens (Qwen3 tokenizer)
+### Layer Pattern
+The model uses a hybrid attention pattern where layers alternate between:
+- **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
+- **Full Attention**: Layers 4,8,12 (Flash Attention 2)
+## Training Details
+### Dataset
+- **Source**: FineWeb-Edu (sample-10BT subset)
+- **Training Samples**: 4 million examples
+- **Validation Split**: 1% (40,000 samples)
+- **Text Processing**: Dynamic truncation to 4x block_size during tokenization
+- **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled
+### Training Configuration
+- **Hardware**: NVIDIA RTX 5090
+- **Training Time**: 3 hours
+- **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
+- **Optimizer**: AdEMAMix with parameters:
+  - Betas: (0.9, 0.999, 0.999)
+  - Alpha: 5.0
+  - t_alpha: 5000, t_beta3: 5000
+  - Weight decay: 0.1
+- **Learning Rate Schedule**: Custom cosine with linear warmup
+  - Start LR: 3e-4
+  - Peak LR: 6e-4 (at 5000 warmup steps)
+  - Min LR: 6e-5
+- **Batch Size**: 64 per device
+- **Precision**: BF16 with torch.compile optimization
+- **Hardware Optimizations**: Flash Attention 2
+- **Epochs**: 1
+### Framework Versions
+- **PyTorch**: 2.8.0+cu129
+- **Transformers**: 4.57.0.dev0
+- **Flash Attention**: 2.x
+- **CUDA**: 12.9
+## Evaluation Results
+### Benchmark Performance (1-shot evaluation)
+| Task | Score |
+|------|-------|
+| ARC-Easy | 39.14% |
+| HellaSwag | 26.55% |
+| MMLU | 24.25% |
+| ARC-Challenge | 17.24% |
+*All evaluations performed in few-shot (1-shot) setting*
+## Model Architecture Components
+### Fourier Analysis Network (FANLayer)
+Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
+```
+FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
+```
+### LayerNorm Scaling (LNS)
+Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
+```
+h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
+```
+### Gradient-Preserving Activation Scaling (GPAS)
+Scales activations without penalizing gradients using stop-gradient operations.
+### Polynomial Composition Activations (PolyNorm)
+Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
+### Gated Delta Networks
+Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.
+## Intended Uses & Limitations
+### Intended Uses
+- Research into hybrid attention architectures
+- Educational purposes for understanding advanced LLM components
+- Small-scale language modeling experiments
+- Benchmarking novel architectural components
+### Limitations
+- Relatively small model size (110M parameters) limits capability compared to larger models
+- Training limited to 4M samples from single dataset
+- Performance below state-of-the-art models on standard benchmarks
+- Experimental architecture may have stability considerations in production
+### Recommendations
+- Best suited for research and educational applications
+- Consider fine-tuning for specific downstream tasks
+- Monitor performance carefully if adapting for production use
+## Training Infrastructure
+- **Mixed Precision**: BF16 for numerical stability
+- **Compilation**: torch.compile with max-autotune mode