KitsuVp commited on
Commit
e9c4723
·
verified ·
1 Parent(s): fee196d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -46
README.md CHANGED
@@ -1,55 +1,182 @@
1
  ---
2
  library_name: transformers
3
  tags:
4
- - generated_from_trainer
 
 
 
 
 
 
 
 
 
 
 
5
  model-index:
6
  - name: NeoLLM
7
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
  # NeoLLM
14
 
15
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
- It achieves the following results on the evaluation set:
17
- - Loss: 3.5958
18
- - Num Input Tokens Seen: 0
19
-
20
- ## Model description
21
-
22
- More information needed
23
-
24
- ## Intended uses & limitations
25
-
26
- More information needed
27
-
28
- ## Training and evaluation data
29
-
30
- More information needed
31
-
32
- ## Training procedure
33
-
34
- ### Training hyperparameters
35
-
36
- The following hyperparameters were used during training:
37
- - learning_rate: 0.0006
38
- - train_batch_size: 64
39
- - eval_batch_size: 64
40
- - seed: 42
41
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
42
- - lr_scheduler_type: linear
43
- - lr_scheduler_warmup_ratio: 0.1
44
- - num_epochs: 1
45
-
46
- ### Training results
47
-
48
-
49
-
50
- ### Framework versions
51
-
52
- - Transformers 4.57.0.dev0
53
- - Pytorch 2.8.0+cu129
54
- - Datasets 3.6.0
55
- - Tokenizers 0.22.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  tags:
4
+ - pytorch
5
+ - neollm
6
+ - hybrid-attention
7
+ - fanformer
8
+ - gated-delta-networks
9
+ - polynomial-activations
10
+ - fineweb-edu
11
+ - ademamix
12
+ - custom-scheduler
13
+ - flash-attention
14
+ - torch-compile
15
+ pipeline_tag: text-generation
16
  model-index:
17
  - name: NeoLLM
18
+ results:
19
+ - task:
20
+ type: text-generation
21
+ name: Text Generation
22
+ dataset:
23
+ type: multiple-choice
24
+ name: ARC-Easy
25
+ metrics:
26
+ - type: accuracy
27
+ value: 39.14
28
+ - task:
29
+ type: text-generation
30
+ name: Text Generation
31
+ dataset:
32
+ type: multiple-choice
33
+ name: HellaSwag
34
+ metrics:
35
+ - type: accuracy
36
+ value: 26.55
37
+ - task:
38
+ type: text-generation
39
+ name: Text Generation
40
+ dataset:
41
+ type: multiple-choice
42
+ name: MMLU
43
+ metrics:
44
+ - type: accuracy
45
+ value: 24.25
46
+ - task:
47
+ type: text-generation
48
+ name: Text Generation
49
+ dataset:
50
+ type: multiple-choice
51
+ name: ARC-Challenge
52
+ metrics:
53
+ - type: accuracy
54
+ value: 17.24
55
+ license: apache-2.0
56
+ datasets:
57
+ - HuggingFaceFW/fineweb-edu
58
+ language:
59
+ - en
60
  ---
61
 
 
 
 
62
  # NeoLLM
63
 
64
+ NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.
65
+
66
+ ## Model Description
67
+
68
+ NeoLLM incorporates several cutting-edge components:
69
+
70
+ - **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
71
+ - **Hybrid Attention Architecture**: Alternates between full attention and linear attention (Gated Delta Net) layers inspired by Qwen3-Next
72
+ - **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics
73
+ - **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
74
+ - **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency
75
+
76
+ ### Architecture Details
77
+
78
+ - **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings)
79
+ - **Hidden Size**: 512
80
+ - **Layers**: 12 layers with hybrid attention pattern
81
+ - **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention)
82
+ - **Intermediate Size**: 1024
83
+ - **Sequence Length**: 512 tokens
84
+ - **Vocabulary**: 151,665 tokens (Qwen3 tokenizer)
85
+
86
+ ### Layer Pattern
87
+ The model uses a hybrid attention pattern where layers alternate between:
88
+ - **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
89
+ - **Full Attention**: Layers 4,8,12 (Flash Attention 2)
90
+
91
+ ## Training Details
92
+
93
+ ### Dataset
94
+ - **Source**: FineWeb-Edu (sample-10BT subset)
95
+ - **Training Samples**: 4 million examples
96
+ - **Validation Split**: 1% (40,000 samples)
97
+ - **Text Processing**: Dynamic truncation to 4x block_size during tokenization
98
+ - **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled
99
+
100
+ ### Training Configuration
101
+ - **Hardware**: NVIDIA RTX 5090
102
+ - **Training Time**: 3 hours
103
+ - **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
104
+ - **Optimizer**: AdEMAMix with parameters:
105
+ - Betas: (0.9, 0.999, 0.999)
106
+ - Alpha: 5.0
107
+ - t_alpha: 5000, t_beta3: 5000
108
+ - Weight decay: 0.1
109
+ - **Learning Rate Schedule**: Custom cosine with linear warmup
110
+ - Start LR: 3e-4
111
+ - Peak LR: 6e-4 (at 5000 warmup steps)
112
+ - Min LR: 6e-5
113
+ - **Batch Size**: 64 per device
114
+ - **Precision**: BF16 with torch.compile optimization
115
+ - **Hardware Optimizations**: Flash Attention 2
116
+ - **Epochs**: 1
117
+
118
+ ### Framework Versions
119
+ - **PyTorch**: 2.8.0+cu129
120
+ - **Transformers**: 4.57.0.dev0
121
+ - **Flash Attention**: 2.x
122
+ - **CUDA**: 12.9
123
+
124
+ ## Evaluation Results
125
+
126
+ ### Benchmark Performance (1-shot evaluation)
127
+
128
+ | Task | Score |
129
+ |------|-------|
130
+ | ARC-Easy | 39.14% |
131
+ | HellaSwag | 26.55% |
132
+ | MMLU | 24.25% |
133
+ | ARC-Challenge | 17.24% |
134
+
135
+ *All evaluations performed in few-shot (1-shot) setting*
136
+
137
+ ## Model Architecture Components
138
+
139
+ ### Fourier Analysis Network (FANLayer)
140
+ Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
141
+ ```
142
+ FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
143
+ ```
144
+
145
+ ### LayerNorm Scaling (LNS)
146
+ Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
147
+ ```
148
+ h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
149
+ ```
150
+
151
+ ### Gradient-Preserving Activation Scaling (GPAS)
152
+ Scales activations without penalizing gradients using stop-gradient operations.
153
+
154
+ ### Polynomial Composition Activations (PolyNorm)
155
+ Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
156
+
157
+ ### Gated Delta Networks
158
+ Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.
159
+
160
+ ## Intended Uses & Limitations
161
+
162
+ ### Intended Uses
163
+ - Research into hybrid attention architectures
164
+ - Educational purposes for understanding advanced LLM components
165
+ - Small-scale language modeling experiments
166
+ - Benchmarking novel architectural components
167
+
168
+ ### Limitations
169
+ - Relatively small model size (110M parameters) limits capability compared to larger models
170
+ - Training limited to 4M samples from single dataset
171
+ - Performance below state-of-the-art models on standard benchmarks
172
+ - Experimental architecture may have stability considerations in production
173
+
174
+ ### Recommendations
175
+ - Best suited for research and educational applications
176
+ - Consider fine-tuning for specific downstream tasks
177
+ - Monitor performance carefully if adapting for production use
178
+
179
+ ## Training Infrastructure
180
+
181
+ - **Mixed Precision**: BF16 for numerical stability
182
+ - **Compilation**: torch.compile with max-autotune mode