KitsuVp commited on
Commit
9a7ec7b
·
verified ·
1 Parent(s): 3404b3b

Model save

Browse files
Files changed (4) hide show
  1. README.md +49 -183
  2. config.json +6 -7
  3. model.safetensors +2 -2
  4. training_args.bin +1 -1
README.md CHANGED
@@ -1,192 +1,58 @@
1
  ---
2
  library_name: transformers
3
  tags:
4
- - pytorch
5
- - neollm
6
- - hybrid-attention
7
- - fanformer
8
- - gated-delta-networks
9
- - polynomial-activations
10
- - fineweb-edu
11
- - ademamix
12
- - custom-scheduler
13
- - flash-attention
14
- - torch-compile
15
- pipeline_tag: text-generation
16
  model-index:
17
  - name: NeoLLM
18
- results:
19
- - task:
20
- type: text-generation
21
- name: Text Generation
22
- dataset:
23
- type: multiple-choice
24
- name: ARC-Easy
25
- metrics:
26
- - type: accuracy
27
- value: 39.14
28
- - task:
29
- type: text-generation
30
- name: Text Generation
31
- dataset:
32
- type: multiple-choice
33
- name: HellaSwag
34
- metrics:
35
- - type: accuracy
36
- value: 26.55
37
- - task:
38
- type: text-generation
39
- name: Text Generation
40
- dataset:
41
- type: multiple-choice
42
- name: MMLU
43
- metrics:
44
- - type: accuracy
45
- value: 24.25
46
- - task:
47
- type: text-generation
48
- name: Text Generation
49
- dataset:
50
- type: multiple-choice
51
- name: ARC-Challenge
52
- metrics:
53
- - type: accuracy
54
- value: 17.24
55
- license: apache-2.0
56
- datasets:
57
- - HuggingFaceFW/fineweb-edu
58
- language:
59
- - en
60
  ---
61
 
 
 
 
62
  # NeoLLM
63
 
64
- NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.
65
-
66
- ## Model Description
67
-
68
- NeoLLM incorporates several cutting-edge components:
69
-
70
- - **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
71
- - **Hybrid Attention Architecture**: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
72
- - **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics
73
- - **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
74
- - **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency
75
- ## Installation
76
-
77
- Before using this model, install the required dependencies:
78
-
79
- ```bash
80
- pip install git+https://github.com/huggingface/transformers.git@main
81
- pip install "cut-cross-entropy @ git+https://github.com/apple/ml-cross-entropy.git"
82
- pip install flash-linear-attention
83
- ```
84
-
85
-
86
- ### Architecture Details
87
-
88
- - **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings)
89
- - **Hidden Size**: 512
90
- - **Layers**: 12 layers with hybrid attention pattern
91
- - **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention)
92
- - **Intermediate Size**: 1024
93
- - **Sequence Length**: 512 tokens
94
- - **Vocabulary**: 151,665 tokens (Qwen3 tokenizer)
95
-
96
- ### Layer Pattern
97
- The model uses a hybrid attention pattern where layers alternate between:
98
- - **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
99
- - **Full Attention**: Layers 4,8,12 (Flash Attention 2)
100
-
101
- ## Training Details
102
-
103
- ### Dataset
104
- - **Source**: FineWeb-Edu (sample-10BT subset)
105
- - **Training Samples**: 4 million examples
106
- - **Validation Split**: 1% (40,000 samples)
107
- - **Text Processing**: Dynamic truncation to 4x block_size during tokenization
108
- - **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled
109
-
110
- ### Training Configuration
111
- - **Hardware**: NVIDIA RTX 5090
112
- - **Training Time**: 3 hours
113
- - **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
114
- - **Optimizer**: AdEMAMix with parameters:
115
- - Betas: (0.9, 0.999, 0.999)
116
- - Alpha: 5.0
117
- - t_alpha: 5000, t_beta3: 5000
118
- - Weight decay: 0.1
119
- - **Learning Rate Schedule**: Custom cosine with linear warmup
120
- - Start LR: 3e-4
121
- - Peak LR: 6e-4 (at 5000 warmup steps)
122
- - Min LR: 6e-5
123
- - **Batch Size**: 64 per device
124
- - **Precision**: BF16 with torch.compile optimization
125
- - **Hardware Optimizations**: Flash Attention 2
126
- - **Epochs**: 1
127
-
128
- ### Framework Versions
129
- - **PyTorch**: 2.8.0+cu129
130
- - **Transformers**: 4.57.0.dev0
131
- - **Flash Attention**: 2.x
132
- - **CUDA**: 12.9
133
-
134
- ## Evaluation Results
135
-
136
- ### Benchmark Performance (1-shot evaluation)
137
-
138
- | Task | Score |
139
- |------|-------|
140
- | ARC-Easy | 39.14% |
141
- | HellaSwag | 26.55% |
142
- | MMLU | 24.25% |
143
- | ARC-Challenge | 17.24% |
144
-
145
- *All evaluations performed in few-shot (1-shot) setting*
146
-
147
- ## Model Architecture Components
148
-
149
- ### Fourier Analysis Network (FANLayer)
150
- Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
151
- ```
152
- FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
153
- ```
154
-
155
- ### LayerNorm Scaling (LNS)
156
- Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
157
- ```
158
- h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
159
- ```
160
-
161
- ### Gradient-Preserving Activation Scaling (GPAS)
162
- Scales activations without penalizing gradients using stop-gradient operations.
163
-
164
- ### Polynomial Composition Activations (PolyNorm)
165
- Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
166
-
167
- ### Gated Delta Networks
168
- Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.
169
-
170
- ## Intended Uses & Limitations
171
-
172
- ### Intended Uses
173
- - Research into hybrid attention architectures
174
- - Educational purposes for understanding advanced LLM components
175
- - Small-scale language modeling experiments
176
- - Benchmarking novel architectural components
177
-
178
- ### Limitations
179
- - Relatively small model size (110M parameters) limits capability compared to larger models
180
- - Training limited to 4M samples from single dataset
181
- - Performance below state-of-the-art models on standard benchmarks
182
- - Experimental architecture may have stability considerations in production
183
-
184
- ### Recommendations
185
- - Best suited for research and educational applications
186
- - Consider fine-tuning for specific downstream tasks
187
- - Monitor performance carefully if adapting for production use
188
-
189
- ## Training Infrastructure
190
-
191
- - **Mixed Precision**: BF16 for numerical stability
192
- - **Compilation**: torch.compile with max-autotune mode
 
1
  ---
2
  library_name: transformers
3
  tags:
4
+ - generated_from_trainer
 
 
 
 
 
 
 
 
 
 
 
5
  model-index:
6
  - name: NeoLLM
7
+ results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
  # NeoLLM
14
 
15
+ This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
+ It achieves the following results on the evaluation set:
17
+ - Loss: 3.8652
18
+
19
+ ## Model description
20
+
21
+ More information needed
22
+
23
+ ## Intended uses & limitations
24
+
25
+ More information needed
26
+
27
+ ## Training and evaluation data
28
+
29
+ More information needed
30
+
31
+ ## Training procedure
32
+
33
+ ### Training hyperparameters
34
+
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 0.0006
37
+ - train_batch_size: 64
38
+ - eval_batch_size: 64
39
+ - seed: 42
40
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
41
+ - lr_scheduler_type: linear
42
+ - lr_scheduler_warmup_ratio: 0.1
43
+ - num_epochs: 1
44
+
45
+ ### Training results
46
+
47
+ | Training Loss | Epoch | Step | Validation Loss |
48
+ |:-------------:|:------:|:----:|:---------------:|
49
+ | 4.2056 | 0.3840 | 3000 | 4.2055 |
50
+ | 3.8841 | 0.7680 | 6000 | 3.8652 |
51
+
52
+
53
+ ### Framework versions
54
+
55
+ - Transformers 4.57.0.dev0
56
+ - Pytorch 2.8.0+cu129
57
+ - Datasets 3.6.0
58
+ - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -2,15 +2,14 @@
2
  "architectures": [
3
  "NeoLLMForCausalLM"
4
  ],
5
- "auto_map": {
6
- "AutoConfig": "configuration_neollm.NeoLLMConfig",
7
- "AutoModel": "modeling_neollm.NeoLLMModel",
8
- "AutoModelForCausalLM": "modeling_neollm.NeoLLMForCausalLM"
9
- },
10
  "attention_bias": false,
11
  "attention_dropout": 0.1,
 
 
 
 
 
12
  "dropout_rate": 0.1,
13
-
14
  "dtype": "bfloat16",
15
  "eos_token_id": 151645,
16
  "fan_ratio": 0.125,
@@ -18,7 +17,7 @@
18
  "hidden_act": "xielu",
19
  "hidden_size": 512,
20
  "initializer_range": 0.02,
21
- "intermediate_size": 1024,
22
  "layer_types": [
23
  "linear_attention",
24
  "linear_attention",
 
2
  "architectures": [
3
  "NeoLLMForCausalLM"
4
  ],
 
 
 
 
 
5
  "attention_bias": false,
6
  "attention_dropout": 0.1,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_unified.UnifiedModelConfig",
9
+ "AutoModel": "modeling_unified.UnifiedModel",
10
+ "AutoModelForCausalLM": "modeling_unified.UnifiedModel"
11
+ },
12
  "dropout_rate": 0.1,
 
13
  "dtype": "bfloat16",
14
  "eos_token_id": 151645,
15
  "fan_ratio": 0.125,
 
17
  "hidden_act": "xielu",
18
  "hidden_size": 512,
19
  "initializer_range": 0.02,
20
+ "intermediate_size": 1536,
21
  "layer_types": [
22
  "linear_attention",
23
  "linear_attention",
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dc073ea209fede7edbf7c6eaf470b935fcafade692d86dfcb001e52da1df45e7
3
- size 219053832
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:673a2ad3e9fb95397d7c50a0d7023b13ddd589eb5b9205b3370e9da8be1d4991
3
+ size 231636744
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd062dd4a82d5ccb7b2eb217167c2361df73816e8fecd313bccdfc47eca850b0
3
  size 5969
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a7ed46ac173cd670ec0cb96d3ba813baf4fad6c4f51be08a8e3127610528168
3
  size 5969