Austin207 commited on
Commit
445fd2d
Β·
verified Β·
1 Parent(s): a683148

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +275 -3
README.md CHANGED
@@ -1,3 +1,275 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - pytorch
9
+ - custom-architecture
10
+ - rope
11
+ - rmsnorm
12
+ - swiglu
13
+ - flash-attention
14
+ - 16k-context
15
+ pipeline_tag: text-generation
16
+ widget:
17
+ - text: "The future of artificial intelligence is"
18
+ example_title: "AI Future"
19
+ - text: "Write a short story about"
20
+ example_title: "Story Generation"
21
+ - text: "Explain quantum computing in simple terms:"
22
+ example_title: "Technical Explanation"
23
+ datasets:
24
+ - tiiuae/falcon-refinedweb
25
+ metrics:
26
+ - perplexity
27
+ model-index:
28
+ - name: MAP-NEO Mini
29
+ results:
30
+ - task:
31
+ type: text-generation
32
+ name: Text Generation
33
+ dataset:
34
+ name: RefinedWeb (100K subset)
35
+ type: tiiuae/falcon-refinedweb
36
+ metrics:
37
+ - type: perplexity
38
+ value: 3.9
39
+ name: Final Training Loss
40
+ ---
41
+
42
+ # MAP-NEO Mini
43
+
44
+ ## Model Description
45
+
46
+ **MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.
47
+
48
+ - **Developed by**: [Your Name/Organization]
49
+ - **Model type**: Autoregressive Language Model
50
+ - **Language(s)**: English
51
+ - **License**: MIT
52
+ - **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention
53
+
54
+ ## Key Features
55
+
56
+ - πŸš€ **Efficient Training**: Trained on RTX 5070 (8GB VRAM) in ~4 hours
57
+ - πŸ“ **Extended Context**: 16,384 token context window (16x typical small models)
58
+ - 🧠 **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference
59
+ - ⚑ **Fast Inference**: ~10 tokens/second on consumer GPU
60
+ - 🎯 **High Quality Data**: Trained on curated RefinedWeb subset
61
+
62
+ ## Architecture Details
63
+
64
+ ### Model Architecture
65
+ - **Parameters**: 253,085,696 (253M)
66
+ - **Layers**: 16 transformer blocks
67
+ - **Hidden Size**: 1,024
68
+ - **Attention Heads**: 16
69
+ - **Head Dimension**: 64
70
+ - **FFN Hidden Size**: 2,736 (2.67x hidden size)
71
+ - **Vocabulary Size**: 50,257 (GPT-2 tokenizer)
72
+ - **Max Sequence Length**: 16,384 tokens
73
+
74
+ ### Architectural Innovations
75
+ - **RMSNorm**: Root Mean Square Layer Normalization for training stability
76
+ - **RoPE**: Rotary Positional Embeddings for better positional understanding
77
+ - **SwiGLU**: Swish-Gated Linear Units for improved FFN performance
78
+ - **Flash Attention**: Memory-efficient attention computation
79
+ - **Weight Tying**: Input/output embeddings shared for parameter efficiency
80
+
81
+ ## Training Data
82
+
83
+ ### Dataset
84
+ - **Source**: `tiiuae/falcon-refinedweb` (curated subset)
85
+ - **Size**: 100,000 high-quality web documents
86
+ - **Tokens**: ~41 million tokens
87
+ - **Sequence Length**: 1,024 tokens per sequence
88
+ - **Sequences**: 40,965 packed sequences
89
+
90
+ ### Data Quality
91
+ - Length filtering: 200-10,000 characters
92
+ - Language detection: English only
93
+ - Quality scoring: High-quality web content
94
+ - Deduplication: Exact and near-duplicate removal
95
+
96
+ ## Training Procedure
97
+
98
+ ### Training Configuration
99
+ - **Hardware**: NVIDIA RTX 5070 (8GB VRAM)
100
+ - **Precision**: bfloat16 mixed precision
101
+ - **Batch Size**: 1 per device
102
+ - **Gradient Accumulation**: 32 steps
103
+ - **Effective Batch Size**: 32
104
+ - **Learning Rate**: 3e-4
105
+ - **Scheduler**: Cosine with linear warmup
106
+ - **Warmup Steps**: 3,750
107
+ - **Total Steps**: 150,000
108
+ - **Training Time**: ~4 hours
109
+
110
+ ### Optimization Details
111
+ - **Optimizer**: AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight_decay=0.01)
112
+ - **Gradient Clipping**: 1.0
113
+ - **Gradient Checkpointing**: Enabled for memory efficiency
114
+ - **Loss Function**: Cross-entropy loss
115
+
116
+ ### Context Extension
117
+ - **Base Context**: 2,048 tokens
118
+ - **Extended Context**: 16,384 tokens
119
+ - **Method**: Linear interpolation of positional embeddings
120
+ - **Validation**: Successfully tested up to 3,600 tokens
121
+
122
+ ## Performance
123
+
124
+ ### Training Metrics
125
+ - **Final Loss**: 3.907
126
+ - **Training Speed**: ~10 iterations/second
127
+ - **Peak Memory**: ~8GB VRAM
128
+ - **Convergence**: Smooth loss curve, no overfitting
129
+
130
+ ### Inference Performance
131
+ - **Speed**: ~10 tokens/second (RTX 5070)
132
+ - **Memory Usage**: 1.3GB for 1,800 token context
133
+ - **Context Limit**: 3,600 tokens practical limit
134
+ - **Temperature**: Recommended 0.7-0.9 for creative tasks
135
+
136
+ ## Usage
137
+
138
+ ### Quick Start
139
+ ---
140
+ import torch
141
+ from transformers import AutoTokenizer
142
+ from model_neo import NeoMini, NeoMiniConfig
143
+
144
+ # Load model
145
+ config = NeoMiniConfig()
146
+ model = NeoMini(config)
147
+ checkpoint = torch.load("extended_context_model.pt")
148
+ model.load_state_dict(checkpoint['model_state_dict'])
149
+ model.eval()
150
+
151
+ # Load tokenizer
152
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
153
+
154
+ # Generate text
155
+ prompt = "The future of AI is"
156
+ input_ids = tokenizer.encode(prompt, return_tensors="pt")
157
+ with torch.no_grad():
158
+ output = model.generate(input_ids, max_length=100, temperature=0.8)
159
+ print(tokenizer.decode(output))
160
+ ---
161
+ ### Interactive Chat
162
+ ---
163
+ python interactive_chat.py
164
+ ---
165
+
166
+ ### Generation Parameters
167
+ - **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
168
+ - **Top-k**: 40-50
169
+ - **Top-p**: 0.8-0.9
170
+ - **Repetition Penalty**: 1.1-1.3
171
+
172
+ ## Limitations
173
+
174
+ ### Current Limitations
175
+ - **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat)
176
+ - **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture
177
+ - **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance
178
+ - **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff
179
+
180
+ ### Known Issues
181
+ - Occasionally generates repetitive patterns (fixable with fine-tuning)
182
+ - May not follow instructions well (base model behavior)
183
+ - Sometimes produces formatting artifacts from web data
184
+
185
+ ## Ethical Considerations
186
+
187
+ ### Bias and Fairness
188
+ - Trained on web data which may contain societal biases
189
+ - No explicit bias mitigation applied during training
190
+ - Users should be aware of potential biased outputs
191
+
192
+ ### Use Cases
193
+ **Intended Uses:**
194
+ - Research and experimentation
195
+ - Text generation and completion
196
+ - Creative writing assistance
197
+ - Educational purposes
198
+
199
+ **Out-of-Scope Uses:**
200
+ - Medical or legal advice
201
+ - High-stakes decision making
202
+ - Content that could cause harm
203
+
204
+ ## Environmental Impact
205
+
206
+ ### Carbon Footprint
207
+ - **Training Hardware**: Single RTX 5070 (200W)
208
+ - **Training Time**: 4 hours
209
+ - **Estimated COβ‚‚**: ~0.3 kg COβ‚‚ equivalent
210
+ - **Efficiency**: 253M parameters per 0.3 kg COβ‚‚
211
+
212
+ ## Model Card Authors
213
+
214
+ [Antony Austin] - Model development and training
215
+ [30/08/2025] - Model card creation
216
+
217
+ ## Citation
218
+
219
+ ---
220
+ @misc{mapneo_mini_2025,
221
+ title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
222
+ author={[Antony Austin]},
223
+ year={2025},
224
+ howpublished={\url{https://huggingface.co/[Austin207]/map-neo-mini}},
225
+ note={Trained on NVIDIA RTX 5070 with RefinedWeb data}
226
+ }
227
+ ---
228
+
229
+ ## Technical Details
230
+
231
+ ### Files Structure
232
+ ---
233
+ map-neo-mini/
234
+ β”œβ”€β”€ config.json # Model configuration
235
+ β”œβ”€β”€ pytorch_model.bin # Model weights
236
+ β”œβ”€β”€ tokenizer.json # Tokenizer configuration
237
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer metadata
238
+ β”œβ”€β”€ special_tokens_map.json # Special tokens
239
+ β”œβ”€β”€ vocab.json # Vocabulary
240
+ β”œβ”€β”€ merges.txt # BPE merges
241
+ └── model_neo.py # Model architecture code
242
+ ---
243
+
244
+ ### Hardware Requirements
245
+ - **Minimum**: 4GB VRAM for inference
246
+ - **Recommended**: 8GB VRAM for extended context
247
+ - **Training**: 8GB+ VRAM with mixed precision
248
+ - **CPU**: Any modern CPU (inference possible but slow)
249
+
250
+ ## Future Work
251
+
252
+ ### Planned Improvements
253
+ - [ ] Conversational fine-tuning with UltraChat dataset
254
+ - [ ] Instruction following capabilities
255
+ - [ ] Multi-language support
256
+ - [ ] Quantized versions (4-bit, 8-bit)
257
+ - [ ] ONNX export for edge deployment
258
+
259
+ ### Research Directions
260
+ - Context window optimization beyond 16K
261
+ - More efficient attention mechanisms
262
+ - Improved training data curation
263
+ - Specialized domain fine-tuning
264
+
265
+ ## Acknowledgments
266
+
267
+ - **Falcon RefinedWeb**: High-quality training data
268
+ - **Hugging Face**: Transformers library and infrastructure
269
+ - **Community**: Open-source ML community for architectural insights
270
+
271
+ ---
272
+
273
+ **Last Updated**: August 30, 2025
274
+ **Model Version**: 1.0.0
275
+ **Status**: Base model (pre-conversational fine-tuning)