lakhera2023 commited on
Commit
2856cda
Β·
verified Β·
1 Parent(s): bfc945c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +332 -0
README.md ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RNJ-1: Building from Scratch
2
+
3
+ A complete PyTorch implementation of the **RNJ-1** (pronounced "range-1") architecture, following the design principles of the model developed by Essential AI, led by Ashish Vaswani (co-author of "Attention Is All You Need").
4
+
5
+ ## Overview
6
+
7
+ This is a **configurable implementation** of the RNJ-1 architecture that allows you to build models of various sizes. The original RNJ-1 is an 8.3B parameter model optimized for code generation, agentic tasks, and STEM problem solving, but this implementation lets you adjust the model size based on your needs and available resources.
8
+
9
+ **The actual parameter count is calculated at runtime** and depends on your configuration in `RNJ1_CONFIG`. The default configuration targets the full RNJ-1 architecture (~8.3B), but you can easily modify it to create smaller models for testing or limited GPU memory.
10
+
11
+ ### Key Facts
12
+
13
+ - **Parameters**: Fully configurable - actual count calculated at runtime (see `RNJ1_CONFIG` in `rnj1.py`)
14
+ - **Default Config**: Targets ~8.3B parameters (can be modified for smaller models)
15
+ - **Context Length**: Configurable (default 32K tokens)
16
+ - **License**: Apache 2.0
17
+ - **Architecture**: Based on Gemma 3, with key simplifications
18
+ - **Original Model**: Essential AI (led by Transformer co-inventor Ashish Vaswani)
19
+
20
+ ## Architecture
21
+
22
+ ### Model Specifications
23
+
24
+ The model configuration is defined in `RNJ1_CONFIG` in `rnj1.py`. You can modify these values to create models of any size:
25
+
26
+ | Hyperparameter | Default Value | Config Key | Notes |
27
+ |----------------|---------------|------------|-------|
28
+ | **Number of Layers** | 32 | `n_layers` | Main size factor |
29
+ | **Model Dimension** | 4096 | `emb_dim` | Affects all layers |
30
+ | **MLP Dimension** | 16384 | `hidden_dim` | Typically 4x emb_dim |
31
+ | **Number of Attention Heads** | 32 | `n_heads` | Should divide emb_dim |
32
+ | **Number of Key-Value Heads** | 8 | `n_kv_groups` | GQA ratio |
33
+ | **Attention Head Dimension** | 128 | `head_dim` | Typically emb_dim/n_heads |
34
+ | **Vocabulary Size** | From tokenizer | `vocab_size` | Affects embedding size |
35
+ | **Tokenizer** | SentencePiece BPE | Auto-detected | With fallback options |
36
+ | **Context Length** | 32768 | `context_length` | Can be reduced |
37
+ | **Activation Function** | GeGLU | Fixed | In FeedForward class |
38
+ | **Tied Embeddings** | Yes | Fixed | Embedding and output head share weights |
39
+
40
+ **Important**:
41
+ - **Total Parameters**: Calculated automatically from the config above using `count_parameters()` function and printed when you run the script
42
+ - The actual parameter count is **printed when you run the script** - look for: `Total trainable parameters: X,XXX,XXX (~X.XXB)`
43
+ - To create a smaller model, modify `RNJ1_CONFIG` in `rnj1.py` before running
44
+ - The embedding layer size = `vocab_size Γ— emb_dim`, which can be significant
45
+ - Example: Reducing `emb_dim` from 4096 to 1024 and `n_layers` from 32 to 12 creates a much smaller model
46
+
47
+ ### Key Architectural Features
48
+
49
+ 1. **Global Attention Only**: Unlike Gemma 3's hybrid sliding window + global attention, RNJ-1 uses **only global attention** throughout all layers. This provides full context awareness at every layer, which is beneficial for code and agentic tasks.
50
+
51
+ 2. **Standard RoPE**: Uses single RoPE (Rotary Position Embeddings) with `theta_base = 10,000`. Context extension from 8K to 32K is handled via YaRN (Yet another RoPE extensioN) during mid-training.
52
+
53
+ 3. **GeGLU Activation**: Uses GeGLU (Gated GeLU) activation function in the feedforward network, which provides better expressiveness compared to standard GeLU.
54
+
55
+ 4. **Grouped Query Attention (GQA)**: 32 query heads with 8 KV heads (4:1 ratio), providing memory efficiency while maintaining performance.
56
+
57
+ 5. **QK Normalization**: Uses query-key normalization for training stability.
58
+
59
+ 6. **4 RMSNorm Layers**: Pre-norm architecture with 4 normalization layers per transformer block:
60
+ - `input_layernorm` (pre-attention)
61
+ - `post_attention_layernorm` (post-attention, pre-residual)
62
+ - `pre_feedforward_layernorm` (pre-feedforward)
63
+ - `post_feedforward_layernorm` (post-feedforward, pre-residual)
64
+
65
+ ## Differences from Gemma 3
66
+
67
+ | Feature | Gemma 3 | RNJ-1 |
68
+ |---------|---------|-------|
69
+ | **Attention** | Hybrid sliding window (5:1 pattern) | Global attention only |
70
+ | **RoPE** | Dual RoPE (10K for local, 1M for global) | Single RoPE (10K, extended via YaRN) |
71
+ | **Activation** | GeLU | GeGLU |
72
+ | **Context Length** | 128K (native) | 32K (extended from 8K) |
73
+ | **Optimizer** | AdamW | Muon (custom) |
74
+ | **Focus** | General-purpose | Code & STEM |
75
+
76
+ ## Installation
77
+
78
+ ### Requirements
79
+
80
+ ```bash
81
+ pip install torch numpy transformers datasets tqdm matplotlib
82
+ ```
83
+
84
+ ### GPU Requirements
85
+
86
+ The memory requirements depend on your model configuration:
87
+
88
+ **Memory requirements depend on your model configuration:**
89
+
90
+ - **With default config** (targeting ~8.3B):
91
+ - **Recommended**: NVIDIA A100 (40GB+) or H100
92
+ - **Minimum**: NVIDIA L4 (24GB) with reduced batch size
93
+ - **Memory**: ~35-40GB VRAM (batch_size=16, block_size=128)
94
+
95
+ - **For smaller models** (modify `RNJ1_CONFIG`):
96
+ - Reduce `emb_dim`, `n_layers`, and `hidden_dim` in `RNJ1_CONFIG`
97
+ - Can run on GPUs with 8-16GB VRAM with appropriate reductions
98
+ - Example smaller config: `emb_dim=1024, n_layers=12, hidden_dim=4096`
99
+ - **Check the printed parameter count** to see your actual model size
100
+
101
+ ## Usage
102
+
103
+ ### Quick Start
104
+
105
+ ```python
106
+ from transformers import AutoTokenizer, AutoModelForCausalLM
107
+ import torch
108
+
109
+ # Load model and tokenizer
110
+ model_id = "EssentialAI/rnj-1-instruct"
111
+ model = AutoModelForCausalLM.from_pretrained(
112
+ model_id,
113
+ dtype=torch.bfloat16,
114
+ device_map="auto",
115
+ )
116
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
117
+
118
+ # Generate text
119
+ messages = [
120
+ {"role": "system", "content": "You are a helpful AI assistant."},
121
+ {"role": "user", "content": "Write a Python function to calculate factorial"}
122
+ ]
123
+
124
+ input_ids = tokenizer.apply_chat_template(
125
+ messages,
126
+ add_generation_prompt=True,
127
+ return_tensors="pt"
128
+ ).to(model.device)
129
+
130
+ output_ids = model.generate(
131
+ input_ids,
132
+ max_new_tokens=200,
133
+ temperature=0.2,
134
+ top_p=0.95
135
+ )
136
+
137
+ response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
138
+ print(response)
139
+ ```
140
+
141
+ ### Training from Scratch
142
+
143
+ The complete training script is provided in `rnj1.py`. It includes:
144
+
145
+ 1. **Dataset Loading**: TinyStories dataset (ideal for small language models)
146
+ 2. **Tokenization**: SentencePiece BPE tokenizer with 128K vocabulary
147
+ 3. **Model Architecture**: Complete RNJ-1 implementation
148
+ 4. **Training Loop**: With mixed precision, gradient accumulation, and learning rate scheduling
149
+
150
+ #### Training Configuration
151
+
152
+ ```python
153
+ # Training hyperparameters (from rnj1.py)
154
+ batch_size = 16
155
+ block_size = 128
156
+ learning_rate = 1e-4
157
+ max_iters = 150000
158
+ warmup_steps = 1000
159
+ gradient_accumulation_steps = 32
160
+ ```
161
+
162
+ #### Model Configuration
163
+
164
+ The model size is determined by `RNJ1_CONFIG` in the script. The actual parameter count is calculated at runtime and printed during initialization. To create a smaller model, modify the configuration:
165
+
166
+ ```python
167
+ # Example: Smaller model for testing
168
+ RNJ1_CONFIG = {
169
+ "vocab_size": vocab_size, # From tokenizer
170
+ "emb_dim": 1024, # Reduced from 4096
171
+ "n_heads": 16, # Reduced from 32
172
+ "head_dim": 64, # Reduced from 128
173
+ "n_kv_groups": 4, # Reduced from 8
174
+ "n_layers": 12, # Reduced from 32
175
+ "hidden_dim": 4096, # Reduced from 16384
176
+ "context_length": 2048, # Reduced from 32K
177
+ "rope_base": 10_000.0,
178
+ "qk_norm": True,
179
+ "dtype": torch.bfloat16,
180
+ }
181
+ ```
182
+
183
+ #### Running Training
184
+
185
+ ```bash
186
+ python rnj1.py
187
+ ```
188
+
189
+ **What the script does:**
190
+ 1. Loads tokenizer (with fallback options if RNJ-1 tokenizer unavailable)
191
+ 2. Downloads and tokenizes TinyStories dataset (if `train.bin` doesn't exist)
192
+ 3. Initializes model with `RNJ1_CONFIG` settings
193
+ 4. **Prints actual parameter count** (this is the real model size!)
194
+ 5. Trains with mixed precision (bfloat16)
195
+ 6. Saves best model based on validation loss (`rnj1_model.pt`)
196
+ 7. Generates sample text after training
197
+
198
+ **To see your actual model size**, look for this output when running:
199
+ ```
200
+ Total trainable parameters: X,XXX,XXX (~X.XXB)
201
+ ```
202
+
203
+ ### Model Components
204
+
205
+ The implementation includes:
206
+
207
+ - **RoPE (Rotary Position Embeddings)**: Standard implementation with configurable base frequency
208
+ - **RMSNorm**: Zero-centered weights with `(1 + weight)` scaling
209
+ - **GroupedQueryAttention**: GQA with QK normalization
210
+ - **FeedForward**: GeGLU-based feedforward network
211
+ - **TransformerBlock**: Complete transformer block with 4 normalization layers
212
+ - **Rnj1Model**: Full model with token embeddings, transformer blocks, and output head
213
+
214
+ ## Performance
215
+
216
+ ### Benchmarks
217
+
218
+ **Code Generation:**
219
+ - **HumanEval+**: Strong performance
220
+ - **MBPP+**: Strong performance
221
+ - **BigCodeBench**: Strong performance
222
+ - **SWE-bench**: 20.8% (exceptional for 8B model)
223
+
224
+ **Mathematical Reasoning:**
225
+ - **GSM8K**: Strong performance
226
+ - **Minerva-MATH**: On par with best models
227
+ - **AIME**: Outperforms or matches best models
228
+
229
+ **STEM:**
230
+ - **GPQA-Diamond**: Close to best similarly sized models
231
+ - **SuperGPQA**: Strong long-context reasoning
232
+
233
+ ## Implementation Details
234
+
235
+ ### Tokenizer
236
+
237
+ - **Type**: SentencePiece BPE
238
+ - **Vocabulary Size**: 128,000 tokens
239
+ - **Loading**: Uses `EssentialAI/rnj-1` tokenizer with fallback options
240
+
241
+ ### Data Type Handling
242
+
243
+ - **Training**: bfloat16 (preferred) or float16
244
+ - **Token IDs**: uint32 (required for vocab_size > 65536)
245
+ - **Mixed Precision**: Automatic via `torch.amp.autocast`
246
+
247
+ ### Memory Optimization
248
+
249
+ - **Gradient Accumulation**: Simulates larger batch size without more memory
250
+ - **Mixed Precision**: Reduces memory usage
251
+ - **Gradient Checkpointing**: Can be added for further memory savings
252
+
253
+ ## Key Features
254
+
255
+ 1. **Complete Implementation**: All components from scratch in PyTorch
256
+ 2. **Training Ready**: Full training loop with best practices
257
+ 3. **Modular Design**: Easy to modify and extend
258
+ 4. **Well Documented**: Inline comments explaining each component
259
+ 5. **Production Ready**: Includes evaluation, checkpointing, and text generation
260
+
261
+ ## Limitations & Notes
262
+
263
+ 1. **Model Size**: The actual parameter count is **calculated and printed at runtime**. The default `RNJ1_CONFIG` targets ~8.3B parameters, but:
264
+ - The actual size depends on `vocab_size` (from tokenizer) and all config values
265
+ - You can modify `RNJ1_CONFIG` to create much smaller models
266
+ - For testing, many users reduce `emb_dim`, `n_layers`, and `hidden_dim` significantly
267
+ - The embedding layer (`vocab_size Γ— emb_dim`) is often the largest component
268
+
269
+ 2. **Optimizer**: This implementation uses AdamW, but the original RNJ-1 uses **Muon optimizer** (custom optimizer by Essential AI). Muon provides superior token efficiency but is not publicly available.
270
+
271
+ 3. **Training Scale**: The provided script uses TinyStories dataset for demonstration. Full RNJ-1 training requires:
272
+ - 8.4T tokens for pre-training (8K context)
273
+ - 380B tokens for context extension (8K β†’ 32K)
274
+ - 150B tokens for supervised fine-tuning
275
+
276
+ 4. **Memory Requirements**: Memory usage depends on model size. For the full 8.3B model, you need significant GPU memory. For smaller models, adjust `batch_size` and `block_size` based on available hardware. You can also reduce model dimensions in `RNJ1_CONFIG`.
277
+
278
+ 5. **Tokenizer Fallback**: If RNJ-1 tokenizer is unavailable, the script falls back to Llama 3.1 tokenizer (also 128K vocab, SentencePiece BPE). The actual vocab_size affects the embedding layer size significantly.
279
+
280
+ ## File Structure
281
+
282
+ ```
283
+ rnj-1/
284
+ β”œβ”€β”€ README.md # This file
285
+ β”œβ”€β”€ rnj1.py # Complete training script
286
+ β”œβ”€β”€ RNJ1_QUICK_REFERENCE.md # Quick reference guide
287
+ β”œβ”€β”€ RNJ1_REVIEW.md # Detailed model review
288
+ β”œβ”€β”€ RNJ1_TOKENIZER_INFO.md # Tokenizer details
289
+ β”œβ”€β”€ RNJ1_VS_GEMMA3_COMPARISON.md # Architecture comparison
290
+ └── linkedin_post_rnj1.md # Social media post about implementation
291
+ ```
292
+
293
+ ## References
294
+
295
+ 1. **Essential AI Research Blog**: [essential.ai/research/rnj-1](https://www.essential.ai/research/rnj-1)
296
+ 2. **Hugging Face Model**: [EssentialAI/rnj-1](https://huggingface.co/EssentialAI/rnj-1)
297
+ 3. **Original Paper**: "Attention Is All You Need" (Vaswani et al., 2017)
298
+ 4. **Gemma 3**: Architecture base for RNJ-1
299
+ 5. **Google Collab**: https://colab.research.google.com/drive/1kwnLGHCDLXjeztkDoOuAS90dQIz2TgjU?usp=sharing
300
+
301
+ ## License
302
+
303
+ This implementation follows the Apache 2.0 license, matching the original RNJ-1 model.
304
+
305
+ ## Acknowledgments
306
+
307
+ - **Essential AI** for releasing the open-weight RNJ-1 model
308
+ - **Ashish Vaswani** and team for the Transformer architecture and RNJ-1 development
309
+ - **Hugging Face** for model hosting and transformers library
310
+ - **TinyStories** dataset creators for providing training data
311
+
312
+ ## Contributing
313
+
314
+ This is an educational implementation. For improvements or corrections:
315
+ 1. Check existing documentation files for details
316
+ 2. Verify against official RNJ-1 specifications
317
+ 3. Test on appropriate hardware
318
+ 4. Document any changes
319
+
320
+ ## Questions & Support
321
+
322
+ For questions about:
323
+ - **Model Architecture**: See `RNJ1_REVIEW.md` and `RNJ1_VS_GEMMA3_COMPARISON.md`
324
+ - **Tokenizer**: See `RNJ1_TOKENIZER_INFO.md`
325
+ - **Quick Usage**: See `RNJ1_QUICK_REFERENCE.md`
326
+ - **Implementation Details**: See inline comments in `rnj1.py`
327
+
328
+ ---
329
+
330
+ **Last Updated**: December 2025
331
+ **Model Version**: RNJ-1 (Base and Instruct)
332
+ **Implementation Version**: 1.0