Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,3 @@
|
|
| 1 |
-
|
| 2 |
---
|
| 3 |
language: en
|
| 4 |
license: apache-2.0
|
|
@@ -7,293 +6,723 @@ tags:
|
|
| 7 |
- pytorch
|
| 8 |
- transformer
|
| 9 |
- recursive-language-model
|
| 10 |
-
-
|
| 11 |
- adaptive-computation
|
|
|
|
| 12 |
datasets:
|
| 13 |
-
-
|
|
|
|
|
|
|
| 14 |
metrics:
|
| 15 |
- perplexity
|
| 16 |
library_name: transformers
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Recursive Language Model - 48M
|
| 21 |
|
| 22 |
-
A transformer-based language model with adaptive recursive processing
|
| 23 |
|
| 24 |
## Model Description
|
| 25 |
|
| 26 |
-
This model implements a **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Quick Start
|
| 31 |
|
| 32 |
### Installation
|
| 33 |
-
|
| 34 |
```bash
|
| 35 |
pip install transformers torch
|
| 36 |
```
|
| 37 |
|
| 38 |
### Basic Usage
|
| 39 |
-
|
| 40 |
```python
|
| 41 |
-
from transformers import AutoModelForCausalLM,
|
|
|
|
| 42 |
|
| 43 |
# Load model and tokenizer
|
| 44 |
model = AutoModelForCausalLM.from_pretrained(
|
| 45 |
"Girinath11/recursive-language-model-48m",
|
| 46 |
trust_remote_code=True
|
| 47 |
)
|
| 48 |
-
tokenizer =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
# Generate text
|
| 51 |
prompt = "The future of artificial intelligence"
|
| 52 |
-
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
| 53 |
|
| 54 |
outputs = model.generate(
|
| 55 |
input_ids,
|
| 56 |
max_new_tokens=50,
|
| 57 |
temperature=0.8,
|
| 58 |
-
|
|
|
|
|
|
|
| 59 |
)
|
| 60 |
|
| 61 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 62 |
```
|
| 63 |
|
| 64 |
-
## Model
|
| 65 |
|
| 66 |
-
### Architecture
|
| 67 |
|
| 68 |
-
| Component |
|
| 69 |
-
|
| 70 |
-
| **Parameters** |
|
| 71 |
-
| **Vocabulary** | 50,257 tokens (GPT-2) |
|
| 72 |
| **Embedding Dimension** | 512 |
|
| 73 |
-
| **Transformer Layers** | 6
|
| 74 |
-
| **Attention Heads** | 8 |
|
| 75 |
-
| **
|
| 76 |
-
| **
|
| 77 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
### Architecture Components
|
| 80 |
|
| 81 |
-
1. **
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
## Training Details
|
| 90 |
|
| 91 |
-
### Dataset
|
| 92 |
|
| 93 |
-
|
| 94 |
-
- **Training Split:** 95,000 samples (95%)
|
| 95 |
-
- **Validation Split:** 5,000 samples (5%)
|
| 96 |
-
- **Tokenizer:** GPT-2 tokenizer (50,257 vocab)
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
|
|
|
| 100 |
```yaml
|
| 101 |
Hardware:
|
| 102 |
-
GPU: NVIDIA T4 (
|
| 103 |
-
Mixed Precision: FP16
|
|
|
|
| 104 |
|
| 105 |
Hyperparameters:
|
| 106 |
-
Batch Size:
|
| 107 |
-
Gradient Accumulation:
|
| 108 |
-
Effective Batch Size:
|
| 109 |
-
Learning Rate:
|
| 110 |
Optimizer: AdamW
|
| 111 |
-
Weight Decay: 0.
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
Regularization:
|
| 117 |
-
Dropout
|
|
|
|
| 118 |
```
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
### Training Time
|
| 121 |
|
| 122 |
-
- **Total Duration:** 2
|
| 123 |
-
- **Time per
|
| 124 |
-
- **
|
|
|
|
| 125 |
|
| 126 |
### Training Progression
|
| 127 |
|
| 128 |
-
|
|
| 129 |
-
|
| 130 |
-
|
|
| 131 |
-
|
|
| 132 |
-
|
|
| 133 |
-
|
|
| 134 |
-
|
|
| 135 |
-
| 6 | 3.81 | 3.90 | 49.27 |
|
| 136 |
-
| 7 | 3.67 | 3.85 | 47.12 |
|
| 137 |
-
| 8 | 3.59 | 3.84 | **46.75** |
|
| 138 |
|
| 139 |
-
**
|
|
|
|
| 140 |
|
| 141 |
-
## Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
### Generation Quality
|
| 144 |
|
| 145 |
-
**Perplexity:
|
| 146 |
|
| 147 |
-
- β
Generates coherent sentences
|
| 148 |
-
- β
Maintains
|
| 149 |
-
- β
Produces
|
| 150 |
-
- β
|
| 151 |
-
-
|
| 152 |
-
- β οΈ
|
|
|
|
|
|
|
| 153 |
|
| 154 |
-
### Inference
|
| 155 |
|
| 156 |
-
| Hardware | Tokens/Second (
|
| 157 |
-
|
| 158 |
-
| CPU (Intel i7) | ~
|
| 159 |
-
| GPU (T4) | ~
|
| 160 |
-
| GPU (V100) | ~
|
|
|
|
| 161 |
|
| 162 |
### Memory Requirements
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
## Usage
|
| 169 |
-
|
| 170 |
-
### Interactive Text Completion
|
| 171 |
|
|
|
|
| 172 |
```python
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
prompts = [
|
| 185 |
-
"The history of
|
| 186 |
-
"Climate change
|
| 187 |
-
"
|
| 188 |
]
|
| 189 |
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
```
|
| 194 |
|
| 195 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
```python
|
| 198 |
-
#
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
-
#
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
```
|
| 204 |
|
| 205 |
-
##
|
| 206 |
|
| 207 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
-
|
| 210 |
-
2. **Model Size:** 48M parameters - smaller than production models
|
| 211 |
-
3. **Language:** Primarily English (GPT-2 tokenizer)
|
| 212 |
-
4. **Coherence:** Long-form generation may lose coherence
|
| 213 |
-
5. **Factuality:** May generate plausible but incorrect information
|
| 214 |
|
| 215 |
-
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
| 218 |
-
-
|
| 219 |
-
-
|
|
|
|
|
|
|
| 220 |
|
| 221 |
-
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
|
|
|
|
|
|
|
| 224 |
|
| 225 |
-
|
| 226 |
-
-
|
| 227 |
-
-
|
| 228 |
-
- Potential biases in article quality across topics
|
| 229 |
|
| 230 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
|
| 232 |
-
|
| 233 |
-
- β
|
| 234 |
-
- β
|
| 235 |
-
-
|
| 236 |
-
-
|
| 237 |
-
- β Don't use for medical, legal, or financial advice
|
| 238 |
|
| 239 |
-
##
|
| 240 |
|
| 241 |
-
###
|
| 242 |
|
| 243 |
-
|
| 244 |
-
- π¬ Research on adaptive computation in transformers
|
| 245 |
-
- π οΈ Prototyping language model applications
|
| 246 |
-
- π» Resource-constrained deployment scenarios
|
| 247 |
-
- π Experimenting with language models
|
| 248 |
-
- βοΈ Text completion and generation experiments
|
| 249 |
|
| 250 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 251 |
|
| 252 |
-
|
| 253 |
-
-
|
| 254 |
-
-
|
| 255 |
-
-
|
| 256 |
-
-
|
| 257 |
|
| 258 |
-
|
| 259 |
|
| 260 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
|
|
|
| 271 |
```
|
| 272 |
|
| 273 |
-
|
| 274 |
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
|
| 279 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
-
|
|
|
|
|
|
|
| 282 |
|
| 283 |
-
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
-
|
| 286 |
|
| 287 |
-
|
| 288 |
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
|
| 293 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
|
| 295 |
-
**
|
| 296 |
-
**
|
| 297 |
-
**
|
| 298 |
-
**
|
| 299 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
language: en
|
| 3 |
license: apache-2.0
|
|
|
|
| 6 |
- pytorch
|
| 7 |
- transformer
|
| 8 |
- recursive-language-model
|
| 9 |
+
- mixture-of-recursion
|
| 10 |
- adaptive-computation
|
| 11 |
+
- rotary-embeddings
|
| 12 |
datasets:
|
| 13 |
+
- HuggingFaceFW/fineweb-edu
|
| 14 |
+
- HuggingFaceTB/cosmopedia
|
| 15 |
+
- openwebtext
|
| 16 |
metrics:
|
| 17 |
- perplexity
|
| 18 |
library_name: transformers
|
| 19 |
pipeline_tag: text-generation
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# Recursive Language Model - 48M (Mixture of Recursion)
|
| 23 |
|
| 24 |
+
A transformer-based language model with **Mixture of Recursion** architecture featuring adaptive recursive processing, rotary positional embeddings (RoPE), and intelligent sequence-level complexity routing for enhanced text generation.
|
| 25 |
|
| 26 |
## Model Description
|
| 27 |
|
| 28 |
+
This model implements a novel **Mixture of Recursion** architecture that dynamically determines the optimal number of recursive refinement passes based on input sequence complexity. Unlike standard transformers that process all inputs uniformly, this model intelligently allocates computational resources.
|
| 29 |
+
|
| 30 |
+
### Key Innovations
|
| 31 |
+
|
| 32 |
+
- π§ **Sequence-Level Router**: Neural classifier that analyzes entire sequences to predict complexity (simple/medium/complex)
|
| 33 |
+
- π **Adaptive Recursion**: 1, 3, or 5 recursive transformer passes based on router prediction
|
| 34 |
+
- π **Rotary Positional Embeddings (RoPE)**: Superior positional encoding with better length generalization
|
| 35 |
+
- β‘ **Dynamic Computation**: Efficient processing that adapts to input difficulty
|
| 36 |
+
- π― **Weight Tying**: Shared embeddings between input and output layers for parameter efficiency
|
| 37 |
+
- π **Multi-Dataset Training**: Trained on diverse, high-quality web text from FineWeb-Edu, Cosmopedia, and OpenWebText
|
| 38 |
|
| 39 |
+
### Architecture Philosophy
|
| 40 |
+
|
| 41 |
+
Traditional transformers apply the same computational depth to all inputs. This model recognizes that some sequences (simple greetings, common phrases) need minimal processing, while others (technical explanations, complex reasoning) benefit from deeper iterative refinement. The router learns to make this decision automatically.
|
| 42 |
|
| 43 |
## Quick Start
|
| 44 |
|
| 45 |
### Installation
|
|
|
|
| 46 |
```bash
|
| 47 |
pip install transformers torch
|
| 48 |
```
|
| 49 |
|
| 50 |
### Basic Usage
|
|
|
|
| 51 |
```python
|
| 52 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 53 |
+
import torch
|
| 54 |
|
| 55 |
# Load model and tokenizer
|
| 56 |
model = AutoModelForCausalLM.from_pretrained(
|
| 57 |
"Girinath11/recursive-language-model-48m",
|
| 58 |
trust_remote_code=True
|
| 59 |
)
|
| 60 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 61 |
+
"Girinath11/recursive-language-model-48m"
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
# Move to GPU if available
|
| 65 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 66 |
+
model = model.to(device)
|
| 67 |
+
model.eval()
|
| 68 |
+
|
| 69 |
+
print("β
Model loaded successfully!")
|
| 70 |
|
| 71 |
# Generate text
|
| 72 |
prompt = "The future of artificial intelligence"
|
| 73 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
|
| 74 |
|
| 75 |
outputs = model.generate(
|
| 76 |
input_ids,
|
| 77 |
max_new_tokens=50,
|
| 78 |
temperature=0.8,
|
| 79 |
+
top_p=0.9,
|
| 80 |
+
do_sample=True,
|
| 81 |
+
pad_token_id=tokenizer.eos_token_id
|
| 82 |
)
|
| 83 |
|
| 84 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 85 |
```
|
| 86 |
|
| 87 |
+
## Model Architecture
|
| 88 |
|
| 89 |
+
### Detailed Architecture Specifications
|
| 90 |
|
| 91 |
+
| Component | Configuration |
|
| 92 |
+
|-----------|--------------|
|
| 93 |
+
| **Total Parameters** | 48,208,641 (~48.2M) |
|
| 94 |
+
| **Vocabulary Size** | 50,257 tokens (GPT-2 BPE) |
|
| 95 |
| **Embedding Dimension** | 512 |
|
| 96 |
+
| **Base Transformer Layers** | 6 |
|
| 97 |
+
| **Attention Heads** | 8 heads per layer |
|
| 98 |
+
| **Head Dimension** | 64 (512 Γ· 8) |
|
| 99 |
+
| **FFN Intermediate Size** | 2048 |
|
| 100 |
+
| **Max Sequence Length** | 512 tokens |
|
| 101 |
+
| **Positional Encoding** | Rotary Positional Embeddings (RoPE) |
|
| 102 |
+
| **Dropout Rate** | 0.1 (both hidden and attention) |
|
| 103 |
+
| **Layer Normalization** | eps=1e-5 |
|
| 104 |
+
|
| 105 |
+
### Recursion Configuration
|
| 106 |
+
|
| 107 |
+
| Complexity Class | Recursion Steps | Use Case |
|
| 108 |
+
|-----------------|----------------|----------|
|
| 109 |
+
| **Simple** | 1 step | Common phrases, greetings, simple completions |
|
| 110 |
+
| **Medium** | 3 steps | Standard text, moderate complexity |
|
| 111 |
+
| **Complex** | 5 steps | Technical content, reasoning, complex narratives |
|
| 112 |
|
| 113 |
### Architecture Components
|
| 114 |
|
| 115 |
+
1. **Embedding Layer**
|
| 116 |
+
- Token embeddings (50,257 Γ 512)
|
| 117 |
+
- Tied with output projection for efficiency
|
| 118 |
+
- Padding token handling (ID: 50256)
|
| 119 |
+
|
| 120 |
+
2. **Base Transformer Stack** (6 layers)
|
| 121 |
+
- Multi-head self-attention with RoPE
|
| 122 |
+
- Feed-forward networks (512 β 2048 β 512)
|
| 123 |
+
- Pre-normalization with LayerNorm
|
| 124 |
+
- Residual connections
|
| 125 |
+
- Causal masking for autoregressive generation
|
| 126 |
+
|
| 127 |
+
3. **Sequence-Level Router**
|
| 128 |
+
- Attention-weighted pooling over sequence
|
| 129 |
+
- 2-layer MLP classifier (512 β 256 β 3)
|
| 130 |
+
- Outputs: complexity class (0=simple, 1=medium, 2=complex)
|
| 131 |
+
- Trained with pseudo-labels based on sequence length
|
| 132 |
+
|
| 133 |
+
4. **Recursive Refinement Layer**
|
| 134 |
+
- Additional transformer block (reused 1-5 times)
|
| 135 |
+
- Same architecture as base layers
|
| 136 |
+
- Applied iteratively based on router decision
|
| 137 |
+
|
| 138 |
+
5. **Output Projection Head**
|
| 139 |
+
- Linear layer (512 β 50,257)
|
| 140 |
+
- Weight-tied with input embeddings
|
| 141 |
+
- Final LayerNorm before projection
|
| 142 |
+
|
| 143 |
+
### Rotary Positional Embeddings
|
| 144 |
+
|
| 145 |
+
Uses RoPE instead of learned positional embeddings for:
|
| 146 |
+
- Better extrapolation to longer sequences
|
| 147 |
+
- Relative position encoding
|
| 148 |
+
- Improved performance on positional tasks
|
| 149 |
+
- Base frequency: 10,000
|
| 150 |
|
| 151 |
## Training Details
|
| 152 |
|
| 153 |
+
### Training Dataset
|
| 154 |
|
| 155 |
+
**Total Training Samples: 50,000** (high-quality web text)
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
| Dataset | Percentage | Samples | Description |
|
| 158 |
+
|---------|-----------|---------|-------------|
|
| 159 |
+
| **FineWeb-Edu** | 45% | 22,500 | Educational web content, filtered for quality |
|
| 160 |
+
| **Cosmopedia** | 30% | 15,000 | Synthetic educational content |
|
| 161 |
+
| **OpenWebText** | 25% | 12,500 | Web text from Reddit links |
|
| 162 |
+
| **Validation** | - | 1,000 | Held-out FineWeb-Edu samples |
|
| 163 |
+
|
| 164 |
+
**Filtering Criteria:**
|
| 165 |
+
- Minimum sequence length: 128 tokens
|
| 166 |
+
- Maximum sequence length: 384 tokens
|
| 167 |
+
- Actual samples after filtering: ~45,000-48,000
|
| 168 |
|
| 169 |
+
### Training Configuration
|
| 170 |
```yaml
|
| 171 |
Hardware:
|
| 172 |
+
GPU: NVIDIA T4 (15 GB)
|
| 173 |
+
Mixed Precision: FP16
|
| 174 |
+
Framework: PyTorch 2.0+ with CUDA
|
| 175 |
|
| 176 |
Hyperparameters:
|
| 177 |
+
Batch Size: 1
|
| 178 |
+
Gradient Accumulation: 32
|
| 179 |
+
Effective Batch Size: 32
|
| 180 |
+
Learning Rate: 3e-4
|
| 181 |
Optimizer: AdamW
|
| 182 |
+
Weight Decay: 0.01
|
| 183 |
+
Warmup Steps: 500
|
| 184 |
+
Max Gradient Norm: 1.0
|
| 185 |
+
Total Epochs: 3
|
| 186 |
+
Max Sequence Length: 384 tokens
|
| 187 |
+
|
| 188 |
+
Loss Function:
|
| 189 |
+
Language Modeling: CrossEntropyLoss (ignore_index=-100)
|
| 190 |
+
Router Loss: CrossEntropyLoss (weight: 0.1)
|
| 191 |
+
Total Loss: LM Loss + 0.1 Γ Router Loss
|
| 192 |
|
| 193 |
Regularization:
|
| 194 |
+
Hidden Dropout: 0.1
|
| 195 |
+
Attention Dropout: 0.1
|
| 196 |
```
|
| 197 |
|
| 198 |
+
### Training Schedule
|
| 199 |
+
|
| 200 |
+
- **Total Training Steps:** 4,686
|
| 201 |
+
- **Steps per Epoch:** 1,562
|
| 202 |
+
- **Warmup:** 500 steps
|
| 203 |
+
- **Learning Rate Schedule:** Linear warmup β Linear decay
|
| 204 |
+
- **Evaluation Frequency:** Every 1,000 steps
|
| 205 |
+
- **Checkpoint Saving:** Every 1,000 steps (top 2 kept)
|
| 206 |
+
|
| 207 |
### Training Time
|
| 208 |
|
| 209 |
+
- **Total Duration:** ~2 hours 10 minutes
|
| 210 |
+
- **Time per Step:** ~1.5-1.6 seconds
|
| 211 |
+
- **Throughput:** 19.12 samples/second
|
| 212 |
+
- **Training Speed:** 0.597 steps/second
|
| 213 |
|
| 214 |
### Training Progression
|
| 215 |
|
| 216 |
+
| Checkpoint | Steps | Training Loss | Eval Loss | Perplexity | Epoch |
|
| 217 |
+
|------------|-------|--------------|-----------|------------|-------|
|
| 218 |
+
| **Start** | 0 | 9.82 | - | - | 0.00 |
|
| 219 |
+
| **Checkpoint 1** | 1000 | 5.46 | 5.72 | 305.15 | 0.21 |
|
| 220 |
+
| **Checkpoint 2** | 2000 | 4.92 | 5.06 | 156.84 | 1.10 |
|
| 221 |
+
| **Checkpoint 3** | 3000 | 4.51 | 4.86 | 128.63 | 2.20 |
|
| 222 |
+
| **Final** | 4686 | 4.32 | 4.59 | **98.86** | 3.02 |
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
+
**Loss Reduction:** 9.82 β 4.59 (53% improvement)
|
| 225 |
+
**Perplexity Achievement:** 98.86 (excellent for 48M model!)
|
| 226 |
|
| 227 |
+
## Performance Metrics
|
| 228 |
+
|
| 229 |
+
### Final Evaluation Results
|
| 230 |
+
```
|
| 231 |
+
π FINAL METRICS:
|
| 232 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 233 |
+
β Evaluation Loss: 4.59
|
| 234 |
+
β Perplexity: 98.86
|
| 235 |
+
β Training Loss (avg): 5.08
|
| 236 |
+
β Total Samples Seen: 150,000 (3 epochs Γ 50K)
|
| 237 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 238 |
+
```
|
| 239 |
|
| 240 |
### Generation Quality
|
| 241 |
|
| 242 |
+
**Perplexity: 98.86** indicates **good quality** for a 48M parameter model:
|
| 243 |
|
| 244 |
+
- β
Generates coherent and grammatical sentences
|
| 245 |
+
- β
Maintains context over short passages (50-100 tokens)
|
| 246 |
+
- β
Produces diverse outputs with proper sampling
|
| 247 |
+
- β
Handles various writing styles and topics
|
| 248 |
+
- β
Suitable for creative writing, completions, and prototyping
|
| 249 |
+
- β οΈ May show repetition in very long generations (200+ tokens)
|
| 250 |
+
- β οΈ Less factually reliable than larger models (175B+)
|
| 251 |
+
- β οΈ Limited reasoning capabilities compared to state-of-the-art
|
| 252 |
|
| 253 |
+
### Inference Performance
|
| 254 |
|
| 255 |
+
| Hardware | Tokens/Second | Latency (50 tokens) |
|
| 256 |
+
|----------|--------------|---------------------|
|
| 257 |
+
| CPU (Intel i7) | ~100 | ~500ms |
|
| 258 |
+
| GPU (T4) | ~500 | ~100ms |
|
| 259 |
+
| GPU (V100) | ~800 | ~60ms |
|
| 260 |
+
| GPU (A100) | ~1200 | ~40ms |
|
| 261 |
|
| 262 |
### Memory Requirements
|
| 263 |
|
| 264 |
+
| Mode | RAM/VRAM | Disk Space |
|
| 265 |
+
|------|----------|------------|
|
| 266 |
+
| **Model Weights** | - | 184 MB |
|
| 267 |
+
| **CPU Inference** | 600 MB | - |
|
| 268 |
+
| **GPU Inference (FP16)** | 1.5 GB | - |
|
| 269 |
+
| **GPU Training (batch=1)** | ~8 GB | - |
|
| 270 |
|
| 271 |
+
## Advanced Usage
|
|
|
|
|
|
|
| 272 |
|
| 273 |
+
### Batch Generation
|
| 274 |
```python
|
| 275 |
+
import torch
|
| 276 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 277 |
+
|
| 278 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 279 |
+
"Girinath11/recursive-language-model-48m",
|
| 280 |
+
trust_remote_code=True
|
| 281 |
+
)
|
| 282 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 283 |
+
"Girinath11/recursive-language-model-48m"
|
| 284 |
+
)
|
| 285 |
+
|
| 286 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 287 |
+
model = model.to(device)
|
| 288 |
+
model.eval()
|
| 289 |
+
|
| 290 |
+
# Batch generation
|
| 291 |
prompts = [
|
| 292 |
+
"The history of computing",
|
| 293 |
+
"Climate change impacts",
|
| 294 |
+
"Space exploration in"
|
| 295 |
]
|
| 296 |
|
| 297 |
+
# Tokenize all prompts
|
| 298 |
+
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
|
| 299 |
+
|
| 300 |
+
# Generate for all prompts at once
|
| 301 |
+
outputs = model.generate(
|
| 302 |
+
**inputs,
|
| 303 |
+
max_new_tokens=50,
|
| 304 |
+
temperature=0.8,
|
| 305 |
+
top_p=0.9,
|
| 306 |
+
do_sample=True,
|
| 307 |
+
pad_token_id=tokenizer.eos_token_id
|
| 308 |
+
)
|
| 309 |
+
|
| 310 |
+
# Decode all outputs
|
| 311 |
+
for i, output in enumerate(outputs):
|
| 312 |
+
print(f"\nPrompt {i+1}: {prompts[i]}")
|
| 313 |
+
print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")
|
| 314 |
```
|
| 315 |
|
| 316 |
+
### Fine-tuning on Custom Data
|
| 317 |
+
```python
|
| 318 |
+
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
|
| 319 |
+
from datasets import load_dataset
|
| 320 |
+
|
| 321 |
+
# Load your custom dataset
|
| 322 |
+
dataset = load_dataset("your_dataset")
|
| 323 |
+
|
| 324 |
+
# Tokenize
|
| 325 |
+
def tokenize(examples):
|
| 326 |
+
return tokenizer(examples['text'], truncation=True, max_length=384)
|
| 327 |
+
|
| 328 |
+
tokenized = dataset.map(tokenize, batched=True)
|
| 329 |
+
|
| 330 |
+
# Training arguments
|
| 331 |
+
training_args = TrainingArguments(
|
| 332 |
+
output_dir="./finetuned-model",
|
| 333 |
+
per_device_train_batch_size=1,
|
| 334 |
+
gradient_accumulation_steps=32,
|
| 335 |
+
learning_rate=1e-4, # Lower LR for fine-tuning
|
| 336 |
+
num_train_epochs=1,
|
| 337 |
+
fp16=True,
|
| 338 |
+
save_steps=500,
|
| 339 |
+
logging_steps=100,
|
| 340 |
+
)
|
| 341 |
+
|
| 342 |
+
# Data collator
|
| 343 |
+
data_collator = DataCollatorForLanguageModeling(
|
| 344 |
+
tokenizer=tokenizer,
|
| 345 |
+
mlm=False
|
| 346 |
+
)
|
| 347 |
|
| 348 |
+
# Trainer
|
| 349 |
+
trainer = Trainer(
|
| 350 |
+
model=model,
|
| 351 |
+
args=training_args,
|
| 352 |
+
train_dataset=tokenized['train'],
|
| 353 |
+
data_collator=data_collator,
|
| 354 |
+
)
|
| 355 |
+
|
| 356 |
+
# Fine-tune
|
| 357 |
+
trainer.train()
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
### Temperature and Sampling Control
|
| 361 |
```python
|
| 362 |
+
# Creative writing (high temperature)
|
| 363 |
+
creative_output = model.generate(
|
| 364 |
+
input_ids,
|
| 365 |
+
max_new_tokens=100,
|
| 366 |
+
temperature=1.0, # More random
|
| 367 |
+
top_p=0.95,
|
| 368 |
+
top_k=50,
|
| 369 |
+
do_sample=True
|
| 370 |
+
)
|
| 371 |
+
|
| 372 |
+
# Focused completion (low temperature)
|
| 373 |
+
focused_output = model.generate(
|
| 374 |
+
input_ids,
|
| 375 |
+
max_new_tokens=100,
|
| 376 |
+
temperature=0.5, # More deterministic
|
| 377 |
+
top_p=0.9,
|
| 378 |
+
top_k=40,
|
| 379 |
+
do_sample=True
|
| 380 |
+
)
|
| 381 |
|
| 382 |
+
# Greedy decoding (most likely tokens)
|
| 383 |
+
greedy_output = model.generate(
|
| 384 |
+
input_ids,
|
| 385 |
+
max_new_tokens=50,
|
| 386 |
+
do_sample=False # Greedy
|
| 387 |
+
)
|
| 388 |
```
|
| 389 |
|
| 390 |
+
## Technical Architecture
|
| 391 |
|
| 392 |
+
### Model Structure
|
| 393 |
+
```
|
| 394 |
+
Input Text
|
| 395 |
+
β
|
| 396 |
+
[Token Embedding Layer] (50,257 Γ 512)
|
| 397 |
+
β
|
| 398 |
+
[6Γ Base Transformer Blocks]
|
| 399 |
+
ββ Multi-Head Attention (8 heads, RoPE)
|
| 400 |
+
ββ Feed-Forward Network (512 β 2048 β 512)
|
| 401 |
+
ββ LayerNorm + Residual Connections
|
| 402 |
+
β
|
| 403 |
+
[Sequence-Level Router]
|
| 404 |
+
ββ Attention-Weighted Pooling
|
| 405 |
+
ββ MLP Classifier (512 β 256 β 3)
|
| 406 |
+
ββ Output: Complexity Class (0/1/2)
|
| 407 |
+
β
|
| 408 |
+
[Adaptive Recursive Refinement]
|
| 409 |
+
ββ Simple: 1Γ Recursion Layer
|
| 410 |
+
ββ Medium: 3Γ Recursion Layer
|
| 411 |
+
ββ Complex: 5Γ Recursion Layer
|
| 412 |
+
β
|
| 413 |
+
[Final LayerNorm]
|
| 414 |
+
β
|
| 415 |
+
[LM Head] (512 β 50,257, weight-tied)
|
| 416 |
+
β
|
| 417 |
+
Output Tokens
|
| 418 |
+
```
|
| 419 |
|
| 420 |
+
### Layer Breakdown
|
|
|
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
+
**1. Embedding Layer (25.7M params)**
|
| 423 |
+
- Token embeddings: 50,257 Γ 512 = 25,731,584 params
|
| 424 |
+
- Weight-tied with output projection
|
| 425 |
|
| 426 |
+
**2. Base Transformer (6 layers, ~19M params)**
|
| 427 |
+
- Each layer: ~3.15M params
|
| 428 |
+
- Self-attention: 4 Γ (512 Γ 512) = 1,048,576
|
| 429 |
+
- FFN: 2 Γ (512 Γ 2048) = 2,097,152
|
| 430 |
+
- LayerNorms: small overhead
|
| 431 |
|
| 432 |
+
**3. Router Network (~0.4M params)**
|
| 433 |
+
- Pooler: 512 Γ 512 = 262,144
|
| 434 |
+
- Classifier: (512 Γ 256) + (256 Γ 3) = 131,840
|
| 435 |
|
| 436 |
+
**4. Recursion Layer (~3.15M params)**
|
| 437 |
+
- Single transformer block (reused 1-5 times)
|
| 438 |
+
- Same structure as base layers
|
| 439 |
|
| 440 |
+
**5. Output Components**
|
| 441 |
+
- Final LayerNorm: ~1K params
|
| 442 |
+
- LM Head: weight-tied (0 additional params)
|
|
|
|
| 443 |
|
| 444 |
+
### Rotary Positional Embeddings (RoPE)
|
| 445 |
+
```python
|
| 446 |
+
# RoPE computation
|
| 447 |
+
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
|
| 448 |
+
t = torch.arange(seq_len)
|
| 449 |
+
freqs = torch.einsum('i,j->ij', t, inv_freq)
|
| 450 |
+
emb = torch.cat((freqs, freqs), dim=-1)
|
| 451 |
+
cos, sin = emb.cos(), emb.sin()
|
| 452 |
+
|
| 453 |
+
# Applied to queries and keys before attention
|
| 454 |
+
q_rotated = (q * cos) + (rotate_half(q) * sin)
|
| 455 |
+
k_rotated = (k * cos) + (rotate_half(k) * sin)
|
| 456 |
+
```
|
| 457 |
|
| 458 |
+
Benefits:
|
| 459 |
+
- β
Better length extrapolation
|
| 460 |
+
- β
Relative position awareness
|
| 461 |
+
- β
No learned position parameters
|
| 462 |
+
- β
Efficient computation
|
|
|
|
| 463 |
|
| 464 |
+
## Training Details
|
| 465 |
|
| 466 |
+
### Dataset Composition
|
| 467 |
|
| 468 |
+
**Training Data:** 50,000 samples from three high-quality sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 469 |
|
| 470 |
+
| Dataset | Source | Percentage | Samples | Description |
|
| 471 |
+
|---------|--------|-----------|---------|-------------|
|
| 472 |
+
| **FineWeb-Edu** | HuggingFace | 45% | 22,500 | Educational web pages, high quality |
|
| 473 |
+
| **Cosmopedia** | HuggingFace | 30% | 15,000 | Synthetic educational textbooks |
|
| 474 |
+
| **OpenWebText** | Community | 25% | 12,500 | Web text from Reddit submissions |
|
| 475 |
|
| 476 |
+
**Data Preprocessing:**
|
| 477 |
+
- Tokenization: GPT-2 BPE tokenizer
|
| 478 |
+
- Truncation: 384 tokens max
|
| 479 |
+
- Filtering: Minimum 128 tokens per sample
|
| 480 |
+
- No data augmentation applied
|
| 481 |
|
| 482 |
+
**Validation Set:** 1,000 samples from FineWeb-Edu
|
| 483 |
|
| 484 |
+
### Training Hyperparameters
|
| 485 |
+
```yaml
|
| 486 |
+
Batch Configuration:
|
| 487 |
+
Per-Device Batch Size: 1
|
| 488 |
+
Gradient Accumulation: 32
|
| 489 |
+
Effective Batch Size: 32
|
| 490 |
+
Total Training Steps: 4,686
|
| 491 |
+
Steps per Epoch: 1,562
|
| 492 |
+
|
| 493 |
+
Optimization:
|
| 494 |
+
Optimizer: AdamW
|
| 495 |
+
Learning Rate: 3e-4
|
| 496 |
+
Weight Decay: 0.01
|
| 497 |
+
Warmup Steps: 500
|
| 498 |
+
LR Schedule: Linear warmup β Linear decay
|
| 499 |
+
Max Gradient Norm: 1.0
|
| 500 |
+
Beta1: 0.9
|
| 501 |
+
Beta2: 0.999
|
| 502 |
+
Epsilon: 1e-8
|
| 503 |
+
|
| 504 |
+
Mixed Precision:
|
| 505 |
+
Enabled: True
|
| 506 |
+
Format: FP16
|
| 507 |
+
Loss Scaling: Dynamic
|
| 508 |
|
| 509 |
+
Regularization:
|
| 510 |
+
Hidden Dropout: 0.1
|
| 511 |
+
Attention Dropout: 0.1
|
| 512 |
+
No additional regularization
|
| 513 |
+
|
| 514 |
+
Evaluation:
|
| 515 |
+
Strategy: steps
|
| 516 |
+
Eval Steps: 1,000
|
| 517 |
+
Metric: eval_loss
|
| 518 |
+
Best Model Selection: Minimum eval_loss
|
| 519 |
```
|
| 520 |
|
| 521 |
+
### Loss Function
|
| 522 |
|
| 523 |
+
**Composite Loss = Language Modeling Loss + Router Loss**
|
| 524 |
+
```python
|
| 525 |
+
# Language modeling loss (primary)
|
| 526 |
+
lm_loss = CrossEntropyLoss(
|
| 527 |
+
predictions=shift_logits,
|
| 528 |
+
targets=shift_labels,
|
| 529 |
+
ignore_index=-100 # Ignore padding tokens
|
| 530 |
+
)
|
| 531 |
|
| 532 |
+
# Router loss (auxiliary, 10% weight)
|
| 533 |
+
router_loss = CrossEntropyLoss(
|
| 534 |
+
predictions=complexity_logits,
|
| 535 |
+
targets=pseudo_labels # Based on sequence length
|
| 536 |
+
)
|
| 537 |
|
| 538 |
+
# Total loss
|
| 539 |
+
total_loss = lm_loss + 0.1 * router_loss
|
| 540 |
+
```
|
| 541 |
|
| 542 |
+
Router pseudo-labels assignment:
|
| 543 |
+
- Sequence length < 170: Simple (class 0)
|
| 544 |
+
- Sequence length 170-340: Medium (class 1)
|
| 545 |
+
- Sequence length > 340: Complex (class 2)
|
| 546 |
|
| 547 |
+
### Training Metrics Over Time
|
| 548 |
|
| 549 |
+
#### Loss Progression
|
| 550 |
|
| 551 |
+
| Step | Epoch | Training Loss | Eval Loss | Eval Perplexity |
|
| 552 |
+
|------|-------|--------------|-----------|-----------------|
|
| 553 |
+
| 100 | 0.02 | 9.82 | - | - |
|
| 554 |
+
| 500 | 0.11 | 6.29 | - | - |
|
| 555 |
+
| 1000 | 0.21 | 5.46 | 5.72 | 305.15 |
|
| 556 |
+
| 1500 | 0.32 | 5.09 | - | - |
|
| 557 |
+
| 2000 | 1.10 | 4.92 | 5.06 | 156.84 |
|
| 558 |
+
| 2500 | 1.29 | 4.51 | - | - |
|
| 559 |
+
| 3000 | 2.20 | 4.51 | 4.86 | 128.63 |
|
| 560 |
+
| 3500 | 2.30 | 4.24 | - | - |
|
| 561 |
+
| 4000 | 2.85 | 4.32 | 4.59 | 98.86 |
|
| 562 |
+
| **4686** | **3.02** | **4.32** | **4.59** | **98.86** |
|
| 563 |
|
| 564 |
+
#### Training Dynamics
|
| 565 |
+
|
| 566 |
+
**Loss Improvement by Phase:**
|
| 567 |
+
- **Epoch 1 (0-1562 steps):** 9.82 β 5.16 (47% reduction) - Rapid initial learning
|
| 568 |
+
- **Epoch 2 (1563-3124 steps):** 5.16 β 4.38 (15% reduction) - Steady refinement
|
| 569 |
+
- **Epoch 3 (3125-4686 steps):** 4.38 β 4.32 (1% reduction) - Fine convergence
|
| 570 |
+
|
| 571 |
+
**Gradient Norms:** Remained stable (0.7-1.5), indicating healthy training without exploding/vanishing gradients.
|
| 572 |
+
|
| 573 |
+
**Learning Rate Schedule:**
|
| 574 |
+
- Warmup (0-500 steps): 0 β 3e-4
|
| 575 |
+
- Peak (500-1000): 3e-4
|
| 576 |
+
- Decay (1000-4686): 3e-4 β ~6e-6
|
| 577 |
+
|
| 578 |
+
### Final Training Statistics
|
| 579 |
+
```
|
| 580 |
+
Total Runtime: 7,844 seconds (2h 10m 44s)
|
| 581 |
+
Samples Processed: 150,000 (50K Γ 3 epochs)
|
| 582 |
+
Training Throughput: 19.12 samples/second
|
| 583 |
+
Steps per Second: 0.597
|
| 584 |
+
Average Step Time: 1.67 seconds
|
| 585 |
+
GPU Utilization: ~90-95%
|
| 586 |
+
Peak Memory Usage: ~8.5 GB (GPU)
|
| 587 |
+
```
|
| 588 |
+
|
| 589 |
+
## Performance Benchmarks
|
| 590 |
+
|
| 591 |
+
### Perplexity Comparison
|
| 592 |
+
|
| 593 |
+
| Model | Parameters | Perplexity | Notes |
|
| 594 |
+
|-------|-----------|------------|-------|
|
| 595 |
+
| **This Model** | **48M** | **98.86** | Mixture of Recursion, 3 epochs |
|
| 596 |
+
| Baseline GPT-2 Small | 117M | ~29 | Official OpenAI |
|
| 597 |
+
| TinyLlama | 1.1B | ~10 | Much larger |
|
| 598 |
+
| Random Baseline | - | ~50,000 | Theoretical worst case |
|
| 599 |
+
|
| 600 |
+
**Context:** For a 48M parameter model trained on 50K samples, perplexity of 98.86 is competitive and indicates good learning.
|
| 601 |
+
|
| 602 |
+
### Generation Quality Assessment
|
| 603 |
+
|
| 604 |
+
**Strengths:**
|
| 605 |
+
- β
Grammatically correct output
|
| 606 |
+
- β
Coherent short-form text (1-3 sentences)
|
| 607 |
+
- β
Diverse vocabulary usage
|
| 608 |
+
- β
Proper punctuation and capitalization
|
| 609 |
+
- β
Context maintenance in short passages
|
| 610 |
+
|
| 611 |
+
**Weaknesses:**
|
| 612 |
+
- β οΈ Occasional repetition in long generations
|
| 613 |
+
- β οΈ Limited factual knowledge (small training set)
|
| 614 |
+
- β οΈ May generate generic/vague statements
|
| 615 |
+
- β οΈ Struggles with very technical topics
|
| 616 |
+
- β οΈ Short context window (384 tokens)
|
| 617 |
+
|
| 618 |
+
## Limitations & Considerations
|
| 619 |
+
|
| 620 |
+
### Technical Limitations
|
| 621 |
+
|
| 622 |
+
1. **Context Window**: 512 token maximum (vs 2048+ for modern models)
|
| 623 |
+
2. **Model Size**: 48M parameters - limited capacity vs billions-scale models
|
| 624 |
+
3. **Training Data**: 50K samples - relatively small dataset
|
| 625 |
+
4. **Single Language**: Primarily English (GPT-2 tokenizer bias)
|
| 626 |
+
5. **Domain Coverage**: Limited by training data diversity
|
| 627 |
+
6. **Reasoning**: Basic completion, limited multi-step reasoning
|
| 628 |
+
|
| 629 |
+
### Known Issues
|
| 630 |
|
| 631 |
+
- **Repetition**: May repeat phrases after 100+ tokens
|
| 632 |
+
- **Factual Errors**: Small knowledge base, may hallucinate facts
|
| 633 |
+
- **Consistency**: Long-form coherence degrades over 200+ tokens
|
| 634 |
+
- **Technical Domains**: Struggles with highly specialized topics
|
| 635 |
+
- **Math/Code**: Limited capability for formal reasoning
|
| 636 |
+
- **Context Retention**: May lose track of earlier context in long sequences
|
| 637 |
+
|
| 638 |
+
### Generation Artifacts
|
| 639 |
+
|
| 640 |
+
- Occasional incomplete sentences at max_tokens boundary
|
| 641 |
+
- May generate run-on sentences without proper punctuation
|
| 642 |
+
- Sometimes produces generic filler phrases
|
| 643 |
+
- Temperature tuning needed for optimal quality
|
| 644 |
+
|
| 645 |
+
## Ethical Considerations
|
| 646 |
+
|
| 647 |
+
### Bias & Fairness
|
| 648 |
+
|
| 649 |
+
This model may exhibit biases inherited from training data:
|
| 650 |
+
|
| 651 |
+
**Potential Biases:**
|
| 652 |
+
- Geographic: Overrepresentation of Western/English content
|
| 653 |
+
- Demographic: Gender, age, cultural biases from web text
|
| 654 |
+
- Temporal: Training data reflects content up to 2024
|
| 655 |
+
- Topic: Educational content may skew certain perspectives
|
| 656 |
+
|
| 657 |
+
**Mitigation Strategies:**
|
| 658 |
+
- Diverse training data sources (FineWeb-Edu, Cosmopedia, OpenWebText)
|
| 659 |
+
- No explicit harmful content filtering (relies on source quality)
|
| 660 |
+
- Users should validate outputs for fairness-critical applications
|
| 661 |
+
|
| 662 |
+
### Responsible Use
|
| 663 |
+
|
| 664 |
+
**β
Recommended:**
|
| 665 |
+
- Educational demonstrations
|
| 666 |
+
- Research on adaptive computation
|
| 667 |
+
- Creative writing assistance (with human review)
|
| 668 |
+
- Prototyping and experimentation
|
| 669 |
+
- Learning about language models
|
| 670 |
+
|
| 671 |
+
**β Not Recommended:**
|
| 672 |
+
- Medical, legal, or financial advice
|
| 673 |
+
- Generating authoritative content without verification
|
| 674 |
+
- Creating misleading or deceptive content
|
| 675 |
+
- Applications requiring high factual accuracy
|
| 676 |
+
- Automated content moderation or decision-making
|
| 677 |
+
- Safety-critical systems
|
| 678 |
+
|
| 679 |
+
### Environmental Impact
|
| 680 |
+
|
| 681 |
+
**Training Carbon Footprint (Estimated):**
|
| 682 |
+
- GPU Hours: ~2.2 hours on T4
|
| 683 |
+
- Estimated COβ: ~0.15 kg (assuming 0.068 kg/GPU-hour for T4)
|
| 684 |
+
- Relatively low impact due to small model size and short training
|
| 685 |
+
|
| 686 |
+
## Comparison with Similar Models
|
| 687 |
+
|
| 688 |
+
| Model | Params | Perplexity | Architecture | Special Features |
|
| 689 |
+
|-------|--------|------------|--------------|-----------------|
|
| 690 |
+
| **This Model** | **48M** | **98.86** | **Mixture of Recursion** | **Adaptive depth, RoPE** |
|
| 691 |
+
| GPT-2 Small | 117M | ~29 | Standard Transformer | OpenAI, well-tested |
|
| 692 |
+
| DistilGPT-2 | 82M | ~35 | Distilled GPT-2 | Faster inference |
|
| 693 |
+
| GPT-Neo 125M | 125M | ~25 | Mesh Transformer | More data, larger |
|
| 694 |
+
|
| 695 |
+
**Trade-offs:**
|
| 696 |
+
- β
Smaller size: Better for deployment
|
| 697 |
+
- β
Novel architecture: Research value
|
| 698 |
+
- β
Adaptive computation: Potentially more efficient
|
| 699 |
+
- β Higher perplexity: Less predictive accuracy
|
| 700 |
+
- β Less training: Smaller knowledge base
|
| 701 |
+
|
| 702 |
+
## Model Card & Transparency
|
| 703 |
+
|
| 704 |
+
### Intended Use
|
| 705 |
+
|
| 706 |
+
**Primary Use Cases:**
|
| 707 |
+
- π **Education**: Teaching language model concepts
|
| 708 |
+
- π¬ **Research**: Experimenting with adaptive computation
|
| 709 |
+
- π οΈ **Prototyping**: Testing LM-based applications
|
| 710 |
+
- π‘ **Learning**: Understanding transformer architectures
|
| 711 |
+
|
| 712 |
+
**Out-of-Scope Uses:**
|
| 713 |
+
- Production chatbots without oversight
|
| 714 |
+
- Generating factual content for publication
|
| 715 |
+
- Automated decision systems
|
| 716 |
+
- Content requiring domain expertise
|
| 717 |
+
|
| 718 |
+
### Evaluation Methodology
|
| 719 |
+
|
| 720 |
+
**Metrics:**
|
| 721 |
+
- Primary: Perplexity on validation set
|
| 722 |
+
- Secondary: Training loss, gradient norms
|
| 723 |
+
- Qualitative: Manual review of generations
|
| 724 |
+
|
| 725 |
+
**Evaluation Data:**
|
| 726 |
+
- 1,000 samples from FineWeb-Edu (held-out)
|
| 727 |
+
- Same preprocessing as training data
|
| 728 |
+
- Evaluated every 1,000 training steps
|