Nebulixlabs commited on
Commit
63e2c6a
Β·
verified Β·
1 Parent(s): 4d9b678

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -1
README.md CHANGED
@@ -1,3 +1,188 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - custom-architecture
7
+ - pytorch
8
+ - scratch-model
9
+ - nanocoder
10
  ---
11
+
12
+ # πŸš€ Nebulixlabs/Nanocoder-Base
13
+
14
+ **Nanocoder-Base** is a custom-built, ultra-lightweight, autoregressive language model trained from scratch. With approximately **19.5 Million parameters**, it is designed to be highly efficient, experimental, and capable of running on severely resource-constrained hardware (including edge devices and single standard GPUs).
15
+
16
+ It was built specifically to understand basic English language structure and the foundational syntax of programming languages like Python and JavaScript.
17
+
18
+ ## πŸ“Š Model Details
19
+
20
+ * **Developer:** Nebulixlabs
21
+ * **Model Type:** Custom Autoregressive Decoder-Only Transformer
22
+ * **Parameter Count:** 19,231,488 (~19.5M)
23
+ * **Language(s):** English, Python, JavaScript
24
+ * **License:** MIT
25
+
26
+ ### Architecture Specifications
27
+ | Component | Specification |
28
+ | :--- | :--- |
29
+ | **Layers (Transformer Blocks)** | 8 |
30
+ | **Hidden Dimension (d_model)** | 256 |
31
+ | **Attention Heads** | 8 (32 dimensions per head) |
32
+ | **Context Window (MAX_SEQ_LEN)** | 256 tokens |
33
+ | **Vocabulary Size** | 50,257 (Standard GPT-2 Tokenizer) |
34
+
35
+ ---
36
+
37
+ ## βš™οΈ How It Works (Under the Hood)
38
+
39
+ Nanocoder is not a standard Hugging Face `transformers` class; it is a raw, custom PyTorch implementation optimized for speed and memory efficiency.
40
+
41
+ 1. **Flash Attention Integration:** Instead of standard multi-head attention math, Nanocoder uses PyTorch 2.0's native `F.scaled_dot_product_attention`. This drastically reduces VRAM usage and speeds up both training and inference.
42
+ 2. **Weight Tying:** The embedding layer (`token_emb`) and the final output layer (`lm_head`) share the same weights. This is a crucial technique that saves millions of parameters while allowing the model to learn token representations more effectively.
43
+ 3. **Pre-Layer Normalization:** To maintain gradient stability during training, LayerNorm is applied *before* the self-attention and feed-forward networks, rather than after.
44
+ 4. **Compute-Optimal Scaling:** The model was trained using a 15x token-to-parameter ratio (~292.5 Million tokens), ensuring it extracts the maximum possible knowledge without overfitting its small parameter budget.
45
+
46
+ ---
47
+
48
+ ## 🎯 Capabilities & Limitations
49
+
50
+ **What Nanocoder is good at:**
51
+ * **Syntax Recognition:** It understands the basic visual structure of code (e.g., Python indentation, function definitions `def ... :`, and basic loops).
52
+ * **Pattern Completion:** Generating short sequences of text or continuing a simple coding prompt.
53
+ * **Educational Prototyping:** It is an excellent foundational model for students and researchers who want to learn how LLMs work, how to write custom PyTorch architectures, and how to execute fine-tuning pipelines locally without massive GPU clusters.
54
+
55
+ **What Nanocoder is NOT good at:**
56
+ * Because it only has 19.5M parameters (compared to billions in Llama or GPT), it has a strict "Capacity Wall."
57
+ * It cannot execute complex mathematical logic, remember long conversational contexts, or write production-ready software.
58
+ * It will hallucinate if asked complex reasoning questions.
59
+
60
+ ---
61
+
62
+ ## πŸ“š Recommended Fine-Tuning Data
63
+
64
+ To make Nanocoder highly effective for your specific use case, you must fine-tune it on **high-quality, narrowly focused datasets**. Do not feed it broad knowledge; feed it specific formats.
65
+
66
+ * **For a Chatbot:** Use datasets like `OpenAssistant/oasst_top1_2023-08-25`. This will teach the model the `<|im_start|>user` and `<|im_start|>assistant` conversational tags.
67
+ * **For a Coding Assistant:** Use `sahil2801/CodeAlpaca-20k`. This teaches the model to read an `Instruction:` and generate the corresponding `Output:` code.
68
+ * **Format is Everything:** Ensure your fine-tuning data strictly follows a uniform template. Small models learn formats much faster than they learn raw facts.
69
+
70
+ ---
71
+
72
+ ## πŸ’» Demo: How to Load and Fine-Tune Nanocoder
73
+
74
+ Because Nanocoder uses a custom architecture, you cannot load it using `AutoModelForCausalLM.from_pretrained()`. You must define the architecture in your script and load the state dictionary.
75
+
76
+ Here is a complete, self-contained PyTorch script to load the model and start a fine-tuning loop:
77
+
78
+ ```python
79
+ import torch
80
+ import torch.nn as nn
81
+ import torch.nn.functional as F
82
+
83
+ # ==========================================
84
+ # 1. DEFINE THE EXACT ARCHITECTURE
85
+ # ==========================================
86
+ VOCAB_SIZE = 50257
87
+ MAX_SEQ_LEN = 256
88
+ EMBED_DIM = 256
89
+ NUM_LAYERS = 8
90
+ NUM_HEADS = 8
91
+
92
+ class SelfAttention(nn.Module):
93
+ def __init__(self):
94
+ super().__init__()
95
+ self.c_attn = nn.Linear(EMBED_DIM, 3 * EMBED_DIM, bias=False)
96
+ self.c_proj = nn.Linear(EMBED_DIM, EMBED_DIM, bias=False)
97
+ self.n_head = NUM_HEADS
98
+ self.head_dim = EMBED_DIM // NUM_HEADS
99
+ self.dropout = nn.Dropout(0.1)
100
+
101
+ def forward(self, x):
102
+ B, T, C = x.size()
103
+ qkv = self.c_attn(x)
104
+ q, k, v = qkv.split(EMBED_DIM, dim=2)
105
+ q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
106
+ k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
107
+ v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
108
+
109
+ y = F.scaled_dot_product_attention(q, k, v, is_causal=True, dropout_p=0.1 if self.training else 0)
110
+ return self.dropout(self.c_proj(y.transpose(1, 2).contiguous().view(B, T, C)))
111
+
112
+ class TransformerBlock(nn.Module):
113
+ def __init__(self):
114
+ super().__init__()
115
+ self.ln_1 = nn.LayerNorm(EMBED_DIM)
116
+ self.attn = SelfAttention()
117
+ self.ln_2 = nn.LayerNorm(EMBED_DIM)
118
+ self.mlp = nn.Sequential(
119
+ nn.Linear(EMBED_DIM, 4 * EMBED_DIM, bias=False),
120
+ nn.GELU(),
121
+ nn.Linear(4 * EMBED_DIM, EMBED_DIM, bias=False),
122
+ nn.Dropout(0.1),
123
+ )
124
+
125
+ def forward(self, x):
126
+ x = x + self.attn(self.ln_1(x))
127
+ x = x + self.mlp(self.ln_2(x))
128
+ return x
129
+
130
+ class NanoCoder(nn.Module):
131
+ def __init__(self):
132
+ super().__init__()
133
+ self.token_emb = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
134
+ self.pos_emb = nn.Embedding(MAX_SEQ_LEN, EMBED_DIM)
135
+ self.blocks = nn.ModuleList([TransformerBlock() for _ in range(NUM_LAYERS)])
136
+ self.ln_f = nn.LayerNorm(EMBED_DIM)
137
+ self.lm_head = nn.Linear(EMBED_DIM, VOCAB_SIZE, bias=False)
138
+ self.token_emb.weight = self.lm_head.weight # Weight Tying
139
+
140
+ def forward(self, idx, targets=None):
141
+ B, T = idx.size()
142
+ pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
143
+ x = self.token_emb(idx) + self.pos_emb(pos)
144
+ for block in self.blocks: x = block(x)
145
+ x = self.ln_f(x)
146
+ logits = self.lm_head(x)
147
+ loss = None
148
+ if targets is not None:
149
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
150
+ return logits, loss
151
+
152
+ # ==========================================
153
+ # 2. LOAD WEIGHTS SAFELY
154
+ # ==========================================
155
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
156
+ model = NanoCoder().to(device)
157
+
158
+ # Replace "nanocoder_base.pth" with your downloaded model path
159
+ state_dict = torch.load("nanocoder_base.pth", map_location=device, weights_only=True)
160
+
161
+ # Clean DataParallel 'module.' prefixes if they exist
162
+ clean_state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
163
+ model.load_state_dict(clean_state_dict)
164
+
165
+ print("βœ… Nebulixlabs/Nanocoder loaded successfully!")
166
+
167
+ # ==========================================
168
+ # 3. QUICK FINE-TUNING LOOP EXAMPLE
169
+ # ==========================================
170
+ # Setup Optimizer (Use a lower learning rate for fine-tuning)
171
+ optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
172
+
173
+ # Dummy Input (Replace with your tokenized DataLoader)
174
+ # Shape: [Batch Size, Sequence Length]
175
+ dummy_input = torch.randint(0, VOCAB_SIZE, (4, MAX_SEQ_LEN)).to(device)
176
+ dummy_target = torch.randint(0, VOCAB_SIZE, (4, MAX_SEQ_LEN)).to(device)
177
+
178
+ model.train()
179
+ optimizer.zero_grad()
180
+
181
+ # Forward pass
182
+ logits, loss = model(dummy_input, targets=dummy_target)
183
+
184
+ # Backward pass
185
+ loss.backward()
186
+ optimizer.step()
187
+
188
+ print(f"πŸ“‰ Sample Training Step Complete. Loss: {loss.item():.4f}")