arshvir commited on
Commit
d2deca9
·
verified ·
1 Parent(s): 8fc2618

Upload 11 files

Browse files
dataset/DATA_MANIFEST.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🗃️ Data Engineering Manifest
2
+
3
+ This document outlines the exact data pipelines used to train CRAB from Stage 1 (Base) to Stage 2 (Instruct).
4
+
5
+ ## Stage 1: Base Pre-Training (`crab_v1.pth`)
6
+ * **Dataset:** `roneneldan/TinyStories`
7
+ * **Purpose:** To teach the model raw English syntax, grammar, and structural reasoning without overwhelming the limited ~70M parameter matrix with complex real-world vocabulary.
8
+ * **Format:** Unstructured autoregressive next-token prediction.
9
+
10
+ ## Stage 2: Instruction Tuning (`crab_v2_qa.pth`)
11
+ * **Dataset:** `databricks/databricks-dolly-15k`
12
+ * **Purpose:** To map the base English understanding to a conversational Q&A and Summarization format.
13
+ * **Filtering Strategy:** * Extracted only `open_qa`, `closed_qa`, and `summarization` categories.
14
+ * Implemented a strict hard-filter: Dropped any sequences longer than **200 tokens** to prevent VRAM overflow and gradient explosions on the T4 GPU.
15
+
16
+ ### The Identity Injection Protocol
17
+ To prevent "Catastrophic Forgetting" of its own persona, custom QA pairs were manually synthesized and injected into the Dolly-15k dataset prior to shuffling and vectorization.
18
+
19
+ **Injected Matrix:**
20
+ ```json
21
+ {"instruction": "Who made you?", "response": "I was created by Arshvir at Jaiho Labs."}
22
+ {"instruction": "What is your name?", "response": "My name is CRAB. I am an AI assistant."}
documents/ARCHITECTURE.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📐 Architecture Blueprint
2
+
3
+ CRAB is a strict **Decoder-Only Transformer**, structurally aligned with the GPT-2 paper but optimized for modern PyTorch execution.
4
+
5
+ ## Core Hyperparameters
6
+ * `vocab_size`: 50,257 (GPT-2 BPE Tokenizer)
7
+ * `block_size` (Context Window): 512
8
+ * `n_embd` (Hidden Dimension): 768
9
+ * `n_head` (Attention Heads): 6
10
+ * `n_layer` (Transformer Blocks): 6
11
+ * `dropout`: 0.10 (Active during Phase 2 tuning)
12
+
13
+ ## Mathematical Core: Causal Multi-Head Attention
14
+ CRAB utilizes PyTorch's native `F.scaled_dot_product_attention`, which routes to hardware-accelerated Flash Attention when available. The causal mask ensures tokens can only attend to previous tokens.
15
+
16
+ $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V$$
17
+
18
+ *(Where $M$ is the lower-triangular causal mask).*
19
+
20
+ ## Optimization
21
+ * **Pre-LayerNorm Architecture:** Layer Normalization is applied *before* the Attention and MLP blocks, providing stable gradient flow for deeper networks.
22
+ * **Activation:** Standard `GELU` (Gaussian Error Linear Unit).
23
+ * **Weight Tying:** The input embedding matrix (`wte`) is structurally tied to the final output projection matrix (`lm_head`) to drastically reduce parameter count and stabilize token prediction mappings.
documents/EXPERIMENTS_LOG.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧪 Experiments & Post-Mortems
2
+
3
+ Building an LLM from scratch requires breaking it. Here are the core failures encountered during development and how they were engineered around.
4
+
5
+ ### Incident 1: The "Alpaca Crash" (Vocabulary Mismatch)
6
+ * **Attempt:** Fine-tuning the 70M base model on the `tatsu-lab/alpaca` instruction dataset.
7
+ * **Failure:** Validation loss spiked to `6.11` and PPL exploded to `452.38`.
8
+ * **Diagnosis:** Alpaca contains highly complex, collegiate-level tasks. Our ~70M base model was pre-trained on toddler-level stories. The model suffered catastrophic forgetting as it attempted to map massive unknown vocabularies to its tiny latent space.
9
+ * **Resolution:** Pivoted to filtering simpler datasets and capping sequence lengths.
10
+
11
+ ### Incident 2: The "Wikitext NaN Explosion"
12
+ * **Attempt:** Continual pre-training on `wikitext-2-raw-v1` using Mixed Precision (FP16).
13
+ * **Failure:** Gradients exploded, resulting in `Loss: NaN`. Inference output resulted in severe hallucination loops (e.g., *"is and lollitter and lollbracotled"*).
14
+ * **Diagnosis:** The `wikitext` dataset contained raw tokenizer artifacts (e.g., `@-@`) which clashed with GPT-2 BPE. Furthermore, high weight decay coupled with FP16 underflow triggered math errors during backward passes.
15
+ * **Resolution:** Rolled back the model weights. Disabled `torch.amp.autocast` (falling back to pure FP32), reduced `weight_decay` to `0.01`, and enforced strict data sanitization.
16
+
17
+ ### Incident 3: Synthetic Memorization (Overfitting)
18
+ * **Attempt:** Training on 1,500 highly repetitive synthetic QA pairs to fix the Alpaca crash.
19
+ * **Failure:** Validation Loss dropped to `0.16` and PPL to `1.18`. The model began reciting dataset lines verbatim, ignoring user prompts.
20
+ * **Diagnosis:** Severe Overfitting due to lack of dataset variance.
21
+ * **Resolution:** Scaled up to `databricks-dolly-15k`, applied 10% Dropout across all transformer modules, and randomized batch sampling. Generalization was successfully restored.
documents/TRAINING_METRICS.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📈 Training Telemetry & Metrics
2
+
3
+ ## Phase 1: Pre-Training Convergence
4
+ * **Hardware:** 1x NVIDIA Tesla T4 (Google Colab)
5
+ * **Optimizer:** AdamW (`lr=6e-4`)
6
+ * **Final Pre-Training Loss:** ~1.8 (Cross-Entropy)
7
+
8
+ ## Phase 2: QA Instruction Tuning
9
+ To transition CRAB to an assistant, we utilized **Target Masking**.
10
+ The User Prompt and Padding tokens were masked with PyTorch's `ignore_index=-100`. Loss was calculated exclusively on the generated assistant tokens.
11
+
12
+ * **Optimizer:** AdamW (`lr=2e-5`, pure FP32 for numerical stability)
13
+ * **Batch Strategy:** 16 Gradient Accumulation Steps (Effective batch size: 64)
14
+ * **Dropout:** 10%
15
+
16
+ **Loss Trajectory (600 Steps):**
17
+ * Step 0: `141.26`
18
+ * Step 200: `103.07`
19
+ * Step 600: `91.12`
20
+ *(Note: Accumulated loss across 16 micro-steps results in higher raw scalar values, but the downward vector confirms convergence).*
21
+
22
+ ## Final Validation Report (`crab_v2_qa.pth`)
23
+ * **Validation Loss:** `5.6761`
24
+ * **Perplexity (PPL):** `291.81`
25
+ * **Response Accuracy:** `22.56%` (Evaluated purely on unmasked response tokens)
26
+
27
+ **Conclusion:** The metrics indicate a model that successfully learned the Chat formatting and Identity injection, but hit a "Semantic Ceiling" when tested against complex out-of-distribution adult vocabulary.
frontend/__init__.py ADDED
File without changes
frontend/ui_components.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ def apply_custom_css():
4
+ st.markdown("""
5
+ <style>
6
+ .stApp { background-color: #0b0f19; color: #f1f5f9; }
7
+ .stButton>button {
8
+ background: linear-gradient(135deg, #FF4B4B 0%, #FF0000 100%);
9
+ color: white; border: none; border-radius: 6px; font-weight: bold; width: 100%;
10
+ }
11
+ .sidebar .sidebar-content { background-color: #0d1527; }
12
+ .crab-response { background-color: #1e293b; padding: 15px; border-radius: 8px; border-left: 4px solid #FF4B4B; }
13
+ </style>
14
+ """, unsafe_allow_html=True)
15
+
16
+ def render_sidebar(status, params, loss):
17
+ with st.sidebar:
18
+ st.image("https://img.icons8.com/fluency/96/crab.png", width=70)
19
+ st.title("⚙️ Engine Telemetry")
20
+
21
+ if "ONLINE" in status:
22
+ st.success(status)
23
+ else:
24
+ st.error(status)
25
+
26
+ st.markdown(f"**Architecture:** ~{params}M Params")
27
+ st.markdown(f"**Validation Loss:** {loss}")
28
+ st.markdown("**Creator:** Arshvir (Jaiho Labs)")
29
+
30
+ st.divider()
31
+ st.subheader("Inference Settings")
32
+ temp = st.slider("Temperature (Creativity)", 0.1, 1.5, 0.6, 0.1)
33
+ max_t = st.slider("Max Tokens", 10, 150, 60, 10)
34
+ return temp, max_t
model_notebooks/crab_finetuning_v2.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
model_notebooks/crab_picolm_v1.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
model_notebooks/pretraining_crab_v1.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
models/architecture.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # models/architecture.py
2
+ import torch
3
+ import torch.nn as nn
4
+ from torch.nn import functional as F
5
+
6
+ class LocalConfig:
7
+ def __init__(self, **kwargs):
8
+ for key, value in kwargs.items():
9
+ setattr(self, key, value)
10
+
11
+ # Fallback defaults if not in JSON
12
+ if not hasattr(self, 'dropout'):
13
+ self.dropout = 0.0
14
+
15
+ class CausalSelfAttention(nn.Module):
16
+ def __init__(self, config):
17
+ super().__init__()
18
+ self.config = config
19
+ self.n_head = config.n_head
20
+ self.n_embd = config.n_embd
21
+
22
+ self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
23
+ self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
24
+
25
+ self.attn_dropout = nn.Dropout(config.dropout)
26
+ self.resid_dropout = nn.Dropout(config.dropout)
27
+
28
+ def forward(self, x):
29
+ B, T, C = x.size()
30
+ qkv = self.c_attn(x)
31
+ q, k, v = qkv.split(self.n_embd, dim=2)
32
+
33
+ head_dim = C // self.n_head
34
+ q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
35
+ k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
36
+ v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
37
+
38
+ # PyTorch Flash Attention via scaled_dot_product_attention
39
+ y = F.scaled_dot_product_attention(
40
+ q, k, v,
41
+ is_causal=True,
42
+ dropout_p=self.config.dropout if self.training else 0.0
43
+ )
44
+
45
+ y = y.transpose(1, 2).contiguous().view(B, T, C)
46
+ return self.resid_dropout(self.c_proj(y))
47
+
48
+ class MLP(nn.Module):
49
+ def __init__(self, config):
50
+ super().__init__()
51
+ self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
52
+ self.gelu = nn.GELU()
53
+ self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
54
+ self.dropout = nn.Dropout(config.dropout)
55
+
56
+ def forward(self, x):
57
+ return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))
58
+
59
+ class Block(nn.Module):
60
+ def __init__(self, config):
61
+ super().__init__()
62
+ self.ln_1 = nn.LayerNorm(config.n_embd)
63
+ self.attn = CausalSelfAttention(config)
64
+ self.ln_2 = nn.LayerNorm(config.n_embd)
65
+ self.mlp = MLP(config)
66
+
67
+ def forward(self, x):
68
+ x = x + self.attn(self.ln_1(x))
69
+ x = x + self.mlp(self.ln_2(x))
70
+ return x
71
+
72
+ class CRAB(nn.Module):
73
+ def __init__(self, config):
74
+ super().__init__()
75
+ self.config = config
76
+ self.transformer = nn.ModuleDict(dict(
77
+ wte = nn.Embedding(config.vocab_size, config.n_embd),
78
+ wpe = nn.Embedding(config.block_size, config.n_embd),
79
+ drop = nn.Dropout(config.dropout),
80
+ h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
81
+ ln_f = nn.LayerNorm(config.n_embd),
82
+ ))
83
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
84
+ self.transformer.wte.weight = self.lm_head.weight
85
+
86
+ def forward(self, idx, targets=None):
87
+ b, t = idx.size()
88
+ pos = torch.arange(0, t, dtype=torch.long, device=idx.device)
89
+
90
+ x = self.transformer.drop(self.transformer.wte(idx) + self.transformer.wpe(pos))
91
+ for block in self.transformer.h:
92
+ x = block(x)
93
+ x = self.transformer.ln_f(x)
94
+ logits = self.lm_head(x)
95
+
96
+ loss = None
97
+ if targets is not None:
98
+ # -100 ignore_index for Target Masking in instruction tuning
99
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
100
+
101
+ return logits, loss
models/models.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # CRAB HuggingFace Model Links
2
+
3
+ Links 1:
4
+ Links 2: