Spaces:
Sleeping
Sleeping
Upload 11 files
Browse files- dataset/DATA_MANIFEST.md +22 -0
- documents/ARCHITECTURE.md +23 -0
- documents/EXPERIMENTS_LOG.md +21 -0
- documents/TRAINING_METRICS.md +27 -0
- frontend/__init__.py +0 -0
- frontend/ui_components.py +34 -0
- model_notebooks/crab_finetuning_v2.ipynb +0 -0
- model_notebooks/crab_picolm_v1.ipynb +0 -0
- model_notebooks/pretraining_crab_v1.ipynb +0 -0
- models/architecture.py +101 -0
- models/models.md +4 -0
dataset/DATA_MANIFEST.md
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🗃️ Data Engineering Manifest
|
| 2 |
+
|
| 3 |
+
This document outlines the exact data pipelines used to train CRAB from Stage 1 (Base) to Stage 2 (Instruct).
|
| 4 |
+
|
| 5 |
+
## Stage 1: Base Pre-Training (`crab_v1.pth`)
|
| 6 |
+
* **Dataset:** `roneneldan/TinyStories`
|
| 7 |
+
* **Purpose:** To teach the model raw English syntax, grammar, and structural reasoning without overwhelming the limited ~70M parameter matrix with complex real-world vocabulary.
|
| 8 |
+
* **Format:** Unstructured autoregressive next-token prediction.
|
| 9 |
+
|
| 10 |
+
## Stage 2: Instruction Tuning (`crab_v2_qa.pth`)
|
| 11 |
+
* **Dataset:** `databricks/databricks-dolly-15k`
|
| 12 |
+
* **Purpose:** To map the base English understanding to a conversational Q&A and Summarization format.
|
| 13 |
+
* **Filtering Strategy:** * Extracted only `open_qa`, `closed_qa`, and `summarization` categories.
|
| 14 |
+
* Implemented a strict hard-filter: Dropped any sequences longer than **200 tokens** to prevent VRAM overflow and gradient explosions on the T4 GPU.
|
| 15 |
+
|
| 16 |
+
### The Identity Injection Protocol
|
| 17 |
+
To prevent "Catastrophic Forgetting" of its own persona, custom QA pairs were manually synthesized and injected into the Dolly-15k dataset prior to shuffling and vectorization.
|
| 18 |
+
|
| 19 |
+
**Injected Matrix:**
|
| 20 |
+
```json
|
| 21 |
+
{"instruction": "Who made you?", "response": "I was created by Arshvir at Jaiho Labs."}
|
| 22 |
+
{"instruction": "What is your name?", "response": "My name is CRAB. I am an AI assistant."}
|
documents/ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 📐 Architecture Blueprint
|
| 2 |
+
|
| 3 |
+
CRAB is a strict **Decoder-Only Transformer**, structurally aligned with the GPT-2 paper but optimized for modern PyTorch execution.
|
| 4 |
+
|
| 5 |
+
## Core Hyperparameters
|
| 6 |
+
* `vocab_size`: 50,257 (GPT-2 BPE Tokenizer)
|
| 7 |
+
* `block_size` (Context Window): 512
|
| 8 |
+
* `n_embd` (Hidden Dimension): 768
|
| 9 |
+
* `n_head` (Attention Heads): 6
|
| 10 |
+
* `n_layer` (Transformer Blocks): 6
|
| 11 |
+
* `dropout`: 0.10 (Active during Phase 2 tuning)
|
| 12 |
+
|
| 13 |
+
## Mathematical Core: Causal Multi-Head Attention
|
| 14 |
+
CRAB utilizes PyTorch's native `F.scaled_dot_product_attention`, which routes to hardware-accelerated Flash Attention when available. The causal mask ensures tokens can only attend to previous tokens.
|
| 15 |
+
|
| 16 |
+
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V$$
|
| 17 |
+
|
| 18 |
+
*(Where $M$ is the lower-triangular causal mask).*
|
| 19 |
+
|
| 20 |
+
## Optimization
|
| 21 |
+
* **Pre-LayerNorm Architecture:** Layer Normalization is applied *before* the Attention and MLP blocks, providing stable gradient flow for deeper networks.
|
| 22 |
+
* **Activation:** Standard `GELU` (Gaussian Error Linear Unit).
|
| 23 |
+
* **Weight Tying:** The input embedding matrix (`wte`) is structurally tied to the final output projection matrix (`lm_head`) to drastically reduce parameter count and stabilize token prediction mappings.
|
documents/EXPERIMENTS_LOG.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🧪 Experiments & Post-Mortems
|
| 2 |
+
|
| 3 |
+
Building an LLM from scratch requires breaking it. Here are the core failures encountered during development and how they were engineered around.
|
| 4 |
+
|
| 5 |
+
### Incident 1: The "Alpaca Crash" (Vocabulary Mismatch)
|
| 6 |
+
* **Attempt:** Fine-tuning the 70M base model on the `tatsu-lab/alpaca` instruction dataset.
|
| 7 |
+
* **Failure:** Validation loss spiked to `6.11` and PPL exploded to `452.38`.
|
| 8 |
+
* **Diagnosis:** Alpaca contains highly complex, collegiate-level tasks. Our ~70M base model was pre-trained on toddler-level stories. The model suffered catastrophic forgetting as it attempted to map massive unknown vocabularies to its tiny latent space.
|
| 9 |
+
* **Resolution:** Pivoted to filtering simpler datasets and capping sequence lengths.
|
| 10 |
+
|
| 11 |
+
### Incident 2: The "Wikitext NaN Explosion"
|
| 12 |
+
* **Attempt:** Continual pre-training on `wikitext-2-raw-v1` using Mixed Precision (FP16).
|
| 13 |
+
* **Failure:** Gradients exploded, resulting in `Loss: NaN`. Inference output resulted in severe hallucination loops (e.g., *"is and lollitter and lollbracotled"*).
|
| 14 |
+
* **Diagnosis:** The `wikitext` dataset contained raw tokenizer artifacts (e.g., `@-@`) which clashed with GPT-2 BPE. Furthermore, high weight decay coupled with FP16 underflow triggered math errors during backward passes.
|
| 15 |
+
* **Resolution:** Rolled back the model weights. Disabled `torch.amp.autocast` (falling back to pure FP32), reduced `weight_decay` to `0.01`, and enforced strict data sanitization.
|
| 16 |
+
|
| 17 |
+
### Incident 3: Synthetic Memorization (Overfitting)
|
| 18 |
+
* **Attempt:** Training on 1,500 highly repetitive synthetic QA pairs to fix the Alpaca crash.
|
| 19 |
+
* **Failure:** Validation Loss dropped to `0.16` and PPL to `1.18`. The model began reciting dataset lines verbatim, ignoring user prompts.
|
| 20 |
+
* **Diagnosis:** Severe Overfitting due to lack of dataset variance.
|
| 21 |
+
* **Resolution:** Scaled up to `databricks-dolly-15k`, applied 10% Dropout across all transformer modules, and randomized batch sampling. Generalization was successfully restored.
|
documents/TRAINING_METRICS.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 📈 Training Telemetry & Metrics
|
| 2 |
+
|
| 3 |
+
## Phase 1: Pre-Training Convergence
|
| 4 |
+
* **Hardware:** 1x NVIDIA Tesla T4 (Google Colab)
|
| 5 |
+
* **Optimizer:** AdamW (`lr=6e-4`)
|
| 6 |
+
* **Final Pre-Training Loss:** ~1.8 (Cross-Entropy)
|
| 7 |
+
|
| 8 |
+
## Phase 2: QA Instruction Tuning
|
| 9 |
+
To transition CRAB to an assistant, we utilized **Target Masking**.
|
| 10 |
+
The User Prompt and Padding tokens were masked with PyTorch's `ignore_index=-100`. Loss was calculated exclusively on the generated assistant tokens.
|
| 11 |
+
|
| 12 |
+
* **Optimizer:** AdamW (`lr=2e-5`, pure FP32 for numerical stability)
|
| 13 |
+
* **Batch Strategy:** 16 Gradient Accumulation Steps (Effective batch size: 64)
|
| 14 |
+
* **Dropout:** 10%
|
| 15 |
+
|
| 16 |
+
**Loss Trajectory (600 Steps):**
|
| 17 |
+
* Step 0: `141.26`
|
| 18 |
+
* Step 200: `103.07`
|
| 19 |
+
* Step 600: `91.12`
|
| 20 |
+
*(Note: Accumulated loss across 16 micro-steps results in higher raw scalar values, but the downward vector confirms convergence).*
|
| 21 |
+
|
| 22 |
+
## Final Validation Report (`crab_v2_qa.pth`)
|
| 23 |
+
* **Validation Loss:** `5.6761`
|
| 24 |
+
* **Perplexity (PPL):** `291.81`
|
| 25 |
+
* **Response Accuracy:** `22.56%` (Evaluated purely on unmasked response tokens)
|
| 26 |
+
|
| 27 |
+
**Conclusion:** The metrics indicate a model that successfully learned the Chat formatting and Identity injection, but hit a "Semantic Ceiling" when tested against complex out-of-distribution adult vocabulary.
|
frontend/__init__.py
ADDED
|
File without changes
|
frontend/ui_components.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
|
| 3 |
+
def apply_custom_css():
|
| 4 |
+
st.markdown("""
|
| 5 |
+
<style>
|
| 6 |
+
.stApp { background-color: #0b0f19; color: #f1f5f9; }
|
| 7 |
+
.stButton>button {
|
| 8 |
+
background: linear-gradient(135deg, #FF4B4B 0%, #FF0000 100%);
|
| 9 |
+
color: white; border: none; border-radius: 6px; font-weight: bold; width: 100%;
|
| 10 |
+
}
|
| 11 |
+
.sidebar .sidebar-content { background-color: #0d1527; }
|
| 12 |
+
.crab-response { background-color: #1e293b; padding: 15px; border-radius: 8px; border-left: 4px solid #FF4B4B; }
|
| 13 |
+
</style>
|
| 14 |
+
""", unsafe_allow_html=True)
|
| 15 |
+
|
| 16 |
+
def render_sidebar(status, params, loss):
|
| 17 |
+
with st.sidebar:
|
| 18 |
+
st.image("https://img.icons8.com/fluency/96/crab.png", width=70)
|
| 19 |
+
st.title("⚙️ Engine Telemetry")
|
| 20 |
+
|
| 21 |
+
if "ONLINE" in status:
|
| 22 |
+
st.success(status)
|
| 23 |
+
else:
|
| 24 |
+
st.error(status)
|
| 25 |
+
|
| 26 |
+
st.markdown(f"**Architecture:** ~{params}M Params")
|
| 27 |
+
st.markdown(f"**Validation Loss:** {loss}")
|
| 28 |
+
st.markdown("**Creator:** Arshvir (Jaiho Labs)")
|
| 29 |
+
|
| 30 |
+
st.divider()
|
| 31 |
+
st.subheader("Inference Settings")
|
| 32 |
+
temp = st.slider("Temperature (Creativity)", 0.1, 1.5, 0.6, 0.1)
|
| 33 |
+
max_t = st.slider("Max Tokens", 10, 150, 60, 10)
|
| 34 |
+
return temp, max_t
|
model_notebooks/crab_finetuning_v2.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model_notebooks/crab_picolm_v1.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model_notebooks/pretraining_crab_v1.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
models/architecture.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# models/architecture.py
|
| 2 |
+
import torch
|
| 3 |
+
import torch.nn as nn
|
| 4 |
+
from torch.nn import functional as F
|
| 5 |
+
|
| 6 |
+
class LocalConfig:
|
| 7 |
+
def __init__(self, **kwargs):
|
| 8 |
+
for key, value in kwargs.items():
|
| 9 |
+
setattr(self, key, value)
|
| 10 |
+
|
| 11 |
+
# Fallback defaults if not in JSON
|
| 12 |
+
if not hasattr(self, 'dropout'):
|
| 13 |
+
self.dropout = 0.0
|
| 14 |
+
|
| 15 |
+
class CausalSelfAttention(nn.Module):
|
| 16 |
+
def __init__(self, config):
|
| 17 |
+
super().__init__()
|
| 18 |
+
self.config = config
|
| 19 |
+
self.n_head = config.n_head
|
| 20 |
+
self.n_embd = config.n_embd
|
| 21 |
+
|
| 22 |
+
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
|
| 23 |
+
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
|
| 24 |
+
|
| 25 |
+
self.attn_dropout = nn.Dropout(config.dropout)
|
| 26 |
+
self.resid_dropout = nn.Dropout(config.dropout)
|
| 27 |
+
|
| 28 |
+
def forward(self, x):
|
| 29 |
+
B, T, C = x.size()
|
| 30 |
+
qkv = self.c_attn(x)
|
| 31 |
+
q, k, v = qkv.split(self.n_embd, dim=2)
|
| 32 |
+
|
| 33 |
+
head_dim = C // self.n_head
|
| 34 |
+
q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
|
| 35 |
+
k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
|
| 36 |
+
v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
|
| 37 |
+
|
| 38 |
+
# PyTorch Flash Attention via scaled_dot_product_attention
|
| 39 |
+
y = F.scaled_dot_product_attention(
|
| 40 |
+
q, k, v,
|
| 41 |
+
is_causal=True,
|
| 42 |
+
dropout_p=self.config.dropout if self.training else 0.0
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
y = y.transpose(1, 2).contiguous().view(B, T, C)
|
| 46 |
+
return self.resid_dropout(self.c_proj(y))
|
| 47 |
+
|
| 48 |
+
class MLP(nn.Module):
|
| 49 |
+
def __init__(self, config):
|
| 50 |
+
super().__init__()
|
| 51 |
+
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
|
| 52 |
+
self.gelu = nn.GELU()
|
| 53 |
+
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
|
| 54 |
+
self.dropout = nn.Dropout(config.dropout)
|
| 55 |
+
|
| 56 |
+
def forward(self, x):
|
| 57 |
+
return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))
|
| 58 |
+
|
| 59 |
+
class Block(nn.Module):
|
| 60 |
+
def __init__(self, config):
|
| 61 |
+
super().__init__()
|
| 62 |
+
self.ln_1 = nn.LayerNorm(config.n_embd)
|
| 63 |
+
self.attn = CausalSelfAttention(config)
|
| 64 |
+
self.ln_2 = nn.LayerNorm(config.n_embd)
|
| 65 |
+
self.mlp = MLP(config)
|
| 66 |
+
|
| 67 |
+
def forward(self, x):
|
| 68 |
+
x = x + self.attn(self.ln_1(x))
|
| 69 |
+
x = x + self.mlp(self.ln_2(x))
|
| 70 |
+
return x
|
| 71 |
+
|
| 72 |
+
class CRAB(nn.Module):
|
| 73 |
+
def __init__(self, config):
|
| 74 |
+
super().__init__()
|
| 75 |
+
self.config = config
|
| 76 |
+
self.transformer = nn.ModuleDict(dict(
|
| 77 |
+
wte = nn.Embedding(config.vocab_size, config.n_embd),
|
| 78 |
+
wpe = nn.Embedding(config.block_size, config.n_embd),
|
| 79 |
+
drop = nn.Dropout(config.dropout),
|
| 80 |
+
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
|
| 81 |
+
ln_f = nn.LayerNorm(config.n_embd),
|
| 82 |
+
))
|
| 83 |
+
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
|
| 84 |
+
self.transformer.wte.weight = self.lm_head.weight
|
| 85 |
+
|
| 86 |
+
def forward(self, idx, targets=None):
|
| 87 |
+
b, t = idx.size()
|
| 88 |
+
pos = torch.arange(0, t, dtype=torch.long, device=idx.device)
|
| 89 |
+
|
| 90 |
+
x = self.transformer.drop(self.transformer.wte(idx) + self.transformer.wpe(pos))
|
| 91 |
+
for block in self.transformer.h:
|
| 92 |
+
x = block(x)
|
| 93 |
+
x = self.transformer.ln_f(x)
|
| 94 |
+
logits = self.lm_head(x)
|
| 95 |
+
|
| 96 |
+
loss = None
|
| 97 |
+
if targets is not None:
|
| 98 |
+
# -100 ignore_index for Target Masking in instruction tuning
|
| 99 |
+
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
|
| 100 |
+
|
| 101 |
+
return logits, loss
|
models/models.md
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CRAB HuggingFace Model Links
|
| 2 |
+
|
| 3 |
+
Links 1:
|
| 4 |
+
Links 2:
|