Spaces:

arshvir
/

crab_language_model

Sleeping

App Files Files Community

arshvir commited on May 16

Commit

d2deca9

verified ·

1 Parent(s): 8fc2618

Upload 11 files

Browse files

Files changed (11) hide show

dataset/DATA_MANIFEST.md +22 -0
documents/ARCHITECTURE.md +23 -0
documents/EXPERIMENTS_LOG.md +21 -0
documents/TRAINING_METRICS.md +27 -0
frontend/__init__.py +0 -0
frontend/ui_components.py +34 -0
model_notebooks/crab_finetuning_v2.ipynb +0 -0
model_notebooks/crab_picolm_v1.ipynb +0 -0
model_notebooks/pretraining_crab_v1.ipynb +0 -0
models/architecture.py +101 -0
models/models.md +4 -0

dataset/DATA_MANIFEST.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# 🗃️ Data Engineering Manifest
+This document outlines the exact data pipelines used to train CRAB from Stage 1 (Base) to Stage 2 (Instruct).
+## Stage 1: Base Pre-Training (`crab_v1.pth`)
+* **Dataset:** `roneneldan/TinyStories`
+* **Purpose:** To teach the model raw English syntax, grammar, and structural reasoning without overwhelming the limited ~70M parameter matrix with complex real-world vocabulary.
+* **Format:** Unstructured autoregressive next-token prediction.
+## Stage 2: Instruction Tuning (`crab_v2_qa.pth`)
+* **Dataset:** `databricks/databricks-dolly-15k`
+* **Purpose:** To map the base English understanding to a conversational Q&A and Summarization format.
+* **Filtering Strategy:** * Extracted only `open_qa`, `closed_qa`, and `summarization` categories.
+  * Implemented a strict hard-filter: Dropped any sequences longer than **200 tokens** to prevent VRAM overflow and gradient explosions on the T4 GPU.
+### The Identity Injection Protocol
+To prevent "Catastrophic Forgetting" of its own persona, custom QA pairs were manually synthesized and injected into the Dolly-15k dataset prior to shuffling and vectorization.
+**Injected Matrix:**
+```json
+{"instruction": "Who made you?", "response": "I was created by Arshvir at Jaiho Labs."}
+{"instruction": "What is your name?", "response": "My name is CRAB. I am an AI assistant."}

documents/ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# 📐 Architecture Blueprint
+CRAB is a strict **Decoder-Only Transformer**, structurally aligned with the GPT-2 paper but optimized for modern PyTorch execution.
+## Core Hyperparameters
+* `vocab_size`: 50,257 (GPT-2 BPE Tokenizer)
+* `block_size` (Context Window): 512
+* `n_embd` (Hidden Dimension): 768
+* `n_head` (Attention Heads): 6
+* `n_layer` (Transformer Blocks): 6
+* `dropout`: 0.10 (Active during Phase 2 tuning)
+## Mathematical Core: Causal Multi-Head Attention
+CRAB utilizes PyTorch's native `F.scaled_dot_product_attention`, which routes to hardware-accelerated Flash Attention when available. The causal mask ensures tokens can only attend to previous tokens.
+$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V$$
+*(Where $M$ is the lower-triangular causal mask).*
+## Optimization
+* **Pre-LayerNorm Architecture:** Layer Normalization is applied *before* the Attention and MLP blocks, providing stable gradient flow for deeper networks.
+* **Activation:** Standard `GELU` (Gaussian Error Linear Unit).
+* **Weight Tying:** The input embedding matrix (`wte`) is structurally tied to the final output projection matrix (`lm_head`) to drastically reduce parameter count and stabilize token prediction mappings.

documents/EXPERIMENTS_LOG.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# 🧪 Experiments & Post-Mortems
+Building an LLM from scratch requires breaking it. Here are the core failures encountered during development and how they were engineered around.
+### Incident 1: The "Alpaca Crash" (Vocabulary Mismatch)
+* **Attempt:** Fine-tuning the 70M base model on the `tatsu-lab/alpaca` instruction dataset.
+* **Failure:** Validation loss spiked to `6.11` and PPL exploded to `452.38`.
+* **Diagnosis:** Alpaca contains highly complex, collegiate-level tasks. Our ~70M base model was pre-trained on toddler-level stories. The model suffered catastrophic forgetting as it attempted to map massive unknown vocabularies to its tiny latent space.
+* **Resolution:** Pivoted to filtering simpler datasets and capping sequence lengths.
+### Incident 2: The "Wikitext NaN Explosion"
+* **Attempt:** Continual pre-training on `wikitext-2-raw-v1` using Mixed Precision (FP16).
+* **Failure:** Gradients exploded, resulting in `Loss: NaN`. Inference output resulted in severe hallucination loops (e.g., *"is and lollitter and lollbracotled"*).
+* **Diagnosis:** The `wikitext` dataset contained raw tokenizer artifacts (e.g., `@-@`) which clashed with GPT-2 BPE. Furthermore, high weight decay coupled with FP16 underflow triggered math errors during backward passes.
+* **Resolution:** Rolled back the model weights. Disabled `torch.amp.autocast` (falling back to pure FP32), reduced `weight_decay` to `0.01`, and enforced strict data sanitization.
+### Incident 3: Synthetic Memorization (Overfitting)
+* **Attempt:** Training on 1,500 highly repetitive synthetic QA pairs to fix the Alpaca crash.
+* **Failure:** Validation Loss dropped to `0.16` and PPL to `1.18`. The model began reciting dataset lines verbatim, ignoring user prompts.
+* **Diagnosis:** Severe Overfitting due to lack of dataset variance.
+* **Resolution:** Scaled up to `databricks-dolly-15k`, applied 10% Dropout across all transformer modules, and randomized batch sampling. Generalization was successfully restored.

documents/TRAINING_METRICS.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# 📈 Training Telemetry & Metrics
+## Phase 1: Pre-Training Convergence
+* **Hardware:** 1x NVIDIA Tesla T4 (Google Colab)
+* **Optimizer:** AdamW (`lr=6e-4`)
+* **Final Pre-Training Loss:** ~1.8 (Cross-Entropy)
+## Phase 2: QA Instruction Tuning
+To transition CRAB to an assistant, we utilized **Target Masking**.
+The User Prompt and Padding tokens were masked with PyTorch's `ignore_index=-100`. Loss was calculated exclusively on the generated assistant tokens.
+* **Optimizer:** AdamW (`lr=2e-5`, pure FP32 for numerical stability)
+* **Batch Strategy:** 16 Gradient Accumulation Steps (Effective batch size: 64)
+* **Dropout:** 10%
+**Loss Trajectory (600 Steps):**
+* Step 0: `141.26`
+* Step 200: `103.07`
+* Step 600: `91.12`
+*(Note: Accumulated loss across 16 micro-steps results in higher raw scalar values, but the downward vector confirms convergence).*
+## Final Validation Report (`crab_v2_qa.pth`)
+* **Validation Loss:** `5.6761`
+* **Perplexity (PPL):** `291.81`
+* **Response Accuracy:** `22.56%` (Evaluated purely on unmasked response tokens)
+**Conclusion:** The metrics indicate a model that successfully learned the Chat formatting and Identity injection, but hit a "Semantic Ceiling" when tested against complex out-of-distribution adult vocabulary.

frontend/__init__.py ADDED Viewed

File without changes

frontend/ui_components.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import streamlit as st
+def apply_custom_css():
+    st.markdown("""
+    <style>
+        .stApp { background-color: #0b0f19; color: #f1f5f9; }
+        .stButton>button {
+            background: linear-gradient(135deg, #FF4B4B 0%, #FF0000 100%);
+            color: white; border: none; border-radius: 6px; font-weight: bold; width: 100%;
+        }
+        .sidebar .sidebar-content { background-color: #0d1527; }
+        .crab-response { background-color: #1e293b; padding: 15px; border-radius: 8px; border-left: 4px solid #FF4B4B; }
+    </style>
+    """, unsafe_allow_html=True)
+def render_sidebar(status, params, loss):
+    with st.sidebar:
+        st.image("https://img.icons8.com/fluency/96/crab.png", width=70)
+        st.title("⚙️ Engine Telemetry")
+        if "ONLINE" in status:
+            st.success(status)
+        else:
+            st.error(status)
+        st.markdown(f"**Architecture:** ~{params}M Params")
+        st.markdown(f"**Validation Loss:** {loss}")
+        st.markdown("**Creator:** Arshvir (Jaiho Labs)")
+        st.divider()
+        st.subheader("Inference Settings")
+        temp = st.slider("Temperature (Creativity)", 0.1, 1.5, 0.6, 0.1)
+        max_t = st.slider("Max Tokens", 10, 150, 60, 10)
+        return temp, max_t

model_notebooks/crab_finetuning_v2.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

model_notebooks/crab_picolm_v1.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

model_notebooks/pretraining_crab_v1.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

models/architecture.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# models/architecture.py
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+class LocalConfig:
+    def __init__(self, **kwargs):
+        for key, value in kwargs.items():
+            setattr(self, key, value)
+        # Fallback defaults if not in JSON
+        if not hasattr(self, 'dropout'):
+            self.dropout = 0.0
+class CausalSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+    def forward(self, x):
+        B, T, C = x.size()
+        qkv = self.c_attn(x)
+        q, k, v = qkv.split(self.n_embd, dim=2)
+        head_dim = C // self.n_head
+        q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
+        k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
+        v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
+        # PyTorch Flash Attention via scaled_dot_product_attention
+        y = F.scaled_dot_product_attention(
+            q, k, v,
+            is_causal=True,
+            dropout_p=self.config.dropout if self.training else 0.0
+        )
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        return self.resid_dropout(self.c_proj(y))
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
+        self.gelu = nn.GELU()
+        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x):
+        return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd)
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = nn.LayerNorm(config.n_embd)
+        self.mlp = MLP(config)
+    def forward(self, x):
+        x = x + self.attn(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class CRAB(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte = nn.Embedding(config.vocab_size, config.n_embd),
+            wpe = nn.Embedding(config.block_size, config.n_embd),
+            drop = nn.Dropout(config.dropout),
+            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+            ln_f = nn.LayerNorm(config.n_embd),
+        ))
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.transformer.wte.weight = self.lm_head.weight
+    def forward(self, idx, targets=None):
+        b, t = idx.size()
+        pos = torch.arange(0, t, dtype=torch.long, device=idx.device)
+        x = self.transformer.drop(self.transformer.wte(idx) + self.transformer.wpe(pos))
+        for block in self.transformer.h:
+            x = block(x)
+        x = self.transformer.ln_f(x)
+        logits = self.lm_head(x)
+        loss = None
+        if targets is not None:
+            # -100 ignore_index for Target Masking in instruction tuning
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-100)
+        return logits, loss

models/models.md ADDED Viewed

	@@ -0,0 +1,4 @@

+# CRAB HuggingFace Model Links
+Links 1:
+Links 2: