v0.01

Browse files

Files changed (3) hide show

jepa_llm_prototypes.ipynb +1258 -0
jepa_option1_sentence_encoder.ipynb +690 -0
jepa_option2_llm_hidden_states.ipynb +699 -0

jepa_llm_prototypes.ipynb ADDED Viewed

	@@ -0,0 +1,1258 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🧠 JEPA-Style LLM Prototypes\n",
+    "\n",
+    "## Making Decoder-Only LLMs Predict State Consequences Instead of Tokens\n",
+    "\n",
+    "This notebook implements three approaches to convert a standard LLM into a JEPA-style world model:\n",
+    "\n",
+    "1. **Option 1:** Sentence Encoder Approach (Simplest)\n",
+    "2. **Option 2:** LLM Hidden States Approach (Medium)\n",
+    "3. **Option 3:** Full Autoencoder Approach (Most Powerful)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 📦 Setup & Installation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install required packages\n",
+    "!pip install -q transformers accelerate bitsandbytes sentence-transformers datasets torch matplotlib tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from tqdm.auto import tqdm\n",
+    "import random\n",
+    "\n",
+    "# Check GPU\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "print(f\"Using device: {device}\")\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    print(f\"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 📊 Create Synthetic Dataset\n",
+    "\n",
+    "We'll create a simple \"enterprise workflow\" dataset with:\n",
+    "- **States:** Document/workflow status descriptions\n",
+    "- **Actions:** User actions\n",
+    "- **Next States:** Resulting state after action\n",
+    "\n",
+    "This simulates learning the \"physics\" of your enterprise domain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class EnterpriseWorkflowDataset(Dataset):\n",
+    "    \"\"\"\n",
+    "    Synthetic dataset simulating enterprise workflow state transitions.\n",
+    "    \n",
+    "    Each sample is a (state, action, next_state) triplet.\n",
+    "    The model learns to predict next_state given state + action.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(self, num_samples=1000, seed=42):\n",
+    "        random.seed(seed)\n",
+    "        self.samples = self._generate_samples(num_samples)\n",
+    "        \n",
+    "    def _generate_samples(self, num_samples):\n",
+    "        samples = []\n",
+    "        \n",
+    "        # Document workflow transitions\n",
+    "        doc_transitions = [\n",
+    "            # (current_state, action, next_state)\n",
+    "            (\"Document is in draft status with 0 sections\", \"User creates new section\", \"Document is in draft status with 1 section\"),\n",
+    "            (\"Document is in draft status with 1 section\", \"User creates new section\", \"Document is in draft status with 2 sections\"),\n",
+    "            (\"Document is in draft status with 2 sections\", \"User creates new section\", \"Document is in draft status with 3 sections\"),\n",
+    "            (\"Document is in draft status with 3 sections\", \"User submits for review\", \"Document is pending review with 3 sections\"),\n",
+    "            (\"Document is pending review with 3 sections\", \"Reviewer approves document\", \"Document is approved and published\"),\n",
+    "            (\"Document is pending review with 3 sections\", \"Reviewer requests changes\", \"Document is in revision with 3 sections\"),\n",
+    "            (\"Document is in revision with 3 sections\", \"User makes requested changes\", \"Document is pending review with 3 sections\"),\n",
+    "            (\"Document is approved and published\", \"User archives document\", \"Document is archived\"),\n",
+    "            (\"Document is in draft status with 1 section\", \"User deletes section\", \"Document is in draft status with 0 sections\"),\n",
+    "            (\"Document is in draft status with 2 sections\", \"User deletes section\", \"Document is in draft status with 1 section\"),\n",
+    "        ]\n",
+    "        \n",
+    "        # Project workflow transitions\n",
+    "        project_transitions = [\n",
+    "            (\"Project is in planning phase with 0 tasks\", \"Manager adds task\", \"Project is in planning phase with 1 task\"),\n",
+    "            (\"Project is in planning phase with 1 task\", \"Manager adds task\", \"Project is in planning phase with 2 tasks\"),\n",
+    "            (\"Project is in planning phase with 2 tasks\", \"Manager starts project\", \"Project is active with 2 tasks and 0 completed\"),\n",
+    "            (\"Project is active with 2 tasks and 0 completed\", \"Team completes task\", \"Project is active with 2 tasks and 1 completed\"),\n",
+    "            (\"Project is active with 2 tasks and 1 completed\", \"Team completes task\", \"Project is completed with all tasks done\"),\n",
+    "            (\"Project is active with 2 tasks and 0 completed\", \"Manager pauses project\", \"Project is on hold with 2 tasks\"),\n",
+    "            (\"Project is on hold with 2 tasks\", \"Manager resumes project\", \"Project is active with 2 tasks and 0 completed\"),\n",
+    "            (\"Project is completed with all tasks done\", \"Manager closes project\", \"Project is archived\"),\n",
+    "        ]\n",
+    "        \n",
+    "        # User account transitions\n",
+    "        account_transitions = [\n",
+    "            (\"User account is new with basic permissions\", \"Admin grants editor role\", \"User account is active with editor permissions\"),\n",
+    "            (\"User account is active with editor permissions\", \"Admin grants admin role\", \"User account is active with admin permissions\"),\n",
+    "            (\"User account is active with admin permissions\", \"User requests deactivation\", \"User account is pending deactivation\"),\n",
+    "            (\"User account is pending deactivation\", \"Admin confirms deactivation\", \"User account is deactivated\"),\n",
+    "            (\"User account is active with editor permissions\", \"Security flags suspicious activity\", \"User account is locked pending review\"),\n",
+    "            (\"User account is locked pending review\", \"Security clears account\", \"User account is active with editor permissions\"),\n",
+    "        ]\n",
+    "        \n",
+    "        # Inventory transitions\n",
+    "        inventory_transitions = [\n",
+    "            (\"Inventory has 100 units in stock\", \"Customer orders 10 units\", \"Inventory has 90 units in stock\"),\n",
+    "            (\"Inventory has 90 units in stock\", \"Customer orders 20 units\", \"Inventory has 70 units in stock\"),\n",
+    "            (\"Inventory has 70 units in stock\", \"Supplier delivers 50 units\", \"Inventory has 120 units in stock\"),\n",
+    "            (\"Inventory has 20 units in stock\", \"System triggers low stock alert\", \"Inventory has 20 units with reorder pending\"),\n",
+    "            (\"Inventory has 20 units with reorder pending\", \"Supplier delivers 100 units\", \"Inventory has 120 units in stock\"),\n",
+    "            (\"Inventory has 0 units in stock\", \"Customer attempts order\", \"Inventory has 0 units with backorder created\"),\n",
+    "        ]\n",
+    "        \n",
+    "        all_transitions = doc_transitions + project_transitions + account_transitions + inventory_transitions\n",
+    "        \n",
+    "        # Generate samples by randomly selecting transitions\n",
+    "        for _ in range(num_samples):\n",
+    "            state, action, next_state = random.choice(all_transitions)\n",
+    "            samples.append({\n",
+    "                'state': state,\n",
+    "                'action': action,\n",
+    "                'next_state': next_state\n",
+    "            })\n",
+    "            \n",
+    "        return samples\n",
+    "    \n",
+    "    def __len__(self):\n",
+    "        return len(self.samples)\n",
+    "    \n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.samples[idx]\n",
+    "\n",
+    "\n",
+    "# Create dataset\n",
+    "train_dataset = EnterpriseWorkflowDataset(num_samples=2000, seed=42)\n",
+    "val_dataset = EnterpriseWorkflowDataset(num_samples=200, seed=123)\n",
+    "\n",
+    "print(f\"Training samples: {len(train_dataset)}\")\n",
+    "print(f\"Validation samples: {len(val_dataset)}\")\n",
+    "print(\"\\nExample sample:\")\n",
+    "print(f\"  State: {train_dataset[0]['state']}\")\n",
+    "print(f\"  Action: {train_dataset[0]['action']}\")\n",
+    "print(f\"  Next State: {train_dataset[0]['next_state']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 🔵 Option 1: Sentence Encoder Approach (Simplest)\n",
+    "\n",
+    "Uses a pre-trained sentence encoder (like `all-MiniLM-L6-v2`) for state embeddings.\n",
+    "\n",
+    "**Pros:**\n",
+    "- Fastest to train\n",
+    "- No need to train encoder\n",
+    "- Small memory footprint\n",
+    "\n",
+    "**Cons:**\n",
+    "- Limited by pre-trained encoder's representation\n",
+    "- May not capture domain-specific nuances"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Option1_SentenceEncoderJEPA(nn.Module):\n",
+    "    \"\"\"\n",
+    "    JEPA-style world model using pre-trained sentence encoder.\n",
+    "    \n",
+    "    Architecture:\n",
+    "    - State Encoder: Pre-trained SentenceTransformer (frozen)\n",
+    "    - Predictor: Small transformer that predicts next state embedding\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        sentence_model_name='all-MiniLM-L6-v2',\n",
+    "        hidden_dim=256,\n",
+    "        num_layers=2,\n",
+    "        num_heads=4,\n",
+    "        dropout=0.1\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        \n",
+    "        # Pre-trained sentence encoder (frozen)\n",
+    "        self.sentence_encoder = SentenceTransformer(sentence_model_name)\n",
+    "        self.sentence_encoder.requires_grad_(False)  # Freeze\n",
+    "        \n",
+    "        # Get embedding dimension from sentence encoder\n",
+    "        self.embed_dim = self.sentence_encoder.get_sentence_embedding_dimension()\n",
+    "        \n",
+    "        # Project state and action to hidden dim\n",
+    "        self.state_proj = nn.Linear(self.embed_dim, hidden_dim)\n",
+    "        self.action_proj = nn.Linear(self.embed_dim, hidden_dim)\n",
+    "        \n",
+    "        # Combine state + action\n",
+    "        self.combine = nn.Sequential(\n",
+    "            nn.Linear(hidden_dim * 2, hidden_dim),\n",
+    "            nn.LayerNorm(hidden_dim),\n",
+    "            nn.GELU()\n",
+    "        )\n",
+    "        \n",
+    "        # Transformer predictor\n",
+    "        encoder_layer = nn.TransformerEncoderLayer(\n",
+    "            d_model=hidden_dim,\n",
+    "            nhead=num_heads,\n",
+    "            dim_feedforward=hidden_dim * 4,\n",
+    "            dropout=dropout,\n",
+    "            activation='gelu',\n",
+    "            batch_first=True\n",
+    "        )\n",
+    "        self.predictor = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)\n",
+    "        \n",
+    "        # Output projection to state embedding space\n",
+    "        self.output_proj = nn.Sequential(\n",
+    "            nn.Linear(hidden_dim, hidden_dim),\n",
+    "            nn.GELU(),\n",
+    "            nn.Linear(hidden_dim, self.embed_dim)\n",
+    "        )\n",
+    "        \n",
+    "    def encode_text(self, texts):\n",
+    "        \"\"\"Encode text to embeddings using sentence encoder.\"\"\"\n",
+    "        with torch.no_grad():\n",
+    "            embeddings = self.sentence_encoder.encode(\n",
+    "                texts, \n",
+    "                convert_to_tensor=True,\n",
+    "                show_progress_bar=False\n",
+    "            )\n",
+    "        return embeddings\n",
+    "    \n",
+    "    def forward(self, state_texts, action_texts):\n",
+    "        \"\"\"\n",
+    "        Predict next state embedding given current state and action.\n",
+    "        \n",
+    "        Args:\n",
+    "            state_texts: List of state descriptions\n",
+    "            action_texts: List of action descriptions\n",
+    "            \n",
+    "        Returns:\n",
+    "            predicted_next_state: Predicted next state embeddings [B, embed_dim]\n",
+    "        \"\"\"\n",
+    "        # Encode state and action\n",
+    "        state_emb = self.encode_text(state_texts)  # [B, embed_dim]\n",
+    "        action_emb = self.encode_text(action_texts)  # [B, embed_dim]\n",
+    "        \n",
+    "        # Project to hidden dim\n",
+    "        state_h = self.state_proj(state_emb)  # [B, hidden_dim]\n",
+    "        action_h = self.action_proj(action_emb)  # [B, hidden_dim]\n",
+    "        \n",
+    "        # Combine\n",
+    "        combined = self.combine(torch.cat([state_h, action_h], dim=-1))  # [B, hidden_dim]\n",
+    "        \n",
+    "        # Add sequence dimension for transformer\n",
+    "        combined = combined.unsqueeze(1)  # [B, 1, hidden_dim]\n",
+    "        \n",
+    "        # Predict through transformer\n",
+    "        predicted = self.predictor(combined)  # [B, 1, hidden_dim]\n",
+    "        \n",
+    "        # Project to state embedding space\n",
+    "        predicted_next_state = self.output_proj(predicted.squeeze(1))  # [B, embed_dim]\n",
+    "        \n",
+    "        return predicted_next_state\n",
+    "    \n",
+    "    def get_target_embedding(self, next_state_texts):\n",
+    "        \"\"\"Get target embedding for loss computation.\"\"\"\n",
+    "        return self.encode_text(next_state_texts)\n",
+    "\n",
+    "\n",
+    "# Create model\n",
+    "print(\"Creating Option 1 model...\")\n",
+    "model_opt1 = Option1_SentenceEncoderJEPA(\n",
+    "    sentence_model_name='all-MiniLM-L6-v2',\n",
+    "    hidden_dim=256,\n",
+    "    num_layers=2,\n",
+    "    num_heads=4\n",
+    ").to(device)\n",
+    "\n",
+    "# Count parameters\n",
+    "trainable_params = sum(p.numel() for p in model_opt1.parameters() if p.requires_grad)\n",
+    "total_params = sum(p.numel() for p in model_opt1.parameters())\n",
+    "print(f\"Trainable parameters: {trainable_params:,}\")\n",
+    "print(f\"Total parameters: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_option1(model, train_dataset, val_dataset, epochs=10, batch_size=32, lr=1e-3):\n",
+    "    \"\"\"\n",
+    "    Training loop for Option 1 model.\n",
+    "    \n",
+    "    Loss: Cosine similarity between predicted and target state embeddings.\n",
+    "    \"\"\"\n",
+    "    model.train()\n",
+    "    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)\n",
+    "    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)\n",
+    "    \n",
+    "    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n",
+    "    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)\n",
+    "    \n",
+    "    history = {'train_loss': [], 'val_loss': [], 'val_similarity': []}\n",
+    "    \n",
+    "    for epoch in range(epochs):\n",
+    "        # Training\n",
+    "        model.train()\n",
+    "        train_losses = []\n",
+    "        \n",
+    "        pbar = tqdm(train_loader, desc=f\"Epoch {epoch+1}/{epochs}\")\n",
+    "        for batch in pbar:\n",
+    "            states = batch['state']\n",
+    "            actions = batch['action']\n",
+    "            next_states = batch['next_state']\n",
+    "            \n",
+    "            # Forward pass\n",
+    "            predicted = model(states, actions)\n",
+    "            target = model.get_target_embedding(next_states)\n",
+    "            \n",
+    "            # Cosine similarity loss (1 - similarity to minimize)\n",
+    "            similarity = F.cosine_similarity(predicted, target, dim=-1)\n",
+    "            loss = (1 - similarity).mean()\n",
+    "            \n",
+    "            # Backward pass\n",
+    "            optimizer.zero_grad()\n",
+    "            loss.backward()\n",
+    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "            optimizer.step()\n",
+    "            \n",
+    "            train_losses.append(loss.item())\n",
+    "            pbar.set_postfix({'loss': f'{loss.item():.4f}'})\n",
+    "        \n",
+    "        scheduler.step()\n",
+    "        \n",
+    "        # Validation\n",
+    "        model.eval()\n",
+    "        val_losses = []\n",
+    "        val_similarities = []\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            for batch in val_loader:\n",
+    "                states = batch['state']\n",
+    "                actions = batch['action']\n",
+    "                next_states = batch['next_state']\n",
+    "                \n",
+    "                predicted = model(states, actions)\n",
+    "                target = model.get_target_embedding(next_states)\n",
+    "                \n",
+    "                similarity = F.cosine_similarity(predicted, target, dim=-1)\n",
+    "                loss = (1 - similarity).mean()\n",
+    "                \n",
+    "                val_losses.append(loss.item())\n",
+    "                val_similarities.append(similarity.mean().item())\n",
+    "        \n",
+    "        # Record history\n",
+    "        history['train_loss'].append(np.mean(train_losses))\n",
+    "        history['val_loss'].append(np.mean(val_losses))\n",
+    "        history['val_similarity'].append(np.mean(val_similarities))\n",
+    "        \n",
+    "        print(f\"Epoch {epoch+1}: Train Loss={np.mean(train_losses):.4f}, \"\n",
+    "              f\"Val Loss={np.mean(val_losses):.4f}, Val Similarity={np.mean(val_similarities):.4f}\")\n",
+    "    \n",
+    "    return history\n",
+    "\n",
+    "\n",
+    "# Train the model\n",
+    "print(\"\\n\" + \"=\"*50)\n",
+    "print(\"Training Option 1: Sentence Encoder JEPA\")\n",
+    "print(\"=\"*50)\n",
+    "history_opt1 = train_option1(model_opt1, train_dataset, val_dataset, epochs=10, batch_size=32)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def test_model(model, test_samples):\n",
+    "    \"\"\"Test the model on specific examples.\"\"\"\n",
+    "    model.eval()\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(\"Model Predictions\")\n",
+    "    print(\"=\"*60)\n",
+    "    \n",
+    "    for sample in test_samples:\n",
+    "        state = sample['state']\n",
+    "        action = sample['action']\n",
+    "        actual_next = sample['next_state']\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            # Get prediction\n",
+    "            predicted_emb = model([state], [action])\n",
+    "            actual_emb = model.get_target_embedding([actual_next])\n",
+    "            \n",
+    "            # Compute similarity\n",
+    "            similarity = F.cosine_similarity(predicted_emb, actual_emb, dim=-1).item()\n",
+    "            \n",
+    "        print(f\"\\nState: {state}\")\n",
+    "        print(f\"Action: {action}\")\n",
+    "        print(f\"Actual Next State: {actual_next}\")\n",
+    "        print(f\"Prediction Similarity: {similarity:.4f} {'✓' if similarity > 0.8 else '✗'}\")\n",
+    "\n",
+    "\n",
+    "# Test on a few examples\n",
+    "test_samples = [\n",
+    "    {'state': 'Document is in draft status with 2 sections', 'action': 'User creates new section', 'next_state': 'Document is in draft status with 3 sections'},\n",
+    "    {'state': 'Project is active with 2 tasks and 0 completed', 'action': 'Team completes task', 'next_state': 'Project is active with 2 tasks and 1 completed'},\n",
+    "    {'state': 'Inventory has 100 units in stock', 'action': 'Customer orders 10 units', 'next_state': 'Inventory has 90 units in stock'},\n",
+    "]\n",
+    "\n",
+    "test_model(model_opt1, test_samples)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 🟢 Option 2: LLM Hidden States Approach (Medium)\n",
+    "\n",
+    "Uses a small LLM's hidden states for state representations.\n",
+    "\n",
+    "**Pros:**\n",
+    "- Better language understanding\n",
+    "- Can fine-tune encoder\n",
+    "- More expressive representations\n",
+    "\n",
+    "**Cons:**\n",
+    "- Slower than Option 1\n",
+    "- Requires more memory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Option2_LLMHiddenStateJEPA(nn.Module):\n",
+    "    \"\"\"\n",
+    "    JEPA-style world model using LLM hidden states.\n",
+    "    \n",
+    "    Architecture:\n",
+    "    - State Encoder: Small LLM (GPT-2 or similar) + pooling\n",
+    "    - Predictor: MLP that predicts next state embedding\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        model_name='gpt2',  # Small model for Colab\n",
+    "        state_dim=512,\n",
+    "        freeze_encoder=True\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        \n",
+    "        # Load tokenizer and model\n",
+    "        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+    "        self.tokenizer.pad_token = self.tokenizer.eos_token\n",
+    "        \n",
+    "        self.encoder = AutoModel.from_pretrained(model_name)\n",
+    "        self.hidden_size = self.encoder.config.hidden_size\n",
+    "        \n",
+    "        if freeze_encoder:\n",
+    "            for param in self.encoder.parameters():\n",
+    "                param.requires_grad = False\n",
+    "        \n",
+    "        self.state_dim = state_dim\n",
+    "        \n",
+    "        # State projection (from LLM hidden to state space)\n",
+    "        self.state_proj = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, state_dim),\n",
+    "            nn.LayerNorm(state_dim),\n",
+    "            nn.GELU()\n",
+    "        )\n",
+    "        \n",
+    "        # Action projection\n",
+    "        self.action_proj = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, state_dim),\n",
+    "            nn.LayerNorm(state_dim),\n",
+    "            nn.GELU()\n",
+    "        )\n",
+    "        \n",
+    "        # Predictor: takes state + action, outputs next state\n",
+    "        self.predictor = nn.Sequential(\n",
+    "            nn.Linear(state_dim * 2, state_dim * 2),\n",
+    "            nn.LayerNorm(state_dim * 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.Dropout(0.1),\n",
+    "            nn.Linear(state_dim * 2, state_dim * 2),\n",
+    "            nn.LayerNorm(state_dim * 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.Dropout(0.1),\n",
+    "            nn.Linear(state_dim * 2, state_dim)\n",
+    "        )\n",
+    "        \n",
+    "    def encode_text(self, texts):\n",
+    "        \"\"\"\n",
+    "        Encode text to embeddings using LLM.\n",
+    "        Uses mean pooling over hidden states.\n",
+    "        \"\"\"\n",
+    "        # Tokenize\n",
+    "        inputs = self.tokenizer(\n",
+    "            texts,\n",
+    "            return_tensors='pt',\n",
+    "            padding=True,\n",
+    "            truncation=True,\n",
+    "            max_length=128\n",
+    "        ).to(self.encoder.device)\n",
+    "        \n",
+    "        # Get hidden states\n",
+    "        with torch.no_grad() if not self.encoder.training else torch.enable_grad():\n",
+    "            outputs = self.encoder(**inputs)\n",
+    "        \n",
+    "        # Mean pooling (exclude padding)\n",
+    "        attention_mask = inputs['attention_mask'].unsqueeze(-1)\n",
+    "        hidden_states = outputs.last_hidden_state\n",
+    "        pooled = (hidden_states * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)\n",
+    "        \n",
+    "        return pooled\n",
+    "    \n",
+    "    def forward(self, state_texts, action_texts):\n",
+    "        \"\"\"\n",
+    "        Predict next state embedding given current state and action.\n",
+    "        \"\"\"\n",
+    "        # Encode state and action\n",
+    "        state_hidden = self.encode_text(state_texts)  # [B, hidden_size]\n",
+    "        action_hidden = self.encode_text(action_texts)  # [B, hidden_size]\n",
+    "        \n",
+    "        # Project to state space\n",
+    "        state_emb = self.state_proj(state_hidden)  # [B, state_dim]\n",
+    "        action_emb = self.action_proj(action_hidden)  # [B, state_dim]\n",
+    "        \n",
+    "        # Combine and predict\n",
+    "        combined = torch.cat([state_emb, action_emb], dim=-1)  # [B, state_dim * 2]\n",
+    "        predicted_next_state = self.predictor(combined)  # [B, state_dim]\n",
+    "        \n",
+    "        return predicted_next_state\n",
+    "    \n",
+    "    def get_target_embedding(self, next_state_texts):\n",
+    "        \"\"\"Get target embedding for loss computation.\"\"\"\n",
+    "        hidden = self.encode_text(next_state_texts)\n",
+    "        return self.state_proj(hidden)\n",
+    "\n",
+    "\n",
+    "# Create model\n",
+    "print(\"Creating Option 2 model...\")\n",
+    "model_opt2 = Option2_LLMHiddenStateJEPA(\n",
+    "    model_name='gpt2',\n",
+    "    state_dim=512,\n",
+    "    freeze_encoder=True\n",
+    ").to(device)\n",
+    "\n",
+    "# Count parameters\n",
+    "trainable_params = sum(p.numel() for p in model_opt2.parameters() if p.requires_grad)\n",
+    "total_params = sum(p.numel() for p in model_opt2.parameters())\n",
+    "print(f\"Trainable parameters: {trainable_params:,}\")\n",
+    "print(f\"Total parameters: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_option2(model, train_dataset, val_dataset, epochs=10, batch_size=16, lr=1e-3):\n",
+    "    \"\"\"\n",
+    "    Training loop for Option 2 model.\n",
+    "    Uses MSE loss + Cosine similarity loss.\n",
+    "    \"\"\"\n",
+    "    model.train()\n",
+    "    optimizer = torch.optim.AdamW(\n",
+    "        filter(lambda p: p.requires_grad, model.parameters()),\n",
+    "        lr=lr,\n",
+    "        weight_decay=0.01\n",
+    "    )\n",
+    "    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)\n",
+    "    \n",
+    "    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n",
+    "    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)\n",
+    "    \n",
+    "    history = {'train_loss': [], 'val_loss': [], 'val_similarity': []}\n",
+    "    \n",
+    "    for epoch in range(epochs):\n",
+    "        # Training\n",
+    "        model.train()\n",
+    "        train_losses = []\n",
+    "        \n",
+    "        pbar = tqdm(train_loader, desc=f\"Epoch {epoch+1}/{epochs}\")\n",
+    "        for batch in pbar:\n",
+    "            states = batch['state']\n",
+    "            actions = batch['action']\n",
+    "            next_states = batch['next_state']\n",
+    "            \n",
+    "            # Forward pass\n",
+    "            predicted = model(states, actions)\n",
+    "            target = model.get_target_embedding(next_states)\n",
+    "            \n",
+    "            # Combined loss: MSE + (1 - cosine similarity)\n",
+    "            mse_loss = F.mse_loss(predicted, target)\n",
+    "            cos_loss = (1 - F.cosine_similarity(predicted, target, dim=-1)).mean()\n",
+    "            loss = mse_loss + cos_loss\n",
+    "            \n",
+    "            # Backward pass\n",
+    "            optimizer.zero_grad()\n",
+    "            loss.backward()\n",
+    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "            optimizer.step()\n",
+    "            \n",
+    "            train_losses.append(loss.item())\n",
+    "            pbar.set_postfix({'loss': f'{loss.item():.4f}'})\n",
+    "        \n",
+    "        scheduler.step()\n",
+    "        \n",
+    "        # Validation\n",
+    "        model.eval()\n",
+    "        val_losses = []\n",
+    "        val_similarities = []\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            for batch in val_loader:\n",
+    "                states = batch['state']\n",
+    "                actions = batch['action']\n",
+    "                next_states = batch['next_state']\n",
+    "                \n",
+    "                predicted = model(states, actions)\n",
+    "                target = model.get_target_embedding(next_states)\n",
+    "                \n",
+    "                mse_loss = F.mse_loss(predicted, target)\n",
+    "                cos_loss = (1 - F.cosine_similarity(predicted, target, dim=-1)).mean()\n",
+    "                loss = mse_loss + cos_loss\n",
+    "                \n",
+    "                val_losses.append(loss.item())\n",
+    "                val_similarities.append(F.cosine_similarity(predicted, target, dim=-1).mean().item())\n",
+    "        \n",
+    "        # Record history\n",
+    "        history['train_loss'].append(np.mean(train_losses))\n",
+    "        history['val_loss'].append(np.mean(val_losses))\n",
+    "        history['val_similarity'].append(np.mean(val_similarities))\n",
+    "        \n",
+    "        print(f\"Epoch {epoch+1}: Train Loss={np.mean(train_losses):.4f}, \"\n",
+    "              f\"Val Loss={np.mean(val_losses):.4f}, Val Similarity={np.mean(val_similarities):.4f}\")\n",
+    "    \n",
+    "    return history\n",
+    "\n",
+    "\n",
+    "# Train the model\n",
+    "print(\"\\n\" + \"=\"*50)\n",
+    "print(\"Training Option 2: LLM Hidden State JEPA\")\n",
+    "print(\"=\"*50)\n",
+    "history_opt2 = train_option2(model_opt2, train_dataset, val_dataset, epochs=10, batch_size=16)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test Option 2\n",
+    "test_model(model_opt2, test_samples)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 🔴 Option 3: Full Autoencoder Approach (Most Powerful)\n",
+    "\n",
+    "Trains a full state autoencoder for domain-specific representations.\n",
+    "\n",
+    "**Pros:**\n",
+    "- Best domain adaptation\n",
+    "- Learnable encoder captures task-specific features\n",
+    "- Highest potential accuracy\n",
+    "\n",
+    "**Cons:**\n",
+    "- Requires more training data\n",
+    "- Longer training time\n",
+    "- More complex to tune"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Option3_AutoencoderJEPA(nn.Module):\n",
+    "    \"\"\"\n",
+    "    JEPA-style world model with learned state autoencoder.\n",
+    "    \n",
+    "    Architecture:\n",
+    "    - State Encoder: Trainable encoder that learns domain-specific embeddings\n",
+    "    - State Decoder: Reconstructs text from embeddings (for training)\n",
+    "    - Predictor: Transformer that predicts next state in latent space\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        model_name='gpt2',\n",
+    "        state_dim=256,\n",
+    "        predictor_layers=3,\n",
+    "        predictor_heads=4\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        \n",
+    "        # Tokenizer\n",
+    "        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+    "        self.tokenizer.pad_token = self.tokenizer.eos_token\n",
+    "        \n",
+    "        # Base LLM for encoding (will be fine-tuned)\n",
+    "        self.base_llm = AutoModel.from_pretrained(model_name)\n",
+    "        self.hidden_size = self.base_llm.config.hidden_size\n",
+    "        self.vocab_size = self.base_llm.config.vocab_size\n",
+    "        \n",
+    "        self.state_dim = state_dim\n",
+    "        \n",
+    "        # State Encoder: LLM hidden → compressed state\n",
+    "        self.state_encoder = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, self.hidden_size // 2),\n",
+    "            nn.LayerNorm(self.hidden_size // 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.Linear(self.hidden_size // 2, state_dim),\n",
+    "            nn.LayerNorm(state_dim)\n",
+    "        )\n",
+    "        \n",
+    "        # State Decoder: compressed state → reconstruction\n",
+    "        self.state_decoder = nn.Sequential(\n",
+    "            nn.Linear(state_dim, self.hidden_size // 2),\n",
+    "            nn.LayerNorm(self.hidden_size // 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.Linear(self.hidden_size // 2, self.hidden_size),\n",
+    "            nn.LayerNorm(self.hidden_size)\n",
+    "        )\n",
+    "        \n",
+    "        # Action Encoder\n",
+    "        self.action_encoder = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, state_dim),\n",
+    "            nn.LayerNorm(state_dim),\n",
+    "            nn.GELU()\n",
+    "        )\n",
+    "        \n",
+    "        # Transformer Predictor: (state, action) → next_state\n",
+    "        self.input_proj = nn.Linear(state_dim * 2, state_dim)\n",
+    "        \n",
+    "        encoder_layer = nn.TransformerEncoderLayer(\n",
+    "            d_model=state_dim,\n",
+    "            nhead=predictor_heads,\n",
+    "            dim_feedforward=state_dim * 4,\n",
+    "            dropout=0.1,\n",
+    "            activation='gelu',\n",
+    "            batch_first=True\n",
+    "        )\n",
+    "        self.predictor = nn.TransformerEncoder(encoder_layer, num_layers=predictor_layers)\n",
+    "        \n",
+    "        # Output projection\n",
+    "        self.output_proj = nn.Linear(state_dim, state_dim)\n",
+    "        \n",
+    "    def get_llm_hidden(self, texts):\n",
+    "        \"\"\"Get LLM hidden states for texts.\"\"\"\n",
+    "        inputs = self.tokenizer(\n",
+    "            texts,\n",
+    "            return_tensors='pt',\n",
+    "            padding=True,\n",
+    "            truncation=True,\n",
+    "            max_length=128\n",
+    "        ).to(self.base_llm.device)\n",
+    "        \n",
+    "        outputs = self.base_llm(**inputs)\n",
+    "        \n",
+    "        # Mean pooling\n",
+    "        attention_mask = inputs['attention_mask'].unsqueeze(-1)\n",
+    "        hidden_states = outputs.last_hidden_state\n",
+    "        pooled = (hidden_states * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)\n",
+    "        \n",
+    "        return pooled\n",
+    "    \n",
+    "    def encode_state(self, texts):\n",
+    "        \"\"\"Encode text to state embedding.\"\"\"\n",
+    "        hidden = self.get_llm_hidden(texts)\n",
+    "        return self.state_encoder(hidden)\n",
+    "    \n",
+    "    def encode_action(self, texts):\n",
+    "        \"\"\"Encode action text to action embedding.\"\"\"\n",
+    "        hidden = self.get_llm_hidden(texts)\n",
+    "        return self.action_encoder(hidden)\n",
+    "    \n",
+    "    def decode_state(self, state_emb):\n",
+    "        \"\"\"Decode state embedding back to hidden space (for reconstruction loss).\"\"\"\n",
+    "        return self.state_decoder(state_emb)\n",
+    "    \n",
+    "    def forward(self, state_texts, action_texts):\n",
+    "        \"\"\"\n",
+    "        Predict next state embedding given current state and action.\n",
+    "        \"\"\"\n",
+    "        # Encode\n",
+    "        state_emb = self.encode_state(state_texts)  # [B, state_dim]\n",
+    "        action_emb = self.encode_action(action_texts)  # [B, state_dim]\n",
+    "        \n",
+    "        # Combine\n",
+    "        combined = torch.cat([state_emb, action_emb], dim=-1)  # [B, state_dim * 2]\n",
+    "        combined = self.input_proj(combined)  # [B, state_dim]\n",
+    "        \n",
+    "        # Add sequence dimension\n",
+    "        combined = combined.unsqueeze(1)  # [B, 1, state_dim]\n",
+    "        \n",
+    "        # Predict through transformer\n",
+    "        predicted = self.predictor(combined)  # [B, 1, state_dim]\n",
+    "        predicted = self.output_proj(predicted.squeeze(1))  # [B, state_dim]\n",
+    "        \n",
+    "        return predicted\n",
+    "    \n",
+    "    def forward_with_reconstruction(self, state_texts, action_texts, next_state_texts):\n",
+    "        \"\"\"\n",
+    "        Forward pass with reconstruction for training.\n",
+    "        Returns: predicted_next_state, target_next_state, reconstruction of current state\n",
+    "        \"\"\"\n",
+    "        # Get all embeddings\n",
+    "        state_emb = self.encode_state(state_texts)\n",
+    "        action_emb = self.encode_action(action_texts)\n",
+    "        target_emb = self.encode_state(next_state_texts)\n",
+    "        \n",
+    "        # Predict next state\n",
+    "        combined = torch.cat([state_emb, action_emb], dim=-1)\n",
+    "        combined = self.input_proj(combined)\n",
+    "        combined = combined.unsqueeze(1)\n",
+    "        predicted = self.predictor(combined)\n",
+    "        predicted_next = self.output_proj(predicted.squeeze(1))\n",
+    "        \n",
+    "        # Reconstruction of current state (for autoencoder regularization)\n",
+    "        state_reconstructed = self.decode_state(state_emb)\n",
+    "        state_hidden_original = self.get_llm_hidden(state_texts)\n",
+    "        \n",
+    "        return predicted_next, target_emb, state_reconstructed, state_hidden_original\n",
+    "\n",
+    "\n",
+    "# Create model\n",
+    "print(\"Creating Option 3 model...\")\n",
+    "model_opt3 = Option3_AutoencoderJEPA(\n",
+    "    model_name='gpt2',\n",
+    "    state_dim=256,\n",
+    "    predictor_layers=3,\n",
+    "    predictor_heads=4\n",
+    ").to(device)\n",
+    "\n",
+    "# Count parameters\n",
+    "trainable_params = sum(p.numel() for p in model_opt3.parameters() if p.requires_grad)\n",
+    "total_params = sum(p.numel() for p in model_opt3.parameters())\n",
+    "print(f\"Trainable parameters: {trainable_params:,}\")\n",
+    "print(f\"Total parameters: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_option3(model, train_dataset, val_dataset, epochs=15, batch_size=16, lr=5e-4):\n",
+    "    \"\"\"\n",
+    "    Training loop for Option 3 model.\n",
+    "    Uses: prediction loss + reconstruction loss + cosine similarity loss.\n",
+    "    \"\"\"\n",
+    "    model.train()\n",
+    "    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)\n",
+    "    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)\n",
+    "    \n",
+    "    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n",
+    "    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)\n",
+    "    \n",
+    "    history = {'train_loss': [], 'val_loss': [], 'val_similarity': []}\n",
+    "    \n",
+    "    for epoch in range(epochs):\n",
+    "        # Training\n",
+    "        model.train()\n",
+    "        train_losses = []\n",
+    "        \n",
+    "        pbar = tqdm(train_loader, desc=f\"Epoch {epoch+1}/{epochs}\")\n",
+    "        for batch in pbar:\n",
+    "            states = batch['state']\n",
+    "            actions = batch['action']\n",
+    "            next_states = batch['next_state']\n",
+    "            \n",
+    "            # Forward pass with reconstruction\n",
+    "            predicted_next, target_next, state_recon, state_orig = model.forward_with_reconstruction(\n",
+    "                states, actions, next_states\n",
+    "            )\n",
+    "            \n",
+    "            # Prediction loss (main objective)\n",
+    "            pred_mse = F.mse_loss(predicted_next, target_next)\n",
+    "            pred_cos = (1 - F.cosine_similarity(predicted_next, target_next, dim=-1)).mean()\n",
+    "            \n",
+    "            # Reconstruction loss (regularization)\n",
+    "            recon_loss = F.mse_loss(state_recon, state_orig.detach())\n",
+    "            \n",
+    "            # Combined loss\n",
+    "            loss = pred_mse + pred_cos + 0.1 * recon_loss\n",
+    "            \n",
+    "            # Backward pass\n",
+    "            optimizer.zero_grad()\n",
+    "            loss.backward()\n",
+    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "            optimizer.step()\n",
+    "            \n",
+    "            train_losses.append(loss.item())\n",
+    "            pbar.set_postfix({'loss': f'{loss.item():.4f}'})\n",
+    "        \n",
+    "        scheduler.step()\n",
+    "        \n",
+    "        # Validation\n",
+    "        model.eval()\n",
+    "        val_losses = []\n",
+    "        val_similarities = []\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            for batch in val_loader:\n",
+    "                states = batch['state']\n",
+    "                actions = batch['action']\n",
+    "                next_states = batch['next_state']\n",
+    "                \n",
+    "                predicted_next, target_next, state_recon, state_orig = model.forward_with_reconstruction(\n",
+    "                    states, actions, next_states\n",
+    "                )\n",
+    "                \n",
+    "                pred_mse = F.mse_loss(predicted_next, target_next)\n",
+    "                pred_cos = (1 - F.cosine_similarity(predicted_next, target_next, dim=-1)).mean()\n",
+    "                recon_loss = F.mse_loss(state_recon, state_orig)\n",
+    "                loss = pred_mse + pred_cos + 0.1 * recon_loss\n",
+    "                \n",
+    "                val_losses.append(loss.item())\n",
+    "                val_similarities.append(F.cosine_similarity(predicted_next, target_next, dim=-1).mean().item())\n",
+    "        \n",
+    "        # Record history\n",
+    "        history['train_loss'].append(np.mean(train_losses))\n",
+    "        history['val_loss'].append(np.mean(val_losses))\n",
+    "        history['val_similarity'].append(np.mean(val_similarities))\n",
+    "        \n",
+    "        print(f\"Epoch {epoch+1}: Train Loss={np.mean(train_losses):.4f}, \"\n",
+    "              f\"Val Loss={np.mean(val_losses):.4f}, Val Similarity={np.mean(val_similarities):.4f}\")\n",
+    "    \n",
+    "    return history\n",
+    "\n",
+    "\n",
+    "# Train the model\n",
+    "print(\"\\n\" + \"=\"*50)\n",
+    "print(\"Training Option 3: Autoencoder JEPA\")\n",
+    "print(\"=\"*50)\n",
+    "history_opt3 = train_option3(model_opt3, train_dataset, val_dataset, epochs=15, batch_size=16)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def test_model_opt3(model, test_samples):\n",
+    "    \"\"\"Test Option 3 model.\"\"\"\n",
+    "    model.eval()\n",
+    "    \n",
+    "    print(\"\\n\" + \"=\"*60)\n",
+    "    print(\"Option 3 Model Predictions\")\n",
+    "    print(\"=\"*60)\n",
+    "    \n",
+    "    for sample in test_samples:\n",
+    "        state = sample['state']\n",
+    "        action = sample['action']\n",
+    "        actual_next = sample['next_state']\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            # Get prediction\n",
+    "            predicted_emb = model([state], [action])\n",
+    "            actual_emb = model.encode_state([actual_next])\n",
+    "            \n",
+    "            # Compute similarity\n",
+    "            similarity = F.cosine_similarity(predicted_emb, actual_emb, dim=-1).item()\n",
+    "            \n",
+    "        print(f\"\\nState: {state}\")\n",
+    "        print(f\"Action: {action}\")\n",
+    "        print(f\"Actual Next State: {actual_next}\")\n",
+    "        print(f\"Prediction Similarity: {similarity:.4f} {'✓' if similarity > 0.8 else '✗'}\")\n",
+    "\n",
+    "\n",
+    "# Test Option 3\n",
+    "test_model_opt3(model_opt3, test_samples)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 📊 Compare All Three Approaches"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plot comparison\n",
+    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
+    "\n",
+    "# Training Loss\n",
+    "axes[0].plot(history_opt1['train_loss'], label='Option 1: Sentence Encoder', marker='o')\n",
+    "axes[0].plot(history_opt2['train_loss'], label='Option 2: LLM Hidden States', marker='s')\n",
+    "axes[0].plot(history_opt3['train_loss'], label='Option 3: Autoencoder', marker='^')\n",
+    "axes[0].set_xlabel('Epoch')\n",
+    "axes[0].set_ylabel('Training Loss')\n",
+    "axes[0].set_title('Training Loss Comparison')\n",
+    "axes[0].legend()\n",
+    "axes[0].grid(True, alpha=0.3)\n",
+    "\n",
+    "# Validation Loss\n",
+    "axes[1].plot(history_opt1['val_loss'], label='Option 1', marker='o')\n",
+    "axes[1].plot(history_opt2['val_loss'], label='Option 2', marker='s')\n",
+    "axes[1].plot(history_opt3['val_loss'], label='Option 3', marker='^')\n",
+    "axes[1].set_xlabel('Epoch')\n",
+    "axes[1].set_ylabel('Validation Loss')\n",
+    "axes[1].set_title('Validation Loss Comparison')\n",
+    "axes[1].legend()\n",
+    "axes[1].grid(True, alpha=0.3)\n",
+    "\n",
+    "# Validation Similarity\n",
+    "axes[2].plot(history_opt1['val_similarity'], label='Option 1', marker='o')\n",
+    "axes[2].plot(history_opt2['val_similarity'], label='Option 2', marker='s')\n",
+    "axes[2].plot(history_opt3['val_similarity'], label='Option 3', marker='^')\n",
+    "axes[2].set_xlabel('Epoch')\n",
+    "axes[2].set_ylabel('Cosine Similarity')\n",
+    "axes[2].set_title('Validation Similarity (Higher = Better)')\n",
+    "axes[2].legend()\n",
+    "axes[2].grid(True, alpha=0.3)\n",
+    "axes[2].set_ylim([0, 1])\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.savefig('jepa_comparison.png', dpi=150, bbox_inches='tight')\n",
+    "plt.show()\n",
+    "\n",
+    "# Print final metrics\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"Final Metrics Comparison\")\n",
+    "print(\"=\"*60)\n",
+    "print(f\"{'Model':<30} {'Val Loss':<15} {'Val Similarity':<15}\")\n",
+    "print(\"-\"*60)\n",
+    "print(f\"{'Option 1: Sentence Encoder':<30} {history_opt1['val_loss'][-1]:<15.4f} {history_opt1['val_similarity'][-1]:<15.4f}\")\n",
+    "print(f\"{'Option 2: LLM Hidden States':<30} {history_opt2['val_loss'][-1]:<15.4f} {history_opt2['val_similarity'][-1]:<15.4f}\")\n",
+    "print(f\"{'Option 3: Autoencoder':<30} {history_opt3['val_loss'][-1]:<15.4f} {history_opt3['val_similarity'][-1]:<15.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 🚀 Interactive Demo: Try Your Own State Transitions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def interactive_demo(model, model_name=\"Model\"):\n",
+    "    \"\"\"\n",
+    "    Interactive demo to test state transitions.\n",
+    "    \"\"\"\n",
+    "    print(f\"\\n{'='*60}\")\n",
+    "    print(f\"Interactive Demo: {model_name}\")\n",
+    "    print(\"=\"*60)\n",
+    "    print(\"\\nEnter a state and action to predict the next state.\")\n",
+    "    print(\"Type 'quit' to exit.\\n\")\n",
+    "    \n",
+    "    # Pre-compute embeddings for all known states for nearest neighbor search\n",
+    "    known_states = list(set(\n",
+    "        [s['state'] for s in train_dataset.samples] + \n",
+    "        [s['next_state'] for s in train_dataset.samples]\n",
+    "    ))\n",
+    "    \n",
+    "    model.eval()\n",
+    "    with torch.no_grad():\n",
+    "        if hasattr(model, 'encode_state'):\n",
+    "            known_embeddings = model.encode_state(known_states)\n",
+    "        else:\n",
+    "            known_embeddings = model.get_target_embedding(known_states)\n",
+    "    \n",
+    "    while True:\n",
+    "        state = input(\"\\nState: \").strip()\n",
+    "        if state.lower() == 'quit':\n",
+    "            break\n",
+    "            \n",
+    "        action = input(\"Action: \").strip()\n",
+    "        if action.lower() == 'quit':\n",
+    "            break\n",
+    "        \n",
+    "        with torch.no_grad():\n",
+    "            # Predict next state embedding\n",
+    "            predicted_emb = model([state], [action])\n",
+    "            \n",
+    "            # Find nearest known state\n",
+    "            similarities = F.cosine_similarity(\n",
+    "                predicted_emb.unsqueeze(1),\n",
+    "                known_embeddings.unsqueeze(0),\n",
+    "                dim=-1\n",
+    "            )\n",
+    "            \n",
+    "            top_k = 3\n",
+    "            top_indices = similarities[0].topk(top_k).indices\n",
+    "            top_sims = similarities[0].topk(top_k).values\n",
+    "            \n",
+    "            print(\"\\nPredicted Next States (by similarity):\")\n",
+    "            for i, (idx, sim) in enumerate(zip(top_indices, top_sims)):\n",
+    "                print(f\"  {i+1}. [{sim:.4f}] {known_states[idx]}\")\n",
+    "\n",
+    "\n",
+    "# Run demo with best model\n",
+    "print(\"\\nRunning demo with Option 1 (Sentence Encoder)...\")\n",
+    "print(\"\\nExample inputs to try:\")\n",
+    "print(\"  State: 'Document is in draft status with 1 section'\")\n",
+    "print(\"  Action: 'User creates new section'\")\n",
+    "\n",
+    "# Uncomment to run interactive demo:\n",
+    "# interactive_demo(model_opt1, \"Option 1: Sentence Encoder\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 💾 Save Models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save all three models\n",
+    "torch.save({\n",
+    "    'model_state_dict': model_opt1.state_dict(),\n",
+    "    'history': history_opt1,\n",
+    "}, 'jepa_option1_sentence_encoder.pt')\n",
+    "\n",
+    "torch.save({\n",
+    "    'model_state_dict': model_opt2.state_dict(),\n",
+    "    'history': history_opt2,\n",
+    "}, 'jepa_option2_llm_hidden.pt')\n",
+    "\n",
+    "torch.save({\n",
+    "    'model_state_dict': model_opt3.state_dict(),\n",
+    "    'history': history_opt3,\n",
+    "}, 'jepa_option3_autoencoder.pt')\n",
+    "\n",
+    "print(\"Models saved!\")\n",
+    "print(\"  - jepa_option1_sentence_encoder.pt\")\n",
+    "print(\"  - jepa_option2_llm_hidden.pt\")\n",
+    "print(\"  - jepa_option3_autoencoder.pt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# 📝 Summary & Next Steps\n",
+    "\n",
+    "## What We Built\n",
+    "\n",
+    "Three JEPA-style world models that predict state consequences:\n",
+    "\n",
+    "| Option | Encoder | Complexity | Best For |\n",
+    "|--------|---------|------------|----------|\n",
+    "| 1 | Pre-trained SentenceTransformer | Simplest | Quick prototyping |\n",
+    "| 2 | Frozen LLM + trainable head | Medium | General domains |\n",
+    "| 3 | Trainable autoencoder | Complex | Domain-specific |\n",
+    "\n",
+    "## Key Differences from Normal LLMs\n",
+    "\n",
+    "- **Input:** State + Action embeddings (not tokens)\n",
+    "- **Output:** State embeddings (not vocabulary logits)\n",
+    "- **Loss:** MSE + Cosine Similarity (not CrossEntropy)\n",
+    "- **Generation:** Single-shot prediction (not autoregressive)\n",
+    "\n",
+    "## Next Steps\n",
+    "\n",
+    "1. **Scale up:** Use larger base models (Llama, Mistral)\n",
+    "2. **Real data:** Replace synthetic data with actual enterprise logs\n",
+    "3. **Multi-step:** Chain predictions for trajectory forecasting\n",
+    "4. **Planning:** Use predicted states for action selection\n",
+    "5. **Continuous learning:** Add test-time training (TTT)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  },
+  "accelerator": "GPU"
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

jepa_option1_sentence_encoder.ipynb ADDED Viewed

	@@ -0,0 +1,690 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🧠 JEPA-Style LLM - Option 1: Sentence Encoder Approach\n",
+    "\n",
+    "**The Simplest Path: Use pre-trained sentence embeddings as your state space**\n",
+    "\n",
+    "This notebook demonstrates how to make a decoder-only transformer act like a JEPA world model:\n",
+    "- Input: State embedding + Action embedding\n",
+    "- Output: Predicted next state embedding (NOT tokens)\n",
+    "- Loss: MSE in embedding space\n",
+    "\n",
+    "**Key Insight:** We're predicting *consequences of actions* in a continuous space, not generating text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies\n",
+    "!pip install -q transformers accelerate sentence-transformers torch datasets wandb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "from transformers import AutoModel, AutoTokenizer, AutoConfig\n",
+    "import numpy as np\n",
+    "from tqdm.auto import tqdm\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# Check GPU\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "print(f\"Using device: {device}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Create Synthetic Training Data\n",
+    "\n",
+    "We'll simulate an enterprise workflow where:\n",
+    "- **State** = description of current document/workflow status\n",
+    "- **Action** = what the user does\n",
+    "- **Next State** = resulting status after action\n",
+    "\n",
+    "In production, you'd collect this from real user interactions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Synthetic enterprise workflow data\n",
+    "# Format: (current_state, action, next_state)\n",
+    "\n",
+    "WORKFLOW_DATA = [\n",
+    "    # Document editing workflows\n",
+    "    (\"Document is empty with no content\", \"User creates new section titled Introduction\", \"Document has one section: Introduction with no content\"),\n",
+    "    (\"Document has one section: Introduction with no content\", \"User writes 500 words in Introduction\", \"Document has Introduction section with 500 words of content\"),\n",
+    "    (\"Document has Introduction section with 500 words of content\", \"User adds new section titled Methods\", \"Document has two sections: Introduction (500 words) and Methods (empty)\"),\n",
+    "    (\"Document has two sections: Introduction (500 words) and Methods (empty)\", \"User writes 300 words in Methods\", \"Document has Introduction (500 words) and Methods (300 words)\"),\n",
+    "    (\"Document has Introduction (500 words) and Methods (300 words)\", \"User submits document for review\", \"Document is pending review with total 800 words\"),\n",
+    "    (\"Document is pending review with total 800 words\", \"Reviewer approves document\", \"Document is approved and ready for publication\"),\n",
+    "    (\"Document is pending review with total 800 words\", \"Reviewer requests changes\", \"Document returned to author with revision requests\"),\n",
+    "    (\"Document returned to author with revision requests\", \"User makes requested edits\", \"Document revised and ready for re-review\"),\n",
+    "    \n",
+    "    # Project management workflows\n",
+    "    (\"Project has no tasks assigned\", \"Manager creates 5 new tasks\", \"Project has 5 tasks all in pending status\"),\n",
+    "    (\"Project has 5 tasks all in pending status\", \"Developer starts working on task 1\", \"Project has 1 in-progress task and 4 pending tasks\"),\n",
+    "    (\"Project has 1 in-progress task and 4 pending tasks\", \"Developer completes task 1\", \"Project has 1 completed task and 4 pending tasks\"),\n",
+    "    (\"Project has 1 completed task and 4 pending tasks\", \"Developer starts tasks 2 and 3\", \"Project has 1 completed, 2 in-progress, and 2 pending tasks\"),\n",
+    "    (\"Project has 1 completed, 2 in-progress, and 2 pending tasks\", \"Developer completes all in-progress tasks\", \"Project has 3 completed tasks and 2 pending tasks\"),\n",
+    "    \n",
+    "    # Email/Communication workflows\n",
+    "    (\"Inbox has 10 unread emails\", \"User reads 3 emails\", \"Inbox has 7 unread emails and 3 read emails\"),\n",
+    "    (\"Inbox has 7 unread emails and 3 read emails\", \"User archives 2 read emails\", \"Inbox has 7 unread, 1 read, and 2 archived emails\"),\n",
+    "    (\"Inbox has 7 unread, 1 read, and 2 archived emails\", \"User receives new email\", \"Inbox has 8 unread, 1 read, and 2 archived emails\"),\n",
+    "    (\"Inbox has 8 unread, 1 read, and 2 archived emails\", \"User marks all as read\", \"Inbox has 0 unread, 9 read, and 2 archived emails\"),\n",
+    "    \n",
+    "    # Database/Data workflows\n",
+    "    (\"Database table has 100 records\", \"User inserts 50 new records\", \"Database table has 150 records\"),\n",
+    "    (\"Database table has 150 records\", \"User deletes 30 records matching filter\", \"Database table has 120 records\"),\n",
+    "    (\"Database table has 120 records\", \"User updates 20 records with new values\", \"Database table has 120 records with 20 modified\"),\n",
+    "    (\"Database table has 120 records with 20 modified\", \"User exports table to CSV\", \"Database table unchanged, CSV file created with 120 rows\"),\n",
+    "    \n",
+    "    # File system workflows  \n",
+    "    (\"Folder contains 5 files totaling 10MB\", \"User uploads 3 new files of 5MB each\", \"Folder contains 8 files totaling 25MB\"),\n",
+    "    (\"Folder contains 8 files totaling 25MB\", \"User deletes 2 files of 3MB each\", \"Folder contains 6 files totaling 19MB\"),\n",
+    "    (\"Folder contains 6 files totaling 19MB\", \"User creates new subfolder\", \"Folder contains 6 files totaling 19MB and 1 empty subfolder\"),\n",
+    "    (\"Folder contains 6 files totaling 19MB and 1 empty subfolder\", \"User moves 2 files to subfolder\", \"Folder contains 4 files and subfolder with 2 files\"),\n",
+    "    \n",
+    "    # Shopping cart workflows\n",
+    "    (\"Cart is empty with 0 items\", \"User adds product A priced at $50\", \"Cart has 1 item with total $50\"),\n",
+    "    (\"Cart has 1 item with total $50\", \"User adds product B priced at $30\", \"Cart has 2 items with total $80\"),\n",
+    "    (\"Cart has 2 items with total $80\", \"User applies 10% discount code\", \"Cart has 2 items with total $72 after discount\"),\n",
+    "    (\"Cart has 2 items with total $72 after discount\", \"User removes product A\", \"Cart has 1 item with total $27 after discount\"),\n",
+    "    (\"Cart has 1 item with total $27 after discount\", \"User proceeds to checkout\", \"Order created for 1 item totaling $27\"),\n",
+    "]\n",
+    "\n",
+    "# Augment data with variations\n",
+    "def augment_data(data, multiplier=5):\n",
+    "    \"\"\"Create variations of the data\"\"\"\n",
+    "    augmented = list(data)\n",
+    "    \n",
+    "    # Add slight variations\n",
+    "    for state, action, next_state in data:\n",
+    "        for i in range(multiplier - 1):\n",
+    "            # Add noise phrases\n",
+    "            prefixes = [\"\", \"Currently, \", \"At this point, \", \"Right now, \"]\n",
+    "            action_prefixes = [\"\", \"Then \", \"Next, \", \"Subsequently, \"]\n",
+    "            \n",
+    "            new_state = np.random.choice(prefixes) + state.lower() if np.random.random() > 0.5 else state\n",
+    "            new_action = np.random.choice(action_prefixes) + action.lower() if np.random.random() > 0.5 else action\n",
+    "            \n",
+    "            augmented.append((new_state, new_action, next_state))\n",
+    "    \n",
+    "    return augmented\n",
+    "\n",
+    "training_data = augment_data(WORKFLOW_DATA, multiplier=10)\n",
+    "print(f\"Total training examples: {len(training_data)}\")\n",
+    "print(f\"\\nSample:\\n  State: {training_data[0][0]}\\n  Action: {training_data[0][1]}\\n  Next State: {training_data[0][2]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Define the JEPA-Style Model\n",
+    "\n",
+    "The key architectural change:\n",
+    "- **Normal LLM:** `hidden_state → vocab_head → token_logits`\n",
+    "- **JEPA LLM:** `hidden_state → state_head → state_embedding`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class JEPAWorldModel(nn.Module):\n",
+    "    \"\"\"\n",
+    "    A decoder-only transformer modified to act like a JEPA world model.\n",
+    "    \n",
+    "    Instead of predicting next tokens, it predicts next STATE EMBEDDINGS\n",
+    "    given current state + action.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        sentence_encoder_name: str = \"all-MiniLM-L6-v2\",\n",
+    "        backbone_name: str = \"gpt2\",  # Small model for testing\n",
+    "        state_dim: int = 384,  # MiniLM output dim\n",
+    "        hidden_dim: int = 512,\n",
+    "        freeze_sentence_encoder: bool = True\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        \n",
+    "        # Sentence encoder for state/action embeddings\n",
+    "        self.sentence_encoder = SentenceTransformer(sentence_encoder_name)\n",
+    "        if freeze_sentence_encoder:\n",
+    "            for param in self.sentence_encoder.parameters():\n",
+    "                param.requires_grad = False\n",
+    "        \n",
+    "        self.state_dim = state_dim\n",
+    "        \n",
+    "        # Backbone transformer (we use its hidden layers, not its LM head)\n",
+    "        self.backbone = AutoModel.from_pretrained(backbone_name)\n",
+    "        backbone_hidden = self.backbone.config.hidden_size  # 768 for GPT-2\n",
+    "        \n",
+    "        # Project state+action embeddings into backbone space\n",
+    "        self.input_projection = nn.Sequential(\n",
+    "            nn.Linear(state_dim * 2, hidden_dim),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(hidden_dim),\n",
+    "            nn.Linear(hidden_dim, backbone_hidden)\n",
+    "        )\n",
+    "        \n",
+    "        # State prediction head (replaces vocabulary head)\n",
+    "        # This is the JEPA key: output embeddings, not tokens\n",
+    "        self.state_predictor = nn.Sequential(\n",
+    "            nn.Linear(backbone_hidden, hidden_dim),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(hidden_dim),\n",
+    "            nn.Linear(hidden_dim, state_dim)\n",
+    "        )\n",
+    "        \n",
+    "    def encode_text(self, texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Convert text to embeddings using sentence encoder\"\"\"\n",
+    "        embeddings = self.sentence_encoder.encode(\n",
+    "            texts, \n",
+    "            convert_to_tensor=True,\n",
+    "            show_progress_bar=False\n",
+    "        )\n",
+    "        return embeddings\n",
+    "    \n",
+    "    def forward(\n",
+    "        self, \n",
+    "        state_texts: list,\n",
+    "        action_texts: list\n",
+    "    ) -> torch.Tensor:\n",
+    "        \"\"\"\n",
+    "        Predict next state embedding given current state and action.\n",
+    "        \n",
+    "        Args:\n",
+    "            state_texts: List of strings describing current states\n",
+    "            action_texts: List of strings describing actions\n",
+    "            \n",
+    "        Returns:\n",
+    "            predicted_next_state: [batch_size, state_dim] embedding\n",
+    "        \"\"\"\n",
+    "        # Encode state and action to embeddings\n",
+    "        state_emb = self.encode_text(state_texts)  # [B, state_dim]\n",
+    "        action_emb = self.encode_text(action_texts)  # [B, state_dim]\n",
+    "        \n",
+    "        # Concatenate state and action\n",
+    "        combined = torch.cat([state_emb, action_emb], dim=-1)  # [B, state_dim*2]\n",
+    "        \n",
+    "        # Project to backbone space\n",
+    "        backbone_input = self.input_projection(combined)  # [B, backbone_hidden]\n",
+    "        backbone_input = backbone_input.unsqueeze(1)  # [B, 1, backbone_hidden]\n",
+    "        \n",
+    "        # Pass through backbone transformer\n",
+    "        backbone_output = self.backbone(\n",
+    "            inputs_embeds=backbone_input\n",
+    "        ).last_hidden_state[:, -1, :]  # [B, backbone_hidden]\n",
+    "        \n",
+    "        # Predict next state embedding (NOT tokens!)\n",
+    "        predicted_next_state = self.state_predictor(backbone_output)  # [B, state_dim]\n",
+    "        \n",
+    "        return predicted_next_state\n",
+    "    \n",
+    "    def get_target_embedding(self, next_state_texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Get target embeddings for loss computation\"\"\"\n",
+    "        with torch.no_grad():\n",
+    "            return self.encode_text(next_state_texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Create Dataset and DataLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class WorkflowDataset(Dataset):\n",
+    "    \"\"\"Dataset of (state, action, next_state) triplets\"\"\"\n",
+    "    \n",
+    "    def __init__(self, data):\n",
+    "        self.data = data\n",
+    "        \n",
+    "    def __len__(self):\n",
+    "        return len(self.data)\n",
+    "    \n",
+    "    def __getitem__(self, idx):\n",
+    "        state, action, next_state = self.data[idx]\n",
+    "        return {\n",
+    "            'state': state,\n",
+    "            'action': action,\n",
+    "            'next_state': next_state\n",
+    "        }\n",
+    "\n",
+    "def collate_fn(batch):\n",
+    "    \"\"\"Collate function that keeps strings as lists\"\"\"\n",
+    "    return {\n",
+    "        'states': [item['state'] for item in batch],\n",
+    "        'actions': [item['action'] for item in batch],\n",
+    "        'next_states': [item['next_state'] for item in batch]\n",
+    "    }\n",
+    "\n",
+    "# Split data\n",
+    "np.random.shuffle(training_data)\n",
+    "split_idx = int(len(training_data) * 0.9)\n",
+    "train_data = training_data[:split_idx]\n",
+    "val_data = training_data[split_idx:]\n",
+    "\n",
+    "train_dataset = WorkflowDataset(train_data)\n",
+    "val_dataset = WorkflowDataset(val_data)\n",
+    "\n",
+    "train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=collate_fn)\n",
+    "val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False, collate_fn=collate_fn)\n",
+    "\n",
+    "print(f\"Train batches: {len(train_loader)}, Val batches: {len(val_loader)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Training Loop\n",
+    "\n",
+    "**The key difference from LLM training:**\n",
+    "- LLM: `loss = CrossEntropy(predicted_logits, target_tokens)`\n",
+    "- JEPA: `loss = MSE(predicted_embedding, target_embedding)`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_epoch(model, dataloader, optimizer, device):\n",
+    "    model.train()\n",
+    "    total_loss = 0\n",
+    "    \n",
+    "    for batch in tqdm(dataloader, desc=\"Training\"):\n",
+    "        # Forward pass\n",
+    "        predicted_next = model(batch['states'], batch['actions'])\n",
+    "        \n",
+    "        # Get target embeddings\n",
+    "        target_next = model.get_target_embedding(batch['next_states'])\n",
+    "        \n",
+    "        # JEPA-style loss: MSE in embedding space\n",
+    "        loss = F.mse_loss(predicted_next, target_next)\n",
+    "        \n",
+    "        # Alternative: Cosine similarity loss\n",
+    "        # loss = 1 - F.cosine_similarity(predicted_next, target_next).mean()\n",
+    "        \n",
+    "        # Backward pass\n",
+    "        optimizer.zero_grad()\n",
+    "        loss.backward()\n",
+    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "        optimizer.step()\n",
+    "        \n",
+    "        total_loss += loss.item()\n",
+    "    \n",
+    "    return total_loss / len(dataloader)\n",
+    "\n",
+    "\n",
+    "def validate(model, dataloader, device):\n",
+    "    model.eval()\n",
+    "    total_loss = 0\n",
+    "    total_cosine_sim = 0\n",
+    "    \n",
+    "    with torch.no_grad():\n",
+    "        for batch in dataloader:\n",
+    "            predicted_next = model(batch['states'], batch['actions'])\n",
+    "            target_next = model.get_target_embedding(batch['next_states'])\n",
+    "            \n",
+    "            loss = F.mse_loss(predicted_next, target_next)\n",
+    "            cosine_sim = F.cosine_similarity(predicted_next, target_next).mean()\n",
+    "            \n",
+    "            total_loss += loss.item()\n",
+    "            total_cosine_sim += cosine_sim.item()\n",
+    "    \n",
+    "    return {\n",
+    "        'loss': total_loss / len(dataloader),\n",
+    "        'cosine_similarity': total_cosine_sim / len(dataloader)\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize model\n",
+    "model = JEPAWorldModel(\n",
+    "    sentence_encoder_name=\"all-MiniLM-L6-v2\",\n",
+    "    backbone_name=\"gpt2\",\n",
+    "    state_dim=384,\n",
+    "    hidden_dim=512\n",
+    ")\n",
+    "model = model.to(device)\n",
+    "\n",
+    "# Optimizer\n",
+    "optimizer = torch.optim.AdamW(\n",
+    "    filter(lambda p: p.requires_grad, model.parameters()),\n",
+    "    lr=1e-4,\n",
+    "    weight_decay=0.01\n",
+    ")\n",
+    "\n",
+    "# Count parameters\n",
+    "trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "total_params = sum(p.numel() for p in model.parameters())\n",
+    "print(f\"Trainable parameters: {trainable_params:,}\")\n",
+    "print(f\"Total parameters: {total_params:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Training\n",
+    "num_epochs = 20\n",
+    "train_losses = []\n",
+    "val_losses = []\n",
+    "val_cosine_sims = []\n",
+    "\n",
+    "for epoch in range(num_epochs):\n",
+    "    train_loss = train_epoch(model, train_loader, optimizer, device)\n",
+    "    val_metrics = validate(model, val_loader, device)\n",
+    "    \n",
+    "    train_losses.append(train_loss)\n",
+    "    val_losses.append(val_metrics['loss'])\n",
+    "    val_cosine_sims.append(val_metrics['cosine_similarity'])\n",
+    "    \n",
+    "    print(f\"Epoch {epoch+1}/{num_epochs}\")\n",
+    "    print(f\"  Train Loss: {train_loss:.4f}\")\n",
+    "    print(f\"  Val Loss: {val_metrics['loss']:.4f}\")\n",
+    "    print(f\"  Val Cosine Similarity: {val_metrics['cosine_similarity']:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plot training curves\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "axes[0].plot(train_losses, label='Train')\n",
+    "axes[0].plot(val_losses, label='Validation')\n",
+    "axes[0].set_xlabel('Epoch')\n",
+    "axes[0].set_ylabel('MSE Loss')\n",
+    "axes[0].set_title('Training Progress')\n",
+    "axes[0].legend()\n",
+    "\n",
+    "axes[1].plot(val_cosine_sims, color='green')\n",
+    "axes[1].set_xlabel('Epoch')\n",
+    "axes[1].set_ylabel('Cosine Similarity')\n",
+    "axes[1].set_title('Prediction Quality (higher = better)')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Test the World Model: Predict Consequences of Actions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict_next_state(model, current_state: str, action: str, candidate_states: list) -> dict:\n",
+    "    \"\"\"\n",
+    "    Given current state and action, predict which candidate state is most likely.\n",
+    "    \n",
+    "    This is how JEPA works: predict embedding, then find closest match.\n",
+    "    \"\"\"\n",
+    "    model.eval()\n",
+    "    \n",
+    "    with torch.no_grad():\n",
+    "        # Predict next state embedding\n",
+    "        predicted_emb = model([current_state], [action])  # [1, state_dim]\n",
+    "        \n",
+    "        # Get embeddings of all candidate states\n",
+    "        candidate_embs = model.encode_text(candidate_states)  # [N, state_dim]\n",
+    "        \n",
+    "        # Compute cosine similarity to find best match\n",
+    "        similarities = F.cosine_similarity(\n",
+    "            predicted_emb.expand(len(candidate_states), -1),\n",
+    "            candidate_embs\n",
+    "        )\n",
+    "        \n",
+    "        # Get rankings\n",
+    "        rankings = similarities.argsort(descending=True)\n",
+    "        \n",
+    "        return {\n",
+    "            'predicted_state': candidate_states[rankings[0]],\n",
+    "            'confidence': similarities[rankings[0]].item(),\n",
+    "            'all_scores': {candidate_states[i]: similarities[i].item() for i in range(len(candidate_states))}\n",
+    "        }\n",
+    "\n",
+    "\n",
+    "# Test on held-out examples\n",
+    "print(\"=\"*80)\n",
+    "print(\"TESTING WORLD MODEL PREDICTIONS\")\n",
+    "print(\"=\"*80)\n",
+    "\n",
+    "test_cases = [\n",
+    "    {\n",
+    "        'state': \"Document is empty with no content\",\n",
+    "        'action': \"User creates new section titled Introduction\",\n",
+    "        'candidates': [\n",
+    "            \"Document has one section: Introduction with no content\",\n",
+    "            \"Document is deleted\",\n",
+    "            \"Document has 500 words\",\n",
+    "            \"User logged out\"\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        'state': \"Cart has 2 items with total $80\",\n",
+    "        'action': \"User applies 10% discount code\",\n",
+    "        'candidates': [\n",
+    "            \"Cart has 2 items with total $72 after discount\",\n",
+    "            \"Cart is empty\",\n",
+    "            \"Cart has 3 items with total $100\",\n",
+    "            \"Order was cancelled\"\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        'state': \"Inbox has 10 unread emails\",\n",
+    "        'action': \"User reads 3 emails\",\n",
+    "        'candidates': [\n",
+    "            \"Inbox has 7 unread emails and 3 read emails\",\n",
+    "            \"Inbox has 13 unread emails\",\n",
+    "            \"Inbox is empty\",\n",
+    "            \"User sent 3 emails\"\n",
+    "        ]\n",
+    "    }\n",
+    "]\n",
+    "\n",
+    "for i, test in enumerate(test_cases):\n",
+    "    print(f\"\\n--- Test {i+1} ---\")\n",
+    "    print(f\"Current State: {test['state']}\")\n",
+    "    print(f\"Action: {test['action']}\")\n",
+    "    \n",
+    "    result = predict_next_state(model, test['state'], test['action'], test['candidates'])\n",
+    "    \n",
+    "    print(f\"\\nPredicted Next State: {result['predicted_state']}\")\n",
+    "    print(f\"Confidence: {result['confidence']:.4f}\")\n",
+    "    print(f\"\\nAll scores:\")\n",
+    "    for state, score in sorted(result['all_scores'].items(), key=lambda x: -x[1]):\n",
+    "        print(f\"  {score:.4f}: {state}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Multi-Step Planning: Chain Predictions\n",
+    "\n",
+    "The power of world models: simulate multiple steps into the future!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def simulate_trajectory(model, initial_state: str, actions: list, possible_states: list) -> list:\n",
+    "    \"\"\"\n",
+    "    Simulate a trajectory of states given a sequence of actions.\n",
+    "    This is multi-step world model prediction!\n",
+    "    \"\"\"\n",
+    "    trajectory = [initial_state]\n",
+    "    current_state = initial_state\n",
+    "    \n",
+    "    for action in actions:\n",
+    "        result = predict_next_state(model, current_state, action, possible_states)\n",
+    "        current_state = result['predicted_state']\n",
+    "        trajectory.append({\n",
+    "            'action': action,\n",
+    "            'resulting_state': current_state,\n",
+    "            'confidence': result['confidence']\n",
+    "        })\n",
+    "    \n",
+    "    return trajectory\n",
+    "\n",
+    "\n",
+    "# Test multi-step planning\n",
+    "print(\"\\n\" + \"=\"*80)\n",
+    "print(\"MULTI-STEP TRAJECTORY SIMULATION\")\n",
+    "print(\"=\"*80)\n",
+    "\n",
+    "possible_states = [\n",
+    "    \"Document is empty with no content\",\n",
+    "    \"Document has one section: Introduction with no content\",\n",
+    "    \"Document has Introduction section with 500 words of content\",\n",
+    "    \"Document has two sections: Introduction (500 words) and Methods (empty)\",\n",
+    "    \"Document has Introduction (500 words) and Methods (300 words)\",\n",
+    "    \"Document is pending review with total 800 words\",\n",
+    "    \"Document is approved and ready for publication\",\n",
+    "    \"Document returned to author with revision requests\",\n",
+    "]\n",
+    "\n",
+    "actions = [\n",
+    "    \"User creates new section titled Introduction\",\n",
+    "    \"User writes 500 words in Introduction\",\n",
+    "    \"User adds new section titled Methods\",\n",
+    "    \"User writes 300 words in Methods\",\n",
+    "    \"User submits document for review\"\n",
+    "]\n",
+    "\n",
+    "trajectory = simulate_trajectory(\n",
+    "    model,\n",
+    "    initial_state=\"Document is empty with no content\",\n",
+    "    actions=actions,\n",
+    "    possible_states=possible_states\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nInitial State: {trajectory[0]}\")\n",
+    "for i, step in enumerate(trajectory[1:], 1):\n",
+    "    print(f\"\\nStep {i}:\")\n",
+    "    print(f\"  Action: {step['action']}\")\n",
+    "    print(f\"  → {step['resulting_state']}\")\n",
+    "    print(f\"  (confidence: {step['confidence']:.4f})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Save the Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save model\n",
+    "torch.save({\n",
+    "    'model_state_dict': model.state_dict(),\n",
+    "    'config': {\n",
+    "        'sentence_encoder_name': 'all-MiniLM-L6-v2',\n",
+    "        'backbone_name': 'gpt2',\n",
+    "        'state_dim': 384,\n",
+    "        'hidden_dim': 512\n",
+    "    }\n",
+    "}, 'jepa_world_model_option1.pt')\n",
+    "\n",
+    "print(\"Model saved to jepa_world_model_option1.pt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "**What we built:**\n",
+    "- A decoder-only transformer that predicts STATE EMBEDDINGS instead of tokens\n",
+    "- Input: (current_state, action) pair\n",
+    "- Output: predicted next_state embedding\n",
+    "- Loss: MSE between predicted and actual state embeddings\n",
+    "\n",
+    "**This is JEPA-like because:**\n",
+    "1. We predict in latent/embedding space, not token space\n",
+    "2. We learn the \"physics\" of state transitions\n",
+    "3. We can do multi-step planning by chaining predictions\n",
+    "\n",
+    "**Next steps:**\n",
+    "- Use your own enterprise data (state, action, next_state) triplets\n",
+    "- Scale up the backbone model\n",
+    "- Add uncertainty estimation for planning"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

jepa_option2_llm_hidden_states.ipynb ADDED Viewed

	@@ -0,0 +1,699 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🧠 JEPA-Style LLM - Option 2: LLM Hidden States as World Model\n",
+    "\n",
+    "**Use the LLM's own internal representations as the state space**\n",
+    "\n",
+    "This approach is more powerful than Option 1 because:\n",
+    "- The state encoder and predictor share the same representation space\n",
+    "- The LLM learns both to encode states AND predict transitions\n",
+    "- No separate sentence encoder needed\n",
+    "\n",
+    "**Architecture:**\n",
+    "```\n",
+    "State text → LLM Encoder → State embedding\n",
+    "Action text → LLM Encoder → Action embedding  \n",
+    "State + Action → LLM Predictor → Next State embedding\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies\n",
+    "!pip install -q transformers accelerate torch datasets bitsandbytes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM\n",
+    "import numpy as np\n",
+    "from tqdm.auto import tqdm\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "print(f\"Using device: {device}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Synthetic Data (Same as Option 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "WORKFLOW_DATA = [\n",
+    "    # Document workflows\n",
+    "    (\"Document is empty\", \"create introduction section\", \"Document has introduction section\"),\n",
+    "    (\"Document has introduction section\", \"write 500 words\", \"Document has introduction with 500 words\"),\n",
+    "    (\"Document has introduction with 500 words\", \"add methods section\", \"Document has introduction and methods sections\"),\n",
+    "    (\"Document has introduction and methods sections\", \"submit for review\", \"Document pending review\"),\n",
+    "    (\"Document pending review\", \"reviewer approves\", \"Document approved\"),\n",
+    "    (\"Document pending review\", \"reviewer rejects\", \"Document needs revision\"),\n",
+    "    \n",
+    "    # Task workflows\n",
+    "    (\"Project has no tasks\", \"create 5 tasks\", \"Project has 5 pending tasks\"),\n",
+    "    (\"Project has 5 pending tasks\", \"start task 1\", \"Project has 1 active and 4 pending tasks\"),\n",
+    "    (\"Project has 1 active and 4 pending tasks\", \"complete task 1\", \"Project has 1 done and 4 pending tasks\"),\n",
+    "    (\"Project has 1 done and 4 pending tasks\", \"start remaining tasks\", \"Project has 1 done and 4 active tasks\"),\n",
+    "    (\"Project has 1 done and 4 active tasks\", \"complete all tasks\", \"Project complete with 5 done tasks\"),\n",
+    "    \n",
+    "    # Shopping cart\n",
+    "    (\"Cart empty\", \"add item for $50\", \"Cart has 1 item totaling $50\"),\n",
+    "    (\"Cart has 1 item totaling $50\", \"add item for $30\", \"Cart has 2 items totaling $80\"),\n",
+    "    (\"Cart has 2 items totaling $80\", \"apply 10% discount\", \"Cart has 2 items totaling $72\"),\n",
+    "    (\"Cart has 2 items totaling $72\", \"checkout\", \"Order placed for $72\"),\n",
+    "    (\"Cart has 2 items totaling $80\", \"remove first item\", \"Cart has 1 item totaling $30\"),\n",
+    "    \n",
+    "    # Database operations\n",
+    "    (\"Table has 100 rows\", \"insert 50 rows\", \"Table has 150 rows\"),\n",
+    "    (\"Table has 150 rows\", \"delete 30 rows\", \"Table has 120 rows\"),\n",
+    "    (\"Table has 120 rows\", \"update 20 rows\", \"Table has 120 rows with 20 modified\"),\n",
+    "    \n",
+    "    # File operations\n",
+    "    (\"Folder has 5 files\", \"upload 3 files\", \"Folder has 8 files\"),\n",
+    "    (\"Folder has 8 files\", \"delete 2 files\", \"Folder has 6 files\"),\n",
+    "    (\"Folder has 6 files\", \"create subfolder\", \"Folder has 6 files and 1 subfolder\"),\n",
+    "]\n",
+    "\n",
+    "# Augment\n",
+    "def augment_data(data, multiplier=15):\n",
+    "    augmented = []\n",
+    "    for state, action, next_state in data:\n",
+    "        for _ in range(multiplier):\n",
+    "            # Randomly add prefixes/variations\n",
+    "            s = state if np.random.random() > 0.3 else f\"Currently: {state}\"\n",
+    "            a = action if np.random.random() > 0.3 else f\"User action: {action}\"\n",
+    "            augmented.append((s, a, next_state))\n",
+    "    return augmented\n",
+    "\n",
+    "training_data = augment_data(WORKFLOW_DATA)\n",
+    "np.random.shuffle(training_data)\n",
+    "print(f\"Total examples: {len(training_data)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. JEPA World Model Using LLM Hidden States\n",
+    "\n",
+    "**Key Idea:** The LLM's hidden states ARE the state embeddings.\n",
+    "- Encode text through the LLM, use mean-pooled hidden states\n",
+    "- Train a predictor on top to forecast next state embeddings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class JEPAWorldModelV2(nn.Module):\n",
+    "    \"\"\"\n",
+    "    JEPA-style world model using LLM hidden states as state space.\n",
+    "    \n",
+    "    The LLM serves dual purpose:\n",
+    "    1. State encoder: text → hidden state → state embedding\n",
+    "    2. Dynamics backbone: process (state, action) to predict next state\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        model_name: str = \"gpt2\",\n",
+    "        state_dim: int = 256,\n",
+    "        freeze_llm: bool = True  # Freeze LLM, train only heads\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "        \n",
+    "        # Load LLM and tokenizer\n",
+    "        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+    "        self.tokenizer.pad_token = self.tokenizer.eos_token\n",
+    "        \n",
+    "        self.llm = AutoModel.from_pretrained(model_name)\n",
+    "        self.hidden_size = self.llm.config.hidden_size\n",
+    "        \n",
+    "        if freeze_llm:\n",
+    "            for param in self.llm.parameters():\n",
+    "                param.requires_grad = False\n",
+    "        \n",
+    "        self.state_dim = state_dim\n",
+    "        \n",
+    "        # State encoder: LLM hidden → compact state embedding\n",
+    "        self.state_encoder = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, self.hidden_size // 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(self.hidden_size // 2),\n",
+    "            nn.Linear(self.hidden_size // 2, state_dim),\n",
+    "            nn.LayerNorm(state_dim)\n",
+    "        )\n",
+    "        \n",
+    "        # Action encoder (same structure)\n",
+    "        self.action_encoder = nn.Sequential(\n",
+    "            nn.Linear(self.hidden_size, self.hidden_size // 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(self.hidden_size // 2),\n",
+    "            nn.Linear(self.hidden_size // 2, state_dim),\n",
+    "            nn.LayerNorm(state_dim)\n",
+    "        )\n",
+    "        \n",
+    "        # State dynamics predictor\n",
+    "        # Input: state_emb + action_emb\n",
+    "        # Output: predicted next_state_emb\n",
+    "        self.dynamics_predictor = nn.Sequential(\n",
+    "            nn.Linear(state_dim * 2, state_dim * 2),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(state_dim * 2),\n",
+    "            nn.Linear(state_dim * 2, state_dim),\n",
+    "            nn.GELU(),\n",
+    "            nn.LayerNorm(state_dim),\n",
+    "            nn.Linear(state_dim, state_dim)\n",
+    "        )\n",
+    "        \n",
+    "    def get_llm_embedding(self, texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Get mean-pooled LLM hidden states for texts\"\"\"\n",
+    "        tokens = self.tokenizer(\n",
+    "            texts,\n",
+    "            return_tensors='pt',\n",
+    "            padding=True,\n",
+    "            truncation=True,\n",
+    "            max_length=128\n",
+    "        ).to(next(self.llm.parameters()).device)\n",
+    "        \n",
+    "        with torch.no_grad() if not self.llm.training else torch.enable_grad():\n",
+    "            outputs = self.llm(**tokens)\n",
+    "            hidden_states = outputs.last_hidden_state  # [B, seq_len, hidden]\n",
+    "            \n",
+    "            # Mean pooling (ignoring padding)\n",
+    "            attention_mask = tokens['attention_mask'].unsqueeze(-1)\n",
+    "            sum_hidden = (hidden_states * attention_mask).sum(dim=1)\n",
+    "            mean_hidden = sum_hidden / attention_mask.sum(dim=1)\n",
+    "            \n",
+    "        return mean_hidden  # [B, hidden_size]\n",
+    "    \n",
+    "    def encode_state(self, state_texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Encode state text to state embedding\"\"\"\n",
+    "        llm_emb = self.get_llm_embedding(state_texts)\n",
+    "        return self.state_encoder(llm_emb)\n",
+    "    \n",
+    "    def encode_action(self, action_texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Encode action text to action embedding\"\"\"\n",
+    "        llm_emb = self.get_llm_embedding(action_texts)\n",
+    "        return self.action_encoder(llm_emb)\n",
+    "    \n",
+    "    def forward(\n",
+    "        self,\n",
+    "        state_texts: list,\n",
+    "        action_texts: list\n",
+    "    ) -> torch.Tensor:\n",
+    "        \"\"\"\n",
+    "        Predict next state embedding from current state and action.\n",
+    "        \n",
+    "        This is the JEPA forward pass:\n",
+    "        (state, action) → predicted_next_state_embedding\n",
+    "        \"\"\"\n",
+    "        # Encode state and action\n",
+    "        state_emb = self.encode_state(state_texts)  # [B, state_dim]\n",
+    "        action_emb = self.encode_action(action_texts)  # [B, state_dim]\n",
+    "        \n",
+    "        # Concatenate for dynamics prediction\n",
+    "        combined = torch.cat([state_emb, action_emb], dim=-1)  # [B, state_dim*2]\n",
+    "        \n",
+    "        # Predict next state\n",
+    "        predicted_next_state = self.dynamics_predictor(combined)  # [B, state_dim]\n",
+    "        \n",
+    "        return predicted_next_state\n",
+    "    \n",
+    "    def get_target_embedding(self, next_state_texts: list) -> torch.Tensor:\n",
+    "        \"\"\"Get target state embedding for loss computation\"\"\"\n",
+    "        return self.encode_state(next_state_texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Dataset and DataLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class WorkflowDataset(Dataset):\n",
+    "    def __init__(self, data):\n",
+    "        self.data = data\n",
+    "        \n",
+    "    def __len__(self):\n",
+    "        return len(self.data)\n",
+    "    \n",
+    "    def __getitem__(self, idx):\n",
+    "        state, action, next_state = self.data[idx]\n",
+    "        return {'state': state, 'action': action, 'next_state': next_state}\n",
+    "\n",
+    "def collate_fn(batch):\n",
+    "    return {\n",
+    "        'states': [item['state'] for item in batch],\n",
+    "        'actions': [item['action'] for item in batch],\n",
+    "        'next_states': [item['next_state'] for item in batch]\n",
+    "    }\n",
+    "\n",
+    "# Split\n",
+    "split_idx = int(len(training_data) * 0.9)\n",
+    "train_data = training_data[:split_idx]\n",
+    "val_data = training_data[split_idx:]\n",
+    "\n",
+    "train_loader = DataLoader(\n",
+    "    WorkflowDataset(train_data), \n",
+    "    batch_size=8, \n",
+    "    shuffle=True, \n",
+    "    collate_fn=collate_fn\n",
+    ")\n",
+    "val_loader = DataLoader(\n",
+    "    WorkflowDataset(val_data), \n",
+    "    batch_size=8, \n",
+    "    shuffle=False, \n",
+    "    collate_fn=collate_fn\n",
+    ")\n",
+    "\n",
+    "print(f\"Train: {len(train_loader)} batches, Val: {len(val_loader)} batches\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Training with JEPA-style Loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class JEPALoss(nn.Module):\n",
+    "    \"\"\"\n",
+    "    Combined loss for JEPA training:\n",
+    "    - MSE: Mean squared error in embedding space\n",
+    "    - Cosine: Similarity loss\n",
+    "    - Contrastive: Push apart wrong predictions\n",
+    "    \"\"\"\n",
+    "    def __init__(self, mse_weight=1.0, cosine_weight=0.5):\n",
+    "        super().__init__()\n",
+    "        self.mse_weight = mse_weight\n",
+    "        self.cosine_weight = cosine_weight\n",
+    "        \n",
+    "    def forward(self, predicted: torch.Tensor, target: torch.Tensor) -> dict:\n",
+    "        # MSE loss\n",
+    "        mse_loss = F.mse_loss(predicted, target)\n",
+    "        \n",
+    "        # Cosine similarity loss (maximize similarity = minimize 1 - sim)\n",
+    "        cosine_sim = F.cosine_similarity(predicted, target, dim=-1)\n",
+    "        cosine_loss = (1 - cosine_sim).mean()\n",
+    "        \n",
+    "        # Combined loss\n",
+    "        total_loss = self.mse_weight * mse_loss + self.cosine_weight * cosine_loss\n",
+    "        \n",
+    "        return {\n",
+    "            'total': total_loss,\n",
+    "            'mse': mse_loss,\n",
+    "            'cosine_loss': cosine_loss,\n",
+    "            'cosine_sim': cosine_sim.mean()\n",
+    "        }"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_epoch(model, dataloader, optimizer, loss_fn, device):\n",
+    "    model.train()\n",
+    "    metrics = {'total': 0, 'mse': 0, 'cosine_sim': 0}\n",
+    "    \n",
+    "    for batch in tqdm(dataloader, desc=\"Training\"):\n",
+    "        # Forward\n",
+    "        predicted = model(batch['states'], batch['actions'])\n",
+    "        target = model.get_target_embedding(batch['next_states'])\n",
+    "        \n",
+    "        # Loss\n",
+    "        losses = loss_fn(predicted, target)\n",
+    "        \n",
+    "        # Backward\n",
+    "        optimizer.zero_grad()\n",
+    "        losses['total'].backward()\n",
+    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "        optimizer.step()\n",
+    "        \n",
+    "        # Track\n",
+    "        for k, v in losses.items():\n",
+    "            if k in metrics:\n",
+    "                metrics[k] += v.item()\n",
+    "    \n",
+    "    return {k: v / len(dataloader) for k, v in metrics.items()}\n",
+    "\n",
+    "\n",
+    "def validate(model, dataloader, loss_fn, device):\n",
+    "    model.eval()\n",
+    "    metrics = {'total': 0, 'mse': 0, 'cosine_sim': 0}\n",
+    "    \n",
+    "    with torch.no_grad():\n",
+    "        for batch in dataloader:\n",
+    "            predicted = model(batch['states'], batch['actions'])\n",
+    "            target = model.get_target_embedding(batch['next_states'])\n",
+    "            losses = loss_fn(predicted, target)\n",
+    "            \n",
+    "            for k, v in losses.items():\n",
+    "                if k in metrics:\n",
+    "                    metrics[k] += v.item()\n",
+    "    \n",
+    "    return {k: v / len(dataloader) for k, v in metrics.items()}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize\n",
+    "model = JEPAWorldModelV2(\n",
+    "    model_name=\"gpt2\",\n",
+    "    state_dim=256,\n",
+    "    freeze_llm=True\n",
+    ").to(device)\n",
+    "\n",
+    "loss_fn = JEPALoss(mse_weight=1.0, cosine_weight=0.5)\n",
+    "\n",
+    "optimizer = torch.optim.AdamW(\n",
+    "    filter(lambda p: p.requires_grad, model.parameters()),\n",
+    "    lr=3e-4,\n",
+    "    weight_decay=0.01\n",
+    ")\n",
+    "\n",
+    "# Count params\n",
+    "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "total = sum(p.numel() for p in model.parameters())\n",
+    "print(f\"Trainable: {trainable:,} / Total: {total:,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Train\n",
+    "num_epochs = 30\n",
+    "history = {'train_loss': [], 'val_loss': [], 'train_sim': [], 'val_sim': []}\n",
+    "\n",
+    "for epoch in range(num_epochs):\n",
+    "    train_metrics = train_epoch(model, train_loader, optimizer, loss_fn, device)\n",
+    "    val_metrics = validate(model, val_loader, loss_fn, device)\n",
+    "    \n",
+    "    history['train_loss'].append(train_metrics['total'])\n",
+    "    history['val_loss'].append(val_metrics['total'])\n",
+    "    history['train_sim'].append(train_metrics['cosine_sim'])\n",
+    "    history['val_sim'].append(val_metrics['cosine_sim'])\n",
+    "    \n",
+    "    if (epoch + 1) % 5 == 0:\n",
+    "        print(f\"Epoch {epoch+1}/{num_epochs}\")\n",
+    "        print(f\"  Train Loss: {train_metrics['total']:.4f}, Cosine Sim: {train_metrics['cosine_sim']:.4f}\")\n",
+    "        print(f\"  Val Loss: {val_metrics['total']:.4f}, Cosine Sim: {val_metrics['cosine_sim']:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plot\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "axes[0].plot(history['train_loss'], label='Train')\n",
+    "axes[0].plot(history['val_loss'], label='Val')\n",
+    "axes[0].set_xlabel('Epoch')\n",
+    "axes[0].set_ylabel('Loss')\n",
+    "axes[0].legend()\n",
+    "axes[0].set_title('JEPA Loss')\n",
+    "\n",
+    "axes[1].plot(history['train_sim'], label='Train')\n",
+    "axes[1].plot(history['val_sim'], label='Val')\n",
+    "axes[1].set_xlabel('Epoch')\n",
+    "axes[1].set_ylabel('Cosine Similarity')\n",
+    "axes[1].legend()\n",
+    "axes[1].set_title('Prediction Quality')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Test: Predict Action Consequences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict_outcome(model, state: str, action: str, candidates: list) -> dict:\n",
+    "    \"\"\"\n",
+    "    JEPA-style inference:\n",
+    "    1. Predict next state embedding\n",
+    "    2. Find closest candidate in embedding space\n",
+    "    \"\"\"\n",
+    "    model.eval()\n",
+    "    \n",
+    "    with torch.no_grad():\n",
+    "        # Predict next state embedding\n",
+    "        predicted_emb = model([state], [action])  # [1, state_dim]\n",
+    "        \n",
+    "        # Encode all candidates\n",
+    "        candidate_embs = model.encode_state(candidates)  # [N, state_dim]\n",
+    "        \n",
+    "        # Compute similarities\n",
+    "        sims = F.cosine_similarity(\n",
+    "            predicted_emb.expand(len(candidates), -1),\n",
+    "            candidate_embs\n",
+    "        )\n",
+    "        \n",
+    "        best_idx = sims.argmax().item()\n",
+    "        \n",
+    "    return {\n",
+    "        'prediction': candidates[best_idx],\n",
+    "        'confidence': sims[best_idx].item(),\n",
+    "        'all_scores': {c: sims[i].item() for i, c in enumerate(candidates)}\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "# Test cases\n",
+    "test_cases = [\n",
+    "    {\n",
+    "        'state': \"Document is empty\",\n",
+    "        'action': \"create introduction section\",\n",
+    "        'candidates': [\n",
+    "            \"Document has introduction section\",\n",
+    "            \"Document deleted\",\n",
+    "            \"Document has 500 words\",\n",
+    "            \"Cart has 1 item\"\n",
+    "        ],\n",
+    "        'expected': \"Document has introduction section\"\n",
+    "    },\n",
+    "    {\n",
+    "        'state': \"Cart has 2 items totaling $80\",\n",
+    "        'action': \"apply 10% discount\",\n",
+    "        'candidates': [\n",
+    "            \"Cart has 2 items totaling $72\",\n",
+    "            \"Cart is empty\",\n",
+    "            \"Cart has 3 items totaling $100\",\n",
+    "            \"Order placed\"\n",
+    "        ],\n",
+    "        'expected': \"Cart has 2 items totaling $72\"\n",
+    "    },\n",
+    "    {\n",
+    "        'state': \"Project has 5 pending tasks\",\n",
+    "        'action': \"start task 1\",\n",
+    "        'candidates': [\n",
+    "            \"Project has 1 active and 4 pending tasks\",\n",
+    "            \"Project has 5 done tasks\",\n",
+    "            \"Project has no tasks\",\n",
+    "            \"Document approved\"\n",
+    "        ],\n",
+    "        'expected': \"Project has 1 active and 4 pending tasks\"\n",
+    "    }\n",
+    "]\n",
+    "\n",
+    "print(\"=\"*80)\n",
+    "print(\"WORLD MODEL PREDICTIONS\")\n",
+    "print(\"=\"*80)\n",
+    "\n",
+    "correct = 0\n",
+    "for i, test in enumerate(test_cases):\n",
+    "    result = predict_outcome(model, test['state'], test['action'], test['candidates'])\n",
+    "    is_correct = result['prediction'] == test['expected']\n",
+    "    correct += is_correct\n",
+    "    \n",
+    "    print(f\"\\nTest {i+1}: {'✓' if is_correct else '✗'}\")\n",
+    "    print(f\"  State: {test['state']}\")\n",
+    "    print(f\"  Action: {test['action']}\")\n",
+    "    print(f\"  Predicted: {result['prediction']}\")\n",
+    "    print(f\"  Expected: {test['expected']}\")\n",
+    "    print(f\"  Confidence: {result['confidence']:.4f}\")\n",
+    "\n",
+    "print(f\"\\nAccuracy: {correct}/{len(test_cases)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Visualize State Embedding Space"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.manifold import TSNE\n",
+    "\n",
+    "# Encode a variety of states\n",
+    "states_to_visualize = [\n",
+    "    # Document states\n",
+    "    \"Document is empty\",\n",
+    "    \"Document has introduction section\",\n",
+    "    \"Document has introduction with 500 words\",\n",
+    "    \"Document pending review\",\n",
+    "    \"Document approved\",\n",
+    "    # Cart states\n",
+    "    \"Cart empty\",\n",
+    "    \"Cart has 1 item totaling $50\",\n",
+    "    \"Cart has 2 items totaling $80\",\n",
+    "    \"Order placed for $72\",\n",
+    "    # Project states\n",
+    "    \"Project has no tasks\",\n",
+    "    \"Project has 5 pending tasks\",\n",
+    "    \"Project complete with 5 done tasks\",\n",
+    "]\n",
+    "\n",
+    "categories = ['doc']*5 + ['cart']*4 + ['project']*3\n",
+    "\n",
+    "model.eval()\n",
+    "with torch.no_grad():\n",
+    "    embeddings = model.encode_state(states_to_visualize).cpu().numpy()\n",
+    "\n",
+    "# t-SNE\n",
+    "tsne = TSNE(n_components=2, perplexity=5, random_state=42)\n",
+    "emb_2d = tsne.fit_transform(embeddings)\n",
+    "\n",
+    "# Plot\n",
+    "plt.figure(figsize=(10, 8))\n",
+    "colors = {'doc': 'blue', 'cart': 'green', 'project': 'red'}\n",
+    "\n",
+    "for i, (x, y) in enumerate(emb_2d):\n",
+    "    plt.scatter(x, y, c=colors[categories[i]], s=100)\n",
+    "    plt.annotate(states_to_visualize[i][:30] + '...', (x, y), fontsize=8)\n",
+    "\n",
+    "plt.title(\"State Embedding Space (t-SNE)\")\n",
+    "plt.xlabel(\"Dimension 1\")\n",
+    "plt.ylabel(\"Dimension 2\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Save Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save only the trained components (not the frozen LLM)\n",
+    "torch.save({\n",
+    "    'state_encoder': model.state_encoder.state_dict(),\n",
+    "    'action_encoder': model.action_encoder.state_dict(),\n",
+    "    'dynamics_predictor': model.dynamics_predictor.state_dict(),\n",
+    "    'config': {\n",
+    "        'model_name': 'gpt2',\n",
+    "        'state_dim': 256\n",
+    "    }\n",
+    "}, 'jepa_world_model_option2.pt')\n",
+    "\n",
+    "print(\"Model saved!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "**Option 2 Advantages over Option 1:**\n",
+    "- Single model serves as both encoder and predictor backbone\n",
+    "- Shared representation space between states and predictions\n",
+    "- Can fine-tune the LLM for even better results\n",
+    "\n",
+    "**Key Implementation Details:**\n",
+    "1. LLM hidden states → mean pooled → state encoder → state embedding\n",
+    "2. State + Action embeddings → dynamics predictor → next state embedding\n",
+    "3. Loss: MSE + Cosine similarity in embedding space\n",
+    "\n",
+    "**This is JEPA because:**\n",
+    "- We predict embeddings, not tokens\n",
+    "- The model learns state dynamics, not text generation\n",
+    "- Planning = finding actions that lead to desired state embeddings"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}