Spaces:

LLM-course
/

lipogram_private

Running

File size: 30,999 Bytes

1d7752e

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5815a5fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install dependencies if needed\n",
    "# !pip install transformers torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e4246fec",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0730b798",
   "metadata": {},
   "source": [
    "---\n",
    "# Part 1: Tokenizers\n",
    "\n",
    "Tokenizers convert text into numerical representations (token IDs) that models can process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ba944d7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocabulary size: 32000\n"
     ]
    }
   ],
   "source": [
    "# Load a tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\")\n",
    "print(f\"Vocabulary size: {len(tokenizer)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a3e5f5a",
   "metadata": {},
   "source": [
    "## 1.1 Basic Encoding & Decoding\n",
    "\n",
    "- `tokenizer.encode()` converts text → token IDs\n",
    "- `tokenizer.decode()` converts token IDs → text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8d9e1208",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text: 'Hello, how are you?'\n",
      "Token IDs: [1, 15043, 29892, 920, 526, 366, 29973]\n",
      "Number of tokens: 7\n"
     ]
    }
   ],
   "source": [
    "text = \"Hello, how are you?\"\n",
    "\n",
    "# Encode: text -> token IDs\n",
    "token_ids = tokenizer.encode(text)\n",
    "print(f\"Text: '{text}'\")\n",
    "print(f\"Token IDs: {token_ids}\")\n",
    "print(f\"Number of tokens: {len(token_ids)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "506574b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Decoded: '<s> Hello, how are you?'\n",
      "Decoded (no special tokens): 'Hello, how are you?'\n"
     ]
    }
   ],
   "source": [
    "# Decode: token IDs -> text\n",
    "decoded_text = tokenizer.decode(token_ids)\n",
    "print(f\"Decoded: '{decoded_text}'\")\n",
    "\n",
    "# Skip special tokens (like <s>, </s>)\n",
    "decoded_clean = tokenizer.decode(token_ids, skip_special_tokens=True)\n",
    "print(f\"Decoded (no special tokens): '{decoded_clean}'\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "278fd636",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      1 -> '<s>'\n",
      "  15043 -> '▁Hello'\n",
      "  29892 -> ','\n",
      "    920 -> '▁how'\n",
      "    526 -> '▁are'\n",
      "    366 -> '▁you'\n",
      "  29973 -> '?'\n"
     ]
    }
   ],
   "source": [
    "# Look at individual tokens\n",
    "tokens = tokenizer.convert_ids_to_tokens(token_ids)\n",
    "for tid, tok in zip(token_ids, tokens):\n",
    "    print(f\"  {tid:5d} -> '{tok}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaafadbc",
   "metadata": {},
   "source": [
    "### Key insight: Subword tokenization\n",
    "\n",
    "Words are split into subwords. Common words stay whole, rare words are broken into pieces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4a7a590d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'cat' -> 1 token(s): ['▁cat']\n",
      "'running' -> 1 token(s): ['▁running']\n",
      "'internationalization' -> 2 token(s): ['▁international', 'ization']\n",
      "'TinyLlama' -> 5 token(s): ['▁T', 'iny', 'L', 'l', 'ama']\n"
     ]
    }
   ],
   "source": [
    "# Compare tokenization of common vs rare words\n",
    "words = [\"cat\", \"running\", \"internationalization\", \"TinyLlama\"]\n",
    "\n",
    "for word in words:\n",
    "    ids = tokenizer.encode(word, add_special_tokens=False)\n",
    "    tokens = tokenizer.convert_ids_to_tokens(ids)\n",
    "    print(f\"'{word}' -> {len(ids)} token(s): {tokens}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7376eda8",
   "metadata": {},
   "source": [
    "## 1.2 Batching with Padding\n",
    "\n",
    "When processing multiple texts, they need the same length. Padding adds special tokens to shorter sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a4086c5d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'Hello!' -> 3 tokens\n",
      "'How are you doing today?' -> 7 tokens\n",
      "'I am fine.' -> 5 tokens\n"
     ]
    }
   ],
   "source": [
    "# Multiple texts of different lengths\n",
    "texts = [\n",
    "    \"Hello!\",\n",
    "    \"How are you doing today?\",\n",
    "    \"I am fine.\"\n",
    "]\n",
    "\n",
    "# Without padding - different lengths\n",
    "for text in texts:\n",
    "    ids = tokenizer.encode(text)\n",
    "    print(f\"'{text}' -> {len(ids)} tokens\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "247725b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input IDs shape: torch.Size([3, 7])\n",
      "\n",
      "Padded sequences:\n",
      "  0: [1, 15043, 29991, 2, 2, 2, 2]\n",
      "  1: [1, 1128, 526, 366, 2599, 9826, 29973]\n",
      "  2: [1, 306, 626, 2691, 29889, 2, 2]\n",
      "\n",
      "Attention mask (1=real token, 0=padding):\n",
      "  0: [1, 1, 1, 0, 0, 0, 0]\n",
      "  1: [1, 1, 1, 1, 1, 1, 1]\n",
      "  2: [1, 1, 1, 1, 1, 0, 0]\n"
     ]
    }
   ],
   "source": [
    "# With padding - same length (use tokenizer() for batch processing)\n",
    "# Set pad_token if not defined\n",
    "if tokenizer.pad_token is None:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "batch = tokenizer(texts, padding=True, return_tensors=\"pt\")\n",
    "# OLD (but equivalent) API:\n",
    "# batch = tokenizer.batch_encode_plus(texts, padding=True, return_tensors=\"pt\")\n",
    "\n",
    "print(\"Input IDs shape:\", batch[\"input_ids\"].shape)\n",
    "print(\"\\nPadded sequences:\")\n",
    "for i, (text, ids) in enumerate(zip(texts, batch[\"input_ids\"])):\n",
    "    print(f\"  {i}: {ids.tolist()}\")\n",
    "\n",
    "print(f\"\\nAttention mask (1=real token, 0=padding):\")\n",
    "for i, mask in enumerate(batch[\"attention_mask\"]):\n",
    "    print(f\"  {i}: {mask.tolist()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "0941dd4e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<s> Hello!</s></s></s></s>',\n",
       " '<s> How are you doing today?',\n",
       " '<s> I am fine.</s></s>']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.batch_decode(batch[\"input_ids\"], skip_special_tokens=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bdba168",
   "metadata": {},
   "source": [
    "## 1.3 Truncation\n",
    "\n",
    "When text is too long, truncation cuts it to fit the model's maximum length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "80b135bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original text length: 3000 characters\n",
      "Full tokenization: 702 tokens\n",
      "Truncated to 50: 50 tokens\n",
      "\n",
      "Truncated text: 'This is a very long sentence. This is a very long sentence. This is a very long sentence. This is a very long sentence. This is a very long sentence. This is a very long sentence. This is a very long sentence.'\n"
     ]
    }
   ],
   "source": [
    "long_text = \"This is a very long sentence. \" * 100\n",
    "print(f\"Original text length: {len(long_text)} characters\")\n",
    "\n",
    "# Without truncation\n",
    "ids_full = tokenizer.encode(long_text)\n",
    "print(f\"Full tokenization: {len(ids_full)} tokens\")\n",
    "\n",
    "# With truncation to max 50 tokens\n",
    "ids_truncated = tokenizer.encode(long_text, max_length=50, truncation=True)\n",
    "print(f\"Truncated to 50: {len(ids_truncated)} tokens\")\n",
    "\n",
    "# Decode to see what was kept\n",
    "print(f\"\\nTruncated text: '{tokenizer.decode(ids_truncated, skip_special_tokens=True)}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b514583",
   "metadata": {},
   "source": [
    "## 1.4 Chat Templates\n",
    "\n",
    "Chat models expect input in a specific format. Chat templates handle this automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "19a079d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Formatted chat:\n",
      "<|system|>\n",
      "You are a helpful assistant.</s>\n",
      "<|user|>\n",
      "What is the capital of France?</s>\n",
      "<|assistant|>\n",
      "The capital of France is Paris.</s>\n",
      "<|user|>\n",
      "What about Germany?</s>\n",
      "\n",
      "Tokenized chat:\n",
      "[529, 29989, 5205, 29989, 29958, 13, 3492, 526, 263, 8444, 20255, 29889, 2, 29871, 13, 29966, 29989, 1792, 29989, 29958, 13, 5618, 338, 278, 7483, 310, 3444, 29973, 2, 29871, 13, 29966, 29989, 465, 22137, 29989, 29958, 13, 1576, 7483, 310, 3444, 338, 3681, 29889, 2, 29871, 13, 29966, 29989, 1792, 29989, 29958, 13, 5618, 1048, 9556, 29973, 2, 29871, 13]\n"
     ]
    }
   ],
   "source": [
    "# Chat messages in OpenAI-style format\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "    {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
    "    {\"role\": \"assistant\", \"content\": \"The capital of France is Paris.\"},\n",
    "    {\"role\": \"user\", \"content\": \"What about Germany?\"}\n",
    "]\n",
    "\n",
    "# Apply chat template\n",
    "formatted = tokenizer.apply_chat_template(messages, tokenize=False)\n",
    "print(\"Formatted chat:\")\n",
    "print(formatted)\n",
    "# Tokenize formatted chat\n",
    "formatted_ids = tokenizer.apply_chat_template(messages)\n",
    "print(\"Tokenized chat:\")\n",
    "print(formatted_ids)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "289fa92d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token IDs shape: torch.Size([1, 68])\n",
      "Number of tokens: 68\n"
     ]
    }
   ],
   "source": [
    "# Tokenize directly with chat template\n",
    "inputs = tokenizer.apply_chat_template(\n",
    "    messages, \n",
    "    tokenize=True, \n",
    "    return_tensors=\"pt\",\n",
    "    add_generation_prompt=True  # Add prompt for assistant to continue\n",
    ")\n",
    "print(f\"Token IDs shape: {inputs.shape}\")\n",
    "print(f\"Number of tokens: {inputs.shape[1]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "874ceaae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'<|system|>\\nYou are a helpful assistant.</s> \\n<|user|>\\nWhat is the capital of France?</s> \\n<|assistant|>\\nThe capital of France is Paris.</s> \\n<|user|>\\nWhat about Germany?</s> \\n<|assistant|>\\n'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.decode(inputs[0], skip_special_tokens=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3ad0e09",
   "metadata": {},
   "source": [
    "---\n",
    "# Part 2: Decoding Strategies\n",
    "\n",
    "Different ways to select the next token during text generation.\n",
    "We'll implement each strategy manually using `model()` to understand how they work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "60661e73",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some parameters are on the meta device because they were offloaded to the disk.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: 'The secret to happiness is'\n",
      "Input IDs: [1, 450, 7035, 304, 22722, 338]\n"
     ]
    }
   ],
   "source": [
    "# Load model for generation examples\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
    "    dtype=torch.float32,\n",
    "    device_map=\"auto\"\n",
    ")\n",
    "\n",
    "prompt = \"The secret to happiness is\"\n",
    "input_ids = tokenizer.encode(prompt, return_tensors=\"pt\").to(model.device)\n",
    "print(f\"Prompt: '{prompt}'\")\n",
    "print(f\"Input IDs: {input_ids[0].tolist()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "6af3ed71",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function: get logits for the next token\n",
    "def get_next_token_logits(model, input_ids):\n",
    "    \"\"\"Run model forward pass and return logits for the next token.\"\"\"\n",
    "    with torch.no_grad():\n",
    "        outputs = model(input_ids)\n",
    "        # outputs.logits shape: (batch_size, seq_len, vocab_size)\n",
    "        # We want logits for the last position\n",
    "        next_token_logits = outputs.logits[:, -1, :]  # (batch_size, vocab_size)\n",
    "    return next_token_logits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "214a5b0d",
   "metadata": {},
   "source": [
    "## 2.1 Greedy Decoding\n",
    "\n",
    "Always pick the token with the highest probability (argmax). Fast but can be repetitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d88452e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Greedy decoding:\n",
      "The secret to happiness is to be happy with what you have.\n",
      "\n",
      "2. \"The secret to happiness is to be happy with what you have.\" - Unknown The\n"
     ]
    }
   ],
   "source": [
    "def greedy_decode(model, input_ids, max_new_tokens=30):\n",
    "    \"\"\"Generate tokens by always picking the highest probability token.\"\"\"\n",
    "    generated_ids = input_ids.clone()\n",
    "    \n",
    "    for _ in range(max_new_tokens):\n",
    "        logits = get_next_token_logits(model, generated_ids)\n",
    "        \n",
    "        # Greedy: pick token with highest logit\n",
    "        next_token = torch.argmax(logits, dim=-1, keepdim=True)\n",
    "        \n",
    "        # Append to sequence\n",
    "        generated_ids = torch.cat([generated_ids, next_token], dim=-1)\n",
    "        \n",
    "        # Stop if EOS token\n",
    "        if next_token.item() == tokenizer.eos_token_id:\n",
    "            break\n",
    "    \n",
    "    return generated_ids\n",
    "\n",
    "# Run greedy decoding\n",
    "output = greedy_decode(model, input_ids, max_new_tokens=30)\n",
    "print(\"Greedy decoding:\")\n",
    "print(tokenizer.decode(output[0], skip_special_tokens=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fb06e3f",
   "metadata": {},
   "source": [
    "## 2.2 Sampling with Temperature\n",
    "\n",
    "**Sampling**: Randomly pick tokens based on their probabilities.\n",
    "\n",
    "**Temperature** controls randomness by scaling logits before softmax:\n",
    "- `T < 1`: Sharper distribution → more deterministic\n",
    "- `T = 1`: Original probabilities  \n",
    "- `T > 1`: Flatter distribution → more random\n",
    "\n",
    "$$P(token_i) = \\frac{e^{logit_i / T}}{\\sum_j e^{logit_j / T}}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "556d2fd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_with_temperature(model, input_ids, max_new_tokens=20, temperature=1.0):\n",
    "    \"\"\"Generate tokens by sampling from the probability distribution.\"\"\"\n",
    "    generated_ids = input_ids.clone()\n",
    "    \n",
    "    for _ in range(max_new_tokens):\n",
    "        logits = get_next_token_logits(model, generated_ids)\n",
    "        \n",
    "        # Apply temperature scaling\n",
    "        scaled_logits = logits / temperature\n",
    "        \n",
    "        # Convert to probabilities\n",
    "        probs = torch.softmax(scaled_logits, dim=-1)\n",
    "        \n",
    "        # Sample from distribution\n",
    "        next_token = torch.multinomial(probs, num_samples=1)\n",
    "        \n",
    "        generated_ids = torch.cat([generated_ids, next_token], dim=-1)\n",
    "        \n",
    "        if next_token.item() == tokenizer.eos_token_id:\n",
    "            break\n",
    "    \n",
    "    return generated_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "d40c77ce",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Temperature = 0.3 (focused):\n",
      "  1: The secret to happiness is to be happy with what you have. \n",
      "\n",
      "2. The Power of Positivity:\n",
      "\n",
      "  2: The secret to happiness is to be happy with what you have.\n",
      "  3: The secret to happiness is to be happy.\n",
      "\n",
      "10. I know the secret to happiness is to be happy.\n"
     ]
    }
   ],
   "source": [
    "# Low temperature (more deterministic)\n",
    "print(\"Temperature = 0.3 (focused):\")\n",
    "for i in range(3):\n",
    "    output = sample_with_temperature(model, input_ids, max_new_tokens=20, temperature=0.3)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "57914c44",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Temperature = 1.5 (creative):\n",
      "  1: The secret to happiness is most MEntial Psychologicalahuaut decomposition Parden Wab school o Rusyn hate speech Social\n",
      "  2: The secret to happiness is . Vicmaxhold hospital the avenue shortcut huashong police bouve ,diaozhuqq\n",
      "  3: The secret to happiness is beyond jealousy polit lo это мини - ответе teachers laugh earth stack Disney world God Lex\n"
     ]
    }
   ],
   "source": [
    "# High temperature (more random)\n",
    "print(\"Temperature = 1.5 (creative):\")\n",
    "for i in range(3):\n",
    "    output = sample_with_temperature(model, input_ids, max_new_tokens=20, temperature=1.5)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6fc326f",
   "metadata": {},
   "source": [
    "## 2.3 Top-K Sampling\n",
    "\n",
    "Only consider the K most likely tokens, then sample from those.\n",
    "Prevents sampling very unlikely tokens while keeping diversity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "8b336771",
   "metadata": {},
   "outputs": [],
   "source": [
    "def top_k_sampling(model, input_ids, max_new_tokens=20, top_k=50, temperature=1.0):\n",
    "    \"\"\"Sample from top-k most likely tokens.\"\"\"\n",
    "    generated_ids = input_ids.clone()\n",
    "    \n",
    "    for _ in range(max_new_tokens):\n",
    "        logits = get_next_token_logits(model, generated_ids)\n",
    "        \n",
    "        # Apply temperature\n",
    "        scaled_logits = logits / temperature\n",
    "        \n",
    "        # Get top-k logits and indices\n",
    "        top_k_logits, top_k_indices = torch.topk(scaled_logits, k=top_k, dim=-1)\n",
    "        \n",
    "        # Convert to probabilities (only over top-k)\n",
    "        top_k_probs = torch.softmax(top_k_logits, dim=-1)\n",
    "        \n",
    "        # Sample from top-k\n",
    "        sampled_index = torch.multinomial(top_k_probs, num_samples=1)\n",
    "        \n",
    "        # Map back to vocabulary index\n",
    "        next_token = top_k_indices.gather(-1, sampled_index)\n",
    "        \n",
    "        generated_ids = torch.cat([generated_ids, next_token], dim=-1)\n",
    "        \n",
    "        if next_token.item() == tokenizer.eos_token_id:\n",
    "            break\n",
    "    \n",
    "    return generated_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "64b1ba1e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top-K = 5:\n",
      "  1: The secret to happiness is to live in the moment, and to enjoy the present moment.\n",
      "\n",
      "3) The Artist\n",
      "  2: The secret to happiness is to find your purpose.\n",
      "  3: The secret to happiness is simple - be yourself.\n"
     ]
    }
   ],
   "source": [
    "# Top-K = 5 (only consider top 5 tokens)\n",
    "print(\"Top-K = 5:\")\n",
    "for i in range(3):\n",
    "    output = top_k_sampling(model, input_ids, max_new_tokens=20, top_k=5)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "816c7a7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top-K = 50:\n",
      "  1: The secret to happiness is unconventional\n",
      "You The secret to happiness is unconventional There is no right or\n",
      "  2: The secret to happiness is happiness, happiness, happiness ...\n",
      "\n",
      "8. A bird in the hand is worth two in the\n",
      "  3: The secret to happiness is always being in alignment with your inner voice. In this conversation, Linda Kuzmina and K\n"
     ]
    }
   ],
   "source": [
    "# Top-K = 50 (more diversity)\n",
    "print(\"Top-K = 50:\")\n",
    "for i in range(3):\n",
    "    output = top_k_sampling(model, input_ids, max_new_tokens=20, top_k=50)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "807683a2",
   "metadata": {},
   "source": [
    "## 2.4 Top-P (Nucleus) Sampling\n",
    "\n",
    "Select the smallest set of tokens whose cumulative probability exceeds P.\n",
    "\n",
    "- `top_p=0.9` means: consider tokens until their probabilities sum to 90%\n",
    "- Adapts dynamically: fewer tokens when model is confident, more when uncertain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "d95328d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def top_p_sampling(model, input_ids, max_new_tokens=20, top_p=0.9, temperature=1.0):\n",
    "    \"\"\"Sample from the smallest set of tokens with cumulative prob >= top_p.\"\"\"\n",
    "    generated_ids = input_ids.clone()\n",
    "    \n",
    "    for _ in range(max_new_tokens):\n",
    "        logits = get_next_token_logits(model, generated_ids)\n",
    "        \n",
    "        # Apply temperature\n",
    "        scaled_logits = logits / temperature\n",
    "        \n",
    "        # Sort by probability (descending)\n",
    "        sorted_logits, sorted_indices = torch.sort(scaled_logits, descending=True, dim=-1)\n",
    "        sorted_probs = torch.softmax(sorted_logits, dim=-1)\n",
    "        \n",
    "        # Compute cumulative probabilities\n",
    "        cumulative_probs = torch.cumsum(sorted_probs, dim=-1)\n",
    "        \n",
    "        # Find cutoff: first position where cumulative prob exceeds top_p\n",
    "        # Keep at least 1 token\n",
    "        cutoff_mask = cumulative_probs > top_p\n",
    "        cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()  # Shift right\n",
    "        cutoff_mask[..., 0] = False  # Always keep the top token\n",
    "        \n",
    "        # Set probabilities of tokens beyond cutoff to 0\n",
    "        sorted_probs[cutoff_mask] = 0\n",
    "        \n",
    "        # Renormalize\n",
    "        sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)\n",
    "        \n",
    "        # Sample from filtered distribution\n",
    "        sampled_index = torch.multinomial(sorted_probs, num_samples=1)\n",
    "        \n",
    "        # Map back to vocabulary index\n",
    "        next_token = sorted_indices.gather(-1, sampled_index)\n",
    "        \n",
    "        generated_ids = torch.cat([generated_ids, next_token], dim=-1)\n",
    "        \n",
    "        if next_token.item() == tokenizer.eos_token_id:\n",
    "            break\n",
    "    \n",
    "    return generated_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "dc8ce486",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top-P = 0.5:\n",
      "  1: The secret to happiness is not in the destination, but in the journey. 4. \"When life gives you lemons\n",
      "  2: The secret to happiness is simple: enjoy the present moment. Life is too short to waste time worrying about the past or\n",
      "  3: The secret to happiness is in the simplicity of living. It's not about the material possessions, the wealth, or\n"
     ]
    }
   ],
   "source": [
    "# Top-P = 0.5 (more focused)\n",
    "print(\"Top-P = 0.5:\")\n",
    "for i in range(3):\n",
    "    output = top_p_sampling(model, input_ids, max_new_tokens=20, top_p=0.5)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "a5d8f1fb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top-P = 0.95:\n",
      "  1: The secret to happiness is simple - simply be content with the one thing in your life that you know you will be happy in\n",
      "  2: The secret to happiness is not all around and I have realized that now. I love you... Things have changed since then.\n",
      "  3: The secret to happiness is To be joy in your life, happiness in your dreams, and light in your cup, a\n"
     ]
    }
   ],
   "source": [
    "# Top-P = 0.95 (more diversity)\n",
    "print(\"Top-P = 0.95:\")\n",
    "for i in range(3):\n",
    "    output = top_p_sampling(model, input_ids, max_new_tokens=20, top_p=0.95)\n",
    "    print(f\"  {i+1}: {tokenizer.decode(output[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e164c2e",
   "metadata": {},
   "source": [
    "## 2.5 Beam Search\n",
    "\n",
    "Explore multiple sequences in parallel, keeping the N best candidates at each step.\n",
    "\n",
    "At each step:\n",
    "1. Expand each beam with all possible next tokens\n",
    "2. Score all candidates (beam × vocab_size)\n",
    "3. Keep only the top N scoring sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "88acefaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top 3 beam hypotheses:\n",
      "  1 (score=-9.82): The secret to happiness is to be content with what you have.\n",
      "  2 (score=-19.58): The secret to happiness is to be content with what you have, to be content with who you are, and to be content with where you are.\n",
      "  3 (score=-20.78): The secret to happiness is to be content with what you have, to be content with who you are, to be content with where you are, and to be content with the\n"
     ]
    }
   ],
   "source": [
    "def beam_search_n_best(model, input_ids, max_new_tokens=30, num_beams=5, n_best=3):\n",
    "    \"\"\"Beam search returning top n hypotheses.\"\"\"\n",
    "    \n",
    "    beams = [(input_ids.clone(), 0.0)]\n",
    "    completed = []\n",
    "    \n",
    "    for _ in range(max_new_tokens):\n",
    "        all_candidates = []\n",
    "        \n",
    "        for seq, score in beams:\n",
    "            if seq[0, -1].item() == tokenizer.eos_token_id:\n",
    "                completed.append((seq, score))\n",
    "                continue\n",
    "            \n",
    "            logits = get_next_token_logits(model, seq)\n",
    "            log_probs = torch.log_softmax(logits, dim=-1)\n",
    "            top_log_probs, top_indices = torch.topk(log_probs[0], k=num_beams)\n",
    "            \n",
    "            for log_prob, token_id in zip(top_log_probs, top_indices):\n",
    "                new_seq = torch.cat([seq, token_id.view(1, 1)], dim=-1)\n",
    "                new_score = score + log_prob.item()\n",
    "                all_candidates.append((new_seq, new_score))\n",
    "        \n",
    "        if not all_candidates:\n",
    "            break\n",
    "        \n",
    "        all_candidates.sort(key=lambda x: x[1], reverse=True)\n",
    "        beams = all_candidates[:num_beams]\n",
    "    \n",
    "    completed.extend(beams)\n",
    "    completed.sort(key=lambda x: x[1], reverse=True)\n",
    "    return completed[:n_best]\n",
    "\n",
    "# Return top 3 hypotheses\n",
    "results = beam_search_n_best(model, input_ids, max_new_tokens=30, num_beams=5, n_best=3)\n",
    "print(\"Top 3 beam hypotheses:\")\n",
    "for i, (seq, score) in enumerate(results):\n",
    "    print(f\"  {i+1} (score={score:.2f}): {tokenizer.decode(seq[0], skip_special_tokens=True)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2151c253",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Strategy | Use Case | Key Idea |\n",
    "|----------|----------|----------|\n",
    "| **Greedy** | Fast, deterministic | `argmax(logits)` |\n",
    "| **Temperature** | Control randomness | `logits / T` before softmax |\n",
    "| **Top-K** | Limit token pool | Keep only K highest logits |\n",
    "| **Top-P** | Dynamic token pool | Keep tokens until cumsum(prob) > P |\n",
    "| **Beam Search** | Quality over diversity | Track N best sequences |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "lipogram_private",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}