Spaces:

Rushabh147
/

code-gen-assistant

Sleeping

File size: 24,059 Bytes

b89e6d6

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "43fed051",
   "metadata": {},
   "source": [
    "# Code Generation Assistant \n",
    "\n",
    "**Generate Python code from natural-language descriptions, grounded in CodeSearchNet.**\n",
    "\n",
    "This notebook runs the core vertical slice of the capstone top-to-bottom:\n",
    "\n",
    "1. **Phase 1** - load + clean CodeSearchNet, EDA\n",
    "2. **Phase 3** - embed the corpus + build a FAISS retrieval index\n",
    "3. **Phase 5** - RAG: retrieve similar examples and condition a code LLM\n",
    "4. **Eval** - baseline (no retrieval) vs RAG, scored with CodeBLEU\n",
    "5. **Interactive** - ask it to write code\n",
    "6. **Phase 4 (optional)** - fine-tune CodeT5+ on docstring->code\n",
    "\n",
    "> CodeSearchNet was built for code *search*, so it ships `(docstring, code)` pairs\n",
    "> and **no unit tests**. We treat the docstring summary as the intent and the\n",
    "> function body as the target. Because it is natively a retrieval corpus, RAG is\n",
    "> the most natural architecture here.\n",
    "\n",
    "**Runtime:** set `Runtime -> Change runtime type -> T4 GPU`. No API key required -\n",
    "generation uses a small local model (`Qwen2.5-Coder-1.5B-Instruct`)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06dd65ca",
   "metadata": {},
   "source": [
    "## 0. Setup\n",
    "\n",
    "Installs everything. `codebleu` is optional (it builds tree-sitter parsers); if it\n",
    "fails the eval falls back to a token-overlap metric so the notebook still runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2733fd6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip -q install datasets transformers accelerate sentence-transformers faiss-cpu pandas matplotlib seaborn\n",
    "# codebleu needs a tree-sitter parser to actually run; install it too (optional - has a fallback)\n",
    "!pip -q install codebleu tree-sitter tree-sitter-python || echo \"codebleu/parser install failed - will use fallback metric\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1125e98a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "print(\"CUDA available:\", torch.cuda.is_available())\n",
    "print(\"Device:\", torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU (generation will be slow)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88a40082",
   "metadata": {},
   "source": [
    "## 1. Config\n",
    "\n",
    "One place for every knob. `MAX_ROWS` keeps the Colab run fast - raise it (or set to\n",
    "`None`) for a fuller run. `python` only to start; depth over breadth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "894bfb5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import dataclass, field\n",
    "from typing import Tuple\n",
    "\n",
    "@dataclass\n",
    "class Config:\n",
    "    # data\n",
    "    candidate_dataset_ids: Tuple[str, ...] = (\n",
    "        \"code-search-net/code_search_net\",  # parquet mirror (most reliable)\n",
    "        \"code_search_net\",                   # canonical (may need older datasets)\n",
    "    )\n",
    "    language: str = \"python\"\n",
    "    max_rows: int = 8000        # subset for speed; set None for full split\n",
    "    # cleaning\n",
    "    min_doc_words: int = 3\n",
    "    max_doc_words: int = 120\n",
    "    min_code_chars: int = 20\n",
    "    max_code_tokens: int = 400\n",
    "    doc_blocklist: Tuple[str, ...] = (\"todo\", \"fixme\", \"auto-generated\",\n",
    "                                      \"autogenerated\", \"do not edit\")\n",
    "    # split\n",
    "    seed: int = 42\n",
    "    train: float = 0.8\n",
    "    val: float = 0.1\n",
    "    # models\n",
    "    embed_model: str = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
    "    gen_model: str = \"Qwen/Qwen2.5-Coder-1.5B-Instruct\"\n",
    "    top_k: int = 3\n",
    "\n",
    "CFG = Config()\n",
    "CFG"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4382632d",
   "metadata": {},
   "source": [
    "## 2. Phase 1a - Load CodeSearchNet\n",
    "\n",
    "Tries the parquet mirror first, then the canonical id. If both fail on your\n",
    "`datasets` version, run `!pip install \"datasets<3\"` and re-run, or download the\n",
    "raw release from the CodeSearchNet GitHub repo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "158ab53b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "\n",
    "USE_COLS = {\n",
    "    \"func_documentation_string\": \"docstring\",\n",
    "    \"func_code_string\": \"code\",\n",
    "    \"language\": \"language\",\n",
    "    \"repository_name\": \"repo\",\n",
    "    \"func_code_url\": \"url\",\n",
    "}\n",
    "\n",
    "def load_codesearchnet(cfg):\n",
    "    last_err = None\n",
    "    for ds_id in cfg.candidate_dataset_ids:\n",
    "        try:\n",
    "            print(f\"[load] trying '{ds_id}' ({cfg.language}) ...\")\n",
    "            ds = load_dataset(ds_id, cfg.language, split=\"train\", trust_remote_code=True)\n",
    "            if cfg.max_rows:\n",
    "                ds = ds.select(range(min(cfg.max_rows, len(ds))))\n",
    "            df = ds.to_pandas()\n",
    "            keep = [c for c in USE_COLS if c in df.columns]\n",
    "            df = df[keep].rename(columns=USE_COLS)\n",
    "            for col in USE_COLS.values():\n",
    "                if col not in df.columns:\n",
    "                    df[col] = \"\"\n",
    "            print(f\"[load] OK - {len(df):,} rows from '{ds_id}'\")\n",
    "            return df\n",
    "        except Exception as e:\n",
    "            print(f\"[load] failed: {e}\")\n",
    "            last_err = e\n",
    "    raise RuntimeError(f\"All dataset ids failed. Last error: {last_err}\")\n",
    "\n",
    "raw = load_codesearchnet(CFG)\n",
    "raw.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32dba780",
   "metadata": {},
   "source": [
    "## 3. Phase 1b - Clean & filter\n",
    "\n",
    "CodeSearchNet is noisy. We keep only the **summary first line** of each docstring\n",
    "as the intent (the rest is usually `:param:`/`:return:` boilerplate), then apply\n",
    "quality filters and dedup. The **funnel** logs how many rows each filter removes -\n",
    "keep it for your write-up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2aa0af5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "WORD_RE = re.compile(r\"\\b\\w+\\b\")\n",
    "\n",
    "def first_line(t):\n",
    "    return t.strip().split(\"\\n\")[0].strip() if isinstance(t, str) else \"\"\n",
    "\n",
    "def word_count(t):\n",
    "    return len(WORD_RE.findall(t)) if isinstance(t, str) else 0\n",
    "\n",
    "def ascii_ratio(t):\n",
    "    if not t:\n",
    "        return 1.0\n",
    "    return sum(1 for ch in t if ord(ch) < 128) / len(t)\n",
    "\n",
    "def approx_tokens(c):\n",
    "    return len(re.findall(r\"\\w+|[^\\s\\w]\", c)) if isinstance(c, str) else 0\n",
    "\n",
    "def clean(df, cfg):\n",
    "    funnel = [(\"raw\", len(df))]\n",
    "    df = df.copy()\n",
    "    df[\"docstring\"] = df[\"docstring\"].map(first_line)\n",
    "    df[\"code\"] = df[\"code\"].fillna(\"\").astype(str)\n",
    "\n",
    "    df = df[(df[\"docstring\"].str.len() > 0) & (df[\"code\"].str.len() > 0)]\n",
    "    funnel.append((\"non_empty\", len(df)))\n",
    "\n",
    "    wc = df[\"docstring\"].map(word_count)\n",
    "    df = df[(wc >= cfg.min_doc_words) & (wc <= cfg.max_doc_words)]\n",
    "    funnel.append((\"doc_word_window\", len(df)))\n",
    "\n",
    "    df = df[df[\"code\"].str.len() >= cfg.min_code_chars]\n",
    "    funnel.append((\"min_code_chars\", len(df)))\n",
    "\n",
    "    df = df[df[\"code\"].map(approx_tokens) <= cfg.max_code_tokens]\n",
    "    funnel.append((\"max_code_tokens\", len(df)))\n",
    "\n",
    "    pat = \"|\".join(re.escape(t) for t in cfg.doc_blocklist)\n",
    "    df = df[~df[\"docstring\"].str.lower().str.contains(pat, regex=True)]\n",
    "    funnel.append((\"doc_blocklist\", len(df)))\n",
    "\n",
    "    df = df[df[\"docstring\"].map(ascii_ratio) >= 0.9]\n",
    "    funnel.append((\"ascii_docs\", len(df)))\n",
    "\n",
    "    df = df.drop_duplicates(subset=[\"code\"]).drop_duplicates(subset=[\"docstring\"])\n",
    "    funnel.append((\"dedup\", len(df)))\n",
    "\n",
    "    funnel_df = pd.DataFrame(funnel, columns=[\"step\", \"rows_remaining\"])\n",
    "    return df.reset_index(drop=True), funnel_df\n",
    "\n",
    "clean_df, funnel = clean(raw, CFG)\n",
    "print(funnel.to_string(index=False))\n",
    "print(\"\\nClean rows:\", len(clean_df))\n",
    "clean_df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20747f0a",
   "metadata": {},
   "source": [
    "## 4. Phase 1c - EDA\n",
    "\n",
    "Quick look at the cleaned corpus: docstring length, code length, and the cleaning\n",
    "funnel. Save these for the report appendix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f684c430",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set_theme(style=\"whitegrid\")\n",
    "\n",
    "doc_words = clean_df[\"docstring\"].map(word_count)\n",
    "code_lines = clean_df[\"code\"].str.count(\"\\n\") + 1\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n",
    "sns.histplot(doc_words, bins=40, ax=axes[0]); axes[0].set(title=\"Docstring length (words)\", xlabel=\"words\")\n",
    "sns.histplot(code_lines.clip(upper=80), bins=40, ax=axes[1]); axes[1].set(title=\"Code length (lines, clipped 80)\", xlabel=\"lines\")\n",
    "axes[2].barh(funnel[\"step\"], funnel[\"rows_remaining\"]); axes[2].invert_yaxis(); axes[2].set(title=\"Cleaning funnel\")\n",
    "plt.tight_layout(); plt.show()\n",
    "\n",
    "print({\n",
    "    \"rows\": len(clean_df),\n",
    "    \"doc_words_median\": int(doc_words.median()),\n",
    "    \"code_lines_median\": int(code_lines.median()),\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db098ca9",
   "metadata": {},
   "source": [
    "## 5. Train / val / test split\n",
    "\n",
    "The **train** pool doubles as the retrieval corpus for RAG. We evaluate on **test**\n",
    "so retrieved examples never leak the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c18c3e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split(df, cfg):\n",
    "    df = df.sample(frac=1.0, random_state=cfg.seed).reset_index(drop=True)\n",
    "    n = len(df); n_tr = int(n * cfg.train); n_va = int(n * cfg.val)\n",
    "    return (df.iloc[:n_tr].reset_index(drop=True),\n",
    "            df.iloc[n_tr:n_tr+n_va].reset_index(drop=True),\n",
    "            df.iloc[n_tr+n_va:].reset_index(drop=True))\n",
    "\n",
    "train_df, val_df, test_df = split(clean_df, CFG)\n",
    "print(f\"train={len(train_df)}  val={len(val_df)}  test={len(test_df)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2d3b684",
   "metadata": {},
   "source": [
    "## 6. Phase 3 - Embeddings + FAISS index\n",
    "\n",
    "Embed each docstring in the train pool and build a cosine-similarity index\n",
    "(`IndexFlatIP` on L2-normalised vectors). The default embedder is small and fast;\n",
    "for a stronger code-aware corpus, swap `embed_model` to\n",
    "`Salesforce/codet5p-110m-embedding` (ties into the CodeT5 family)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68145a4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "import faiss\n",
    "import numpy as np\n",
    "\n",
    "embedder = SentenceTransformer(CFG.embed_model)\n",
    "corpus = train_df.reset_index(drop=True)\n",
    "\n",
    "corpus_emb = embedder.encode(\n",
    "    corpus[\"docstring\"].tolist(),\n",
    "    batch_size=64, show_progress_bar=True,\n",
    "    convert_to_numpy=True, normalize_embeddings=True,\n",
    ").astype(\"float32\")\n",
    "\n",
    "index = faiss.IndexFlatIP(corpus_emb.shape[1])\n",
    "index.add(corpus_emb)\n",
    "print(\"Indexed vectors:\", index.ntotal, \"| dim:\", corpus_emb.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "386988c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def retrieve(query, k=None):\n",
    "    k = k or CFG.top_k\n",
    "    q = embedder.encode([query], convert_to_numpy=True,\n",
    "                        normalize_embeddings=True).astype(\"float32\")\n",
    "    scores, idx = index.search(q, k)\n",
    "    out = corpus.iloc[idx[0]].copy()\n",
    "    out[\"score\"] = scores[0]\n",
    "    return out\n",
    "\n",
    "# sanity check\n",
    "retrieve(\"read a json file from disk and return a dict\")[[\"docstring\", \"score\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45bad6b2",
   "metadata": {},
   "source": [
    "## 7. Phase 5a - Load the code LLM\n",
    "\n",
    "`Qwen2.5-Coder-1.5B-Instruct` fits on a free T4. For higher quality (and a Colab Pro\n",
    "GPU) bump `gen_model` to `Qwen/Qwen2.5-Coder-7B-Instruct`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c42beea4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "\n",
    "tok = AutoTokenizer.from_pretrained(CFG.gen_model)\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    CFG.gen_model, torch_dtype=\"auto\", device_map=\"auto\"\n",
    ")\n",
    "\n",
    "def chat_generate(messages, max_new_tokens=320):\n",
    "    text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    inputs = tok(text, return_tensors=\"pt\").to(model.device)\n",
    "    out = model.generate(**inputs, max_new_tokens=max_new_tokens,\n",
    "                         do_sample=False, pad_token_id=tok.eos_token_id)\n",
    "    new = out[0][inputs.input_ids.shape[1]:]\n",
    "    return tok.decode(new, skip_special_tokens=True)\n",
    "\n",
    "def extract_code(text):\n",
    "    \"\"\"Strip markdown fences if the model wrapped the code.\"\"\"\n",
    "    m = re.search(r\"```(?:python)?\\n(.*?)```\", text, re.DOTALL)\n",
    "    return m.group(1).strip() if m else text.strip()\n",
    "\n",
    "print(\"Model loaded:\", CFG.gen_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2622d4fd",
   "metadata": {},
   "source": [
    "## 8. Phase 5b - Baseline vs RAG prompts\n",
    "\n",
    "Same model, two prompting strategies. The RAG prompt injects the top-k retrieved\n",
    "`(docstring, code)` pairs as dynamic few-shot context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54db7627",
   "metadata": {},
   "outputs": [],
   "source": [
    "SYS = (\"You are an expert Python coding assistant. Write a single, correct, \"\n",
    "       \"self-contained Python function for the request. Output only code.\")\n",
    "\n",
    "def baseline_messages(intent):\n",
    "    return [{\"role\": \"system\", \"content\": SYS},\n",
    "            {\"role\": \"user\", \"content\": f\"# Task: {intent}\"}]\n",
    "\n",
    "def rag_messages(intent, k=None):\n",
    "    ex = retrieve(intent, k)\n",
    "    blocks = [f\"# Task: {r.docstring}\\n{r.code}\" for _, r in ex.iterrows()]\n",
    "    context = \"\\n\\n\".join(blocks)\n",
    "    user = (f\"Here are similar reference examples:\\n\\n{context}\\n\\n\"\n",
    "            f\"# Now write a function for this task:\\n# Task: {intent}\")\n",
    "    return [{\"role\": \"system\", \"content\": SYS},\n",
    "            {\"role\": \"user\", \"content\": user}]\n",
    "\n",
    "demo = \"Write a function that returns the n-th Fibonacci number.\"\n",
    "print(\"=== BASELINE ===\")\n",
    "print(extract_code(chat_generate(baseline_messages(demo))))\n",
    "print(\"\\n=== RAG ===\")\n",
    "print(extract_code(chat_generate(rag_messages(demo))))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2f6fda3",
   "metadata": {},
   "source": [
    "## 9. Eval - CodeBLEU, baseline vs RAG\n",
    "\n",
    "We score generated code against the reference on held-out **test** rows. CodeBLEU\n",
    "weights AST + data-flow match, not just text overlap. If `codebleu` did not install,\n",
    "we fall back to a token-overlap F1 so the cell still runs.\n",
    "\n",
    "> Caveat: CodeSearchNet has no unit tests, so this measures *similarity to the\n",
    "> reference*, not functional correctness. For pass@k, add a HumanEval/MBPP harness\n",
    "> (Phase 2) - flagged in the next-steps cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8530179f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try CodeBLEU; fall back to token-F1 if the metric OR its parser is unavailable.\n",
    "score, METRIC = None, None\n",
    "try:\n",
    "    from codebleu import calc_codebleu\n",
    "    # actually CALL it once - this is what needs the tree-sitter parser\n",
    "    _ = calc_codebleu([\"def f(): return 1\"], [\"def f(): return 1\"], lang=\"python\")\n",
    "    def score(ref, hyp):\n",
    "        return calc_codebleu([ref], [hyp], lang=\"python\")[\"codebleu\"]\n",
    "    METRIC = \"CodeBLEU\"\n",
    "except Exception as e:\n",
    "    print(\"CodeBLEU unavailable, using token-F1 fallback:\", e)\n",
    "    def _toks(s):\n",
    "        return set(re.findall(r\"\\w+\", s))\n",
    "    def score(ref, hyp):\n",
    "        a, b = _toks(ref), _toks(hyp)\n",
    "        if not a or not b:\n",
    "            return 0.0\n",
    "        inter = len(a & b)\n",
    "        p, rec = inter / len(b), inter / len(a)\n",
    "        return 0.0 if p + rec == 0 else 2 * p * rec / (p + rec)\n",
    "    METRIC = \"token-F1 (fallback)\"\n",
    "print(\"Using metric:\", METRIC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1f22892",
   "metadata": {},
   "outputs": [],
   "source": [
    "N_EVAL = 15  # keep small on free Colab; raise for the real run\n",
    "sample = test_df.sample(min(N_EVAL, len(test_df)), random_state=CFG.seed)\n",
    "\n",
    "rows = []\n",
    "for _, r in sample.iterrows():\n",
    "    base = extract_code(chat_generate(baseline_messages(r.docstring)))\n",
    "    rag = extract_code(chat_generate(rag_messages(r.docstring)))\n",
    "    rows.append({\"baseline\": score(r.code, base), \"rag\": score(r.code, rag)})\n",
    "\n",
    "res = pd.DataFrame(rows)\n",
    "print(f\"Mean {METRIC} over {len(res)} test tasks:\")\n",
    "print(res.mean().round(4).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86b042c5",
   "metadata": {},
   "source": [
    "## 10. Interactive - ask it to write code\n",
    "\n",
    "Edit the string and run. This uses the RAG pipeline and shows the retrieved\n",
    "examples so the grounding is visible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13cf8cc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def ask(intent, show_sources=True):\n",
    "    if show_sources:\n",
    "        print(\"Retrieved examples:\")\n",
    "        for _, r in retrieve(intent).iterrows():\n",
    "            print(f\"  - ({r.score:.2f}) {r.docstring}\")\n",
    "        print(\"-\" * 50)\n",
    "    print(extract_code(chat_generate(rag_messages(intent))))\n",
    "\n",
    "ask(\"Write a function to check whether a string is a valid IPv4 address.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61adc276",
   "metadata": {},
   "source": [
    "## 11. (Optional) Phase 4 - Fine-tune CodeT5+\n",
    "\n",
    "A compact demonstration of the fine-tuning arm: train `codet5p-220m` on a small\n",
    "`docstring -> code` subset for a few steps so you can see the loop work, then\n",
    "generate. For the real capstone result, raise `subset`/`epochs` and run on a Pro\n",
    "GPU. **This section is slow - skip on a first pass.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2f5217e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set to True to run fine-tuning.\n",
    "RUN_FINETUNE = False\n",
    "\n",
    "if RUN_FINETUNE:\n",
    "    from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,\n",
    "                              Seq2SeqTrainer, Seq2SeqTrainingArguments,\n",
    "                              DataCollatorForSeq2Seq)\n",
    "    from datasets import Dataset\n",
    "\n",
    "    ck = \"Salesforce/codet5p-220m\"\n",
    "    t5_tok = AutoTokenizer.from_pretrained(ck)\n",
    "    t5 = AutoModelForSeq2SeqLM.from_pretrained(ck)\n",
    "\n",
    "    subset = train_df.head(2000)\n",
    "    def to_features(batch):\n",
    "        x = t5_tok(batch[\"docstring\"], max_length=64, truncation=True, padding=\"max_length\")\n",
    "        y = t5_tok(text_target=batch[\"code\"], max_length=256, truncation=True, padding=\"max_length\")\n",
    "        x[\"labels\"] = y[\"input_ids\"]\n",
    "        return x\n",
    "\n",
    "    hf = Dataset.from_pandas(subset[[\"docstring\", \"code\"]]).map(\n",
    "        to_features, batched=True, remove_columns=[\"docstring\", \"code\"])\n",
    "\n",
    "    args = Seq2SeqTrainingArguments(\n",
    "        output_dir=\"codet5p-ft\", per_device_train_batch_size=8,\n",
    "        num_train_epochs=1, learning_rate=5e-5, logging_steps=20,\n",
    "        fp16=torch.cuda.is_available(), report_to=\"none\", save_strategy=\"no\")\n",
    "\n",
    "    trainer = Seq2SeqTrainer(\n",
    "        model=t5, args=args, train_dataset=hf,\n",
    "        data_collator=DataCollatorForSeq2Seq(t5_tok, model=t5))\n",
    "    trainer.train()\n",
    "\n",
    "    def t5_generate(intent):\n",
    "        ids = t5_tok(intent, return_tensors=\"pt\").input_ids.to(t5.device)\n",
    "        out = t5.generate(ids, max_length=256)\n",
    "        return t5_tok.decode(out[0], skip_special_tokens=True)\n",
    "\n",
    "    print(t5_generate(\"Return the factorial of a non-negative integer n.\"))\n",
    "else:\n",
    "    print(\"Fine-tuning skipped. Set RUN_FINETUNE = True to run it.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed8b964b",
   "metadata": {},
   "source": [
    "## 12. Next steps + deploying to VS Code\n",
    "\n",
    "**What's still to add for the full capstone:**\n",
    "- **Phase 2 functional eval:** wire up HumanEval / MBPP for real `pass@k` (they ship\n",
    "  unit tests, unlike CodeSearchNet). This is the metric graders trust most.\n",
    "- **Phase 6 agentic loop:** generate -> run in a sandbox -> read traceback -> repair.\n",
    "- **Retrieval quality:** measure recall@k / MRR on the search task to justify the embedder.\n",
    "\n",
    "**Lifting this into VS Code for deployment:**\n",
    "1. The functions here map cleanly onto the repo modules: `clean()` -> `src/data/clean.py`,\n",
    "   `retrieve()` + index build -> `src/rag/retriever.py`, `chat_generate()`/prompts ->\n",
    "   `src/rag/generator.py`.\n",
    "2. Persist the FAISS index (`faiss.write_index(index, \"index.faiss\")`) and the corpus\n",
    "   so you don't rebuild on every start.\n",
    "3. Wrap `ask()` in a **Streamlit** app (`app.py`) for the Phase 7 chat UI:\n",
    "   `streamlit run app.py`.\n",
    "4. Keep `config.yaml` as the single source of truth across notebook and app."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}