{ "cells": [ { "cell_type": "markdown", "id": "43fed051", "metadata": {}, "source": [ "# Code Generation Assistant \n", "\n", "**Generate Python code from natural-language descriptions, grounded in CodeSearchNet.**\n", "\n", "This notebook runs the core vertical slice of the capstone top-to-bottom:\n", "\n", "1. **Phase 1** - load + clean CodeSearchNet, EDA\n", "2. **Phase 3** - embed the corpus + build a FAISS retrieval index\n", "3. **Phase 5** - RAG: retrieve similar examples and condition a code LLM\n", "4. **Eval** - baseline (no retrieval) vs RAG, scored with CodeBLEU\n", "5. **Interactive** - ask it to write code\n", "6. **Phase 4 (optional)** - fine-tune CodeT5+ on docstring->code\n", "\n", "> CodeSearchNet was built for code *search*, so it ships `(docstring, code)` pairs\n", "> and **no unit tests**. We treat the docstring summary as the intent and the\n", "> function body as the target. Because it is natively a retrieval corpus, RAG is\n", "> the most natural architecture here.\n", "\n", "**Runtime:** set `Runtime -> Change runtime type -> T4 GPU`. No API key required -\n", "generation uses a small local model (`Qwen2.5-Coder-1.5B-Instruct`)." ] }, { "cell_type": "markdown", "id": "06dd65ca", "metadata": {}, "source": [ "## 0. Setup\n", "\n", "Installs everything. `codebleu` is optional (it builds tree-sitter parsers); if it\n", "fails the eval falls back to a token-overlap metric so the notebook still runs." ] }, { "cell_type": "code", "execution_count": null, "id": "2733fd6d", "metadata": {}, "outputs": [], "source": [ "!pip -q install datasets transformers accelerate sentence-transformers faiss-cpu pandas matplotlib seaborn\n", "# codebleu needs a tree-sitter parser to actually run; install it too (optional - has a fallback)\n", "!pip -q install codebleu tree-sitter tree-sitter-python || echo \"codebleu/parser install failed - will use fallback metric\"" ] }, { "cell_type": "code", "execution_count": null, "id": "1125e98a", "metadata": {}, "outputs": [], "source": [ "import torch\n", "print(\"CUDA available:\", torch.cuda.is_available())\n", "print(\"Device:\", torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU (generation will be slow)\")" ] }, { "cell_type": "markdown", "id": "88a40082", "metadata": {}, "source": [ "## 1. Config\n", "\n", "One place for every knob. `MAX_ROWS` keeps the Colab run fast - raise it (or set to\n", "`None`) for a fuller run. `python` only to start; depth over breadth." ] }, { "cell_type": "code", "execution_count": null, "id": "894bfb5f", "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass, field\n", "from typing import Tuple\n", "\n", "@dataclass\n", "class Config:\n", " # data\n", " candidate_dataset_ids: Tuple[str, ...] = (\n", " \"code-search-net/code_search_net\", # parquet mirror (most reliable)\n", " \"code_search_net\", # canonical (may need older datasets)\n", " )\n", " language: str = \"python\"\n", " max_rows: int = 8000 # subset for speed; set None for full split\n", " # cleaning\n", " min_doc_words: int = 3\n", " max_doc_words: int = 120\n", " min_code_chars: int = 20\n", " max_code_tokens: int = 400\n", " doc_blocklist: Tuple[str, ...] = (\"todo\", \"fixme\", \"auto-generated\",\n", " \"autogenerated\", \"do not edit\")\n", " # split\n", " seed: int = 42\n", " train: float = 0.8\n", " val: float = 0.1\n", " # models\n", " embed_model: str = \"sentence-transformers/all-MiniLM-L6-v2\"\n", " gen_model: str = \"Qwen/Qwen2.5-Coder-1.5B-Instruct\"\n", " top_k: int = 3\n", "\n", "CFG = Config()\n", "CFG" ] }, { "cell_type": "markdown", "id": "4382632d", "metadata": {}, "source": [ "## 2. Phase 1a - Load CodeSearchNet\n", "\n", "Tries the parquet mirror first, then the canonical id. If both fail on your\n", "`datasets` version, run `!pip install \"datasets<3\"` and re-run, or download the\n", "raw release from the CodeSearchNet GitHub repo." ] }, { "cell_type": "code", "execution_count": null, "id": "158ab53b", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "\n", "USE_COLS = {\n", " \"func_documentation_string\": \"docstring\",\n", " \"func_code_string\": \"code\",\n", " \"language\": \"language\",\n", " \"repository_name\": \"repo\",\n", " \"func_code_url\": \"url\",\n", "}\n", "\n", "def load_codesearchnet(cfg):\n", " last_err = None\n", " for ds_id in cfg.candidate_dataset_ids:\n", " try:\n", " print(f\"[load] trying '{ds_id}' ({cfg.language}) ...\")\n", " ds = load_dataset(ds_id, cfg.language, split=\"train\", trust_remote_code=True)\n", " if cfg.max_rows:\n", " ds = ds.select(range(min(cfg.max_rows, len(ds))))\n", " df = ds.to_pandas()\n", " keep = [c for c in USE_COLS if c in df.columns]\n", " df = df[keep].rename(columns=USE_COLS)\n", " for col in USE_COLS.values():\n", " if col not in df.columns:\n", " df[col] = \"\"\n", " print(f\"[load] OK - {len(df):,} rows from '{ds_id}'\")\n", " return df\n", " except Exception as e:\n", " print(f\"[load] failed: {e}\")\n", " last_err = e\n", " raise RuntimeError(f\"All dataset ids failed. Last error: {last_err}\")\n", "\n", "raw = load_codesearchnet(CFG)\n", "raw.head(2)" ] }, { "cell_type": "markdown", "id": "32dba780", "metadata": {}, "source": [ "## 3. Phase 1b - Clean & filter\n", "\n", "CodeSearchNet is noisy. We keep only the **summary first line** of each docstring\n", "as the intent (the rest is usually `:param:`/`:return:` boilerplate), then apply\n", "quality filters and dedup. The **funnel** logs how many rows each filter removes -\n", "keep it for your write-up." ] }, { "cell_type": "code", "execution_count": null, "id": "2aa0af5e", "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "WORD_RE = re.compile(r\"\\b\\w+\\b\")\n", "\n", "def first_line(t):\n", " return t.strip().split(\"\\n\")[0].strip() if isinstance(t, str) else \"\"\n", "\n", "def word_count(t):\n", " return len(WORD_RE.findall(t)) if isinstance(t, str) else 0\n", "\n", "def ascii_ratio(t):\n", " if not t:\n", " return 1.0\n", " return sum(1 for ch in t if ord(ch) < 128) / len(t)\n", "\n", "def approx_tokens(c):\n", " return len(re.findall(r\"\\w+|[^\\s\\w]\", c)) if isinstance(c, str) else 0\n", "\n", "def clean(df, cfg):\n", " funnel = [(\"raw\", len(df))]\n", " df = df.copy()\n", " df[\"docstring\"] = df[\"docstring\"].map(first_line)\n", " df[\"code\"] = df[\"code\"].fillna(\"\").astype(str)\n", "\n", " df = df[(df[\"docstring\"].str.len() > 0) & (df[\"code\"].str.len() > 0)]\n", " funnel.append((\"non_empty\", len(df)))\n", "\n", " wc = df[\"docstring\"].map(word_count)\n", " df = df[(wc >= cfg.min_doc_words) & (wc <= cfg.max_doc_words)]\n", " funnel.append((\"doc_word_window\", len(df)))\n", "\n", " df = df[df[\"code\"].str.len() >= cfg.min_code_chars]\n", " funnel.append((\"min_code_chars\", len(df)))\n", "\n", " df = df[df[\"code\"].map(approx_tokens) <= cfg.max_code_tokens]\n", " funnel.append((\"max_code_tokens\", len(df)))\n", "\n", " pat = \"|\".join(re.escape(t) for t in cfg.doc_blocklist)\n", " df = df[~df[\"docstring\"].str.lower().str.contains(pat, regex=True)]\n", " funnel.append((\"doc_blocklist\", len(df)))\n", "\n", " df = df[df[\"docstring\"].map(ascii_ratio) >= 0.9]\n", " funnel.append((\"ascii_docs\", len(df)))\n", "\n", " df = df.drop_duplicates(subset=[\"code\"]).drop_duplicates(subset=[\"docstring\"])\n", " funnel.append((\"dedup\", len(df)))\n", "\n", " funnel_df = pd.DataFrame(funnel, columns=[\"step\", \"rows_remaining\"])\n", " return df.reset_index(drop=True), funnel_df\n", "\n", "clean_df, funnel = clean(raw, CFG)\n", "print(funnel.to_string(index=False))\n", "print(\"\\nClean rows:\", len(clean_df))\n", "clean_df.head(2)" ] }, { "cell_type": "markdown", "id": "20747f0a", "metadata": {}, "source": [ "## 4. Phase 1c - EDA\n", "\n", "Quick look at the cleaned corpus: docstring length, code length, and the cleaning\n", "funnel. Save these for the report appendix." ] }, { "cell_type": "code", "execution_count": null, "id": "f684c430", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set_theme(style=\"whitegrid\")\n", "\n", "doc_words = clean_df[\"docstring\"].map(word_count)\n", "code_lines = clean_df[\"code\"].str.count(\"\\n\") + 1\n", "\n", "fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n", "sns.histplot(doc_words, bins=40, ax=axes[0]); axes[0].set(title=\"Docstring length (words)\", xlabel=\"words\")\n", "sns.histplot(code_lines.clip(upper=80), bins=40, ax=axes[1]); axes[1].set(title=\"Code length (lines, clipped 80)\", xlabel=\"lines\")\n", "axes[2].barh(funnel[\"step\"], funnel[\"rows_remaining\"]); axes[2].invert_yaxis(); axes[2].set(title=\"Cleaning funnel\")\n", "plt.tight_layout(); plt.show()\n", "\n", "print({\n", " \"rows\": len(clean_df),\n", " \"doc_words_median\": int(doc_words.median()),\n", " \"code_lines_median\": int(code_lines.median()),\n", "})" ] }, { "cell_type": "markdown", "id": "db098ca9", "metadata": {}, "source": [ "## 5. Train / val / test split\n", "\n", "The **train** pool doubles as the retrieval corpus for RAG. We evaluate on **test**\n", "so retrieved examples never leak the answer." ] }, { "cell_type": "code", "execution_count": null, "id": "0c18c3e2", "metadata": {}, "outputs": [], "source": [ "def split(df, cfg):\n", " df = df.sample(frac=1.0, random_state=cfg.seed).reset_index(drop=True)\n", " n = len(df); n_tr = int(n * cfg.train); n_va = int(n * cfg.val)\n", " return (df.iloc[:n_tr].reset_index(drop=True),\n", " df.iloc[n_tr:n_tr+n_va].reset_index(drop=True),\n", " df.iloc[n_tr+n_va:].reset_index(drop=True))\n", "\n", "train_df, val_df, test_df = split(clean_df, CFG)\n", "print(f\"train={len(train_df)} val={len(val_df)} test={len(test_df)}\")" ] }, { "cell_type": "markdown", "id": "b2d3b684", "metadata": {}, "source": [ "## 6. Phase 3 - Embeddings + FAISS index\n", "\n", "Embed each docstring in the train pool and build a cosine-similarity index\n", "(`IndexFlatIP` on L2-normalised vectors). The default embedder is small and fast;\n", "for a stronger code-aware corpus, swap `embed_model` to\n", "`Salesforce/codet5p-110m-embedding` (ties into the CodeT5 family)." ] }, { "cell_type": "code", "execution_count": null, "id": "68145a4c", "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "import faiss\n", "import numpy as np\n", "\n", "embedder = SentenceTransformer(CFG.embed_model)\n", "corpus = train_df.reset_index(drop=True)\n", "\n", "corpus_emb = embedder.encode(\n", " corpus[\"docstring\"].tolist(),\n", " batch_size=64, show_progress_bar=True,\n", " convert_to_numpy=True, normalize_embeddings=True,\n", ").astype(\"float32\")\n", "\n", "index = faiss.IndexFlatIP(corpus_emb.shape[1])\n", "index.add(corpus_emb)\n", "print(\"Indexed vectors:\", index.ntotal, \"| dim:\", corpus_emb.shape[1])" ] }, { "cell_type": "code", "execution_count": null, "id": "386988c0", "metadata": {}, "outputs": [], "source": [ "def retrieve(query, k=None):\n", " k = k or CFG.top_k\n", " q = embedder.encode([query], convert_to_numpy=True,\n", " normalize_embeddings=True).astype(\"float32\")\n", " scores, idx = index.search(q, k)\n", " out = corpus.iloc[idx[0]].copy()\n", " out[\"score\"] = scores[0]\n", " return out\n", "\n", "# sanity check\n", "retrieve(\"read a json file from disk and return a dict\")[[\"docstring\", \"score\"]]" ] }, { "cell_type": "markdown", "id": "45bad6b2", "metadata": {}, "source": [ "## 7. Phase 5a - Load the code LLM\n", "\n", "`Qwen2.5-Coder-1.5B-Instruct` fits on a free T4. For higher quality (and a Colab Pro\n", "GPU) bump `gen_model` to `Qwen/Qwen2.5-Coder-7B-Instruct`." ] }, { "cell_type": "code", "execution_count": null, "id": "c42beea4", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoTokenizer, AutoModelForCausalLM\n", "\n", "tok = AutoTokenizer.from_pretrained(CFG.gen_model)\n", "model = AutoModelForCausalLM.from_pretrained(\n", " CFG.gen_model, torch_dtype=\"auto\", device_map=\"auto\"\n", ")\n", "\n", "def chat_generate(messages, max_new_tokens=320):\n", " text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n", " inputs = tok(text, return_tensors=\"pt\").to(model.device)\n", " out = model.generate(**inputs, max_new_tokens=max_new_tokens,\n", " do_sample=False, pad_token_id=tok.eos_token_id)\n", " new = out[0][inputs.input_ids.shape[1]:]\n", " return tok.decode(new, skip_special_tokens=True)\n", "\n", "def extract_code(text):\n", " \"\"\"Strip markdown fences if the model wrapped the code.\"\"\"\n", " m = re.search(r\"```(?:python)?\\n(.*?)```\", text, re.DOTALL)\n", " return m.group(1).strip() if m else text.strip()\n", "\n", "print(\"Model loaded:\", CFG.gen_model)" ] }, { "cell_type": "markdown", "id": "2622d4fd", "metadata": {}, "source": [ "## 8. Phase 5b - Baseline vs RAG prompts\n", "\n", "Same model, two prompting strategies. The RAG prompt injects the top-k retrieved\n", "`(docstring, code)` pairs as dynamic few-shot context." ] }, { "cell_type": "code", "execution_count": null, "id": "54db7627", "metadata": {}, "outputs": [], "source": [ "SYS = (\"You are an expert Python coding assistant. Write a single, correct, \"\n", " \"self-contained Python function for the request. Output only code.\")\n", "\n", "def baseline_messages(intent):\n", " return [{\"role\": \"system\", \"content\": SYS},\n", " {\"role\": \"user\", \"content\": f\"# Task: {intent}\"}]\n", "\n", "def rag_messages(intent, k=None):\n", " ex = retrieve(intent, k)\n", " blocks = [f\"# Task: {r.docstring}\\n{r.code}\" for _, r in ex.iterrows()]\n", " context = \"\\n\\n\".join(blocks)\n", " user = (f\"Here are similar reference examples:\\n\\n{context}\\n\\n\"\n", " f\"# Now write a function for this task:\\n# Task: {intent}\")\n", " return [{\"role\": \"system\", \"content\": SYS},\n", " {\"role\": \"user\", \"content\": user}]\n", "\n", "demo = \"Write a function that returns the n-th Fibonacci number.\"\n", "print(\"=== BASELINE ===\")\n", "print(extract_code(chat_generate(baseline_messages(demo))))\n", "print(\"\\n=== RAG ===\")\n", "print(extract_code(chat_generate(rag_messages(demo))))" ] }, { "cell_type": "markdown", "id": "f2f6fda3", "metadata": {}, "source": [ "## 9. Eval - CodeBLEU, baseline vs RAG\n", "\n", "We score generated code against the reference on held-out **test** rows. CodeBLEU\n", "weights AST + data-flow match, not just text overlap. If `codebleu` did not install,\n", "we fall back to a token-overlap F1 so the cell still runs.\n", "\n", "> Caveat: CodeSearchNet has no unit tests, so this measures *similarity to the\n", "> reference*, not functional correctness. For pass@k, add a HumanEval/MBPP harness\n", "> (Phase 2) - flagged in the next-steps cell." ] }, { "cell_type": "code", "execution_count": null, "id": "8530179f", "metadata": {}, "outputs": [], "source": [ "# Try CodeBLEU; fall back to token-F1 if the metric OR its parser is unavailable.\n", "score, METRIC = None, None\n", "try:\n", " from codebleu import calc_codebleu\n", " # actually CALL it once - this is what needs the tree-sitter parser\n", " _ = calc_codebleu([\"def f(): return 1\"], [\"def f(): return 1\"], lang=\"python\")\n", " def score(ref, hyp):\n", " return calc_codebleu([ref], [hyp], lang=\"python\")[\"codebleu\"]\n", " METRIC = \"CodeBLEU\"\n", "except Exception as e:\n", " print(\"CodeBLEU unavailable, using token-F1 fallback:\", e)\n", " def _toks(s):\n", " return set(re.findall(r\"\\w+\", s))\n", " def score(ref, hyp):\n", " a, b = _toks(ref), _toks(hyp)\n", " if not a or not b:\n", " return 0.0\n", " inter = len(a & b)\n", " p, rec = inter / len(b), inter / len(a)\n", " return 0.0 if p + rec == 0 else 2 * p * rec / (p + rec)\n", " METRIC = \"token-F1 (fallback)\"\n", "print(\"Using metric:\", METRIC)" ] }, { "cell_type": "code", "execution_count": null, "id": "b1f22892", "metadata": {}, "outputs": [], "source": [ "N_EVAL = 15 # keep small on free Colab; raise for the real run\n", "sample = test_df.sample(min(N_EVAL, len(test_df)), random_state=CFG.seed)\n", "\n", "rows = []\n", "for _, r in sample.iterrows():\n", " base = extract_code(chat_generate(baseline_messages(r.docstring)))\n", " rag = extract_code(chat_generate(rag_messages(r.docstring)))\n", " rows.append({\"baseline\": score(r.code, base), \"rag\": score(r.code, rag)})\n", "\n", "res = pd.DataFrame(rows)\n", "print(f\"Mean {METRIC} over {len(res)} test tasks:\")\n", "print(res.mean().round(4).to_string())" ] }, { "cell_type": "markdown", "id": "86b042c5", "metadata": {}, "source": [ "## 10. Interactive - ask it to write code\n", "\n", "Edit the string and run. This uses the RAG pipeline and shows the retrieved\n", "examples so the grounding is visible." ] }, { "cell_type": "code", "execution_count": null, "id": "13cf8cc1", "metadata": {}, "outputs": [], "source": [ "def ask(intent, show_sources=True):\n", " if show_sources:\n", " print(\"Retrieved examples:\")\n", " for _, r in retrieve(intent).iterrows():\n", " print(f\" - ({r.score:.2f}) {r.docstring}\")\n", " print(\"-\" * 50)\n", " print(extract_code(chat_generate(rag_messages(intent))))\n", "\n", "ask(\"Write a function to check whether a string is a valid IPv4 address.\")" ] }, { "cell_type": "markdown", "id": "61adc276", "metadata": {}, "source": [ "## 11. (Optional) Phase 4 - Fine-tune CodeT5+\n", "\n", "A compact demonstration of the fine-tuning arm: train `codet5p-220m` on a small\n", "`docstring -> code` subset for a few steps so you can see the loop work, then\n", "generate. For the real capstone result, raise `subset`/`epochs` and run on a Pro\n", "GPU. **This section is slow - skip on a first pass.**" ] }, { "cell_type": "code", "execution_count": null, "id": "c2f5217e", "metadata": {}, "outputs": [], "source": [ "# Set to True to run fine-tuning.\n", "RUN_FINETUNE = False\n", "\n", "if RUN_FINETUNE:\n", " from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,\n", " Seq2SeqTrainer, Seq2SeqTrainingArguments,\n", " DataCollatorForSeq2Seq)\n", " from datasets import Dataset\n", "\n", " ck = \"Salesforce/codet5p-220m\"\n", " t5_tok = AutoTokenizer.from_pretrained(ck)\n", " t5 = AutoModelForSeq2SeqLM.from_pretrained(ck)\n", "\n", " subset = train_df.head(2000)\n", " def to_features(batch):\n", " x = t5_tok(batch[\"docstring\"], max_length=64, truncation=True, padding=\"max_length\")\n", " y = t5_tok(text_target=batch[\"code\"], max_length=256, truncation=True, padding=\"max_length\")\n", " x[\"labels\"] = y[\"input_ids\"]\n", " return x\n", "\n", " hf = Dataset.from_pandas(subset[[\"docstring\", \"code\"]]).map(\n", " to_features, batched=True, remove_columns=[\"docstring\", \"code\"])\n", "\n", " args = Seq2SeqTrainingArguments(\n", " output_dir=\"codet5p-ft\", per_device_train_batch_size=8,\n", " num_train_epochs=1, learning_rate=5e-5, logging_steps=20,\n", " fp16=torch.cuda.is_available(), report_to=\"none\", save_strategy=\"no\")\n", "\n", " trainer = Seq2SeqTrainer(\n", " model=t5, args=args, train_dataset=hf,\n", " data_collator=DataCollatorForSeq2Seq(t5_tok, model=t5))\n", " trainer.train()\n", "\n", " def t5_generate(intent):\n", " ids = t5_tok(intent, return_tensors=\"pt\").input_ids.to(t5.device)\n", " out = t5.generate(ids, max_length=256)\n", " return t5_tok.decode(out[0], skip_special_tokens=True)\n", "\n", " print(t5_generate(\"Return the factorial of a non-negative integer n.\"))\n", "else:\n", " print(\"Fine-tuning skipped. Set RUN_FINETUNE = True to run it.\")" ] }, { "cell_type": "markdown", "id": "ed8b964b", "metadata": {}, "source": [ "## 12. Next steps + deploying to VS Code\n", "\n", "**What's still to add for the full capstone:**\n", "- **Phase 2 functional eval:** wire up HumanEval / MBPP for real `pass@k` (they ship\n", " unit tests, unlike CodeSearchNet). This is the metric graders trust most.\n", "- **Phase 6 agentic loop:** generate -> run in a sandbox -> read traceback -> repair.\n", "- **Retrieval quality:** measure recall@k / MRR on the search task to justify the embedder.\n", "\n", "**Lifting this into VS Code for deployment:**\n", "1. The functions here map cleanly onto the repo modules: `clean()` -> `src/data/clean.py`,\n", " `retrieve()` + index build -> `src/rag/retriever.py`, `chat_generate()`/prompts ->\n", " `src/rag/generator.py`.\n", "2. Persist the FAISS index (`faiss.write_index(index, \"index.faiss\")`) and the corpus\n", " so you don't rebuild on every start.\n", "3. Wrap `ask()` in a **Streamlit** app (`app.py`) for the Phase 7 chat UI:\n", " `streamlit run app.py`.\n", "4. Keep `config.yaml` as the single source of truth across notebook and app." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }