{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build a Simple RAG Chatbot with LangChain, ChromaDB & Gradio\n",
    "\n",
    "This notebook walks you through building a **Retrieval-Augmented Generation (RAG)** chatbot from scratch.  \n",
    "We'll scrape **one webpage**, chunk it, embed it into a vector store, hook up a small **HuggingFace LLM**, and serve it all through a **Gradio** chat UI.\n",
    "\n",
    "### What is RAG?\n",
    "RAG = **Retrieve** relevant documents → **Augment** the prompt with them → **Generate** an answer.  \n",
    "This lets an LLM answer questions about content it was never trained on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 1: Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q langchain langchain-community langchain-chroma langchain-huggingface \\\n",
    "    langchain-text-splitters chromadb sentence-transformers \\\n",
    "    beautifulsoup4 gradio huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 2: Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import gradio as gr\n",
    "\n",
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint\n",
    "from langchain_chroma import Chroma\n",
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
    "from langchain.chains.retrieval import create_retrieval_chain\n",
    "from langchain_core.messages import HumanMessage, AIMessage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 3: Set Your HuggingFace Token\n",
    "\n",
    "We'll use the **HuggingFace Inference API** to run a small LLM without needing a GPU locally.  \n",
    "Get a free token at https://huggingface.co/settings/tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Paste your HuggingFace token here\n",
    "os.environ[\"HF_TOKEN\"] = \"hf_...\"\n",
    "os.environ[\"USER_AGENT\"] = \"rag-tutorial\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 4: Scrape a Webpage\n",
    "\n",
    "We use LangChain's `WebBaseLoader` to fetch and parse HTML from a single URL.  \n",
    "This gives us `Document` objects with the page text and metadata (source URL)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "URL = \"https://aisviz.gitbook.io/documentation\"\n",
    "\n",
    "loader = WebBaseLoader(URL)\n",
    "raw_docs = loader.load()\n",
    "\n",
    "print(f\"Loaded {len(raw_docs)} document(s)\")\n",
    "print(f\"Character count: {len(raw_docs[0].page_content)}\")\n",
    "print(f\"\\nFirst 500 chars:\\n{raw_docs[0].page_content[:500]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 5: Clean the Text\n",
    "\n",
    "Web-scraped text is often messy — control characters, excessive whitespace, etc.  \n",
    "A quick clean-up keeps our chunks meaningful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_text(text: str) -> str:\n",
    "    \"\"\"Remove control characters and normalize whitespace while preserving structure.\"\"\"\n",
    "    # Remove control characters but keep newlines and tabs\n",
    "    text = re.sub(r\"[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f]\", \"\", text)\n",
    "    # Normalize line endings\n",
    "    text = text.replace(\"\\r\\n\", \"\\n\").replace(\"\\r\", \"\\n\")\n",
    "    # Collapse excessive newlines (3+ → 2)\n",
    "    text = re.sub(r\"\\n{3,}\", \"\\n\\n\", text)\n",
    "    # Collapse excessive spaces\n",
    "    text = re.sub(r\" {3,}\", \" \", text)\n",
    "    return text.strip()\n",
    "\n",
    "\n",
    "for doc in raw_docs:\n",
    "    doc.page_content = clean_text(doc.page_content)\n",
    "\n",
    "print(f\"Cleaned character count: {len(raw_docs[0].page_content)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 6: Chunk the Documents\n",
    "\n",
    "LLMs have limited context windows, and embeddings work best on focused passages.  \n",
    "We split documents into overlapping chunks so that no context is lost at boundaries.\n",
    "\n",
    "- **chunk_size**: max characters per chunk  \n",
    "- **chunk_overlap**: characters shared between consecutive chunks  \n",
    "- **separators**: split boundaries in priority order (paragraphs → lines → sentences → words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=768,\n",
    "    chunk_overlap=100,\n",
    "    separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n",
    ")\n",
    "\n",
    "chunks = splitter.split_documents(raw_docs)\n",
    "\n",
    "print(f\"Split into {len(chunks)} chunks\")\n",
    "print(f\"\\n--- Chunk 0 ---\\n{chunks[0].page_content[:300]}...\")\n",
    "print(f\"\\n--- Chunk 1 ---\\n{chunks[1].page_content[:300]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 7: Create Embeddings & Store in ChromaDB\n",
    "\n",
    "We convert each chunk into a **vector embedding** — a numerical representation of its meaning.  \n",
    "ChromaDB stores these vectors and lets us find the most relevant chunks for any query.\n",
    "\n",
    "We use `all-MiniLM-L6-v2` — a small, fast embedding model (80 MB)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = HuggingFaceEmbeddings(\n",
    "    model_name=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
    "    model_kwargs={\"device\": \"cpu\"},\n",
    ")\n",
    "\n",
    "vectorstore = Chroma.from_documents(\n",
    "    documents=chunks,\n",
    "    embedding=embeddings,\n",
    "    # persist_directory=\"./chroma_db\",  # uncomment to save to disk\n",
    ")\n",
    "\n",
    "print(f\"Stored {vectorstore._collection.count()} vectors in ChromaDB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick test — similarity search\n",
    "\n",
    "Let's verify our vector store works by querying it directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vectorstore.similarity_search(\"What is AISdb?\", k=3)\n",
    "\n",
    "for i, doc in enumerate(results):\n",
    "    print(f\"\\n--- Result {i+1} ---\")\n",
    "    print(doc.page_content[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 8: Set Up the Retriever\n",
    "\n",
    "A **retriever** wraps the vector store with a standard interface.  \n",
    "`k=5` means we'll fetch the top 5 most relevant chunks for each question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 5})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 9: Set Up the LLM\n",
    "\n",
    "We use `HuggingFaceEndpoint` to call a small model via the **free Inference API**.  \n",
    "No GPU needed — the model runs on HuggingFace's servers.\n",
    "\n",
    "`HuggingFaceTB/SmolLM2-1.7B-Instruct` is a capable small model (~1.7B params)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = HuggingFaceEndpoint(\n",
    "    repo_id=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n",
    "    temperature=0.3,\n",
    "    max_new_tokens=512,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 10: Build the RAG Chain\n",
    "\n",
    "This is the core of the chatbot. The chain:\n",
    "\n",
    "1. Takes the user's question  \n",
    "2. Retrieves relevant document chunks from ChromaDB  \n",
    "3. Passes both the question and the context to the LLM  \n",
    "4. Returns the generated answer  \n",
    "\n",
    "The **system prompt** tells the LLM to answer only from the provided context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_prompt = \"\"\"You are a helpful assistant that answers questions based on the provided context.\n",
    "Use ONLY the context below to answer. If the context doesn't contain the answer, say\n",
    "\"I don't have enough information to answer that.\"\n",
    "\n",
    "Context:\n",
    "{context}\"\"\"\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages([\n",
    "    (\"system\", system_prompt),\n",
    "    MessagesPlaceholder(\"chat_history\"),\n",
    "    (\"human\", \"{input}\"),\n",
    "])\n",
    "\n",
    "# Combine retrieved docs into the prompt\n",
    "question_answer_chain = create_stuff_documents_chain(llm, prompt)\n",
    "\n",
    "# Wire up retriever + QA chain\n",
    "rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n",
    "\n",
    "print(\"RAG chain ready!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick test — ask a question"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = rag_chain.invoke({\n",
    "    \"input\": \"What is AISdb?\",\n",
    "    \"chat_history\": [],\n",
    "})\n",
    "\n",
    "print(response[\"answer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 11: Add Chat History\n",
    "\n",
    "A real chatbot remembers previous messages. We store the conversation as a list of  \n",
    "`HumanMessage` / `AIMessage` objects and pass them into each call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_history = []  # stores LangChain message objects\n",
    "\n",
    "\n",
    "def ask(question: str) -> str:\n",
    "    \"\"\"Send a question to the RAG chain with chat history.\"\"\"\n",
    "    response = rag_chain.invoke({\n",
    "        \"input\": question,\n",
    "        \"chat_history\": chat_history,\n",
    "    })\n",
    "    answer = response[\"answer\"]\n",
    "\n",
    "    # Append to history so the next call has context\n",
    "    chat_history.append(HumanMessage(content=question))\n",
    "    chat_history.append(AIMessage(content=answer))\n",
    "\n",
    "    return answer\n",
    "\n",
    "\n",
    "# Test multi-turn conversation\n",
    "print(ask(\"What is AISdb?\"))\n",
    "print(\"---\")\n",
    "print(ask(\"What can it do?\"))  # \"it\" should refer to AISdb from context"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Step 12: Build the Gradio Chat UI\n",
    "\n",
    "Finally, let's wrap everything in a Gradio `ChatInterface`.  \n",
    "This gives us a polished chat window with message history, a text box, and a send button — in ~15 lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def respond(message, history):\n",
    "    \"\"\"Gradio chat callback. `history` is a list of {role, content} dicts.\"\"\"\n",
    "    # Convert Gradio history → LangChain messages\n",
    "    lc_history = []\n",
    "    for msg in history:\n",
    "        if msg[\"role\"] == \"user\":\n",
    "            lc_history.append(HumanMessage(content=msg[\"content\"]))\n",
    "        else:\n",
    "            lc_history.append(AIMessage(content=msg[\"content\"]))\n",
    "\n",
    "    response = rag_chain.invoke({\n",
    "        \"input\": message,\n",
    "        \"chat_history\": lc_history,\n",
    "    })\n",
    "\n",
    "    return response[\"answer\"]\n",
    "\n",
    "\n",
    "demo = gr.ChatInterface(\n",
    "    fn=respond,\n",
    "    type=\"messages\",\n",
    "    title=\"RAG Chatbot\",\n",
    "    description=\"Ask me anything about the documentation! Powered by a small HuggingFace LLM + ChromaDB.\",\n",
    "    examples=[\n",
    "        \"What is AISdb?\",\n",
    "        \"How do I get started?\",\n",
    "        \"What features does it have?\",\n",
    "    ],\n",
    "    theme=gr.themes.Soft(),\n",
    ")\n",
    "\n",
    "demo.launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Recap\n",
    "\n",
    "Here's everything we built in this notebook:\n",
    "\n",
    "| Step | What | Tool |\n",
    "|------|------|------|\n",
    "| 1 | Scraped a webpage | `WebBaseLoader` |\n",
    "| 2 | Cleaned the text | Regex |\n",
    "| 3 | Chunked into passages | `RecursiveCharacterTextSplitter` |\n",
    "| 4 | Embedded & stored vectors | `HuggingFaceEmbeddings` + `ChromaDB` |\n",
    "| 5 | Connected a small LLM | `HuggingFaceEndpoint` (SmolLM2 1.7B) |\n",
    "| 6 | Built a RAG chain | LangChain `create_retrieval_chain` |\n",
    "| 7 | Added chat history | `HumanMessage` / `AIMessage` list |\n",
    "| 8 | Served with a chat UI | Gradio `ChatInterface` |\n",
    "\n",
    "### Next Steps\n",
    "- Scrape **multiple URLs** to expand the knowledge base\n",
    "- **Persist** ChromaDB to disk so you don't re-embed on every restart\n",
    "- Add a **history-aware retriever** that rewrites questions using prior context\n",
    "- Swap in a **larger LLM** (Gemini, GPT, Claude) for better answers\n",
    "- Deploy to **HuggingFace Spaces** for free hosting"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}