UzairKhiiba commited on
Commit
dfdcf8b
Β·
1 Parent(s): e651c77

Extract + Chunk + Embed + Pinecone + BM25Corpus

Browse files

Extract all PDFs, chunk with xed + recursive strategies, embed with all MiniLM-L6-v2, up sert both namespaces to Pinecone, save bm25_corpus.json

notebooks/1_extract_and_index.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.12.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceType":"datasetVersion","sourceId":15489940,"datasetId":9910050,"databundleVersionId":16414607},{"sourceType":"datasetVersion","sourceId":15489007,"datasetId":9909457,"databundleVersionId":16413609}],"dockerImageVersionId":31328,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# πŸ“š RAG Pipeline β€” Extract & Index\n**Task:** Extract PDFs β†’ Chunk (Fixed + Recursive) β†’ Embed (all-MiniLM-L6-v2) β†’ Upsert to Pinecone β†’ Save BM25 corpus\n\n**Data structure:**\n```\nNLP RAG project data/\nβ”œβ”€β”€ CS Electives related Data/\nβ”‚ β”œβ”€β”€ course_outlines/ ← PDFs\nβ”‚ └── program-announcements/ ← PDFs\nβ”œβ”€β”€ IBA General Information/ ← PDFs\n└── TAs related Data/ ← PDFs\n```","metadata":{}},{"cell_type":"markdown","source":"## 0. Install Dependencies","metadata":{}},{"cell_type":"code","source":"import sys\n!{sys.executable} -m pip install --upgrade pip","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:34:32.853745Z","iopub.execute_input":"2026-04-02T11:34:32.854027Z","iopub.status.idle":"2026-04-02T11:34:56.771313Z","shell.execute_reply.started":"2026-04-02T11:34:32.854004Z","shell.execute_reply":"2026-04-02T11:34:56.770261Z"}},"outputs":[{"name":"stdout","text":"Requirement already satisfied: pip in /usr/local/lib/python3.12/dist-packages (24.1.2)\nCollecting pip\n Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)\nDownloading pip-26.0.1-py3-none-any.whl (1.8 MB)\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m100.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m0:01\u001b[0mm\n\u001b[?25hInstalling collected packages: pip\n Attempting uninstall: pip\n Found existing installation: pip 24.1.2\n Uninstalling pip-24.1.2:\n Successfully uninstalled pip-24.1.2\nSuccessfully installed pip-26.0.1\n","output_type":"stream"}],"execution_count":1},{"cell_type":"code","source":"pip install -q --no-deps pypdf pdfplumber rank_bm25 tqdm","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:34:56.773248Z","iopub.execute_input":"2026-04-02T11:34:56.773535Z","iopub.status.idle":"2026-04-02T11:34:57.961306Z","shell.execute_reply.started":"2026-04-02T11:34:56.773508Z","shell.execute_reply":"2026-04-02T11:34:57.960345Z"}},"outputs":[{"name":"stdout","text":"Note: you may need to restart the kernel to use updated packages.\n","output_type":"stream"}],"execution_count":2},{"cell_type":"code","source":"pip install -q sentence-transformers pinecone-client langchain langchain-community","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:34:57.962509Z","iopub.execute_input":"2026-04-02T11:34:57.963662Z","iopub.status.idle":"2026-04-02T11:35:50.706610Z","shell.execute_reply.started":"2026-04-02T11:34:57.963612Z","shell.execute_reply":"2026-04-02T11:35:50.705419Z"}},"outputs":[{"name":"stdout","text":"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\nbigframes 2.35.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.\ngoogle-adk 1.25.1 requires google-cloud-bigquery-storage>=2.0.0, which is not installed.\ngoogle-colab 1.0.0 requires jupyter-server==2.14.0, but you have jupyter-server 2.12.5 which is incompatible.\ngoogle-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.3 which is incompatible.\ngoogle-colab 1.0.0 requires requests==2.32.4, but you have requests 2.33.1 which is incompatible.\ngcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2026.2.0 which is incompatible.\u001b[0m\u001b[31m\n\u001b[0mNote: you may need to restart the kernel to use updated packages.\n","output_type":"stream"}],"execution_count":3},{"cell_type":"code","source":"!pip install --upgrade pdfplumber pdfminer.six --break-system-packages","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:35:50.709178Z","iopub.execute_input":"2026-04-02T11:35:50.709450Z","iopub.status.idle":"2026-04-02T11:35:53.803936Z","shell.execute_reply.started":"2026-04-02T11:35:50.709425Z","shell.execute_reply":"2026-04-02T11:35:53.802826Z"}},"outputs":[{"name":"stdout","text":"Requirement already satisfied: pdfplumber in /usr/local/lib/python3.12/dist-packages (0.11.9)\nCollecting pdfminer.six\n Downloading pdfminer_six-20260107-py3-none-any.whl.metadata (4.3 kB)\n Downloading pdfminer_six-20251230-py3-none-any.whl.metadata (4.3 kB)\nRequirement already satisfied: Pillow>=9.1 in /usr/local/lib/python3.12/dist-packages (from pdfplumber) (11.3.0)\nCollecting pypdfium2>=4.18.0 (from pdfplumber)\n Downloading pypdfium2-5.6.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (68 kB)\nRequirement already satisfied: charset-normalizer>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from pdfminer.six) (3.4.4)\nRequirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.12/dist-packages (from pdfminer.six) (43.0.3)\nRequirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.12/dist-packages (from cryptography>=36.0.0->pdfminer.six) (2.0.0)\nRequirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six) (3.0)\nDownloading pdfminer_six-20251230-py3-none-any.whl (6.6 MB)\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.6/6.6 MB\u001b[0m \u001b[31m64.3 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n\u001b[?25hDownloading pypdfium2-5.6.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.6/3.6 MB\u001b[0m \u001b[31m85.7 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n\u001b[?25hInstalling collected packages: pypdfium2, pdfminer.six\n\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2/2\u001b[0m [pdfminer.six][0m [pdfminer.six]\n\u001b[1A\u001b[2KSuccessfully installed pdfminer.six-20251230 pypdfium2-5.6.0\n","output_type":"stream"}],"execution_count":4},{"cell_type":"markdown","source":"## 1. Configuration","metadata":{}},{"cell_type":"code","source":"from dotenv import load_dotenv\nload_dotenv(dotenv_path=\"/kaggle/input/datasets/uzairkhiiba/envfile/.env\") # adjust path\n\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:35:53.805025Z","iopub.execute_input":"2026-04-02T11:35:53.805288Z","iopub.status.idle":"2026-04-02T11:35:53.841269Z","shell.execute_reply.started":"2026-04-02T11:35:53.805261Z","shell.execute_reply":"2026-04-02T11:35:53.840411Z"}},"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"True"},"metadata":{}}],"execution_count":5},{"cell_type":"code","source":"import os\n\n# ── Pinecone ──────────────────────────────────────────────────────────────────\nPINECONE_API_KEY = os.getenv(\"PINECONE_API_KEY\")\nPINECONE_ENV = os.getenv(\"PINECONE_ENV\", \"us-east-1-aws\") # e.g. \"us-east-1-aws\"\nINDEX_NAME = \"iba-rag\"\n\n# ── Namespaces (one per chunking strategy) ────────────────────────────────────\nNS_FIXED = \"fixed-chunks\"\nNS_RECURSIVE = \"recursive-chunks\"\n\n# ── Embedding model ───────────────────────────────────────────────────────────\nEMBED_MODEL = \"all-MiniLM-L6-v2\"\nEMBED_DIM = 384\n\n# ── Chunking parameters ───────────────────────────────────────────────────────\nFIXED_CHUNK_SIZE = 512 # characters\nFIXED_OVERLAP = 50\n\nREC_CHUNK_SIZE = 512\nREC_OVERLAP = 100\n\n# ── Data root ─────────────────────────────────────────────────────────────────\nDATA_ROOT = \"/kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data\" \n\n# ── Leaf folders that contain the documents ───────────────────────────────────\nLEAF_FOLDERS = [\n # CS Electives\n os.path.join(DATA_ROOT, \"CS Electives related Data\", \"course_outlines\"),\n os.path.join(DATA_ROOT, \"CS Electives related Data\", \"program-announcements\"),\n\n # IBA General Info\n os.path.join(DATA_ROOT, \"IBA General Information\"),\n\n # Exam Preparation Assistant (deep folders)\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"Computer Architecture and Assembly Language\", \"Quizs\"),\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"Data Structures\"),\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"DataBases\"),\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"Introduction to Programming\"),\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"Object Oriented Programming\"),\n os.path.join(DATA_ROOT, \"Exam Preparation Assistant\", \"Operating system\"),\n\n # TAs related\n os.path.join(DATA_ROOT, \"TAs related Data\"),\n]\n\nBM25_CORPUS_PATH = \"bm25_corpus.json\"\n\nprint(\"βœ… Configuration loaded\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:35:53.842236Z","iopub.execute_input":"2026-04-02T11:35:53.842420Z","iopub.status.idle":"2026-04-02T11:35:53.849868Z","shell.execute_reply.started":"2026-04-02T11:35:53.842401Z","shell.execute_reply":"2026-04-02T11:35:53.849041Z"}},"outputs":[{"name":"stdout","text":"βœ… Configuration loaded\n","output_type":"stream"}],"execution_count":6},{"cell_type":"markdown","source":"## 2. Extract Text from All PDFs","metadata":{}},{"cell_type":"code","source":"import pdfplumber\nfrom pathlib import Path\nfrom tqdm import tqdm\n\ndef extract_text_from_pdf(pdf_path: str) -> str:\n \"\"\"Extract all text from a PDF using pdfplumber.\"\"\"\n text_pages = []\n try:\n with pdfplumber.open(pdf_path) as pdf:\n for page in pdf.pages:\n page_text = page.extract_text()\n if page_text:\n text_pages.append(page_text)\n except Exception as e:\n print(f\" ⚠️ Failed to extract {pdf_path}: {e}\")\n return \"\\n\".join(text_pages)\n\n\ndef collect_pdfs(leaf_folders: list[str]) -> list[dict]:\n \"\"\"Walk leaf folders and return list of {path, source_folder, filename}.\"\"\"\n records = []\n for folder in leaf_folders:\n folder_path = Path(folder)\n if not folder_path.exists():\n print(f\" ⚠️ Folder not found: {folder}\")\n continue\n pdfs = list(folder_path.glob(\"**/*.pdf\"))\n print(f\" πŸ“‚ {folder} β†’ {len(pdfs)} PDF(s)\")\n for pdf in pdfs:\n records.append({\n \"path\": str(pdf),\n \"source_folder\": folder,\n \"filename\": pdf.name,\n })\n return records\n\n\nprint(\"πŸ” Collecting PDFs from leaf folders...\")\npdf_records = collect_pdfs(LEAF_FOLDERS)\nprint(f\"\\nπŸ“„ Total PDFs found: {len(pdf_records)}\")\n\nprint(\"\\nπŸ“– Extracting text...\")\ndocuments = [] # list of {text, metadata}\nfor rec in tqdm(pdf_records, desc=\"Extracting\"):\n text = extract_text_from_pdf(rec[\"path\"])\n if text.strip():\n documents.append({\n \"text\": text,\n \"metadata\": {\n \"source\": rec[\"path\"],\n \"filename\": rec[\"filename\"],\n \"folder\": rec[\"source_folder\"],\n }\n })\n else:\n print(f\" ⚠️ Empty text for: {rec['filename']}\")\n\nprint(f\"\\nβœ… Extracted text from {len(documents)} document(s)\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:35:53.850701Z","iopub.execute_input":"2026-04-02T11:35:53.850958Z","iopub.status.idle":"2026-04-02T11:38:06.619952Z","shell.execute_reply.started":"2026-04-02T11:35:53.850938Z","shell.execute_reply":"2026-04-02T11:38:06.618894Z"}},"outputs":[{"name":"stdout","text":"πŸ” Collecting PDFs from leaf folders...\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/CS Electives related Data/course_outlines β†’ 17 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/CS Electives related Data/program-announcements β†’ 4 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/IBA General Information β†’ 1 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/Computer Architecture and Assembly Language/Quizs β†’ 0 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/Data Structures β†’ 30 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/DataBases β†’ 31 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/Introduction to Programming β†’ 56 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/Object Oriented Programming β†’ 38 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/Exam Preparation Assistant/Operating system β†’ 11 PDF(s)\n πŸ“‚ /kaggle/input/datasets/uzairkhiiba/ibahive/NLP RAG project data/TAs related Data β†’ 7 PDF(s)\n\nπŸ“„ Total PDFs found: 195\n\nπŸ“– Extracting text...\n","output_type":"stream"},{"name":"stderr","text":"Extracting: 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 144/195 [01:25<00:24, 2.04it/s]Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nCould not get FontBBox from font descriptor because None cannot be parsed as 4 floats\nExtracting: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 195/195 [02:12<00:00, 1.47it/s]","output_type":"stream"},{"name":"stdout","text":"\nβœ… Extracted text from 195 document(s)\n","output_type":"stream"},{"name":"stderr","text":"\n","output_type":"stream"}],"execution_count":7},{"cell_type":"markdown","source":"## 3. Chunking β€” Fixed vs. Recursive","metadata":{}},{"cell_type":"code","source":"from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter\n# ── Fixed-size chunker ────────────────────────────────────────────────────────\nfixed_splitter = CharacterTextSplitter(\n separator=\"\\n\",\n chunk_size=FIXED_CHUNK_SIZE,\n chunk_overlap=FIXED_OVERLAP,\n length_function=len,\n)\n\n# ── Recursive chunker ─────────────────────────────────────────────────────────\nrecursive_splitter = RecursiveCharacterTextSplitter(\n separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n chunk_size=REC_CHUNK_SIZE,\n chunk_overlap=REC_OVERLAP,\n length_function=len,\n)\n\n\ndef make_chunks(documents: list[dict], splitter, strategy_name: str) -> list[dict]:\n \"\"\"Apply a splitter to all documents and return flat list of chunks.\"\"\"\n all_chunks = []\n for doc in documents:\n raw_chunks = splitter.split_text(doc[\"text\"])\n for i, chunk in enumerate(raw_chunks):\n all_chunks.append({\n \"text\": chunk,\n \"metadata\": {\n **doc[\"metadata\"],\n \"strategy\": strategy_name,\n \"chunk_index\": i,\n }\n })\n return all_chunks\n\n\nfixed_chunks = make_chunks(documents, fixed_splitter, \"fixed\")\nrecursive_chunks = make_chunks(documents, recursive_splitter, \"recursive\")\n\nprint(f\"βœ‚οΈ Fixed chunks: {len(fixed_chunks)}\")\nprint(f\"βœ‚οΈ Recursive chunks: {len(recursive_chunks)}\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:38:06.621138Z","iopub.execute_input":"2026-04-02T11:38:06.621463Z","iopub.status.idle":"2026-04-02T11:38:33.949745Z","shell.execute_reply.started":"2026-04-02T11:38:06.621441Z","shell.execute_reply":"2026-04-02T11:38:33.948995Z"}},"outputs":[{"name":"stderr","text":"Created a chunk of size 531, which is longer than the specified 512\n","output_type":"stream"},{"name":"stdout","text":"βœ‚οΈ Fixed chunks: 4839\nβœ‚οΈ Recursive chunks: 5443\n","output_type":"stream"}],"execution_count":8},{"cell_type":"markdown","source":"## 4. Embed with all-MiniLM-L6-v2","metadata":{}},{"cell_type":"code","source":"from sentence_transformers import SentenceTransformer\nimport numpy as np\n\nprint(f\"βš™οΈ Loading embedding model: {EMBED_MODEL}\")\nembedder = SentenceTransformer(EMBED_MODEL)\n\ndef embed_chunks(chunks: list[dict], batch_size: int = 64) -> list[dict]:\n \"\"\"Add 'embedding' key to each chunk dict.\"\"\"\n texts = [c[\"text\"] for c in chunks]\n embeddings = embedder.encode(\n texts,\n batch_size=batch_size,\n show_progress_bar=True,\n normalize_embeddings=True, # cosine similarity via dot product\n )\n for chunk, emb in zip(chunks, embeddings):\n chunk[\"embedding\"] = emb.tolist()\n return chunks\n\nprint(\"\\nπŸ”’ Embedding fixed chunks...\")\nfixed_chunks = embed_chunks(fixed_chunks)\n\nprint(\"\\nπŸ”’ Embedding recursive chunks...\")\nrecursive_chunks = embed_chunks(recursive_chunks)\n\nprint(f\"\\nβœ… Embeddings ready (dim={EMBED_DIM})\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:38:33.950682Z","iopub.execute_input":"2026-04-02T11:38:33.951392Z","iopub.status.idle":"2026-04-02T11:46:36.566272Z","shell.execute_reply.started":"2026-04-02T11:38:33.951367Z","shell.execute_reply":"2026-04-02T11:46:36.564962Z"}},"outputs":[{"name":"stdout","text":"βš™οΈ Loading embedding model: all-MiniLM-L6-v2\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"30262f0fa4464ead8ab69ba74f6d80c0"}},"metadata":{}},{"name":"stderr","text":"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"3333ad9a70184d5a8d8cbce2680e406f"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"README.md: 0.00B [00:00, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"738fa67e8c57447e82dd171730577a0e"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"4eeb2c7e3b1f4af0bce0999cd29914c8"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"config.json: 0%| | 0.00/612 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5d05a2d16da84d9a8a12da53bed1905e"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"model.safetensors: 0%| | 0.00/90.9M [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"0ab19b8992ec4dd18a607b7a05c7fcaa"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"Loading weights: 0%| | 0/103 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"b97985b7962d4ad4935aa7912bb954a4"}},"metadata":{}},{"name":"stderr","text":"BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2\nKey | Status | | \n------------------------+------------+--+-\nembeddings.position_ids | UNEXPECTED | | \n\nNotes:\n- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"tokenizer_config.json: 0%| | 0.00/350 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"fdce14aca78c42908110ac9d13598ecb"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"vocab.txt: 0.00B [00:00, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"80af019c386a4cbb92ef4e7965c6acfc"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"tokenizer.json: 0.00B [00:00, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"fada6a17fdaf40489e893d14d5f5ac73"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5811f9b4d5c0447c87af461dc27f2b9d"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"config.json: 0%| | 0.00/190 [00:00<?, ?B/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"7f9191f607fc4e8db2a21759c82eb9a2"}},"metadata":{}},{"name":"stdout","text":"\nπŸ”’ Embedding fixed chunks...\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"Batches: 0%| | 0/76 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"b79e95e56e95491a9d4fc5c95b8eb782"}},"metadata":{}},{"name":"stdout","text":"\nπŸ”’ Embedding recursive chunks...\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"Batches: 0%| | 0/86 [00:00<?, ?it/s]","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"66e2197ee0d443db96af1b711361f34e"}},"metadata":{}},{"name":"stdout","text":"\nβœ… Embeddings ready (dim=384)\n","output_type":"stream"}],"execution_count":9},{"cell_type":"markdown","source":"## 5. Upsert to Pinecone (Two Namespaces)","metadata":{}},{"cell_type":"code","source":"!pip uninstall -q -y pinecone-client\n!pip install -q pinecone","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:46:36.569006Z","iopub.execute_input":"2026-04-02T11:46:36.569275Z","iopub.status.idle":"2026-04-02T11:46:54.720812Z","shell.execute_reply.started":"2026-04-02T11:46:36.569253Z","shell.execute_reply":"2026-04-02T11:46:54.719951Z"}},"outputs":[],"execution_count":10},{"cell_type":"code","source":"from pinecone import Pinecone, ServerlessSpec\nimport uuid\n\n# ── Init Pinecone ─────────────────────────────────────────────────────────────\npc = Pinecone(api_key=PINECONE_API_KEY)\n\n# Create index if it does not exist\nexisting_indexes = [idx.name for idx in pc.list_indexes()]\nif INDEX_NAME not in existing_indexes:\n print(f\"πŸ“Œ Creating Pinecone index '{INDEX_NAME}' (dim={EMBED_DIM})...\")\n pc.create_index(\n name=INDEX_NAME,\n dimension=EMBED_DIM,\n metric=\"cosine\",\n spec=ServerlessSpec(cloud=\"aws\", region=\"us-east-1\"),\n )\nelse:\n print(f\"ℹ️ Index '{INDEX_NAME}' already exists β€” reusing\")\n\nindex = pc.Index(INDEX_NAME)\nprint(index.describe_index_stats())","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:46:54.721766Z","iopub.execute_input":"2026-04-02T11:46:54.722004Z","iopub.status.idle":"2026-04-02T11:46:57.854436Z","shell.execute_reply.started":"2026-04-02T11:46:54.721978Z","shell.execute_reply":"2026-04-02T11:46:57.853581Z"}},"outputs":[{"name":"stdout","text":"ℹ️ Index 'iba-rag' already exists β€” reusing\n{'_response_info': {'raw_headers': {'connection': 'keep-alive',\n 'content-length': '229',\n 'content-type': 'application/json',\n 'date': 'Thu, 02 Apr 2026 11:46:57 GMT',\n 'grpc-status': '0',\n 'server': 'envoy',\n 'x-envoy-upstream-service-time': '67',\n 'x-pinecone-request-latency-ms': '67',\n 'x-pinecone-response-duration-ms': '69'}},\n 'dimension': 384,\n 'index_fullness': 0.0,\n 'memoryFullness': 0.0,\n 'metric': 'cosine',\n 'namespaces': {'fixed-chunks': {'vector_count': 5069},\n 'recursive-chunks': {'vector_count': 5703}},\n 'storageFullness': 0.0,\n 'total_vector_count': 10772,\n 'vector_type': 'dense'}\n","output_type":"stream"}],"execution_count":11},{"cell_type":"code","source":"def upsert_to_pinecone(index, chunks: list[dict], namespace: str, batch_size: int = 100):\n \"\"\"Upsert a list of chunks (with embeddings) into a Pinecone namespace.\"\"\"\n vectors = []\n for chunk in chunks:\n vectors.append({\n \"id\": str(uuid.uuid4()),\n \"values\": chunk[\"embedding\"],\n \"metadata\": {\n \"text\": chunk[\"text\"][:1000], # Pinecone metadata limit\n \"source\": chunk[\"metadata\"][\"source\"],\n \"filename\": chunk[\"metadata\"][\"filename\"],\n \"folder\": chunk[\"metadata\"][\"folder\"],\n \"strategy\": chunk[\"metadata\"][\"strategy\"],\n \"chunk_index\": chunk[\"metadata\"][\"chunk_index\"],\n }\n })\n\n total_upserted = 0\n for i in tqdm(range(0, len(vectors), batch_size), desc=f\"Upserting [{namespace}]\"):\n batch = vectors[i : i + batch_size]\n index.upsert(vectors=batch, namespace=namespace)\n total_upserted += len(batch)\n\n print(f\" βœ… Upserted {total_upserted} vectors β†’ namespace='{namespace}'\")\n\n\nprint(\"πŸš€ Upserting fixed-chunk namespace...\")\nupsert_to_pinecone(index, fixed_chunks, NS_FIXED)\n\nprint(\"\\nπŸš€ Upserting recursive-chunk namespace...\")\nupsert_to_pinecone(index, recursive_chunks, NS_RECURSIVE)\n\nprint(\"\\nπŸ“Š Final index stats:\")\nprint(index.describe_index_stats())","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:46:57.856019Z","iopub.execute_input":"2026-04-02T11:46:57.856435Z","iopub.status.idle":"2026-04-02T11:47:59.208367Z","shell.execute_reply.started":"2026-04-02T11:46:57.856401Z","shell.execute_reply":"2026-04-02T11:47:59.207348Z"}},"outputs":[{"name":"stdout","text":"πŸš€ Upserting fixed-chunk namespace...\n","output_type":"stream"},{"name":"stderr","text":"Upserting [fixed-chunks]: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 49/49 [00:30<00:00, 1.62it/s]\n","output_type":"stream"},{"name":"stdout","text":" βœ… Upserted 4839 vectors β†’ namespace='fixed-chunks'\n\nπŸš€ Upserting recursive-chunk namespace...\n","output_type":"stream"},{"name":"stderr","text":"Upserting [recursive-chunks]: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [00:30<00:00, 1.78it/s]","output_type":"stream"},{"name":"stdout","text":" βœ… Upserted 5443 vectors β†’ namespace='recursive-chunks'\n\nπŸ“Š Final index stats:\n{'_response_info': {'raw_headers': {'connection': 'keep-alive',\n 'content-length': '230',\n 'content-type': 'application/json',\n 'date': 'Thu, 02 Apr 2026 11:47:59 GMT',\n 'grpc-status': '0',\n 'server': 'envoy',\n 'x-envoy-upstream-service-time': '3',\n 'x-pinecone-request-latency-ms': '3',\n 'x-pinecone-response-duration-ms': '5'}},\n 'dimension': 384,\n 'index_fullness': 0.0,\n 'memoryFullness': 0.0,\n 'metric': 'cosine',\n 'namespaces': {'fixed-chunks': {'vector_count': 9908},\n 'recursive-chunks': {'vector_count': 11146}},\n 'storageFullness': 0.0,\n 'total_vector_count': 21054,\n 'vector_type': 'dense'}\n","output_type":"stream"},{"name":"stderr","text":"\n","output_type":"stream"}],"execution_count":12},{"cell_type":"markdown","source":"## 6. Save BM25 Corpus (`bm25_corpus.json`)","metadata":{}},{"cell_type":"code","source":"import json\n\n# BM25 corpus = one entry per unique chunk text (use recursive chunks as primary)\nbm25_corpus = [\n {\n \"id\": str(uuid.uuid4()),\n \"text\": chunk[\"text\"],\n \"metadata\": {\n \"source\": chunk[\"metadata\"][\"source\"],\n \"filename\": chunk[\"metadata\"][\"filename\"],\n \"folder\": chunk[\"metadata\"][\"folder\"],\n \"chunk_index\": chunk[\"metadata\"][\"chunk_index\"],\n }\n }\n for chunk in recursive_chunks\n]\n\nwith open(BM25_CORPUS_PATH, \"w\", encoding=\"utf-8\") as f:\n json.dump(bm25_corpus, f, ensure_ascii=False, indent=2)\n\nprint(f\"πŸ’Ύ BM25 corpus saved β†’ {BM25_CORPUS_PATH}\")\nprint(f\" Total entries : {len(bm25_corpus)}\")\nprint(f\" File size : {os.path.getsize(BM25_CORPUS_PATH) / 1024:.1f} KB\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:47:59.209609Z","iopub.execute_input":"2026-04-02T11:47:59.209905Z","iopub.status.idle":"2026-04-02T11:47:59.305109Z","shell.execute_reply.started":"2026-04-02T11:47:59.209876Z","shell.execute_reply":"2026-04-02T11:47:59.304057Z"}},"outputs":[{"name":"stdout","text":"πŸ’Ύ BM25 corpus saved β†’ bm25_corpus.json\n Total entries : 5443\n File size : 4945.3 KB\n","output_type":"stream"}],"execution_count":13},{"cell_type":"markdown","source":"## 7. Sanity Check β€” BM25 + Dense Query","metadata":{}},{"cell_type":"code","source":"from rank_bm25 import BM25Okapi\n\n# Load saved corpus\nwith open(BM25_CORPUS_PATH, \"r\", encoding=\"utf-8\") as f:\n corpus_data = json.load(f)\n\ntokenized_corpus = [entry[\"text\"].lower().split() for entry in corpus_data]\nbm25 = BM25Okapi(tokenized_corpus)\n\nQUERY = \"What are the CS elective courses available?\" # ← change as needed\n\n# ── BM25 top-3 ────────────────────────────────────────────────────────────────\nprint(\"πŸ” BM25 Top-3 results:\")\nscores = bm25.get_scores(QUERY.lower().split())\ntop_k = scores.argsort()[-3:][::-1]\nfor rank, idx in enumerate(top_k, 1):\n entry = corpus_data[idx]\n print(f\" [{rank}] score={scores[idx]:.3f} | {entry['metadata']['filename']}\")\n print(f\" {entry['text'][:200]}...\\n\")\n\n# ── Dense (Pinecone) top-3 ────────────────────────────────────────────────────\nprint(\"πŸ” Dense (Pinecone / recursive ns) Top-3 results:\")\nq_emb = embedder.encode([QUERY], normalize_embeddings=True).tolist()\nresults = index.query(vector=q_emb[0], top_k=3, namespace=NS_RECURSIVE, include_metadata=True)\nfor rank, match in enumerate(results[\"matches\"], 1):\n print(f\" [{rank}] score={match['score']:.4f} | {match['metadata']['filename']}\")\n print(f\" {match['metadata']['text'][:200]}...\\n\")","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-04-02T11:47:59.306183Z","iopub.execute_input":"2026-04-02T11:47:59.306537Z","iopub.status.idle":"2026-04-02T11:48:00.158709Z","shell.execute_reply.started":"2026-04-02T11:47:59.306495Z","shell.execute_reply":"2026-04-02T11:48:00.157989Z"}},"outputs":[{"name":"stdout","text":"πŸ” BM25 Top-3 results:\n [1] score=23.661 | 2022-2023.md.pdf\n - MIS457 β€” IS Security\nSemester-wise CS Elective Distribution\n- Semester 5: 1 CS elective\n- Semester 6: 2 CS electives\n- Semester 7: 2 CS electives\n- Semester 8: 2 CS electives\nAdditional Note\nSenior ...\n\n [2] score=23.661 | 2023-2024.md.pdf\n - MIS457 β€” IS Security\nSemester-wise CS Elective Distribution\n- Semester 5: 1 CS elective\n- Semester 6: 2 CS electives\n- Semester 7: 2 CS electives\n- Semester 8: 2 CS electives\nAdditional Note\nSenior ...\n\n [3] score=20.343 | soln_from_text_midterm.pdf\n (j) A student cannot add more than two courses at a time (i.e., in a single\nupdate).\n(k) The number of CS majors must be more than the number of Math majors.\n(l) The number of distinct courses in whic...\n\nπŸ” Dense (Pinecone / recursive ns) Top-3 results:\n [1] score=0.7150 | 2022-2023.md.pdf\n CS Elective Requirement\nStudents must select 7 CS elective courses, totaling 21 credit hours.\nEach CS elective is typically worth 3 credit hours.\nCS Electives Offered\n- CSE308 β€” Web Based Application ...\n\n [2] score=0.7150 | 2023-2024.md.pdf\n CS Elective Requirement\nStudents must select 7 CS elective courses, totaling 21 credit hours.\nEach CS elective is typically worth 3 credit hours.\nCS Electives Offered\n- CSE308 β€” Web Based Application ...\n\n [3] score=0.7150 | 2022-2023.md.pdf\n CS Elective Requirement\nStudents must select 7 CS elective courses, totaling 21 credit hours.\nEach CS elective is typically worth 3 credit hours.\nCS Electives Offered\n- CSE308 β€” Web Based Application ...\n\n","output_type":"stream"}],"execution_count":14},{"cell_type":"markdown","source":"## βœ… Summary\n\n| Step | Status |\n|------|--------|\n| PDF extraction | βœ… |\n| Fixed-size chunking | βœ… |\n| Recursive chunking | βœ… |\n| Embedding (all-MiniLM-L6-v2) | βœ… |\n| Pinecone upsert β€” `fixed-chunks` namespace | βœ… |\n| Pinecone upsert β€” `recursive-chunks` namespace | βœ… |\n| `bm25_corpus.json` saved | βœ… |\n\n**Next notebook β†’** `2_retrieval_and_rerank.ipynb`","metadata":{}}]}