{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "i2nHJ6o7TQW3" }, "source": [ "# πŸš€ Multi-Document RAG System with Advanced Retrieval\n", "\n", "## Project Overview\n", "This notebook implements a **production-ready Retrieval-Augmented Generation (RAG)** system capable of:\n", "- Ingesting **multiple PDF documents** into a unified knowledge base\n", "- Answering questions using **hybrid retrieval** (vector + keyword search)\n", "- Providing **cited, verifiable answers** with source attribution\n", "- **Comparing information** across multiple documents\n", "\n", "## Architecture Summary\n", "```\n", "User Query β†’ Query Classification β†’ Query Expansion (Multi-Query)\n", " ↓\n", "HyDE Generation β†’ Hybrid Retrieval (Vector + BM25)\n", " ↓\n", "RRF Fusion β†’ Cross-Encoder Re-ranking β†’ LLM Generation\n", " ↓\n", "Answer Verification β†’ Final Response with Citations\n", "```\n", "\n", "## Key Technologies\n", "| Component | Technology |\n", "|-----------|------------|\n", "| LLM | Llama 3.3 70B (via Groq) |\n", "| Embeddings | BAAI/bge-large-en-v1.5 |\n", "| Re-ranker | BAAI/bge-reranker-v2-m3 |\n", "| Vector DB | ChromaDB |\n", "| Keyword Search | BM25 |\n", "| UI | Gradio |\n", "\n", "## Requirements\n", "- **Groq API Key** (free at console.groq.com)\n", "- **Python 3.10+**\n", "- **GPU recommended** but not required" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AiaiOaSb-m1U", "outputId": "4b784a1e-a4a0-43d8-b4ff-eb7b1255438b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ”₯ Cleaning up the environment\n", "\u001b[33mWARNING: Skipping langchain-community as it is not installed.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Skipping langchain-groq as it is not installed.\u001b[0m\u001b[33m\n", "\u001b[0mπŸ“¦ Installing the Dependencies\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.0/61.0 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.0/18.0 MB\u001b[0m \u001b[31m56.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "dask-cudf-cu12 25.10.0 requires pandas<2.4.0dev0,>=2.0, which is not installed.\n", "access 1.1.10.post3 requires pandas>=2.1.0, which is not installed.\n", "access 1.1.10.post3 requires scipy>=1.14.1, which is not installed.\n", "pandas-gbq 0.30.0 requires pandas>=1.1.4, which is not installed.\n", "geemap 0.35.3 requires pandas, which is not installed.\n", "yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.\n", "yellowbrick 1.5 requires scipy>=1.0.0, which is not installed.\n", "tensorflow-decision-forests 1.12.0 requires pandas, which is not installed.\n", "librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.\n", "librosa 0.11.0 requires scipy>=1.6.0, which is not installed.\n", "cmdstanpy 1.3.0 requires pandas, which is not installed.\n", "albumentations 2.0.8 requires scipy>=1.10.0, which is not installed.\n", "mizani 0.13.5 requires pandas>=2.2.0, which is not installed.\n", "mizani 0.13.5 requires scipy>=1.8.0, which is not installed.\n", "imbalanced-learn 0.14.1 requires scikit-learn<2,>=1.4.2, which is not installed.\n", "imbalanced-learn 0.14.1 requires scipy<2,>=1.11.4, which is not installed.\n", "hdbscan 0.8.41 requires scikit-learn>=1.6, which is not installed.\n", "hdbscan 0.8.41 requires scipy>=1.0, which is not installed.\n", "stumpy 1.13.0 requires scipy>=1.10, which is not installed.\n", "spreg 1.8.4 requires pandas, which is not installed.\n", "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", "spreg 1.8.4 requires scipy>=0.11, which is not installed.\n", "spopt 0.7.0 requires pandas>=2.1.0, which is not installed.\n", "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", "spopt 0.7.0 requires scipy>=1.12.0, which is not installed.\n", "datasets 4.0.0 requires pandas, which is not installed.\n", "plotnine 0.14.5 requires pandas>=2.2.0, which is not installed.\n", "plotnine 0.14.5 requires scipy>=1.8.0, which is not installed.\n", "pymc 5.27.0 requires pandas>=0.24.0, which is not installed.\n", "pymc 5.27.0 requires scipy>=1.4.1, which is not installed.\n", "db-dtypes 1.5.0 requires pandas<3.0.0,>=1.5.3, which is not installed.\n", "sklearn-pandas 2.2.0 requires pandas>=1.1.4, which is not installed.\n", "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", "sklearn-pandas 2.2.0 requires scipy>=1.5.1, which is not installed.\n", "pynndescent 0.6.0 requires scikit-learn>=0.18, which is not installed.\n", "pynndescent 0.6.0 requires scipy>=1.0, which is not installed.\n", "cvxpy 1.6.7 requires scipy>=1.11.0, which is not installed.\n", "scikit-image 0.25.2 requires scipy>=1.11.4, which is not installed.\n", "mlxtend 0.23.4 requires pandas>=0.24.2, which is not installed.\n", "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", "mlxtend 0.23.4 requires scipy>=1.2.1, which is not installed.\n", "clarabel 0.11.1 requires scipy, which is not installed.\n", "mapclassify 2.10.0 requires pandas>=2.1, which is not installed.\n", "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", "mapclassify 2.10.0 requires scipy>=1.12, which is not installed.\n", "cudf-cu12 25.10.0 requires pandas<2.4.0dev0,>=2.0, which is not installed.\n", "segregation 2.5.3 requires pandas, which is not installed.\n", "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", "segregation 2.5.3 requires scipy, which is not installed.\n", "bqplot 0.12.45 requires pandas<3.0.0,>=1.0.0, which is not installed.\n", "osqp 1.0.5 requires scipy>=0.13.2, which is not installed.\n", "giddy 2.3.8 requires scipy>=1.12, which is not installed.\n", "pytensor 2.36.3 requires scipy<2,>=1, which is not installed.\n", "matplotlib-venn 1.1.2 requires scipy, which is not installed.\n", "mgwr 2.2.1 requires scipy>=0.11, which is not installed.\n", "tsfresh 0.21.1 requires pandas>=0.25.0, which is not installed.\n", "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", which is not installed.\n", "arviz 0.22.0 requires pandas>=2.1.0, which is not installed.\n", "arviz 0.22.0 requires scipy>=1.11.0, which is not installed.\n", "inequality 1.1.2 requires pandas>=2.1, which is not installed.\n", "inequality 1.1.2 requires scipy>=1.12, which is not installed.\n", "missingno 0.5.2 requires scipy, which is not installed.\n", "pysal 25.7 requires pandas>=1.4, which is not installed.\n", "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", "pysal 25.7 requires scipy>=1.8, which is not installed.\n", "xgboost 3.1.2 requires scipy, which is not installed.\n", "prophet 1.2.1 requires pandas>=1.0.4, which is not installed.\n", "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", "cuml-cu12 25.10.0 requires scipy>=1.8.0, which is not installed.\n", "dopamine-rl 4.1.2 requires pandas>=0.24.2, which is not installed.\n", "bigquery-magics 0.10.3 requires pandas>=1.2.0, which is not installed.\n", "hyperopt 0.2.7 requires scipy, which is not installed.\n", "bokeh 3.7.3 requires pandas>=1.2, which is not installed.\n", "spint 1.0.7 requires scipy>=0.11, which is not installed.\n", "fastai 2.8.6 requires pandas, which is not installed.\n", "fastai 2.8.6 requires scikit-learn, which is not installed.\n", "fastai 2.8.6 requires scipy, which is not installed.\n", "geopandas 1.1.2 requires pandas>=2.0.0, which is not installed.\n", "pointpats 2.5.2 requires pandas!=1.5.0,>=1.4, which is not installed.\n", "pointpats 2.5.2 requires scipy>=1.10, which is not installed.\n", "shap 0.50.0 requires pandas, which is not installed.\n", "shap 0.50.0 requires scikit-learn, which is not installed.\n", "shap 0.50.0 requires scipy, which is not installed.\n", "spglm 1.1.0 requires scipy>=1.8, which is not installed.\n", "cufflinks 0.17.3 requires pandas>=0.19.2, which is not installed.\n", "gradio 5.50.0 requires pandas<3.0,>=1.0, which is not installed.\n", "xarray 2025.12.0 requires pandas>=2.2, which is not installed.\n", "tobler 0.13.0 requires pandas>=2.2, which is not installed.\n", "tobler 0.13.0 requires scipy>=1.13, which is not installed.\n", "scs 3.2.10 requires scipy, which is not installed.\n", "statsmodels 0.14.6 requires pandas!=2.1.0,>=1.4, which is not installed.\n", "statsmodels 0.14.6 requires scipy!=1.9.2,>=1.8, which is not installed.\n", "esda 2.8.1 requires pandas>=2.1, which is not installed.\n", "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", "esda 2.8.1 requires scipy>=1.12, which is not installed.\n", "xarray-einstats 0.9.1 requires scipy>=1.11, which is not installed.\n", "holoviews 1.22.1 requires pandas>=1.3, which is not installed.\n", "momepy 0.11.0 requires pandas>=2.0, which is not installed.\n", "treelite 4.4.1 requires scipy, which is not installed.\n", "libpysal 4.14.0 requires pandas>=2.1.0, which is not installed.\n", "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", "libpysal 4.14.0 requires scipy>=1.12.0, which is not installed.\n", "jax 0.7.2 requires scipy>=1.13, which is not installed.\n", "seaborn 0.13.2 requires pandas>=1.2, which is not installed.\n", "jaxlib 0.7.2 requires scipy>=1.13, which is not installed.\n", "umap-learn 0.5.9.post2 requires scikit-learn>=1.6, which is not installed.\n", "umap-learn 0.5.9.post2 requires scipy>=1.3.1, which is not installed.\n", "dask-cuda 25.10.0 requires pandas>=1.3, which is not installed.\n", "spaghetti 1.7.6 requires pandas!=1.5.0,>=1.4, which is not installed.\n", "spaghetti 1.7.6 requires scipy>=1.8, which is not installed.\n", "quantecon 0.10.1 requires scipy>=1.5.0, which is not installed.\n", "bigframes 2.31.0 requires pandas>=1.5.3, which is not installed.\n", "lightgbm 4.6.0 requires scipy, which is not installed.\n", "yfinance 0.2.66 requires pandas>=1.3.0, which is not installed.\n", "opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", "pytensor 2.36.3 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", "rasterio 1.5.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", "jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m124.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "access 1.1.10.post3 requires scipy>=1.14.1, which is not installed.\n", "mizani 0.13.5 requires scipy>=1.8.0, which is not installed.\n", "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", "spreg 1.8.4 requires scipy>=0.11, which is not installed.\n", "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", "spopt 0.7.0 requires scipy>=1.12.0, which is not installed.\n", "plotnine 0.14.5 requires scipy>=1.8.0, which is not installed.\n", "pymc 5.27.0 requires scipy>=1.4.1, which is not installed.\n", "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", "sklearn-pandas 2.2.0 requires scipy>=1.5.1, which is not installed.\n", "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", "mlxtend 0.23.4 requires scipy>=1.2.1, which is not installed.\n", "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", "mapclassify 2.10.0 requires scipy>=1.12, which is not installed.\n", "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", "segregation 2.5.3 requires scipy, which is not installed.\n", "giddy 2.3.8 requires scipy>=1.12, which is not installed.\n", "mgwr 2.2.1 requires scipy>=0.11, which is not installed.\n", "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", which is not installed.\n", "arviz 0.22.0 requires scipy>=1.11.0, which is not installed.\n", "inequality 1.1.2 requires scipy>=1.12, which is not installed.\n", "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", "pysal 25.7 requires scipy>=1.8, which is not installed.\n", "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", "cuml-cu12 25.10.0 requires scipy>=1.8.0, which is not installed.\n", "spint 1.0.7 requires scipy>=0.11, which is not installed.\n", "fastai 2.8.6 requires scikit-learn, which is not installed.\n", "fastai 2.8.6 requires scipy, which is not installed.\n", "pointpats 2.5.2 requires scipy>=1.10, which is not installed.\n", "shap 0.50.0 requires scikit-learn, which is not installed.\n", "shap 0.50.0 requires scipy, which is not installed.\n", "spglm 1.1.0 requires scipy>=1.8, which is not installed.\n", "tobler 0.13.0 requires scipy>=1.13, which is not installed.\n", "statsmodels 0.14.6 requires scipy!=1.9.2,>=1.8, which is not installed.\n", "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", "esda 2.8.1 requires scipy>=1.12, which is not installed.\n", "xarray-einstats 0.9.1 requires scipy>=1.11, which is not installed.\n", "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", "libpysal 4.14.0 requires scipy>=1.12.0, which is not installed.\n", "spaghetti 1.7.6 requires scipy>=1.8, which is not installed.\n", "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.6/60.6 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.2/38.2 MB\u001b[0m \u001b[31m19.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.\n", "librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.\n", "imbalanced-learn 0.14.1 requires scikit-learn<2,>=1.4.2, which is not installed.\n", "hdbscan 0.8.41 requires scikit-learn>=1.6, which is not installed.\n", "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", "pynndescent 0.6.0 requires scikit-learn>=0.18, which is not installed.\n", "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", "fastai 2.8.6 requires scikit-learn, which is not installed.\n", "sentence-transformers 5.2.0 requires scikit-learn, which is not installed.\n", "shap 0.50.0 requires scikit-learn, which is not installed.\n", "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", "umap-learn 0.5.9.post2 requires scikit-learn>=1.6, which is not installed.\n", "access 1.1.10.post3 requires scipy>=1.14.1, but you have scipy 1.13.1 which is incompatible.\n", "pytensor 2.36.3 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", but you have scipy 1.13.1 which is incompatible.\n", "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m397.0/397.0 kB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m311.8/311.8 kB\u001b[0m \u001b[31m31.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m65.5/65.5 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", "fastai 2.8.6 requires scikit-learn, which is not installed.\n", "sentence-transformers 5.2.0 requires scikit-learn, which is not installed.\n", "shap 0.50.0 requires scikit-learn, which is not installed.\n", "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", "langgraph-prebuilt 1.0.5 requires langchain-core>=1.0.0, but you have langchain-core 0.2.40 which is incompatible.\n", "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", "google-adk 1.21.0 requires tenacity<10.0.0,>=9.0.0, but you have tenacity 8.5.0 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m40.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m55.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m397.1/397.1 kB\u001b[0m \u001b[31m36.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.0/51.0 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "langgraph-prebuilt 1.0.5 requires langchain-core>=1.0.0, but you have langchain-core 0.2.43 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m23.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.5/137.5 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m52.0/52.0 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m584.3/584.3 kB\u001b[0m \u001b[31m20.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.4/2.4 MB\u001b[0m \u001b[31m74.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m278.2/278.2 kB\u001b[0m \u001b[31m30.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m107.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.4/17.4 MB\u001b[0m \u001b[31m129.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m72.5/72.5 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m220.0/220.0 kB\u001b[0m \u001b[31m24.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.4/66.4 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.6/132.6 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m167.6/167.6 kB\u001b[0m \u001b[31m19.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.6/60.6 kB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m88.0/88.0 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-exporter-otlp-proto-common==1.37.0, but you have opentelemetry-exporter-otlp-proto-common 1.39.1 which is incompatible.\n", "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-proto==1.37.0, but you have opentelemetry-proto 1.39.1 which is incompatible.\n", "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-sdk~=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", "opentelemetry-exporter-gcp-logging 1.11.0a0 requires opentelemetry-sdk<1.39.0,>=1.35.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", "google-adk 1.21.0 requires opentelemetry-api<=1.37.0,>=1.37.0, but you have opentelemetry-api 1.39.1 which is incompatible.\n", "google-adk 1.21.0 requires opentelemetry-sdk<=1.37.0,>=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", "google-adk 1.21.0 requires tenacity<10.0.0,>=9.0.0, but you have tenacity 8.5.0 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.1/227.1 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.9/8.9 MB\u001b[0m \u001b[31m119.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", but you have scipy 1.13.1 which is incompatible.\n", "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.8/295.8 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hCRITICAL: Go to 'Runtime' > 'Restart session' NOW.\n", "After restarting, run Cell 2.\n", "Requirement already satisfied: gradio in /usr/local/lib/python3.12/dist-packages (5.50.0)\n", "Requirement already satisfied: aiofiles<25.0,>=22.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (24.1.0)\n", "Requirement already satisfied: anyio<5.0,>=3.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (4.12.1)\n", "Requirement already satisfied: brotli>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.2.0)\n", "Requirement already satisfied: fastapi<1.0,>=0.115.2 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.123.10)\n", "Requirement already satisfied: ffmpy in /usr/local/lib/python3.12/dist-packages (from gradio) (1.0.0)\n", "Requirement already satisfied: gradio-client==1.14.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.14.0)\n", "Requirement already satisfied: groovy~=0.1 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.1.2)\n", "Requirement already satisfied: httpx<1.0,>=0.24.1 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.28.1)\n", "Requirement already satisfied: huggingface-hub<2.0,>=0.33.5 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.36.0)\n", "Requirement already satisfied: jinja2<4.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.1.6)\n", "Requirement already satisfied: markupsafe<4.0,>=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.0.3)\n", "Requirement already satisfied: numpy<3.0,>=1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.26.4)\n", "Requirement already satisfied: orjson~=3.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.11.5)\n", "Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from gradio) (24.2)\n", "Requirement already satisfied: pandas<3.0,>=1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.2.2)\n", "Requirement already satisfied: pillow<12.0,>=8.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (11.3.0)\n", "Requirement already satisfied: pydantic<=2.12.3,>=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.12.3)\n", "Requirement already satisfied: pydub in /usr/local/lib/python3.12/dist-packages (from gradio) (0.25.1)\n", "Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.0.21)\n", "Requirement already satisfied: pyyaml<7.0,>=5.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (6.0.3)\n", "Requirement already satisfied: ruff>=0.9.3 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.14.11)\n", "Requirement already satisfied: safehttpx<0.2.0,>=0.1.6 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.1.7)\n", "Requirement already satisfied: semantic-version~=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.10.0)\n", "Requirement already satisfied: starlette<1.0,>=0.40.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.50.0)\n", "Requirement already satisfied: tomlkit<0.14.0,>=0.12.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.13.3)\n", "Requirement already satisfied: typer<1.0,>=0.12 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.21.1)\n", "Requirement already satisfied: typing-extensions~=4.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (4.15.0)\n", "Requirement already satisfied: uvicorn>=0.14.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.40.0)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from gradio-client==1.14.0->gradio) (2025.3.0)\n", "Requirement already satisfied: websockets<16.0,>=13.0 in /usr/local/lib/python3.12/dist-packages (from gradio-client==1.14.0->gradio) (15.0.1)\n", "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.12/dist-packages (from anyio<5.0,>=3.0->gradio) (3.11)\n", "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from fastapi<1.0,>=0.115.2->gradio) (0.0.4)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1.0,>=0.24.1->gradio) (2026.1.4)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1.0,>=0.24.1->gradio) (1.0.9)\n", "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1.0,>=0.24.1->gradio) (0.16.0)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (3.20.2)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (2.32.4)\n", "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (4.67.1)\n", "Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (1.2.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2025.3)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (2.41.4)\n", "Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (0.4.2)\n", "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (8.3.1)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (1.5.4)\n", "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (13.9.4)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas<3.0,>=1.0->gradio) (1.17.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio) (4.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio) (2.19.2)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->huggingface-hub<2.0,>=0.33.5->gradio) (3.4.4)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->huggingface-hub<2.0,>=0.33.5->gradio) (2.5.0)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio) (0.1.2)\n" ] } ], "source": [ "# ==========================================\n", "# CELL 1: DEPENDENCY Installation\n", "# ==========================================\n", "import os\n", "\n", "print(\"πŸ”₯ Cleaning up the environment\")\n", "\n", "# 1. Uninstall EVERYTHING to ensure no \"ghost\" versions remain\n", "!pip uninstall -y -q numpy pandas scipy scikit-learn langchain langchain-community langchain-core langchain-groq\n", "\n", "\n", "print(\"πŸ“¦ Installing the Dependencies\")\n", "\n", "# CORE MATH LIBRARIES\n", "!pip install -q numpy==1.26.4\n", "!pip install -q pandas==2.2.2\n", "!pip install -q scipy==1.13.1\n", "\n", "# LANGCHAIN 0.2 ECOSYSTEM (\n", "# We strictly pin these to the 0.2 series to avoid the breaking 0.3 update\n", "!pip install -q langchain-core==0.2.40\n", "!pip install -q langchain-community==0.2.16\n", "!pip install -q langchain==0.2.16\n", "!pip install -q langchain-groq==0.1.9\n", "!pip install -q langchain-text-splitters==0.2.4\n", "\n", "# VECTOR DATABASE & EMBEDDINGS\n", "!pip install -q chromadb==0.5.5\n", "!pip install -q sentence-transformers==3.0.1\n", "!pip install -q pypdf==4.3.1\n", "!pip install -q rank-bm25==0.2.2\n", "\n", "\n", "print(\"CRITICAL: Go to 'Runtime' > 'Restart session' NOW.\")\n", "print(\"After restarting, run Cell 2.\")\n", "\n", "!pip install gradio" ] }, { "cell_type": "markdown", "metadata": { "id": "e7y44MyPVC3_" }, "source": [ "This cell imports all required libraries and sets up the compute device (GPU if available, else CPU).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iUn1NfUQBHPK", "outputId": "bf945e96-5495-4d71-e3d2-b6bae439b0dc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "System ready. Running on: CUDA\n" ] } ], "source": [ "import os\n", "import sys\n", "import json\n", "import torch\n", "import numpy as np\n", "from typing import List, Dict, Tuple, Optional\n", "from collections import defaultdict\n", "from dataclasses import dataclass\n", "import hashlib\n", "import gradio as gr\n", "from datetime import datetime\n", "\n", "# Core Imports\n", "from langchain_community.document_loaders import PyPDFLoader\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "from langchain_community.vectorstores import Chroma\n", "from langchain_community.embeddings import HuggingFaceEmbeddings\n", "from langchain_community.retrievers import BM25Retriever\n", "from langchain_groq import ChatGroq\n", "from langchain.schema import Document\n", "\n", "# Advanced Models\n", "from sentence_transformers import SentenceTransformer, CrossEncoder\n", "\n", "# Setup Device\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(f\"System ready. Running on: {device.upper()}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Gojw-RfdVOa5" }, "source": [ "## Core Data Structures\n", "\n", "### QueryProfile Dataclass\n", "\n", "**Purpose**: Encapsulates the result of query classification to guide retrieval strategy.\n", "\n", "| Field | Type | Description | Example Values |\n", "|-------|------|-------------|----------------|\n", "| `query_type` | str | Category of question | `\"factoid\"`, `\"summary\"`, `\"comparison\"`, `\"extraction\"`, `\"reasoning\"` |\n", "| `intent` | str | Same as query_type (for extensibility) | Same as above |\n", "| `needs_multi_docs` | bool | Does query span multiple documents? | `True` for comparison queries |\n", "| `requires_comparison` | bool | Is this a compare/contrast question? | `True` if \"compare\", \"difference\" in query |\n", "| `answer_style` | str | How to format the answer | `\"direct\"`, `\"bullets\"`, `\"steps\"` |\n", "| `k` | int | Number of chunks to retrieve | 5-12 (auto-tuned based on query type) |\n", "\n", "### Query Type β†’ Retrieval Strategy Mapping:\n", "```\n", "factoid β†’ k=6, style=direct (simple fact lookup)\n", "summary β†’ k=10, style=bullets (overview questions)\n", "comparison β†’ k=12, style=bullets (cross-document comparison)\n", "extraction β†’ k=8, style=direct (extract specific info)\n", "reasoning β†’ k=10, style=steps (explain how/why)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "zs63mc1EBls0" }, "outputs": [], "source": [ "@dataclass\n", "class QueryProfile:\n", " query_type: str\n", " intent: str\n", " needs_multi_docs: bool\n", " requires_comparison: bool\n", " answer_style: str\n", " k: int\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_voziDjOVl6M" }, "source": [ "### QueryCache Class\n", "\n", "**Purpose**: LRU-style cache to avoid redundant LLM calls for repeated queries.\n", "\n", "#### How It Works:\n", "1. **Key Generation**: MD5 hash of query string\n", "2. **Storage**: Dictionary mapping hash β†’ response\n", "3. **Eviction**: FIFO (First-In-First-Out) when `max_size` exceeded\n", "\n", "#### Methods:\n", "| Method | Input | Output | Description |\n", "|--------|-------|--------|-------------|\n", "| `get(query)` | Query string | Response or `None` | Check if query is cached |\n", "| `set(query, response)` | Query + Response | None | Store result, evict oldest if full |\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "oWpJgmpwN1tS" }, "outputs": [], "source": [ "class QueryCache:\n", " \"\"\"Simple cache for repeated queries\"\"\"\n", " def __init__(self, max_size=100):\n", " self.cache = {}\n", " self.max_size = max_size\n", "\n", " def get(self, query: str) -> Optional[str]:\n", " key = hashlib.md5(query.encode()).hexdigest()\n", " return self.cache.get(key)\n", "\n", " def set(self, query: str, response: str):\n", " key = hashlib.md5(query.encode()).hexdigest()\n", " if len(self.cache) >= self.max_size:\n", " self.cache.pop(next(iter(self.cache)))\n", " self.cache[key] = response\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rwBQl2SjVuW3" }, "source": [ "### SemanticChunker Class\n", "\n", "**Purpose**: Split documents into semantically coherent chunks (vs. arbitrary character-based splits).\n", "\n", "#### Why Semantic Chunking?\n", "| Traditional Chunking | Semantic Chunking |\n", "|---------------------|-------------------|\n", "| Splits at fixed character count | Splits at topic boundaries |\n", "| May cut mid-sentence/concept | Preserves complete ideas |\n", "| Lower retrieval relevance | Higher retrieval relevance |\n", "\n", "#### Algorithm:\n", "```\n", "1. Split text into sentences (by \". \")\n", "2. Encode each sentence with SentenceTransformer\n", "3. For each consecutive sentence pair:\n", " - Compute cosine similarity\n", " - If similarity > threshold AND size < max:\n", " β†’ Add to current chunk\n", " - Else:\n", " β†’ Save chunk, start new one\n", "4. Return list of semantic chunks\n", "```\n", "\n", "#### Parameters:\n", "| Parameter | Default | Description |\n", "|-----------|---------|-------------|\n", "| `model_name` | `all-MiniLM-L6-v2` | Sentence embedding model |\n", "| `max_chunk_size` | 1000 | Maximum characters per chunk |\n", "| `similarity_threshold` | 0.5 | Cosine similarity threshold for grouping |" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "ZqF22Kb5Nv7e" }, "outputs": [], "source": [ "class SemanticChunker:\n", " \"\"\"Advanced semantic chunking using sentence embeddings\"\"\"\n", " def __init__(self, model_name=\"sentence-transformers/all-MiniLM-L6-v2\"):\n", " self.model = SentenceTransformer(model_name, device=device)\n", "\n", " def chunk_document(self, text: str, max_chunk_size=1000, similarity_threshold=0.5):\n", " \"\"\"Split text into semantically coherent chunks\"\"\"\n", " sentences = text.replace('\\n', ' ').split('. ')\n", " sentences = [s.strip() + '.' for s in sentences if s.strip()]\n", "\n", " if not sentences:\n", " return [text]\n", "\n", " embeddings = self.model.encode(sentences)\n", " chunks = []\n", " current_chunk = [sentences[0]]\n", " current_size = len(sentences[0])\n", "\n", " for i in range(1, len(sentences)):\n", " similarity = np.dot(embeddings[i-1], embeddings[i]) / (\n", " np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])\n", " )\n", " sentence_len = len(sentences[i])\n", "\n", " if similarity > similarity_threshold and current_size + sentence_len < max_chunk_size:\n", " current_chunk.append(sentences[i])\n", " current_size += sentence_len\n", " else:\n", " chunks.append(' '.join(current_chunk))\n", " current_chunk = [sentences[i]]\n", " current_size = sentence_len\n", "\n", " if current_chunk:\n", " chunks.append(' '.join(current_chunk))\n", "\n", " return chunks\n" ] }, { "cell_type": "markdown", "metadata": { "id": "yTCw6h2YWCZq" }, "source": [ "### ReciprocalRankFusion (RRF) Class\n", "\n", "**Purpose**: Combine multiple ranked retrieval lists into a single optimal ranking.\n", "\n", "#### The Problem RRF Solves:\n", "When using multiple retrievers (vector search, keyword search, etc.), each returns a ranked list. How do we combine them?\n", "\n", "#### RRF Formula:\n", "```\n", "score(doc) = Ξ£ 1 / (k + rank_i + 1)\n", "```\n", "Where:\n", "- `k` = 60 (smoothing constant, standard value)\n", "- `rank_i` = position of document in retrieval list i (0-indexed)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "nxiRtNGINqQr" }, "outputs": [], "source": [ "class ReciprocalRankFusion:\n", " \"\"\"RRF for combining multiple retrieval results\"\"\"\n", " @staticmethod\n", " def fuse(retrieval_results: List[List[Document]], k=60) -> List[Document]:\n", " doc_scores = defaultdict(float)\n", " doc_map = {}\n", "\n", " for docs in retrieval_results:\n", " for rank, doc in enumerate(docs):\n", " doc_id = doc.metadata.get('chunk_id') or f\"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}\"\n", " doc_scores[doc_id] += 1 / (k + rank + 1)\n", " doc_map[doc_id] = doc\n", "\n", " sorted_docs = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)\n", " return [doc_map[doc_id] for doc_id, _ in sorted_docs]\n" ] }, { "cell_type": "markdown", "metadata": { "id": "VtUYO8vPWMzp" }, "source": [ "## EnhancedRAG - Complete RAG Engine\n", "\n", "This is the **core class** that orchestrates the entire RAG pipeline. All document ingestion, retrieval, and generation flows through this class.\n", "\n", "---\n", "\n", "### Class Architecture\n", "\n", "```\n", "EnhancedRAGv3\n", "β”œβ”€β”€ Storage Layer\n", "β”‚ β”œβ”€β”€ vector_db (ChromaDB) # Semantic search index\n", "β”‚ β”œβ”€β”€ bm25_retriever # Keyword search index\n", "β”‚ β”œβ”€β”€ documents (List) # All document chunks\n", "β”‚ └── pdf_metadata (Dict) # PDF tracking {name: {path, pages, chunks, pdf_id}}\n", "β”‚\n", "β”œβ”€β”€ Model Layer (Lazy-loaded for memory efficiency)\n", "β”‚ β”œβ”€β”€ embedding_model # BAAI/bge-large-en-v1.5 (~1.2GB)\n", "β”‚ β”œβ”€β”€ cross_encoder # BAAI/bge-reranker-v2-m3 (~560MB)\n", "β”‚ β”œβ”€β”€ semantic_chunker # all-MiniLM-L6-v2 (~90MB)\n", "β”‚ └── query_model # all-MiniLM-L6-v2 (~90MB)\n", "β”‚\n", "β”œβ”€β”€ LLM Layer\n", "β”‚ └── llm (ChatGroq) # Llama 3.3 70B via Groq API\n", "β”‚\n", "└── Utility Layer\n", " β”œβ”€β”€ cache (QueryCache) # Response caching (max 100 queries)\n", " └── api_key # Groq API key\n", "```\n", "\n", "---\n", "\n", "### Method Reference\n", "\n", "| Method | Purpose | Key Details |\n", "|--------|---------|-------------|\n", "| `__init__(api_key)` | Initialize system | Sets up LLM, all other models lazy-loaded |\n", "| `load_models()` | Load ML models | BGE embeddings β†’ CrossEncoder β†’ Chunker β†’ Query model |\n", "| `ingest_pdf(path)` | Process PDF | Extract β†’ Chunk β†’ Index in ChromaDB + BM25 |\n", "| `chat(query)` | Answer questions | Full pipeline: classify β†’ expand β†’ retrieve β†’ rerank β†’ generate |\n", "| `summarize_document()` | Summarize all docs | Map-reduce: batch summaries β†’ final synthesis |\n", "\n", "---\n", "\n", "### 1. Initialization & Model Loading\n", "\n", "**`__init__(api_key)`** - Sets up the system with Groq API key. Models are NOT loaded yet (lazy loading for faster startup).\n", "\n", "**`load_models()`** - Loads all ML models with progress tracking:\n", "\n", "| Progress | Model | Size | Purpose |\n", "|----------|-------|------|---------|\n", "| 10% β†’ 40% | BAAI/bge-large-en-v1.5 | ~1.2GB | Document & query embeddings (1024-dim, normalized) |\n", "| 40% β†’ 60% | BAAI/bge-reranker-v2-m3 | ~560MB | Cross-encoder re-ranking |\n", "| 60% β†’ 80% | all-MiniLM-L6-v2 | ~90MB | Semantic chunking |\n", "| 80% β†’ 100% | all-MiniLM-L6-v2 | ~90MB | Query processing |\n", "\n", "---\n", "\n", "### 2. Document Ingestion Pipeline\n", "\n", "**`ingest_pdf(pdf_path, use_semantic_chunking=True)`**\n", "\n", "```\n", "PDF File\n", " ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 1. PyPDFLoader β”‚ Extract text from each page\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 2. Duplicate Check β”‚ Skip if pdf_name already in pdf_metadata\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 3. Chunking β”‚ SemanticChunker (default) or RecursiveTextSplitter\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 4. Add Metadata β”‚ {page, source, pdf_name, pdf_id, chunk_id}\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 5. Rebuild Indexes β”‚ ChromaDB (vector) + BM25 (keyword) with ALL docs\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", "```\n", "\n", "**Chunk Metadata Schema:**\n", "```python\n", "{\n", " \"page\": 0, # 0-indexed page number\n", " \"source\": \"/path/to/doc.pdf\", # Full file path\n", " \"pdf_name\": \"doc.pdf\", # Filename only\n", " \"pdf_id\": \"a1b2c3d4\", # 8-char MD5 hash (unique per PDF)\n", " \"chunk_id\": \"a1b2c3d4-42\" # Unique chunk identifier\n", "}\n", "```\n", "\n", "---\n", "\n", "### 3. Query Classification\n", "\n", "**`_classify_query(query) β†’ QueryProfile`**\n", "\n", "Determines optimal retrieval strategy using LLM + heuristic fallback:\n", "\n", "| Query Type | Trigger Keywords | k | Answer Style |\n", "|------------|------------------|---|--------------|\n", "| `factoid` | \"what is\", \"who is\", \"define\" | 6 | direct |\n", "| `summary` | \"summarize\", \"overview\", \"key points\" | 10 | bullets |\n", "| `comparison` | \"compare\", \"difference\", \"vs\", \"between\" | 12 | bullets |\n", "| `extraction` | (default) | 8 | direct |\n", "| `reasoning` | \"explain\", \"how does\", \"why\" | 10 | steps |\n", "\n", "**Returns:** `QueryProfile(query_type, intent, needs_multi_docs, requires_comparison, answer_style, k)`\n", "\n", "---\n", "\n", "### 4. Query Enhancement Techniques\n", "\n", "**`_generate_hyde_document(query) β†’ str`** - HyDE (Hypothetical Document Embeddings)\n", "\n", "```\n", "Query: \"What is attention?\"\n", " ↓ LLM generates\n", "HyDE Doc: \"The attention mechanism is a neural network component\n", " that allows models to focus on relevant parts...\"\n", " ↓\n", "Used for retrieval (matches real docs better than short query!)\n", "```\n", "\n", "**`_expand_query(query) β†’ List[str]`** - Multi-Query Expansion\n", "\n", "```\n", "Original: \"What are the benefits of transformers?\"\n", " ↓ LLM generates 3 variants\n", "[\n", " \"What are the benefits of transformers?\", # Original\n", " \"What advantages do transformer models offer?\", # Variant 1\n", " \"Why are transformers better than RNNs?\", # Variant 2\n", " \"What makes transformer architecture effective?\" # Variant 3\n", "]\n", " ↓\n", "All used for retrieval β†’ RRF fuses results\n", "```\n", "\n", "---\n", "\n", "### 5. Hybrid Retrieval Pipeline\n", "\n", "**`_retrieve_with_rrf(query, k, fetch_factor=2) β†’ List[Document]`**\n", "\n", "```\n", "Query\n", " β”‚\n", " β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", " ↓ ↓\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ Vector Search (MMR)β”‚ β”‚ BM25 Search β”‚\n", "β”‚ β”‚ β”‚ β”‚\n", "β”‚ β€’ Semantic match β”‚ β”‚ β€’ Exact keywords β”‚\n", "β”‚ β€’ lambda=0.6 β”‚ β”‚ β€’ Term frequency β”‚\n", "β”‚ (relevance+ β”‚ β”‚ β”‚\n", "β”‚ diversity) β”‚ β”‚ β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " β”‚ β”‚\n", " β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", " β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", " β”‚ RRF Fusion β”‚ score = Ξ£ 1/(60 + rank + 1)\n", " β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", " ↓\n", " Fused ranked list\n", "```\n", "\n", "**Why Hybrid?**\n", "- Vector: Understands synonyms, semantic similarity\n", "- BM25: Exact term matching, handles rare words\n", "- Combined: Best of both worlds\n", "\n", "---\n", "\n", "### 6. Re-ranking & PDF Diversity\n", "\n", "**`_rerank_documents(query, documents, top_k) β†’ List[(Document, score)]`**\n", "\n", "Uses **CrossEncoder** for neural re-ranking:\n", "- Bi-encoder (initial): Fast but less accurate (query/doc encoded separately)\n", "- Cross-encoder (re-rank): Slower but accurate (query+doc processed together)\n", "\n", "**Comparison Query Boost:** For comparison queries, documents containing keywords like \"compared to\", \"in contrast\", \"whereas\" get +10% score boost per keyword.\n", "\n", "---\n", "\n", "**`_ensure_pdf_diversity(query, documents, target_docs=2) β†’ List[Document]`**\n", "\n", "For multi-document queries, ensures chunks from ALL loaded PDFs:\n", "\n", "```\n", "Problem: Query about \"both papers\" returns only Paper A chunks\n", " ↓\n", "Solution: Detect missing PDFs β†’ filtered vector search β†’ add their chunks\n", " ↓\n", "Result: [chunk_A1, chunk_A2, chunk_A3, chunk_B1, chunk_B2]\n", "```\n", "\n", "---\n", "\n", "### 7. Main Chat Pipeline\n", "\n", "**`chat(query, use_hyde=True, use_multi_query=True) β†’ (answer, citations, metadata)`**\n", "\n", "```\n", "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", "β”‚ 1. CACHE CHECK β”‚ Return immediately if query cached β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 2. CLASSIFY QUERY β”‚ β†’ QueryProfile (type, k, style) β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 3. EXPAND QUERY β”‚ Generate 3 alternative phrasings β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 4. GENERATE HyDE β”‚ Create hypothetical answer document β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 5. RETRIEVE β”‚ For EACH query variant: β”‚\n", "β”‚ β”‚ β€’ Vector search (MMR) β”‚\n", "β”‚ β”‚ β€’ BM25 search β”‚\n", "β”‚ β”‚ β€’ RRF fusion β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 6. GLOBAL RRF β”‚ Fuse results from all query variants β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 7. PDF DIVERSITY β”‚ Ensure chunks from all loaded PDFs β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 8. RERANK β”‚ CrossEncoder neural scoring β†’ top k β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 9. BUILD CONTEXT β”‚ Format: \"[Source 1]: chunk content...\" β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 10. LLM GENERATE β”‚ Answer with inline [Source X] citationsβ”‚\n", "β”‚ β”‚ (Different prompts for comparison) β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 11. VERIFY (complex) β”‚ Self-check: direct? structured? If not β”‚\n", "β”‚ β”‚ β†’ regenerate improved answer β”‚\n", "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", "β”‚ 12. CACHE & RETURN β”‚ Store result, return (answer, cites, β”‚\n", "β”‚ β”‚ metadata) β”‚\n", "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", "```\n", "\n", "---\n", "\n", "### 8. Document Summarization\n", "\n", "**`summarize_document(max_chunks=None) β†’ (summary, metadata)`**\n", "\n", "Uses **Map-Reduce** pattern:\n", "\n", "```\n", "MAP PHASE:\n", " Chunks [1-10] β†’ LLM β†’ 3-5 bullet summary\n", " Chunks [11-20] β†’ LLM β†’ 3-5 bullet summary\n", " ...\n", " Chunks [n-m] β†’ LLM β†’ 3-5 bullet summary\n", "\n", "REDUCE PHASE:\n", " All batch summaries β†’ LLM β†’ Final structured summary:\n", " β€’ Overview (2-3 sentences)\n", " β€’ Main Topics (bullets)\n", " β€’ Important Details (3-5 points)\n", " β€’ Conclusion\n", "```\n", "\n", "---\n", "\n", "### Key Parameters Reference\n", "\n", "| Parameter | Location | Default | Description |\n", "|-----------|----------|---------|-------------|\n", "| `k` | QueryProfile | 5-12 | Chunks to retrieve (auto-tuned by query type) |\n", "| `fetch_factor` | _retrieve_with_rrf | 2 | Multiplier for initial retrieval pool |\n", "| `lambda_mult` | MMR search | 0.6 | Diversity vs relevance (0=diverse, 1=relevant) |\n", "| `similarity_threshold` | SemanticChunker | 0.5 | Cosine sim for chunk boundaries |\n", "| `max_chunk_size` | SemanticChunker | 1000 | Max characters per chunk |\n", "| `chunk_size` | TextSplitter | 800 | Fallback chunker size |\n", "| `chunk_overlap` | TextSplitter | 150 | Character overlap between chunks |\n", "| `max_size` | QueryCache | 100 | Maximum cached queries |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iXFJQbsgNVpz" }, "outputs": [], "source": [ "class EnhancedRAGv3:\n", " def __init__(self, api_key: str):\n", " self.vector_db = None\n", " self.bm25_retriever = None\n", " self.documents = []\n", " self.pdf_metadata = {} # Track multiple PDFs\n", " self.doc_headers = {} # Store extracted headers (title, authors, abstract) per PDF\n", " self.cache = QueryCache()\n", " self.api_key = api_key\n", " self.is_initialized = False\n", "\n", " # Initialize LLM\n", " self.llm = ChatGroq(\n", " temperature=0,\n", " model_name=\"llama-3.3-70b-versatile\",\n", " groq_api_key=api_key\n", " )\n", "\n", " # Models (loaded on demand)\n", " self.embedding_model = None\n", " self.cross_encoder = None\n", " self.semantic_chunker = None\n", " self.query_model = None\n", "\n", " def load_models(self, progress=gr.Progress()):\n", " \"\"\"Load all models with progress tracking\"\"\"\n", " if self.is_initialized:\n", " return \"Models already loaded.\"\n", "\n", " progress(0.1, desc=\"Loading BGE embeddings...\")\n", " self.embedding_model = HuggingFaceEmbeddings(\n", " model_name=\"BAAI/bge-large-en-v1.5\",\n", " model_kwargs={'device': device, 'trust_remote_code': True},\n", " encode_kwargs={'normalize_embeddings': True}\n", " )\n", "\n", " progress(0.4, desc=\"Loading Re-ranker...\")\n", " self.cross_encoder = CrossEncoder('BAAI/bge-reranker-v2-m3', device=device)\n", "\n", " progress(0.6, desc=\"Loading Semantic Chunker...\")\n", " self.semantic_chunker = SemanticChunker()\n", "\n", " progress(0.8, desc=\"Loading Query Model...\")\n", " self.query_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)\n", "\n", " progress(1.0, desc=\"Complete\")\n", " self.is_initialized = True\n", " return \"All models loaded successfully.\"\n", "\n", " def _extract_document_header(self, pages: List[Document], pdf_name: str, pdf_id: str) -> Dict:\n", " \"\"\"Extract title, authors, and abstract from first pages of PDF\"\"\"\n", " # Get text from first 2 pages (where metadata usually is)\n", " header_text = \"\"\n", " for i, page in enumerate(pages[:2]):\n", " header_text += page.page_content + \"\\n\\n\"\n", " \n", " # Use LLM to extract structured metadata\n", " extraction_prompt = f\"\"\"Extract the following information from this academic paper's first pages.\n", "Return ONLY a JSON object with these keys:\n", "- title: The paper's title (string)\n", "- authors: List of author names (array of strings)\n", "- abstract: The paper's abstract if present (string, or null if not found)\n", "- institutions: List of institutions/affiliations if present (array of strings, or empty array)\n", "\n", "Text from first pages:\n", "{header_text[:4000]}\n", "\n", "JSON:\"\"\"\n", "\n", " try:\n", " response = self.llm.invoke(extraction_prompt)\n", " # Parse JSON from response\n", " import re\n", " json_match = re.search(r'\\{[\\s\\S]*\\}', response.content)\n", " if json_match:\n", " metadata = json.loads(json_match.group())\n", " metadata['pdf_name'] = pdf_name\n", " metadata['pdf_id'] = pdf_id\n", " metadata['raw_header'] = header_text[:2000] # Store raw text too\n", " return metadata\n", " except Exception as e:\n", " print(f\"Header extraction error: {e}\")\n", " \n", " # Fallback: return raw header text\n", " return {\n", " 'title': None,\n", " 'authors': [],\n", " 'abstract': None,\n", " 'institutions': [],\n", " 'pdf_name': pdf_name,\n", " 'pdf_id': pdf_id,\n", " 'raw_header': header_text[:2000]\n", " }\n", "\n", " def _is_metadata_query(self, query: str) -> Tuple[bool, str]:\n", " \"\"\"Check if query is asking for basic document metadata\"\"\"\n", " query_lower = query.lower()\n", " \n", " # Author queries\n", " author_patterns = ['who are the authors', 'who wrote', 'author', 'authors', 'written by', 'by whom']\n", " if any(p in query_lower for p in author_patterns):\n", " return True, 'authors'\n", " \n", " # Title queries\n", " title_patterns = ['what is the title', 'title of', 'paper title', 'document title', 'name of the paper']\n", " if any(p in query_lower for p in title_patterns):\n", " return True, 'title'\n", " \n", " # Abstract queries\n", " abstract_patterns = ['what is the abstract', 'abstract of', 'paper abstract', 'summarize the abstract']\n", " if any(p in query_lower for p in abstract_patterns):\n", " return True, 'abstract'\n", " \n", " # Institution queries\n", " institution_patterns = ['which institution', 'which university', 'affiliation', 'where are the authors from']\n", " if any(p in query_lower for p in institution_patterns):\n", " return True, 'institutions'\n", " \n", " return False, None\n", "\n", " def _answer_metadata_query(self, query: str, metadata_type: str) -> Tuple[str, str, str]:\n", " \"\"\"Answer queries about document metadata directly\"\"\"\n", " if not self.doc_headers:\n", " return \"No document metadata available.\", \"\", \"\"\n", " \n", " # Build response from stored headers\n", " responses = []\n", " citations = []\n", " \n", " for pdf_name, header in self.doc_headers.items():\n", " if metadata_type == 'authors':\n", " authors = header.get('authors', [])\n", " if authors:\n", " author_str = \", \".join(authors)\n", " responses.append(f\"**{pdf_name}**: {author_str}\")\n", " else:\n", " # Fallback to raw header\n", " responses.append(f\"**{pdf_name}**: Authors could not be automatically extracted. See first page.\")\n", " \n", " elif metadata_type == 'title':\n", " title = header.get('title')\n", " if title:\n", " responses.append(f\"**{pdf_name}**: {title}\")\n", " else:\n", " responses.append(f\"**{pdf_name}**: Title could not be automatically extracted.\")\n", " \n", " elif metadata_type == 'abstract':\n", " abstract = header.get('abstract')\n", " if abstract:\n", " responses.append(f\"**{pdf_name}**:\\n{abstract}\")\n", " else:\n", " responses.append(f\"**{pdf_name}**: Abstract not found in first pages.\")\n", " \n", " elif metadata_type == 'institutions':\n", " institutions = header.get('institutions', [])\n", " if institutions:\n", " inst_str = \", \".join(institutions)\n", " responses.append(f\"**{pdf_name}**: {inst_str}\")\n", " else:\n", " responses.append(f\"**{pdf_name}**: Institutions could not be automatically extracted.\")\n", " \n", " # Create citation from raw header\n", " snippet = header.get('raw_header', '')[:300] + \"...\"\n", " citations.append(f\"\"\"\n", "
\n", "[1] {pdf_name} β€” Page 1 | Relevance: High (Document Header)\n", "
{snippet}
\n", "
\n", "\"\"\")\n", " \n", " answer = \"\\n\\n\".join(responses)\n", " citations_html = \"\\n\".join(citations)\n", " metadata_str = f\"**Query Type:** Metadata ({metadata_type}) | **Direct extraction from document headers**\"\n", " \n", " return answer, citations_html, metadata_str\n", "\n", " def _classify_query(self, query: str) -> QueryProfile:\n", " classification_prompt = f\"\"\"You route user questions for a RAG system.\n", "Return ONLY compact JSON with keys:\n", "- query_type: one of [factoid, summary, comparison, extraction, reasoning]\n", "- needs_multi_docs: true/false (set true when the query likely spans multiple documents or asks for differences)\n", "- requires_comparison: true/false\n", "- answer_style: one of [direct, bullets, steps]\n", "- k: integer between 5 and 12 indicating how many chunks to retrieve\n", "\n", "Question: {query}\n", "\n", "JSON:\"\"\"\n", "\n", " def heuristic_profile() -> QueryProfile:\n", " ql = query.lower()\n", " requires_comparison = any(word in ql for word in ['compare', 'difference', 'versus', 'vs', 'between', 'across'])\n", " needs_multi = requires_comparison or any(word in ql for word in ['both', 'each document', 'all documents', 'across'])\n", " if any(pattern in ql for pattern in ['what is', 'who is', 'when ', 'define', 'list ']):\n", " qt = 'factoid'\n", " k_val = 6\n", " style = 'direct'\n", " elif requires_comparison:\n", " qt = 'comparison'\n", " k_val = 12\n", " style = 'bullets'\n", " elif any(word in ql for word in ['summarize', 'overview', 'key points', 'conclusion']):\n", " qt = 'summary'\n", " k_val = 10\n", " style = 'bullets'\n", " elif any(word in ql for word in ['explain', 'how does', 'process', 'steps', 'methodology']):\n", " qt = 'reasoning'\n", " k_val = 10\n", " style = 'steps'\n", " else:\n", " qt = 'extraction'\n", " k_val = 8\n", " style = 'direct'\n", " return QueryProfile(\n", " query_type=qt,\n", " intent=qt,\n", " needs_multi_docs=needs_multi,\n", " requires_comparison=requires_comparison,\n", " answer_style=style,\n", " k=k_val\n", " )\n", "\n", " try:\n", " response = self.llm.invoke(classification_prompt)\n", " data = json.loads(response.content)\n", " qt = str(data.get('query_type', 'extraction')).lower()\n", " needs_multi = bool(data.get('needs_multi_docs', False))\n", " requires_comparison = bool(data.get('requires_comparison', False))\n", " style = str(data.get('answer_style', 'direct')).lower()\n", " k_val = int(data.get('k', 8))\n", " k_val = max(5, min(k_val, 12))\n", " if qt not in ['factoid', 'summary', 'comparison', 'extraction', 'reasoning']:\n", " qt = 'extraction'\n", " if style not in ['direct', 'bullets', 'steps']:\n", " style = 'direct'\n", " return QueryProfile(\n", " query_type=qt,\n", " intent=qt,\n", " needs_multi_docs=needs_multi or requires_comparison,\n", " requires_comparison=requires_comparison or qt == 'comparison',\n", " answer_style=style,\n", " k=k_val\n", " )\n", " except Exception:\n", " return heuristic_profile()\n", "\n", " def _generate_hyde_document(self, query: str) -> str:\n", " hyde_prompt = f\"\"\"Generate a detailed, factual paragraph that would answer this question:\n", "\n", "Question: {query}\n", "\n", "Write a comprehensive answer (2-3 sentences) as if from an expert document:\"\"\"\n", " try:\n", " response = self.llm.invoke(hyde_prompt)\n", " return response.content\n", " except:\n", " return query\n", "\n", " def _expand_query(self, query: str) -> List[str]:\n", " expansion_prompt = f\"\"\"Generate 3 different versions of this question to retrieve relevant documents:\n", "\n", "Original Question: {query}\n", "\n", "Generate 3 alternative phrasings (one per line):\"\"\"\n", " try:\n", " response = self.llm.invoke(expansion_prompt)\n", " queries = response.content.strip().split('\\n')\n", " queries = [q.strip().lstrip('1234567890.-) ') for q in queries if q.strip()]\n", " return [query] + queries[:3]\n", " except:\n", " return [query]\n", "\n", " def _adaptive_retrieve(self, query: str, query_type: str) -> int:\n", " k_map = {'factoid': 5, 'medium': 8, 'complex': 12}\n", " return k_map.get(query_type, 8)\n", "\n", " def ingest_pdf(self, pdf_path: str, use_semantic_chunking=True, progress=gr.Progress()):\n", " \"\"\"Ingest PDF with progress tracking - supports multiple PDFs\"\"\"\n", " progress(0.1, desc=f\"Loading PDF: {os.path.basename(pdf_path)}...\")\n", "\n", " # Check if already loaded\n", " pdf_name = os.path.basename(pdf_path)\n", " if pdf_name in self.pdf_metadata:\n", " return f\"Notice: Document '{pdf_name}' is already loaded.\"\n", "\n", " loader = PyPDFLoader(pdf_path)\n", " docs = loader.load()\n", "\n", " # Add unique PDF identifier to metadata\n", " pdf_id = hashlib.md5(pdf_path.encode()).hexdigest()[:8]\n", " \n", " progress(0.2, desc=\"Extracting document metadata (title, authors)...\")\n", " # Extract and store document header metadata\n", " header_info = self._extract_document_header(docs, pdf_name, pdf_id)\n", " self.doc_headers[pdf_name] = header_info\n", " \n", " # Log extracted info\n", " if header_info.get('authors'):\n", " print(f\"Extracted authors: {header_info['authors']}\")\n", " if header_info.get('title'):\n", " print(f\"Extracted title: {header_info['title']}\")\n", "\n", " progress(0.3, desc=f\"Loaded {len(docs)} pages. Chunking...\")\n", "\n", " chunk_counter = len(self.documents)\n", "\n", " if use_semantic_chunking:\n", " splits = []\n", " for i, doc in enumerate(docs):\n", " progress(0.3 + (0.3 * i / len(docs)), desc=f\"Semantic chunking page {i+1}/{len(docs)}...\")\n", " semantic_chunks = self.semantic_chunker.chunk_document(doc.page_content)\n", " for chunk in semantic_chunks:\n", " chunk_counter += 1\n", " # Mark first page chunks as header chunks\n", " is_header = doc.metadata.get('page', 0) == 0\n", " splits.append(Document(\n", " page_content=chunk,\n", " metadata={\n", " 'page': doc.metadata.get('page', 0),\n", " 'source': pdf_path,\n", " 'pdf_name': pdf_name,\n", " 'pdf_id': pdf_id,\n", " 'chunk_id': f\"{pdf_id}-{chunk_counter}\",\n", " 'is_header': is_header\n", " }\n", " ))\n", " else:\n", " text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=800,\n", " chunk_overlap=150,\n", " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n", " length_function=len\n", " )\n", " splits = text_splitter.split_documents(docs)\n", " # Add PDF metadata\n", " for split in splits:\n", " chunk_counter += 1\n", " is_header = split.metadata.get('page', 0) == 0\n", " split.metadata['pdf_name'] = pdf_name\n", " split.metadata['pdf_id'] = pdf_id\n", " split.metadata['chunk_id'] = f\"{pdf_id}-{chunk_counter}\"\n", " split.metadata['is_header'] = is_header\n", "\n", " # Add to existing documents\n", " self.documents.extend(splits)\n", "\n", " # Track PDF metadata\n", " total_pages = max([doc.metadata.get('page', 0) for doc in docs]) + 1\n", " self.pdf_metadata[pdf_name] = {\n", " 'path': pdf_path,\n", " 'pages': total_pages,\n", " 'chunks': len(splits),\n", " 'pdf_id': pdf_id,\n", " 'added': datetime.now().strftime(\"%Y-%m-%d %H:%M\")\n", " }\n", "\n", " progress(0.7, desc=f\"Rebuilding Vector Index ({len(self.documents)} total chunks)...\")\n", "\n", " # Rebuild vector DB with all documents\n", " self.vector_db = Chroma.from_documents(\n", " documents=self.documents,\n", " embedding=self.embedding_model,\n", " collection_name=\"rag_gradio_v3\"\n", " )\n", "\n", " progress(0.9, desc=\"Rebuilding Keyword Index...\")\n", " self.bm25_retriever = BM25Retriever.from_documents(self.documents)\n", "\n", " progress(1.0, desc=\"Complete\")\n", "\n", " # Build return message with extracted metadata\n", " extracted_info = \"\"\n", " if header_info.get('title'):\n", " extracted_info += f\"\\n**Title:** {header_info['title']}\"\n", " if header_info.get('authors'):\n", " extracted_info += f\"\\n**Authors:** {', '.join(header_info['authors'])}\"\n", "\n", " return f\"\"\"**Document Added Successfully**\n", "\n", "**File:** {pdf_name}\n", "**Pages:** {total_pages}\n", "**Chunks:** {len(splits)}\n", "{extracted_info}\n", "\n", "**Total Collection:**\n", "- {len(self.pdf_metadata)} document(s)\n", "- {len(self.documents)} total chunks\n", "\n", "Ready to answer questions.\"\"\"\n", "\n", " def get_loaded_pdfs(self) -> str:\n", " \"\"\"Return formatted list of loaded PDFs\"\"\"\n", " if not self.pdf_metadata:\n", " return \"No documents loaded yet.\"\n", "\n", " output = \"## Loaded Documents\\n\\n\"\n", " for idx, (name, info) in enumerate(self.pdf_metadata.items(), 1):\n", " output += f\"**{idx}. {name}**\\n\"\n", " output += f\" - Pages: {info['pages']} | Chunks: {info['chunks']}\\n\"\n", " output += f\" - Added: {info['added']}\\n\"\n", " # Add extracted metadata if available\n", " if name in self.doc_headers:\n", " header = self.doc_headers[name]\n", " if header.get('title'):\n", " output += f\" - Title: {header['title']}\\n\"\n", " if header.get('authors'):\n", " output += f\" - Authors: {', '.join(header['authors'][:3])}{'...' if len(header['authors']) > 3 else ''}\\n\"\n", " output += \"\\n\"\n", "\n", " output += f\"**Total:** {len(self.pdf_metadata)} document(s), {len(self.documents)} chunks\"\n", " return output\n", "\n", " def clear_all_documents(self):\n", " \"\"\"Clear all loaded documents\"\"\"\n", " self.documents = []\n", " self.pdf_metadata = {}\n", " self.doc_headers = {} # Clear headers too\n", " self.vector_db = None\n", " self.bm25_retriever = None\n", " self.cache = QueryCache() # Clear cache too\n", " return \"All documents cleared.\"\n", "\n", " def _retrieve_with_rrf(self, query: str, k: int = 5, fetch_factor: int = 2, prioritize_header: bool = False) -> List[Document]:\n", " fetch_k = max(k * fetch_factor, k)\n", " vector_docs = self.vector_db.as_retriever(\n", " search_type=\"mmr\",\n", " search_kwargs={\"k\": fetch_k, \"fetch_k\": fetch_k * 2, \"lambda_mult\": 0.6}\n", " ).invoke(query)\n", " self.bm25_retriever.k = fetch_k\n", " keyword_docs = self.bm25_retriever.invoke(query)\n", " \n", " # If prioritizing header, add first-page chunks\n", " if prioritize_header:\n", " header_docs = [doc for doc in self.documents if doc.metadata.get('is_header', False)]\n", " fused_docs = ReciprocalRankFusion.fuse([vector_docs, keyword_docs, header_docs])\n", " else:\n", " fused_docs = ReciprocalRankFusion.fuse([vector_docs, keyword_docs])\n", " \n", " return fused_docs[:fetch_k]\n", "\n", " def _rerank_documents(self, query: str, documents: List[Document], top_k: int = 5, force_comparison: bool = False, boost_header: bool = False) -> List[Tuple[Document, float]]:\n", " if not documents:\n", " return []\n", "\n", " # For comparison queries, boost documents that likely contain comparative info\n", " is_comparison = force_comparison or any(word in query.lower() for word in ['compare', 'difference', 'differ', 'versus', 'vs'])\n", "\n", " pairs = [[query, doc.page_content] for doc in documents]\n", " scores = self.cross_encoder.predict(pairs)\n", "\n", " # Boost scores for docs that contain comparison keywords\n", " if is_comparison:\n", " comparison_keywords = ['compared to', 'in contrast', 'difference', 'whereas', 'unlike', 'while', 'however']\n", " for i, doc in enumerate(documents):\n", " content_lower = doc.page_content.lower()\n", " keyword_count = sum(1 for kw in comparison_keywords if kw in content_lower)\n", " if keyword_count > 0:\n", " scores[i] *= (1 + 0.1 * keyword_count) # Boost by 10% per keyword\n", " \n", " # Boost header chunks for metadata-like queries\n", " if boost_header:\n", " for i, doc in enumerate(documents):\n", " if doc.metadata.get('is_header', False) or doc.metadata.get('page', 99) == 0:\n", " scores[i] *= 1.5 # 50% boost for first page content\n", "\n", " scored_docs = list(zip(documents, scores))\n", " scored_docs.sort(key=lambda x: x[1], reverse=True)\n", " return scored_docs[:top_k]\n", "\n", " def _dedupe_documents(self, documents: List[Document]) -> List[Document]:\n", " deduped = []\n", " seen = set()\n", " for doc in documents:\n", " key = doc.metadata.get('chunk_id') or f\"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}\"\n", " if key in seen:\n", " continue\n", " seen.add(key)\n", " deduped.append(doc)\n", " return deduped\n", "\n", " def _ensure_pdf_diversity(self, query: str, documents: List[Document], target_docs: int = 2, per_pdf: int = 3) -> List[Document]:\n", " if not documents or not self.pdf_metadata:\n", " return documents\n", "\n", " seen_ids = set(doc.metadata.get('pdf_id') for doc in documents if doc.metadata.get('pdf_id'))\n", " if len(seen_ids) >= target_docs:\n", " return documents\n", "\n", " missing_ids = [info['pdf_id'] for info in self.pdf_metadata.values() if info['pdf_id'] not in seen_ids]\n", " extra_docs = []\n", " for pdf_id in missing_ids[:max(0, target_docs - len(seen_ids))]:\n", " filtered_docs = self.vector_db.as_retriever(\n", " search_type=\"mmr\",\n", " search_kwargs={\n", " \"k\": per_pdf,\n", " \"fetch_k\": per_pdf * 2,\n", " \"lambda_mult\": 0.6,\n", " \"filter\": {\"pdf_id\": pdf_id}\n", " }\n", " ).invoke(query)\n", " extra_docs.extend(filtered_docs)\n", "\n", " combined = documents + extra_docs\n", " return self._dedupe_documents(combined)\n", "\n", " def _create_citation_card(self, idx: int, doc: Document, score: float) -> str:\n", " \"\"\"Create a formatted citation card\"\"\"\n", " page = doc.metadata.get('page', 'Unknown')\n", " pdf_name = doc.metadata.get('pdf_name', 'Unknown Document')\n", "\n", " # Get snippet (first 200 chars)\n", " snippet = doc.page_content[:200] + \"...\" if len(doc.page_content) > 200 else doc.page_content\n", "\n", " # Relevance label based on score\n", " if score > 0.7:\n", " relevance = \"High\"\n", " elif score > 0.5:\n", " relevance = \"Medium\"\n", " else:\n", " relevance = \"Low\"\n", "\n", " card = f\"\"\"\n", "
\n", "[{idx}] {pdf_name} β€” Page {page} | Relevance: {relevance} ({score:.2f})\n", "
{snippet}
\n", "
\n", "\"\"\"\n", " return card\n", "\n", " def chat(self, query: str, use_hyde: bool = True, use_multi_query: bool = True, progress=gr.Progress()):\n", " \"\"\"Enhanced chat with better answers and citations\"\"\"\n", " if not self.vector_db:\n", " return \"Please upload at least one document first.\", \"\", \"\"\n", "\n", " # Check if this is a metadata query (authors, title, etc.)\n", " is_metadata_query, metadata_type = self._is_metadata_query(query)\n", " if is_metadata_query and self.doc_headers:\n", " progress(0.5, desc=f\"Retrieving {metadata_type} from document metadata...\")\n", " answer, citations, metadata = self._answer_metadata_query(query, metadata_type)\n", " progress(1.0, desc=\"Complete\")\n", " return answer, citations, metadata\n", "\n", " # Check cache\n", " cached_response = self.cache.get(query)\n", " if cached_response:\n", " return f\"*Retrieved from cache*\\n\\n{cached_response}\", \"\", \"Cached result\"\n", "\n", " progress(0.1, desc=\"Classifying query...\")\n", " profile = self._classify_query(query)\n", " k = profile.k\n", " \n", " # Check if query might need header info (about the paper itself)\n", " needs_header_boost = any(word in query.lower() for word in \n", " ['paper', 'study', 'research', 'introduction', 'propose', 'contribution', 'this work'])\n", "\n", " base_queries = [query]\n", "\n", " if use_multi_query:\n", " progress(0.22, desc=\"Expanding query variants...\")\n", " expanded_queries = self._expand_query(query)\n", " base_queries.extend(expanded_queries[:2])\n", "\n", " if use_hyde:\n", " progress(0.32, desc=\"Generating HyDE document...\")\n", " hyde_doc = self._generate_hyde_document(query)\n", " base_queries.append(hyde_doc)\n", "\n", " progress(0.45, desc=\"Retrieving candidates (MMR + BM25)...\")\n", " retrieval_results = []\n", " for bq in base_queries:\n", " retrieval_results.append(self._retrieve_with_rrf(bq, k=k, fetch_factor=2, prioritize_header=needs_header_boost))\n", "\n", " fused_docs = ReciprocalRankFusion.fuse(retrieval_results)\n", " fused_docs = self._dedupe_documents(fused_docs)[:max(k * 3, k)]\n", "\n", " if profile.needs_multi_docs and len(self.pdf_metadata) > 1:\n", " fused_docs = self._ensure_pdf_diversity(\n", " query,\n", " fused_docs,\n", " target_docs=min(3, len(self.pdf_metadata)),\n", " per_pdf=max(2, k // 3)\n", " )\n", "\n", " progress(0.7, desc=\"Re-ranking with CrossEncoder...\")\n", " reranked_docs = self._rerank_documents(query, fused_docs, top_k=max(5, k), \n", " force_comparison=profile.requires_comparison,\n", " boost_header=needs_header_boost)\n", "\n", " progress(0.8, desc=\"Building context...\")\n", "\n", " # Build context with inline citations\n", " context_parts = []\n", " citation_cards = []\n", "\n", " for idx, (doc, score) in enumerate(reranked_docs, 1):\n", " page = doc.metadata.get('page', 'Unknown')\n", " pdf_name = doc.metadata.get('pdf_name', 'Unknown')\n", "\n", " # Add to context\n", " context_parts.append(f\"[Source {idx}]: {doc.page_content}\\n\")\n", "\n", " # Create citation card\n", " citation_cards.append(self._create_citation_card(idx, doc, score))\n", "\n", " context_str = \"\\n\".join(context_parts)\n", "\n", " # Enhanced prompt for better answers\n", " is_comparison = profile.requires_comparison\n", " style_hint = \"\"\n", " if profile.answer_style == 'bullets':\n", " style_hint = \"Use concise bullet points.\"\n", " elif profile.answer_style == 'steps':\n", " style_hint = \"Use numbered steps when explaining processes.\"\n", "\n", " style_instruction = style_hint or \"Keep structure aligned to the question type.\"\n", "\n", " if is_comparison:\n", " prompt = f\"\"\"You are an expert AI assistant analyzing academic/technical documents. Answer this COMPARISON question with precision and structure.\n", "\n", "## COMPARISON QUESTION TYPE\n", "\n", "## CRITICAL INSTRUCTIONS:\n", "1. **Start with a direct comparison statement** - Don't give background first\n", "2. **Use a structured format:**\n", " - Brief 1-2 sentence overview of what's being compared\n", " - Bullet points listing specific differences\n", " - Each bullet should be concrete and factual\n", "3. **Be specific with numbers, names, and technical details** from the sources\n", "4. **Cite sources** [Source X] after each factual claim\n", "5. **If sources lack comparison info**, explicitly state: \"The provided sources do not contain direct comparison information on [aspect]. Based on what's available: [answer what you can]\"\n", "\n", "## CONTEXT FROM DOCUMENTS:\n", "{context_str}\n", "\n", "## COMPARISON QUESTION:\n", "{query}\n", "\n", "## STRUCTURED COMPARISON ANSWER:\n", "\"\"\"\n", " else:\n", " prompt = f\"\"\"You are an expert AI assistant analyzing academic/technical documents. Your goal is to provide accurate, well-structured, and comprehensive answers.\n", "\n", "## QUERY TYPE: {profile.query_type.upper()}\n", "\n", "## INSTRUCTIONS:\n", "1. **Answer the question directly in the first sentence** - Don't start with background\n", "2. **Use inline citations** [Source X] immediately after each claim or fact\n", "3. **Structure your answer clearly:**\n", " - For factoid queries: Direct answer (2-3 sentences) with supporting details\n", " - For complex queries: Organized explanation with bullet points or numbered lists\n", " - For \"explain\" queries: Start with simple definition, then elaborate\n", "4. **Be comprehensive but concise** - NO repetition or filler words\n", "5. **Use specific facts**: numbers, names, technical terms from sources\n", "6. **If information is insufficient**, state: \"The sources provided do not fully address [aspect]. Based on available information: [what you can answer]\"\n", "7. {style_instruction}\n", "\n", "## CONTEXT FROM DOCUMENTS:\n", "{context_str}\n", "\n", "## QUESTION:\n", "{query}\n", "\n", "## YOUR ANSWER:\n", "\"\"\"\n", "\n", " progress(0.9, desc=\"Generating enhanced answer...\")\n", " try:\n", " response = self.llm.invoke(prompt)\n", " answer = response.content\n", "\n", " # Add verification step for complex queries and comparisons\n", " if profile.query_type in ['summary', 'comparison', 'reasoning'] or is_comparison:\n", " verify_prompt = f\"\"\"Review this answer for a {profile.query_type} query. Check if it:\n", "\n", "Question: {query}\n", "\n", "Answer: {answer}\n", "\n", "**Evaluation Criteria:**\n", "1. **Directness**: Does it answer the question in the first sentence?\n", "2. **Structure**: Is it well-organized with bullet points for complex info?\n", "3. **Specificity**: Does it use concrete facts/numbers from sources?\n", "4. **Completeness**: Does it address all parts of the question?\n", "5. **No fluff**: Is it concise without repetition?\n", "\n", "If the answer has issues, provide an IMPROVED VERSION following this format:\n", "- Start with direct answer\n", "- Use bullet points for lists/comparisons\n", "- Include specific facts with citations\n", "- Be concise\n", "\n", "If it's already good, respond with only: \"VERIFIED\"\n", "\n", "Your response:\"\"\"\n", "\n", " verify_response = self.llm.invoke(verify_prompt)\n", " if \"VERIFIED\" not in verify_response.content.upper():\n", " # Extract improved answer (remove any preamble)\n", " improved = verify_response.content\n", " if \"IMPROVED VERSION\" in improved or \"Here\" in improved[:50]:\n", " # Find where actual answer starts\n", " lines = improved.split('\\n')\n", " answer_lines = []\n", " started = False\n", " for line in lines:\n", " if started or (line.strip() and not line.strip().startswith(('**', 'If', 'Your', 'The answer'))):\n", " started = True\n", " answer_lines.append(line)\n", " if answer_lines:\n", " answer = '\\n'.join(answer_lines)\n", " else:\n", " answer = improved\n", "\n", " self.cache.set(query, answer)\n", "\n", " # Format citations\n", " citations_html = \"\\n\".join(citation_cards)\n", "\n", " # Metadata\n", " metadata = f\"\"\"**Query Type:** {profile.query_type.title()} | **Multi-Doc:** {\"Yes\" if profile.needs_multi_docs else \"No\"} | **Sources Used:** {len(reranked_docs)} | **Documents Searched:** {len(self.pdf_metadata)}\"\"\"\n", "\n", " progress(1.0, desc=\"Complete\")\n", " return answer, citations_html, metadata\n", "\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\", \"\"\n", "\n", " def summarize_document(self, max_chunks: int = None, progress=gr.Progress()):\n", " \"\"\"Generate document summary\"\"\"\n", " if not self.documents:\n", " return \"No document loaded.\", \"\"\n", "\n", " chunks_to_process = self.documents[:max_chunks] if max_chunks else self.documents\n", " total_chunks = len(chunks_to_process)\n", "\n", " progress(0.1, desc=f\"Processing {total_chunks} chunks...\")\n", "\n", " chunk_summaries = []\n", " batch_size = 10\n", "\n", " for i in range(0, total_chunks, batch_size):\n", " batch = chunks_to_process[i:i+batch_size]\n", " batch_text = \"\\n\\n---\\n\\n\".join([doc.page_content for doc in batch])\n", "\n", " progress(0.1 + (0.6 * i / total_chunks), desc=f\"Summarizing chunks {i+1}-{min(i+batch_size, total_chunks)}...\")\n", "\n", " map_prompt = f\"\"\"Summarize the key points from this document section in 3-5 bullet points:\n", "\n", "{batch_text}\n", "\n", "Key Points:\"\"\"\n", "\n", " try:\n", " response = self.llm.invoke(map_prompt)\n", " chunk_summaries.append(response.content)\n", " except Exception as e:\n", " continue\n", "\n", " progress(0.8, desc=\"Synthesizing final summary...\")\n", "\n", " combined_summaries = \"\\n\\n\".join(chunk_summaries)\n", "\n", " reduce_prompt = f\"\"\"You are summarizing documents. Below are summaries of different sections.\n", "\n", "Create a comprehensive, well-structured summary that includes:\n", "\n", "1. **Overview**: What are these documents about? (2-3 sentences)\n", "2. **Main Topics**: Key themes and subjects covered (bullet points)\n", "3. **Important Details**: Critical information, findings, or arguments (3-5 points)\n", "4. **Conclusion**: Overall takeaway or significance\n", "\n", "Section Summaries:\n", "{combined_summaries}\n", "\n", "## COMPREHENSIVE SUMMARY:\"\"\"\n", "\n", " try:\n", " final_response = self.llm.invoke(reduce_prompt)\n", " summary = final_response.content\n", "\n", " # Build metadata\n", " metadata = f\"\"\"## Summary Statistics\n", "\n", "**Documents Analyzed:** {len(self.pdf_metadata)}\n", "**Total Chunks:** {total_chunks}\n", "**Total Pages:** {sum(info['pages'] for info in self.pdf_metadata.values())}\n", "\n", "### Documents Included:\n", "\"\"\"\n", " for name, info in self.pdf_metadata.items():\n", " metadata += f\"- **{name}** ({info['pages']} pages)\\n\"\n", "\n", " progress(1.0, desc=\"Complete\")\n", " return summary, metadata\n", "\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "FOZPs9yFYxfn" }, "source": [ "## Gradio Web Interface\n", "\n", "The `create_interface()` function creates a 4-tab web UI:\n", "\n", "| Tab | Purpose | Key Actions |\n", "|-----|---------|-------------|\n", "| **Setup** | Initialize system | Enter API key β†’ Load models β†’ Upload PDFs |\n", "| **Chat** | Q&A interface | Ask questions with HyDE/Multi-Query options |\n", "| **Summarize** | Document summary | Generate map-reduce summary of all docs |\n", "| **Help** | Documentation | Usage guide and example questions |\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "63Rjs0gNBm_y" }, "outputs": [], "source": [ "def create_interface():\n", " \"\"\"Create enhanced Gradio interface\"\"\"\n", "\n", " # Global RAG instance\n", " rag_system = None\n", "\n", " def initialize_system(api_key):\n", " nonlocal rag_system\n", " if not api_key:\n", " return \"Please enter your Groq API key.\", \"\"\n", " try:\n", " rag_system = EnhancedRAGv3(api_key)\n", " status = rag_system.load_models()\n", " return status, \"\"\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\"\n", "\n", " def upload_and_process(file, use_semantic):\n", " if rag_system is None or not rag_system.is_initialized:\n", " return \"Please initialize the system first.\", \"\"\n", " try:\n", " status = rag_system.ingest_pdf(file.name, use_semantic_chunking=use_semantic)\n", " loaded_pdfs = rag_system.get_loaded_pdfs()\n", " return status, loaded_pdfs\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\"\n", "\n", " def ask_question(query, use_hyde, use_multi_query):\n", " if rag_system is None:\n", " return \"Please initialize the system first.\", \"\", \"\"\n", " if not query.strip():\n", " return \"Please enter a question.\", \"\", \"\"\n", " try:\n", " answer, citations, metadata = rag_system.chat(query, use_hyde=use_hyde, use_multi_query=use_multi_query)\n", " return answer, citations, metadata\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\", \"\"\n", "\n", " def summarize_doc():\n", " if rag_system is None:\n", " return \"Please initialize the system first.\", \"\"\n", " try:\n", " summary, metadata = rag_system.summarize_document()\n", " return summary, metadata\n", " except Exception as e:\n", " return f\"Error: {str(e)}\", \"\"\n", "\n", " def clear_docs():\n", " if rag_system is None:\n", " return \"No system initialized\", \"\"\n", " status = rag_system.clear_all_documents()\n", " return status, \"\"\n", "\n", " def get_pdf_list():\n", " if rag_system is None:\n", " return \"No system initialized\"\n", " return rag_system.get_loaded_pdfs()\n", "\n", " # Create Gradio Blocks interface\n", " with gr.Blocks(\n", " title=\"Multi-PDF RAG System\",\n", " theme=gr.themes.Base(),\n", " css=\"\"\"\n", " .gradio-container { max-width: 1200px; margin: auto; }\n", " h1 { font-weight: 600; }\n", " .prose { font-size: 14px; }\n", " \"\"\"\n", " ) as app:\n", " gr.Markdown(\"\"\"\n", " # Multi-Document RAG System\n", " **Advanced Document Q&A with Multiple PDF Support** β€” Powered by Llama 3.3 70B\n", "\n", " Multi-document support | Enhanced citations | Verification system\n", " \"\"\")\n", "\n", " with gr.Tab(\"Setup\"):\n", " gr.Markdown(\"### Step 1: Initialize System\")\n", " with gr.Row():\n", " api_key_input = gr.Textbox(\n", " label=\"Groq API Key\",\n", " type=\"password\",\n", " placeholder=\"Enter your Groq API key\",\n", " scale=3\n", " )\n", " init_btn = gr.Button(\"Initialize\", variant=\"primary\", scale=1)\n", " init_status = gr.Textbox(label=\"Status\", interactive=False)\n", "\n", " gr.Markdown(\"### Step 2: Upload Documents\")\n", " gr.Markdown(\"*Multiple PDFs supported β€” each document will be added to the knowledge base.*\")\n", "\n", " with gr.Row():\n", " with gr.Column(scale=2):\n", " file_input = gr.File(label=\"Select PDF\", file_types=[\".pdf\"])\n", " semantic_check = gr.Checkbox(label=\"Use Semantic Chunking (Recommended)\", value=True)\n", " with gr.Row():\n", " upload_btn = gr.Button(\"Add Document\", variant=\"primary\", scale=2)\n", " clear_btn = gr.Button(\"Clear All\", variant=\"stop\", scale=1)\n", " upload_status = gr.Markdown()\n", "\n", " with gr.Column(scale=1):\n", " gr.Markdown(\"#### Document Library\")\n", " loaded_pdfs_display = gr.Markdown(\"No documents loaded yet.\")\n", " refresh_btn = gr.Button(\"Refresh\", size=\"sm\")\n", "\n", " init_btn.click(initialize_system, inputs=[api_key_input], outputs=[init_status, loaded_pdfs_display])\n", " upload_btn.click(upload_and_process, inputs=[file_input, semantic_check], outputs=[upload_status, loaded_pdfs_display])\n", " clear_btn.click(clear_docs, outputs=[upload_status, loaded_pdfs_display])\n", " refresh_btn.click(get_pdf_list, outputs=[loaded_pdfs_display])\n", "\n", " with gr.Tab(\"Chat\"):\n", " gr.Markdown(\"### Ask Questions About Your Documents\")\n", "\n", " with gr.Row():\n", " with gr.Column(scale=3):\n", " query_input = gr.Textbox(\n", " label=\"Your Question\",\n", " placeholder=\"What are the key conclusions of these papers?\",\n", " lines=3\n", " )\n", " with gr.Row():\n", " hyde_check = gr.Checkbox(label=\"HyDE (Better Retrieval)\", value=True)\n", " multi_query_check = gr.Checkbox(label=\"Multi-Query (More Comprehensive)\", value=True)\n", " ask_btn = gr.Button(\"Submit\", variant=\"primary\", size=\"lg\")\n", "\n", " with gr.Column(scale=1):\n", " gr.Markdown(\"\"\"\n", " #### Tips\n", " - Be specific in your questions\n", " - HyDE improves retrieval quality\n", " - Multi-Query finds more context\n", " - Questions search across all loaded documents\n", " \"\"\")\n", "\n", " metadata_output = gr.Markdown(label=\"Query Info\")\n", " answer_output = gr.Markdown(label=\"Answer\")\n", "\n", " gr.Markdown(\"#### Sources & Citations\")\n", " sources_output = gr.HTML(label=\"Sources\")\n", "\n", " ask_btn.click(\n", " ask_question,\n", " inputs=[query_input, hyde_check, multi_query_check],\n", " outputs=[answer_output, sources_output, metadata_output]\n", " )\n", "\n", " gr.Examples(\n", " examples=[\n", " \"What are the main findings?\",\n", " \"Explain the methodology in detail\",\n", " \"What are the key conclusions?\",\n", " \"Compare the approaches discussed\",\n", " \"What are the limitations mentioned?\",\n", " \"Summarize the most important contributions\"\n", " ],\n", " inputs=query_input\n", " )\n", "\n", " with gr.Tab(\"Summarize\"):\n", " gr.Markdown(\"### Generate Comprehensive Summary\")\n", " gr.Markdown(\"Analyzes all loaded documents and creates a unified summary.\")\n", "\n", " summarize_btn = gr.Button(\"Generate Summary\", variant=\"primary\", size=\"lg\")\n", " summary_metadata = gr.Markdown()\n", " summary_output = gr.Markdown()\n", "\n", " summarize_btn.click(summarize_doc, outputs=[summary_output, summary_metadata])\n", "\n", " with gr.Tab(\"Help\"):\n", " gr.Markdown(\"\"\"\n", " ## How to Use This System\n", "\n", " ### Quick Start\n", " 1. **Setup Tab**: Enter Groq API key and click \"Initialize\"\n", " 2. **Setup Tab**: Upload PDF(s) and click \"Add Document\" (repeat for multiple files)\n", " 3. **Chat Tab**: Ask questions about your documents\n", " 4. **Summarize Tab**: Get unified summary of all documents\n", "\n", " ---\n", "\n", " ## Example PDFs for Testing\n", "\n", " Download these PDFs to test the system:\n", "\n", " ### Academic Research Papers\n", " 1. **Attention is All You Need** (Transformer Paper)\n", " - URL: `https://arxiv.org/pdf/1706.03762.pdf`\n", " - Great for: Testing technical Q&A\n", "\n", " 2. **BERT: Pre-training of Deep Bidirectional Transformers**\n", " - URL: `https://arxiv.org/pdf/1810.04805.pdf`\n", " - Great for: Multi-paper comparison\n", "\n", " 3. **GPT-3 Paper** (Language Models are Few-Shot Learners)\n", " - URL: `https://arxiv.org/pdf/2005.14165.pdf`\n", " - Great for: Complex methodology questions\n", "\n", " ### Business & Finance\n", " 4. **Tesla Q3 2024 Earnings Report**\n", " - URL: `https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2024-Update`\n", " - Great for: Financial analysis questions\n", "\n", " 5. **World Bank Annual Report**\n", " - URL: `https://thedocs.worldbank.org/en/doc/9a8210d538854d29883cf3a19e66a3e2-0350012021/original/WBG-Annual-Report-2021-EN.pdf`\n", " - Great for: Economic data queries\n", "\n", " ### Technical Documentation\n", " 6. **Python Documentation (any topic)**\n", " - URL: Search \"python [topic] pdf\" on official Python docs\n", " - Great for: Technical Q&A\n", "\n", " ### Medical/Scientific\n", " 7. **WHO COVID-19 Reports**\n", " - URL: `https://www.who.int/publications` (download any PDF)\n", " - Great for: Health information extraction\n", "\n", " ---\n", "\n", " ## Real-Life Use Case Questions\n", "\n", " ### Research Paper Analysis (Attention Paper)\n", " ```\n", " 1. \"What is the main innovation proposed in this paper?\"\n", " 2. \"Explain the self-attention mechanism in detail\"\n", " 3. \"How does the Transformer compare to RNN-based models?\"\n", " 4. \"What are the computational complexity advantages?\"\n", " 5. \"What datasets were used for evaluation?\"\n", " 6. \"What are the key results on machine translation tasks?\"\n", " 7. \"Describe the encoder-decoder architecture\"\n", " 8. \"What are the limitations mentioned by the authors?\"\n", " ```\n", "\n", " ### Business/Finance Analysis (Earnings Reports)\n", " ```\n", " 1. \"What was the total revenue this quarter?\"\n", " 2. \"How did the company perform compared to last quarter?\"\n", " 3. \"What are the main growth drivers mentioned?\"\n", " 4. \"Summarize the key financial metrics\"\n", " 5. \"What guidance did management provide?\"\n", " 6. \"What are the main risks discussed?\"\n", " 7. \"How much cash does the company have?\"\n", " 8. \"What investments are being made in R&D?\"\n", " ```\n", "\n", " ### Multi-Paper Comparison (Upload BERT + GPT-3)\n", " ```\n", " 1. \"Compare the pre-training objectives of these models\"\n", " 2. \"What are the key differences in architecture?\"\n", " 3. \"Which model performs better on which tasks?\"\n", " 4. \"How do the training datasets differ?\"\n", " 5. \"What are the main innovations in each paper?\"\n", " 6. \"Compare the computational requirements\"\n", " ```\n", "\n", " ### Policy/Report Analysis (World Bank Report)\n", " ```\n", " 1. \"What are the main economic trends discussed?\"\n", " 2. \"Which regions showed the strongest growth?\"\n", " 3. \"What interventions are recommended?\"\n", " 4. \"Summarize the poverty reduction initiatives\"\n", " 5. \"What are the key challenges identified?\"\n", " 6. \"What metrics are used to measure success?\"\n", " ```\n", "\n", " ### Medical Literature (WHO Reports)\n", " ```\n", " 1. \"What are the recommended treatment protocols?\"\n", " 2. \"What evidence supports these guidelines?\"\n", " 3. \"What are the risk factors mentioned?\"\n", " 4. \"Summarize the epidemiological data\"\n", " 5. \"What prevention measures are suggested?\"\n", " ```\n", "\n", " ### Contract/Legal Document Analysis\n", " ```\n", " 1. \"What are the key obligations of each party?\"\n", " 2. \"What are the termination conditions?\"\n", " 3. \"Summarize the payment terms\"\n", " 4. \"What warranties are provided?\"\n", " 5. \"What are the liability limitations?\"\n", " ```\n", "\n", " ---\n", "\n", " ## Complete Example Workflow\n", "\n", " **Scenario:** Analyzing the \"Attention is All You Need\" paper\n", "\n", " **Step 1:** Upload the PDF\n", " - Download: `https://arxiv.org/pdf/1706.03762.pdf`\n", " - Upload in Setup tab\n", "\n", " **Step 2:** Start with broad question:\n", " ```\n", " \"What is this paper about and what problem does it solve?\"\n", " ```\n", "\n", " **Step 3:** Dive into specifics:\n", " ```\n", " \"Explain how multi-head attention works\"\n", " \"What are the advantages over recurrent models?\"\n", " \"How does positional encoding work?\"\n", " ```\n", "\n", " **Step 4:** Compare (if you upload GPT-3 paper too):\n", " ```\n", " \"How does the Transformer architecture used in GPT-3 differ from the original?\"\n", " ```\n", "\n", " **Step 5:** Get comprehensive summary:\n", " - Go to Summarize tab\n", " - Click \"Generate Summary\"\n", "\n", " ---\n", "\n", " ### Key Features\n", "\n", " #### Multi-Document Support\n", " - Upload multiple PDFs to build a knowledge base\n", " - Questions search across all loaded documents\n", " - Each source is clearly attributed with document name and page\n", "\n", " #### Enhanced Citations\n", " - Expandable citation cards with snippets\n", " - Relevance scores (High, Medium, Low)\n", " - Shows exact page numbers and document names\n", "\n", " #### Better Answer Quality\n", " - Advanced query classification (factoid/medium/complex)\n", " - Verification system for complex queries\n", " - Structured, comprehensive responses\n", " - Context-aware answer length\n", "\n", " #### Advanced Retrieval\n", " - **HyDE**: Generates hypothetical documents for better retrieval\n", " - **Multi-Query**: Expands questions for comprehensive coverage\n", " - **RRF Fusion**: Combines vector + keyword search\n", " - **Cross-Encoder Re-ranking**: Improves relevance scoring\n", "\n", " ### Best Practices\n", "\n", " **For Best Results:**\n", " - Use semantic chunking (default)\n", " - Keep HyDE and Multi-Query enabled for important questions\n", " - Ask specific, well-formed questions\n", " - Review citations to verify answers\n", "\n", " **Example Questions:**\n", " - \"What is the transformer architecture?\" (Factoid)\n", " - \"Explain how self-attention works in detail\" (Complex)\n", " - \"Compare BERT and GPT approaches\" (Complex)\n", " - \"What are the key innovations in this paper?\" (Medium)\n", "\n", " ### Performance\n", " - **Simple queries**: ~5-8 seconds\n", " - **Complex queries**: ~10-15 seconds (includes verification)\n", " - **Full summary**: ~30-90 seconds (depends on document count)\n", "\n", " ### Technical Details\n", " - **Embeddings**: BAAI/bge-large-en-v1.5 (SOTA)\n", " - **Re-ranker**: BAAI/bge-reranker-v2-m3\n", " - **LLM**: Llama 3.3 70B Versatile via Groq\n", " - **Retrieval**: Hybrid (Vector + BM25) with RRF fusion\n", "\n", " ---\n", "\n", " **Need Help?** Make sure to initialize the system and upload at least one PDF before asking questions!\n", " \"\"\")\n", "\n", " return app\n" ] }, { "cell_type": "markdown", "metadata": { "id": "h1wXxw54Y7--" }, "source": [ "### Launch Configuration:\n", "```python\n", "app.launch(share=True, debug=True, server_port=7860)\n", "```\n", "- `share=True` β†’ Creates public `gradio.live` URL\n", "- Access locally at `http://localhost:7860`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000, "referenced_widgets": [ "16e7ec24f6c34211be67872719bf9563", "30b098e339a54662bfd60b2378705b90", "185e3b2ad58a4478a9fbd816ecbf9742", "daf3357706804fccb6c31135a9026df0", "79609124541c46708f54300e8327076d", "bb46a6e38ec14107bb0c08b732b6d5c2", "5fd6e74c4ce749889cf368fa5350b537", "c834082a75e54e2c9b93e01d7fcc0e63", "e9beb0ca922943c48062059151d03e84", "23a80524f261439f880f9d1665546d2c", "593b75ac4dcf4a879406f4ecbf09307d", "a7f006a99ebc4f76a93f00b5bc81013d", "20707ef4487b4181ac60f90c2cbdd654", "1f07e31fdca34c27a2124e2e3752b573", "3fe271f3a67b4976afccd5c87550b588", "67094e8997af4300817747c0e5e0834e", "f8072b44de5e4b25b53c6dab4ac03e5b", "ede80e01c89742298a1dc17f3d4ad433", "a2096398deb14083817d4b5476bcadd4", "8de9e181d2b946aaad5c5b66a2742708", "876047addea2453c8e97aefbee2a87fe", "2d7afeb82fb84452be405e87488b7a1e", "9aa90ace4d9e4586ac65496983a5e8df", "7cb016c4d08346479c9fbc00100a7ccc", "0171997bc10c4f1498aca91b8991b421", "b35aa187abbc418296bc2645629b4e37", "0d454c8bacb24275892705fc6f5a5342", "e1547d895b58455daef9c6a190f3330c", "f4b191e440f645a18d1904a795765ac8", "d85dc47eaffe4a8eb269836bc421d657", "95a985c765084d76b9ce331d49077243", "5d2f39828b8049bb94c0515715df82d8", "a7a3df6b70de49ec9161a0e7fa00cc67", "90db4efa43024c9183e73696d8789256", "9a6f7e9fcc1b42ffa4927801a8d907c1", "dff4ff30c5594218b1906f427a054261", "77e3497365d444ccaddfc7c827d5099d", "d59caf966ee9400aa97210936745cbee", "78e493bbe2474fe4bb363b943b58ba01", "75548e5439cc4556a40fd1c79d40f695", "6b9626df9125449e9f20d47a22b5fff0", "6cb8f6cafc874488946fab0da980f80f", "d59e08a3214e455c89f443656a31a787", "019b6c48e98e4743819c5562e834597c", "fabd252b6d494103b399efa3acac44be", "43e348e91c9d44a08e3cb3b1473f3ea0", "3c97b950cb5844a189cb83b586d5cde0", "ff039e2efc5d4429bbd72b68d1cd4cf7", "6926b0c7ee324e4d9b6a73cfeede6dfb", "50fc937584e748cab0affb2eba8751f4", "fb4639c037d34d8db7a9d7a264ae3632", "67b62ec4c1794ba49151169e095fe71c", "ab415f2ad72e49a48ea62af7d2053811", "f91f7666be7f46c28ec6b68a8ac9202f", "d27aa73b45a34d02be27e7b6728c8269", "a1c261daf2bc4410b3c524ae65b584b6", "2a568b3fb61d4009b30e97489299a0cf", "b7251ba370ab42a0b923829cbf19823e", "1074084c422a4932a49d0e6e76a3e291", "4a5e54c4d7cc4c85a5eb12836d0b7519", "02857216ca404b17b6d850f69fdc4167", "cfebfe51ce2a4c1ab3423c99fac04de4", "e23178cc86744a1e8f82a9f8b8b9d457", "aefe8e9f423b46e99735d229b303f279", "32935a20278c408dbf37cde6578f8c3e", "e82cd751987d45e3a55925955855bacf", "617f883b8d344d83aa92f5af4c4fa500", "9757dfc308e04c7eac1879acb34a524d", "188705886f344f84ad5190bb43297506", "1c6c826032384883b3a3dcdfa93eda18", "b453d0a3ff944998b6c482376a38896e", "bc58242b7fb4400fa36f05059b320cbc", "dcd53beda53c4722a343e3da200b1375", "b00a22bc1ebd4566b3d556316d448986", "0d607a5044a34a89bed74d0688647a73", "b2a482182bf34c6f88968bbbe54d99b7", "a0cce362b9a7462d9e0c37ac0870293b", "986b8c9dcad245cea63a43ef4df3d36f", "064086f0c0f444f2aa52c55b460d6ca7", "417dd85b34a84c2684960a5c16f7a762", "ac2673ad0288472a952351c018f1271d", "ed30fe2e571c40d086425c063518e3b7", "9ba656852b25431f8113b1b1783afcb0", "17574ccf363a470f89927cbf4cd10004", "557202483cdb49dea117927777cc6d35", "a2ee87d4317a484cbd0550baa42dc419", "18ddd81f0a7c4d7c81579be798cfe1dd", "da919e0da2104bea95559e459e3955a0", "7cc6b20897a34abbafb2b2eae17757e0", "9d50a215d8954b36a5665a6a17ec5bdc", "75e3dd18da934743a090fd095bb50c84", "075d54e510934230abe640032c79fc02", "8616627ec29e45ddbd8c04271ab0df04", "36ccd95611f3475581c30d8258c7d1c8", "e4f13b17e6534bd1976665d30455e0c0", "f5292c055ae248d58f38adf45801e00c", "9c2be268d3b34cc3b49791892d27ff87", "f7ffa842a0b84916913e4094a5fb294c", "5a23cfb2ad50450e99b4f2cb1411cc20", "0a8b2c0ade00462d80de5e26d096718d", "7d7b7921ca2a446896ce2ef4ba2d3ba4", "2b933ba020274544ba009b64b4e5c430", "aff8b0997198412a94db617bade60603", "49717bef77dc4b86a0e3340463e54d82", "bef49fe64f37432c8d0468d2047075e0", "2bfa46f9d0174de581d0747d493308d1", "03e5824940dd44bc8e93b40e882ef01d", "71f836f754ae49a0be420b5dd736e4fe", "6f53a9115aaf45cea6f45e6bd239d7d8", "6bc03369ce31443ca05fe7a11b0a5514", "8057ffb29e9c42dabda51185fdecd7dd", "d86e4a45f0074d11832892dc2353c4a3", "90238310655543b2a11b75ca7fd1e726", "832e3ab1589c42e4bf7fe9e84d34b4e4", "1ca167735ec940a387b02fbc93f827bb", "5131abad55034d7896b6244b51368722", "4fdc87ebdc4f4b108e8d2e4b820d3a62", "ec8ad3d1497f4be4af8a9f8b30053060", "0c93b8ec65c341d2927aa3fdcaf69736", "3fb649ec3fed425285c971f63146f9f5", "63bde89569a0433786559639a3edcc73", "c5c5632955f94450b472b19dfe26c831", "3a51ba19984140cbb595c7639edc8fc8", "7b334f4843e14b5eb32351a884539495", "501141a07ded4447884e84aede39598e", "190d38c3a6824faa96f3096932ca187e", "cabcd44775d24cc199dae2fe021f0ca1", "e311630ede5b4e01a1a41a15c72c1ceb", "85683bfe46bb4d4c988b64aee1569e4f", "c02e6e4947794b10ad1a974739e5b429", "76dbdc9a829942548be0444e79679e3a", "207944749ecd4ecd85ee59e12dbf53fe", "c2fdf8a947604af4829638a3cec51a84", "13c8fad66efe409c995e85f07754007e", "bb196705a1814abf9cfe12ed6ebc13bd", "5cad3c37275a479196b91a32de160de3", "0e8c7ae0c50447b8982966ed2651f051", "abbe664ca8624bde93a03b573e7a7604", "6fbf07ce310e4a6fb92e6da441a83aef", "fb5f1d3cfc904387938a993c76fe5183", "2bf70d9d271a425cae2a4c3acb422653", "6ac0bb4623e449718f9bf4d10c097e79", "55d83174c9264ea2b34a799fd91cd8b1", "2b23db625dd441c89060ce72b2125d77", "b614e5e0e8424bc0b6ed38c8fdae81ea", "6b628b8819354295badd05c352b0dcd2", "6fcec8b21cdf4a5483e5bc8b74de1851", "f37c06b82d394070aacb6dcd5a666f5d", "a5fbeac70f3549e8bce9af57829b9467", "895493cb35c540f796a650194247131b", "57e50172ddf64aa88d83701eb507be0e", "0ea1f35898df46c4ae79c4f192a79259", "b30b0d7b64ef44e2b09f93c87d9b453d", "36c1bd64980349269ff4455264b8f75e", "7e82594c67434da1888e0afe07ececd4", "2ed045704e81491f8daa33f2e65d55ff", "78eafefa1b384106a30a39529d855d0a", "8e6f6a1ff043487687a7b748ecefcc6f", "aeed8ee243854cdd9262e1632211a853", "2523cbe84f9f4c0dad084bc074a39038", "031042b3e35d47ee8a243172a90c24a7", "ad884a180a42467d824e6ce654a35d2c", "903aa9bcbb1e444895e4fb56a450dfd0", "2ca6ea7a92c649c4bd63df2a5f791c4b", "b8c313402cc74009abd4dde9c30d7b42", "72e246cd9065482fbc45c256a49cee70", "c836d8bbe5a0412e9bb69d1d74f22f33", "9f5e547a23e04c8e99794bc4bda8bbbf", "f5f8430f0dcf4325b86eadd97e73c49f", "344d068bf2dc47468e957941f0498bf2", "2a0e74c0400c435db5bfa184f02080b9", "75ecddc97ef748a881ebab2d93441b5a", "29dcf40da5c04d64890369b7364abf1a", "4b77e86724314936b7ef1af13aa3c2ec", "ccb224a2cabd429c9a24adf969457ace", "139489a30a664a76ad6aebd6bf9d15fd", "5905e6655c8840d3ba018a80324c1260", "294421182a034eeca0744f0ee260c95f", "6976c1d50d944367928aea95221497dd", "60dce6fdb66a401c8fb2ef544cd50e71", "48905aaf882d4bbda4dbc6ab237f63fb", "544dfa758d464291a5703716fbd7cab0", "e22124dc4309414fbb77c50cea435d81", "7541ee8e44424b2582b88f9031761c2c", "6e9b8499c67845eb8f631878145f2ce4", "ade2cf56143d4146bf611b889e7f15be", "682eb71ee4704cc1bc87bbaa4fd88ce1", "e6b5f001215e4e1bb0197737c0044a88", "07cfcdb368494466b41d219617e104d0", "ff25b2aa452b4482a0ca4cd79b0fce5c", "f88889237b254af9ab6de48091736773", "8f8484604d9a4e2f9812803c11979bcd", "498e839dad9641e4b7a7a6f58e8d2064", "1da8f2a4f0a04be0ac49507fc5a39a6c", "5cd91bda54ae41fdb8f0dd908b6c907f", "2babc0facf4a4c7388e06e07fe01c9a8", "8756c0bb65f44ba2887a42323df7f69d", "511d0f7656e54cd08f12b2d8d849618a", "934a9dd7613141668a98b89b666f1298", "c5658ae5b5fc42b29dc2dee4af4380b0", "233fe8c43399408390f5816cbbda87bd", "c4d56ab96ec9471d8d63acc0d22ecc06", "8bcf3e4c3ad943af8a53de8a35fa4f62", "e0432b02ec5c4303a0dd665152ac5e5c", "1b0913e77c3a42adbeb16da2cf504125", "59c7e3887eff4232bae50fb145cea1c8", "642cc483005b4111b338761b57a5894d", "c6a1d5fb08014113969e8ac1cc3a750d", "354d6e5abfe34bfbae4f12db3bf40c3d", "ad7901bf45ae459bbc59840b84fc9592", "fff471a9822c43dcb29b7da9ef8d5363", "b55b481ae8184da4bcadd47e94859803", "3d6620065647443f975656f54cc902ab", "2c05c8942f8c40428d70fc0a1ccbc65b", "0537c891baa94c9395f2833f06d6d355", "7462484bfba043949511e20bef99e48a", "540650e9ac2b49cb9f10dfbf5a28debd", "5c2042cfd9274ebeaf07f53ee96ba77b", "743ecbc5bb3d48c88aacfb65ade766d1", "3c30728848ac47858f84b8ed298397f4", "ee0ee61a3a55401f8a525a05a4150e7b", "648444be4f3b453e86e05317b815514a", "56bdbb9fc0634aa5aee44a229731584f", "a1ac9d59c05c46b797493f8bb0d7d1bb", "0163b07e777a44d581080b356a00fc46", "4254a69095a347b8a69a6698abab6ec5", "2179b292baae4e97bcfbf45f2450acfb", "272ddd2794b140f99f1e74ef3dfa4645", "e694886e08694c5ca8d719461279f66f", "d85c4e7d80274e038dd7a27ee0e1eb8b", "211bd4e5dce1412db17ac0113cacd6fe", "63eef4b4fe2444f684ccd604cea646fe", "488b7ef5d92849e782a1536fc9f8a496", "a4cf2dc7060147bba169e3861e5e15af", "3bb4ade00aed4f7997d569510335dded", "82b0173c80cc458b82907f818975adc6", "9f97a9dbaf7a458c8000fcb8f421da49", "18ba732281ca4bac98fd3c7a6b23cb29", "494d3ddb1f524442ba01f26dbaf26925", "55aff6c99f714ba1b7a8cc8098708aed", "9f692246e6864101a6f783609c2c9f8f", "9aea4996473941b7a483e8887fc9d355", "4d8f2b9838fa43ec9e1589039e7f34f2", "8f8b53d0a9e44f078ff89c8b452a2b5d", "4b9aff699cb443e5a3d27f8d16201b6a", "e2922459c68f42ce9a45f8b0b147080e", "960e5671ec0248a09526c97a094f8b86", "5ebedc75db3a410ba278cb6450754c77", "3b5ac35bda37444d9d4c8bad59cce3e6", "9d943d4ce7704daeb20bf3d76983746f", "c31c0f5c2edd424aa6306784c7171f0a", "7cef63aeb4b340cba68b0f29e02105fb", "6eb0668347d6466cacbd02cba1232527", "3bebaa2a5e8549b6b064f686d9a9ad3e", "87dc2afb532242c29893430181d76e82", "702f2ca2d7344efa9fa8bce3cd495821", "55251fc883e44f2fae5bb333abcda2b2", "6e9019fbdb3840af9cd5b588931ff9e2", "7bd2d0c10b124a5db1c31fdb1b90a442", "10f1c29e12214ff2ba2bfd3a90b9ae45", "ef2b929504f7476db18e614cee4c9085", "37ff763014b74693893260be60196c17", "097ad750bfae40fcbceb318350e76032", "b6395f0b8700489fa5a61267ceb96a28", "055feef07f8b4bdf9c2bbd54c43da4f7", "e464a7fba0514c05918e597bb6bc21f7", "c0ccbf04ea664460aff0e136fc64d2bb", "51e2ed9da1e34cea8a3ef8673b798462", "1ef59519ab8c4263b1b7f1de256d9592", "09076be40be4454b8d8ee6665c37a12d", "b39445bb1ece4cbf8129d685617f126d", "7b863a82fbc84299adf0c3fef0878b26", "9e923bd19c0c424495cc8bce188f8fb0", "5ae24c0ca1324a6a8b7803017fd83f29", "4ba90ec13cb44087b0d71121c4626fb5", "326b4706bd2041089c26fa295c729e23", "3e31f5ed5126417d9fb8046e565e39fd", "131402914e014cde8e1f79895e78ae7c", "a10203eb20d645b882d5dc6c512bdce5", "7f749553b3b04ebc8881f0be3015ee28", "443fc75017434cd7b76c8ff647339672", "061e46bfbce547bcb3e2e7ffba88e216", "33c139b906564b35bb1ab1564b2f158f", "e920b1bded8441a1aeb991cce781ecce", "4cb0327a8b4f4a2a882e51a374d0dc88", "ee9255e3dbf240cda71efcbdef0f0898", "8ee9c6b9afa64006a3896f7dde639a2e", "8944df99bc914f86a135fe7c272dd871", "a642ae7c12ea47d48acf758f3a4d3ab2", "29d54b0e795a48ca8b9d4f4812bdec38", "177de5d4346c496a865d8a4970fd1546", "5df1e0b31f744e7dbe1df7927e48a69c", "b4a4c708054044c69cd831d6a6ea0133", "6e74b603f29140c09fc10ae9b5a2cdfb", "21017fbebf794bc7a874c489d885fd72", "a5effffb66674bc580a513398a8f57b2", "069ee53a4f674e309153f5433c3e86ea", "a4590c87c2524e5fba9ac5a9c78aa84b", "148a5093003c49fdbee32958c72226d3", "7dd80ab18b84467e9e8251f3aed4a727", "b3fc7d8880a54cf1936248a29e89b87a", "1b51170785904761b298def1f33d7733", "bb6b6f85f0a34514b8e355091167552a", "0ec8968901274132b55d63b6afbf062a", "25e8ab54bda042298f80cc9ad6b8fb8e", "38a60013a0584090ba7ae6c22b89fc12", "0b49d08fc6db40c0b4a476cf8565a1a0", "2c2a7055b4e3486fa86278bb1020e506" ] }, "id": "VBv-xjEqByZr", "outputId": "2557eda4-049b-4d88-b5a0-b508605c348b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipython-input-3496743039.py:60: DeprecationWarning: The 'theme' parameter in the Blocks constructor will be removed in Gradio 6.0. You will need to pass 'theme' to Blocks.launch() instead.\n", " with gr.Blocks(\n", "/tmp/ipython-input-3496743039.py:60: DeprecationWarning: The 'css' parameter in the Blocks constructor will be removed in Gradio 6.0. You will need to pass 'css' to Blocks.launch() instead.\n", " with gr.Blocks(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().\n", "* Running on public URL: https://f02d585ac558d5f5c6.gradio.live\n", "\n", "This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipython-input-1459962086.py:30: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEmbeddings`.\n", " self.embedding_model = HuggingFaceEmbeddings(\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "16e7ec24f6c34211be67872719bf9563", "version_major": 2, "version_minor": 0 }, "text/plain": [ "modules.json: 0%| | 0.00/349 [00:00