diff --git "a/rag.ipynb" "b/rag.ipynb" new file mode 100644--- /dev/null +++ "b/rag.ipynb" @@ -0,0 +1,3007 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "i2nHJ6o7TQW3" + }, + "source": [ + "# πŸš€ Multi-Document RAG System with Advanced Retrieval\n", + "\n", + "## Project Overview\n", + "This notebook implements a **production-ready Retrieval-Augmented Generation (RAG)** system capable of:\n", + "- Ingesting **multiple PDF documents** into a unified knowledge base\n", + "- Answering questions using **hybrid retrieval** (vector + keyword search)\n", + "- Providing **cited, verifiable answers** with source attribution\n", + "- **Comparing information** across multiple documents\n", + "\n", + "## Architecture Summary\n", + "```\n", + "User Query β†’ Query Classification β†’ Query Expansion (Multi-Query)\n", + " ↓\n", + "HyDE Generation β†’ Hybrid Retrieval (Vector + BM25)\n", + " ↓\n", + "RRF Fusion β†’ Cross-Encoder Re-ranking β†’ LLM Generation\n", + " ↓\n", + "Answer Verification β†’ Final Response with Citations\n", + "```\n", + "\n", + "## Key Technologies\n", + "| Component | Technology |\n", + "|-----------|------------|\n", + "| LLM | Llama 3.3 70B (via Groq) |\n", + "| Embeddings | BAAI/bge-large-en-v1.5 |\n", + "| Re-ranker | BAAI/bge-reranker-v2-m3 |\n", + "| Vector DB | ChromaDB |\n", + "| Keyword Search | BM25 |\n", + "| UI | Gradio |\n", + "\n", + "## Requirements\n", + "- **Groq API Key** (free at console.groq.com)\n", + "- **Python 3.10+**\n", + "- **GPU recommended** but not required" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "AiaiOaSb-m1U", + "outputId": "4b784a1e-a4a0-43d8-b4ff-eb7b1255438b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "πŸ”₯ Cleaning up the environment\n", + "\u001b[33mWARNING: Skipping langchain-community as it is not installed.\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Skipping langchain-groq as it is not installed.\u001b[0m\u001b[33m\n", + "\u001b[0mπŸ“¦ Installing the Dependencies\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.0/61.0 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.0/18.0 MB\u001b[0m \u001b[31m56.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "dask-cudf-cu12 25.10.0 requires pandas<2.4.0dev0,>=2.0, which is not installed.\n", + "access 1.1.10.post3 requires pandas>=2.1.0, which is not installed.\n", + "access 1.1.10.post3 requires scipy>=1.14.1, which is not installed.\n", + "pandas-gbq 0.30.0 requires pandas>=1.1.4, which is not installed.\n", + "geemap 0.35.3 requires pandas, which is not installed.\n", + "yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.\n", + "yellowbrick 1.5 requires scipy>=1.0.0, which is not installed.\n", + "tensorflow-decision-forests 1.12.0 requires pandas, which is not installed.\n", + "librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.\n", + "librosa 0.11.0 requires scipy>=1.6.0, which is not installed.\n", + "cmdstanpy 1.3.0 requires pandas, which is not installed.\n", + "albumentations 2.0.8 requires scipy>=1.10.0, which is not installed.\n", + "mizani 0.13.5 requires pandas>=2.2.0, which is not installed.\n", + "mizani 0.13.5 requires scipy>=1.8.0, which is not installed.\n", + "imbalanced-learn 0.14.1 requires scikit-learn<2,>=1.4.2, which is not installed.\n", + "imbalanced-learn 0.14.1 requires scipy<2,>=1.11.4, which is not installed.\n", + "hdbscan 0.8.41 requires scikit-learn>=1.6, which is not installed.\n", + "hdbscan 0.8.41 requires scipy>=1.0, which is not installed.\n", + "stumpy 1.13.0 requires scipy>=1.10, which is not installed.\n", + "spreg 1.8.4 requires pandas, which is not installed.\n", + "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", + "spreg 1.8.4 requires scipy>=0.11, which is not installed.\n", + "spopt 0.7.0 requires pandas>=2.1.0, which is not installed.\n", + "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "spopt 0.7.0 requires scipy>=1.12.0, which is not installed.\n", + "datasets 4.0.0 requires pandas, which is not installed.\n", + "plotnine 0.14.5 requires pandas>=2.2.0, which is not installed.\n", + "plotnine 0.14.5 requires scipy>=1.8.0, which is not installed.\n", + "pymc 5.27.0 requires pandas>=0.24.0, which is not installed.\n", + "pymc 5.27.0 requires scipy>=1.4.1, which is not installed.\n", + "db-dtypes 1.5.0 requires pandas<3.0.0,>=1.5.3, which is not installed.\n", + "sklearn-pandas 2.2.0 requires pandas>=1.1.4, which is not installed.\n", + "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", + "sklearn-pandas 2.2.0 requires scipy>=1.5.1, which is not installed.\n", + "pynndescent 0.6.0 requires scikit-learn>=0.18, which is not installed.\n", + "pynndescent 0.6.0 requires scipy>=1.0, which is not installed.\n", + "cvxpy 1.6.7 requires scipy>=1.11.0, which is not installed.\n", + "scikit-image 0.25.2 requires scipy>=1.11.4, which is not installed.\n", + "mlxtend 0.23.4 requires pandas>=0.24.2, which is not installed.\n", + "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", + "mlxtend 0.23.4 requires scipy>=1.2.1, which is not installed.\n", + "clarabel 0.11.1 requires scipy, which is not installed.\n", + "mapclassify 2.10.0 requires pandas>=2.1, which is not installed.\n", + "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "mapclassify 2.10.0 requires scipy>=1.12, which is not installed.\n", + "cudf-cu12 25.10.0 requires pandas<2.4.0dev0,>=2.0, which is not installed.\n", + "segregation 2.5.3 requires pandas, which is not installed.\n", + "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", + "segregation 2.5.3 requires scipy, which is not installed.\n", + "bqplot 0.12.45 requires pandas<3.0.0,>=1.0.0, which is not installed.\n", + "osqp 1.0.5 requires scipy>=0.13.2, which is not installed.\n", + "giddy 2.3.8 requires scipy>=1.12, which is not installed.\n", + "pytensor 2.36.3 requires scipy<2,>=1, which is not installed.\n", + "matplotlib-venn 1.1.2 requires scipy, which is not installed.\n", + "mgwr 2.2.1 requires scipy>=0.11, which is not installed.\n", + "tsfresh 0.21.1 requires pandas>=0.25.0, which is not installed.\n", + "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", + "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", which is not installed.\n", + "arviz 0.22.0 requires pandas>=2.1.0, which is not installed.\n", + "arviz 0.22.0 requires scipy>=1.11.0, which is not installed.\n", + "inequality 1.1.2 requires pandas>=2.1, which is not installed.\n", + "inequality 1.1.2 requires scipy>=1.12, which is not installed.\n", + "missingno 0.5.2 requires scipy, which is not installed.\n", + "pysal 25.7 requires pandas>=1.4, which is not installed.\n", + "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", + "pysal 25.7 requires scipy>=1.8, which is not installed.\n", + "xgboost 3.1.2 requires scipy, which is not installed.\n", + "prophet 1.2.1 requires pandas>=1.0.4, which is not installed.\n", + "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "cuml-cu12 25.10.0 requires scipy>=1.8.0, which is not installed.\n", + "dopamine-rl 4.1.2 requires pandas>=0.24.2, which is not installed.\n", + "bigquery-magics 0.10.3 requires pandas>=1.2.0, which is not installed.\n", + "hyperopt 0.2.7 requires scipy, which is not installed.\n", + "bokeh 3.7.3 requires pandas>=1.2, which is not installed.\n", + "spint 1.0.7 requires scipy>=0.11, which is not installed.\n", + "fastai 2.8.6 requires pandas, which is not installed.\n", + "fastai 2.8.6 requires scikit-learn, which is not installed.\n", + "fastai 2.8.6 requires scipy, which is not installed.\n", + "geopandas 1.1.2 requires pandas>=2.0.0, which is not installed.\n", + "pointpats 2.5.2 requires pandas!=1.5.0,>=1.4, which is not installed.\n", + "pointpats 2.5.2 requires scipy>=1.10, which is not installed.\n", + "shap 0.50.0 requires pandas, which is not installed.\n", + "shap 0.50.0 requires scikit-learn, which is not installed.\n", + "shap 0.50.0 requires scipy, which is not installed.\n", + "spglm 1.1.0 requires scipy>=1.8, which is not installed.\n", + "cufflinks 0.17.3 requires pandas>=0.19.2, which is not installed.\n", + "gradio 5.50.0 requires pandas<3.0,>=1.0, which is not installed.\n", + "xarray 2025.12.0 requires pandas>=2.2, which is not installed.\n", + "tobler 0.13.0 requires pandas>=2.2, which is not installed.\n", + "tobler 0.13.0 requires scipy>=1.13, which is not installed.\n", + "scs 3.2.10 requires scipy, which is not installed.\n", + "statsmodels 0.14.6 requires pandas!=2.1.0,>=1.4, which is not installed.\n", + "statsmodels 0.14.6 requires scipy!=1.9.2,>=1.8, which is not installed.\n", + "esda 2.8.1 requires pandas>=2.1, which is not installed.\n", + "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", + "esda 2.8.1 requires scipy>=1.12, which is not installed.\n", + "xarray-einstats 0.9.1 requires scipy>=1.11, which is not installed.\n", + "holoviews 1.22.1 requires pandas>=1.3, which is not installed.\n", + "momepy 0.11.0 requires pandas>=2.0, which is not installed.\n", + "treelite 4.4.1 requires scipy, which is not installed.\n", + "libpysal 4.14.0 requires pandas>=2.1.0, which is not installed.\n", + "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "libpysal 4.14.0 requires scipy>=1.12.0, which is not installed.\n", + "jax 0.7.2 requires scipy>=1.13, which is not installed.\n", + "seaborn 0.13.2 requires pandas>=1.2, which is not installed.\n", + "jaxlib 0.7.2 requires scipy>=1.13, which is not installed.\n", + "umap-learn 0.5.9.post2 requires scikit-learn>=1.6, which is not installed.\n", + "umap-learn 0.5.9.post2 requires scipy>=1.3.1, which is not installed.\n", + "dask-cuda 25.10.0 requires pandas>=1.3, which is not installed.\n", + "spaghetti 1.7.6 requires pandas!=1.5.0,>=1.4, which is not installed.\n", + "spaghetti 1.7.6 requires scipy>=1.8, which is not installed.\n", + "quantecon 0.10.1 requires scipy>=1.5.0, which is not installed.\n", + "bigframes 2.31.0 requires pandas>=1.5.3, which is not installed.\n", + "lightgbm 4.6.0 requires scipy, which is not installed.\n", + "yfinance 0.2.66 requires pandas>=1.3.0, which is not installed.\n", + "opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "pytensor 2.36.3 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", + "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "rasterio 1.5.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", + "jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.7/12.7 MB\u001b[0m \u001b[31m124.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "access 1.1.10.post3 requires scipy>=1.14.1, which is not installed.\n", + "mizani 0.13.5 requires scipy>=1.8.0, which is not installed.\n", + "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", + "spreg 1.8.4 requires scipy>=0.11, which is not installed.\n", + "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "spopt 0.7.0 requires scipy>=1.12.0, which is not installed.\n", + "plotnine 0.14.5 requires scipy>=1.8.0, which is not installed.\n", + "pymc 5.27.0 requires scipy>=1.4.1, which is not installed.\n", + "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", + "sklearn-pandas 2.2.0 requires scipy>=1.5.1, which is not installed.\n", + "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", + "mlxtend 0.23.4 requires scipy>=1.2.1, which is not installed.\n", + "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "mapclassify 2.10.0 requires scipy>=1.12, which is not installed.\n", + "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", + "segregation 2.5.3 requires scipy, which is not installed.\n", + "giddy 2.3.8 requires scipy>=1.12, which is not installed.\n", + "mgwr 2.2.1 requires scipy>=0.11, which is not installed.\n", + "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", + "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", which is not installed.\n", + "arviz 0.22.0 requires scipy>=1.11.0, which is not installed.\n", + "inequality 1.1.2 requires scipy>=1.12, which is not installed.\n", + "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", + "pysal 25.7 requires scipy>=1.8, which is not installed.\n", + "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "cuml-cu12 25.10.0 requires scipy>=1.8.0, which is not installed.\n", + "spint 1.0.7 requires scipy>=0.11, which is not installed.\n", + "fastai 2.8.6 requires scikit-learn, which is not installed.\n", + "fastai 2.8.6 requires scipy, which is not installed.\n", + "pointpats 2.5.2 requires scipy>=1.10, which is not installed.\n", + "shap 0.50.0 requires scikit-learn, which is not installed.\n", + "shap 0.50.0 requires scipy, which is not installed.\n", + "spglm 1.1.0 requires scipy>=1.8, which is not installed.\n", + "tobler 0.13.0 requires scipy>=1.13, which is not installed.\n", + "statsmodels 0.14.6 requires scipy!=1.9.2,>=1.8, which is not installed.\n", + "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", + "esda 2.8.1 requires scipy>=1.12, which is not installed.\n", + "xarray-einstats 0.9.1 requires scipy>=1.11, which is not installed.\n", + "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "libpysal 4.14.0 requires scipy>=1.12.0, which is not installed.\n", + "spaghetti 1.7.6 requires scipy>=1.8, which is not installed.\n", + "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", + "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.6/60.6 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.2/38.2 MB\u001b[0m \u001b[31m19.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.\n", + "librosa 0.11.0 requires scikit-learn>=1.1.0, which is not installed.\n", + "imbalanced-learn 0.14.1 requires scikit-learn<2,>=1.4.2, which is not installed.\n", + "hdbscan 0.8.41 requires scikit-learn>=1.6, which is not installed.\n", + "spreg 1.8.4 requires scikit-learn>=0.22, which is not installed.\n", + "spopt 0.7.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "sklearn-pandas 2.2.0 requires scikit-learn>=0.23.0, which is not installed.\n", + "pynndescent 0.6.0 requires scikit-learn>=0.18, which is not installed.\n", + "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", + "mapclassify 2.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", + "tsfresh 0.21.1 requires scikit-learn>=0.22.0, which is not installed.\n", + "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", + "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "fastai 2.8.6 requires scikit-learn, which is not installed.\n", + "sentence-transformers 5.2.0 requires scikit-learn, which is not installed.\n", + "shap 0.50.0 requires scikit-learn, which is not installed.\n", + "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", + "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "umap-learn 0.5.9.post2 requires scikit-learn>=1.6, which is not installed.\n", + "access 1.1.10.post3 requires scipy>=1.14.1, but you have scipy 1.13.1 which is incompatible.\n", + "pytensor 2.36.3 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", but you have scipy 1.13.1 which is incompatible.\n", + "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", + "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m397.0/397.0 kB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m311.8/311.8 kB\u001b[0m \u001b[31m31.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m65.5/65.5 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "mlxtend 0.23.4 requires scikit-learn>=1.3.1, which is not installed.\n", + "segregation 2.5.3 requires scikit-learn>=0.21.3, which is not installed.\n", + "pysal 25.7 requires scikit-learn>=1.1, which is not installed.\n", + "cuml-cu12 25.10.0 requires scikit-learn>=1.4, which is not installed.\n", + "fastai 2.8.6 requires scikit-learn, which is not installed.\n", + "sentence-transformers 5.2.0 requires scikit-learn, which is not installed.\n", + "shap 0.50.0 requires scikit-learn, which is not installed.\n", + "esda 2.8.1 requires scikit-learn>=1.4, which is not installed.\n", + "libpysal 4.14.0 requires scikit-learn>=1.4.0, which is not installed.\n", + "langgraph-prebuilt 1.0.5 requires langchain-core>=1.0.0, but you have langchain-core 0.2.40 which is incompatible.\n", + "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\n", + "tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.\n", + "google-adk 1.21.0 requires tenacity<10.0.0,>=9.0.0, but you have tenacity 8.5.0 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m40.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m55.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m397.1/397.1 kB\u001b[0m \u001b[31m36.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.0/51.0 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "langgraph-prebuilt 1.0.5 requires langchain-core>=1.0.0, but you have langchain-core 0.2.43 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m23.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m137.5/137.5 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m52.0/52.0 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m584.3/584.3 kB\u001b[0m \u001b[31m20.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.4/2.4 MB\u001b[0m \u001b[31m74.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m278.2/278.2 kB\u001b[0m \u001b[31m30.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m107.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.4/17.4 MB\u001b[0m \u001b[31m129.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m72.5/72.5 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m220.0/220.0 kB\u001b[0m \u001b[31m24.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.4/66.4 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.6/132.6 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m167.6/167.6 kB\u001b[0m \u001b[31m19.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.6/60.6 kB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m88.0/88.0 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-exporter-otlp-proto-common==1.37.0, but you have opentelemetry-exporter-otlp-proto-common 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-proto==1.37.0, but you have opentelemetry-proto 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-sdk~=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-gcp-logging 1.11.0a0 requires opentelemetry-sdk<1.39.0,>=1.35.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", + "google-adk 1.21.0 requires opentelemetry-api<=1.37.0,>=1.37.0, but you have opentelemetry-api 1.39.1 which is incompatible.\n", + "google-adk 1.21.0 requires opentelemetry-sdk<=1.37.0,>=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", + "google-adk 1.21.0 requires tenacity<10.0.0,>=9.0.0, but you have tenacity 8.5.0 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.1/227.1 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.9/8.9 MB\u001b[0m \u001b[31m119.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tsfresh 0.21.1 requires scipy>=1.14.0; python_version >= \"3.10\", but you have scipy 1.13.1 which is incompatible.\n", + "shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.8/295.8 kB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCRITICAL: Go to 'Runtime' > 'Restart session' NOW.\n", + "After restarting, run Cell 2.\n", + "Requirement already satisfied: gradio in /usr/local/lib/python3.12/dist-packages (5.50.0)\n", + "Requirement already satisfied: aiofiles<25.0,>=22.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (24.1.0)\n", + "Requirement already satisfied: anyio<5.0,>=3.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (4.12.1)\n", + "Requirement already satisfied: brotli>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.2.0)\n", + "Requirement already satisfied: fastapi<1.0,>=0.115.2 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.123.10)\n", + "Requirement already satisfied: ffmpy in /usr/local/lib/python3.12/dist-packages (from gradio) (1.0.0)\n", + "Requirement already satisfied: gradio-client==1.14.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.14.0)\n", + "Requirement already satisfied: groovy~=0.1 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.1.2)\n", + "Requirement already satisfied: httpx<1.0,>=0.24.1 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.28.1)\n", + "Requirement already satisfied: huggingface-hub<2.0,>=0.33.5 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.36.0)\n", + "Requirement already satisfied: jinja2<4.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.1.6)\n", + "Requirement already satisfied: markupsafe<4.0,>=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.0.3)\n", + "Requirement already satisfied: numpy<3.0,>=1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (1.26.4)\n", + "Requirement already satisfied: orjson~=3.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (3.11.5)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from gradio) (24.2)\n", + "Requirement already satisfied: pandas<3.0,>=1.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.2.2)\n", + "Requirement already satisfied: pillow<12.0,>=8.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (11.3.0)\n", + "Requirement already satisfied: pydantic<=2.12.3,>=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.12.3)\n", + "Requirement already satisfied: pydub in /usr/local/lib/python3.12/dist-packages (from gradio) (0.25.1)\n", + "Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.0.21)\n", + "Requirement already satisfied: pyyaml<7.0,>=5.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (6.0.3)\n", + "Requirement already satisfied: ruff>=0.9.3 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.14.11)\n", + "Requirement already satisfied: safehttpx<0.2.0,>=0.1.6 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.1.7)\n", + "Requirement already satisfied: semantic-version~=2.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (2.10.0)\n", + "Requirement already satisfied: starlette<1.0,>=0.40.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.50.0)\n", + "Requirement already satisfied: tomlkit<0.14.0,>=0.12.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.13.3)\n", + "Requirement already satisfied: typer<1.0,>=0.12 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.21.1)\n", + "Requirement already satisfied: typing-extensions~=4.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (4.15.0)\n", + "Requirement already satisfied: uvicorn>=0.14.0 in /usr/local/lib/python3.12/dist-packages (from gradio) (0.40.0)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from gradio-client==1.14.0->gradio) (2025.3.0)\n", + "Requirement already satisfied: websockets<16.0,>=13.0 in /usr/local/lib/python3.12/dist-packages (from gradio-client==1.14.0->gradio) (15.0.1)\n", + "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.12/dist-packages (from anyio<5.0,>=3.0->gradio) (3.11)\n", + "Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from fastapi<1.0,>=0.115.2->gradio) (0.0.4)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1.0,>=0.24.1->gradio) (2026.1.4)\n", + "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1.0,>=0.24.1->gradio) (1.0.9)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1.0,>=0.24.1->gradio) (0.16.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (3.20.2)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (2.32.4)\n", + "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (4.67.1)\n", + "Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.33.5->gradio) (1.2.0)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2025.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas<3.0,>=1.0->gradio) (2025.3)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (2.41.4)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<=2.12.3,>=2.0->gradio) (0.4.2)\n", + "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (8.3.1)\n", + "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (1.5.4)\n", + "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0,>=0.12->gradio) (13.9.4)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas<3.0,>=1.0->gradio) (1.17.0)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio) (4.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio) (2.19.2)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->huggingface-hub<2.0,>=0.33.5->gradio) (3.4.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->huggingface-hub<2.0,>=0.33.5->gradio) (2.5.0)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio) (0.1.2)\n" + ] + } + ], + "source": [ + "# ==========================================\n", + "# CELL 1: DEPENDENCY Installation\n", + "# ==========================================\n", + "import os\n", + "\n", + "print(\"πŸ”₯ Cleaning up the environment\")\n", + "\n", + "# 1. Uninstall EVERYTHING to ensure no \"ghost\" versions remain\n", + "!pip uninstall -y -q numpy pandas scipy scikit-learn langchain langchain-community langchain-core langchain-groq\n", + "\n", + "\n", + "print(\"πŸ“¦ Installing the Dependencies\")\n", + "\n", + "# CORE MATH LIBRARIES\n", + "!pip install -q numpy==1.26.4\n", + "!pip install -q pandas==2.2.2\n", + "!pip install -q scipy==1.13.1\n", + "\n", + "# LANGCHAIN 0.2 ECOSYSTEM (\n", + "# We strictly pin these to the 0.2 series to avoid the breaking 0.3 update\n", + "!pip install -q langchain-core==0.2.40\n", + "!pip install -q langchain-community==0.2.16\n", + "!pip install -q langchain==0.2.16\n", + "!pip install -q langchain-groq==0.1.9\n", + "!pip install -q langchain-text-splitters==0.2.4\n", + "\n", + "# VECTOR DATABASE & EMBEDDINGS\n", + "!pip install -q chromadb==0.5.5\n", + "!pip install -q sentence-transformers==3.0.1\n", + "!pip install -q pypdf==4.3.1\n", + "!pip install -q rank-bm25==0.2.2\n", + "\n", + "\n", + "print(\"CRITICAL: Go to 'Runtime' > 'Restart session' NOW.\")\n", + "print(\"After restarting, run Cell 2.\")\n", + "\n", + "!pip install gradio" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e7y44MyPVC3_" + }, + "source": [ + "This cell imports all required libraries and sets up the compute device (GPU if available, else CPU).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "iUn1NfUQBHPK", + "outputId": "bf945e96-5495-4d71-e3d2-b6bae439b0dc" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "System ready. Running on: CUDA\n" + ] + } + ], + "source": [ + "import os\n", + "import sys\n", + "import json\n", + "import torch\n", + "import numpy as np\n", + "from typing import List, Dict, Tuple, Optional\n", + "from collections import defaultdict\n", + "from dataclasses import dataclass\n", + "import hashlib\n", + "import gradio as gr\n", + "from datetime import datetime\n", + "\n", + "# Core Imports\n", + "from langchain_community.document_loaders import PyPDFLoader\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain_community.vectorstores import Chroma\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.retrievers import BM25Retriever\n", + "from langchain_groq import ChatGroq\n", + "from langchain.schema import Document\n", + "\n", + "# Advanced Models\n", + "from sentence_transformers import SentenceTransformer, CrossEncoder\n", + "\n", + "# Setup Device\n", + "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "print(f\"System ready. Running on: {device.upper()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gojw-RfdVOa5" + }, + "source": [ + "## Core Data Structures\n", + "\n", + "### QueryProfile Dataclass\n", + "\n", + "**Purpose**: Encapsulates the result of query classification to guide retrieval strategy.\n", + "\n", + "| Field | Type | Description | Example Values |\n", + "|-------|------|-------------|----------------|\n", + "| `query_type` | str | Category of question | `\"factoid\"`, `\"summary\"`, `\"comparison\"`, `\"extraction\"`, `\"reasoning\"` |\n", + "| `intent` | str | Same as query_type (for extensibility) | Same as above |\n", + "| `needs_multi_docs` | bool | Does query span multiple documents? | `True` for comparison queries |\n", + "| `requires_comparison` | bool | Is this a compare/contrast question? | `True` if \"compare\", \"difference\" in query |\n", + "| `answer_style` | str | How to format the answer | `\"direct\"`, `\"bullets\"`, `\"steps\"` |\n", + "| `k` | int | Number of chunks to retrieve | 5-12 (auto-tuned based on query type) |\n", + "\n", + "### Query Type β†’ Retrieval Strategy Mapping:\n", + "```\n", + "factoid β†’ k=6, style=direct (simple fact lookup)\n", + "summary β†’ k=10, style=bullets (overview questions)\n", + "comparison β†’ k=12, style=bullets (cross-document comparison)\n", + "extraction β†’ k=8, style=direct (extract specific info)\n", + "reasoning β†’ k=10, style=steps (explain how/why)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "zs63mc1EBls0" + }, + "outputs": [], + "source": [ + "@dataclass\n", + "class QueryProfile:\n", + " query_type: str\n", + " intent: str\n", + " needs_multi_docs: bool\n", + " requires_comparison: bool\n", + " answer_style: str\n", + " k: int\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_voziDjOVl6M" + }, + "source": [ + "### QueryCache Class\n", + "\n", + "**Purpose**: LRU-style cache to avoid redundant LLM calls for repeated queries.\n", + "\n", + "#### How It Works:\n", + "1. **Key Generation**: MD5 hash of query string\n", + "2. **Storage**: Dictionary mapping hash β†’ response\n", + "3. **Eviction**: FIFO (First-In-First-Out) when `max_size` exceeded\n", + "\n", + "#### Methods:\n", + "| Method | Input | Output | Description |\n", + "|--------|-------|--------|-------------|\n", + "| `get(query)` | Query string | Response or `None` | Check if query is cached |\n", + "| `set(query, response)` | Query + Response | None | Store result, evict oldest if full |\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "oWpJgmpwN1tS" + }, + "outputs": [], + "source": [ + "class QueryCache:\n", + " \"\"\"Simple cache for repeated queries\"\"\"\n", + " def __init__(self, max_size=100):\n", + " self.cache = {}\n", + " self.max_size = max_size\n", + "\n", + " def get(self, query: str) -> Optional[str]:\n", + " key = hashlib.md5(query.encode()).hexdigest()\n", + " return self.cache.get(key)\n", + "\n", + " def set(self, query: str, response: str):\n", + " key = hashlib.md5(query.encode()).hexdigest()\n", + " if len(self.cache) >= self.max_size:\n", + " self.cache.pop(next(iter(self.cache)))\n", + " self.cache[key] = response\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rwBQl2SjVuW3" + }, + "source": [ + "### SemanticChunker Class\n", + "\n", + "**Purpose**: Split documents into semantically coherent chunks (vs. arbitrary character-based splits).\n", + "\n", + "#### Why Semantic Chunking?\n", + "| Traditional Chunking | Semantic Chunking |\n", + "|---------------------|-------------------|\n", + "| Splits at fixed character count | Splits at topic boundaries |\n", + "| May cut mid-sentence/concept | Preserves complete ideas |\n", + "| Lower retrieval relevance | Higher retrieval relevance |\n", + "\n", + "#### Algorithm:\n", + "```\n", + "1. Split text into sentences (by \". \")\n", + "2. Encode each sentence with SentenceTransformer\n", + "3. For each consecutive sentence pair:\n", + " - Compute cosine similarity\n", + " - If similarity > threshold AND size < max:\n", + " β†’ Add to current chunk\n", + " - Else:\n", + " β†’ Save chunk, start new one\n", + "4. Return list of semantic chunks\n", + "```\n", + "\n", + "#### Parameters:\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `model_name` | `all-MiniLM-L6-v2` | Sentence embedding model |\n", + "| `max_chunk_size` | 1000 | Maximum characters per chunk |\n", + "| `similarity_threshold` | 0.5 | Cosine similarity threshold for grouping |" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "ZqF22Kb5Nv7e" + }, + "outputs": [], + "source": [ + "class SemanticChunker:\n", + " \"\"\"Advanced semantic chunking using sentence embeddings\"\"\"\n", + " def __init__(self, model_name=\"sentence-transformers/all-MiniLM-L6-v2\"):\n", + " self.model = SentenceTransformer(model_name, device=device)\n", + "\n", + " def chunk_document(self, text: str, max_chunk_size=1000, similarity_threshold=0.5):\n", + " \"\"\"Split text into semantically coherent chunks\"\"\"\n", + " sentences = text.replace('\\n', ' ').split('. ')\n", + " sentences = [s.strip() + '.' for s in sentences if s.strip()]\n", + "\n", + " if not sentences:\n", + " return [text]\n", + "\n", + " embeddings = self.model.encode(sentences)\n", + " chunks = []\n", + " current_chunk = [sentences[0]]\n", + " current_size = len(sentences[0])\n", + "\n", + " for i in range(1, len(sentences)):\n", + " similarity = np.dot(embeddings[i-1], embeddings[i]) / (\n", + " np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])\n", + " )\n", + " sentence_len = len(sentences[i])\n", + "\n", + " if similarity > similarity_threshold and current_size + sentence_len < max_chunk_size:\n", + " current_chunk.append(sentences[i])\n", + " current_size += sentence_len\n", + " else:\n", + " chunks.append(' '.join(current_chunk))\n", + " current_chunk = [sentences[i]]\n", + " current_size = sentence_len\n", + "\n", + " if current_chunk:\n", + " chunks.append(' '.join(current_chunk))\n", + "\n", + " return chunks\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yTCw6h2YWCZq" + }, + "source": [ + "### ReciprocalRankFusion (RRF) Class\n", + "\n", + "**Purpose**: Combine multiple ranked retrieval lists into a single optimal ranking.\n", + "\n", + "#### The Problem RRF Solves:\n", + "When using multiple retrievers (vector search, keyword search, etc.), each returns a ranked list. How do we combine them?\n", + "\n", + "#### RRF Formula:\n", + "```\n", + "score(doc) = Ξ£ 1 / (k + rank_i + 1)\n", + "```\n", + "Where:\n", + "- `k` = 60 (smoothing constant, standard value)\n", + "- `rank_i` = position of document in retrieval list i (0-indexed)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "nxiRtNGINqQr" + }, + "outputs": [], + "source": [ + "class ReciprocalRankFusion:\n", + " \"\"\"RRF for combining multiple retrieval results\"\"\"\n", + " @staticmethod\n", + " def fuse(retrieval_results: List[List[Document]], k=60) -> List[Document]:\n", + " doc_scores = defaultdict(float)\n", + " doc_map = {}\n", + "\n", + " for docs in retrieval_results:\n", + " for rank, doc in enumerate(docs):\n", + " doc_id = doc.metadata.get('chunk_id') or f\"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}\"\n", + " doc_scores[doc_id] += 1 / (k + rank + 1)\n", + " doc_map[doc_id] = doc\n", + "\n", + " sorted_docs = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)\n", + " return [doc_map[doc_id] for doc_id, _ in sorted_docs]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtUYO8vPWMzp" + }, + "source": [ + "## EnhancedRAG - Complete RAG Engine\n", + "\n", + "This is the **core class** that orchestrates the entire RAG pipeline. All document ingestion, retrieval, and generation flows through this class.\n", + "\n", + "---\n", + "\n", + "### Class Architecture\n", + "\n", + "```\n", + "EnhancedRAGv3\n", + "β”œβ”€β”€ Storage Layer\n", + "β”‚ β”œβ”€β”€ vector_db (ChromaDB) # Semantic search index\n", + "β”‚ β”œβ”€β”€ bm25_retriever # Keyword search index\n", + "β”‚ β”œβ”€β”€ documents (List) # All document chunks\n", + "β”‚ └── pdf_metadata (Dict) # PDF tracking {name: {path, pages, chunks, pdf_id}}\n", + "β”‚\n", + "β”œβ”€β”€ Model Layer (Lazy-loaded for memory efficiency)\n", + "β”‚ β”œβ”€β”€ embedding_model # BAAI/bge-large-en-v1.5 (~1.2GB)\n", + "β”‚ β”œβ”€β”€ cross_encoder # BAAI/bge-reranker-v2-m3 (~560MB)\n", + "β”‚ β”œβ”€β”€ semantic_chunker # all-MiniLM-L6-v2 (~90MB)\n", + "β”‚ └── query_model # all-MiniLM-L6-v2 (~90MB)\n", + "β”‚\n", + "β”œβ”€β”€ LLM Layer\n", + "β”‚ └── llm (ChatGroq) # Llama 3.3 70B via Groq API\n", + "β”‚\n", + "└── Utility Layer\n", + " β”œβ”€β”€ cache (QueryCache) # Response caching (max 100 queries)\n", + " └── api_key # Groq API key\n", + "```\n", + "\n", + "---\n", + "\n", + "### Method Reference\n", + "\n", + "| Method | Purpose | Key Details |\n", + "|--------|---------|-------------|\n", + "| `__init__(api_key)` | Initialize system | Sets up LLM, all other models lazy-loaded |\n", + "| `load_models()` | Load ML models | BGE embeddings β†’ CrossEncoder β†’ Chunker β†’ Query model |\n", + "| `ingest_pdf(path)` | Process PDF | Extract β†’ Chunk β†’ Index in ChromaDB + BM25 |\n", + "| `chat(query)` | Answer questions | Full pipeline: classify β†’ expand β†’ retrieve β†’ rerank β†’ generate |\n", + "| `summarize_document()` | Summarize all docs | Map-reduce: batch summaries β†’ final synthesis |\n", + "\n", + "---\n", + "\n", + "### 1. Initialization & Model Loading\n", + "\n", + "**`__init__(api_key)`** - Sets up the system with Groq API key. Models are NOT loaded yet (lazy loading for faster startup).\n", + "\n", + "**`load_models()`** - Loads all ML models with progress tracking:\n", + "\n", + "| Progress | Model | Size | Purpose |\n", + "|----------|-------|------|---------|\n", + "| 10% β†’ 40% | BAAI/bge-large-en-v1.5 | ~1.2GB | Document & query embeddings (1024-dim, normalized) |\n", + "| 40% β†’ 60% | BAAI/bge-reranker-v2-m3 | ~560MB | Cross-encoder re-ranking |\n", + "| 60% β†’ 80% | all-MiniLM-L6-v2 | ~90MB | Semantic chunking |\n", + "| 80% β†’ 100% | all-MiniLM-L6-v2 | ~90MB | Query processing |\n", + "\n", + "---\n", + "\n", + "### 2. Document Ingestion Pipeline\n", + "\n", + "**`ingest_pdf(pdf_path, use_semantic_chunking=True)`**\n", + "\n", + "```\n", + "PDF File\n", + " ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 1. PyPDFLoader β”‚ Extract text from each page\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 2. Duplicate Check β”‚ Skip if pdf_name already in pdf_metadata\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 3. Chunking β”‚ SemanticChunker (default) or RecursiveTextSplitter\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 4. Add Metadata β”‚ {page, source, pdf_name, pdf_id, chunk_id}\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 5. Rebuild Indexes β”‚ ChromaDB (vector) + BM25 (keyword) with ALL docs\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + "```\n", + "\n", + "**Chunk Metadata Schema:**\n", + "```python\n", + "{\n", + " \"page\": 0, # 0-indexed page number\n", + " \"source\": \"/path/to/doc.pdf\", # Full file path\n", + " \"pdf_name\": \"doc.pdf\", # Filename only\n", + " \"pdf_id\": \"a1b2c3d4\", # 8-char MD5 hash (unique per PDF)\n", + " \"chunk_id\": \"a1b2c3d4-42\" # Unique chunk identifier\n", + "}\n", + "```\n", + "\n", + "---\n", + "\n", + "### 3. Query Classification\n", + "\n", + "**`_classify_query(query) β†’ QueryProfile`**\n", + "\n", + "Determines optimal retrieval strategy using LLM + heuristic fallback:\n", + "\n", + "| Query Type | Trigger Keywords | k | Answer Style |\n", + "|------------|------------------|---|--------------|\n", + "| `factoid` | \"what is\", \"who is\", \"define\" | 6 | direct |\n", + "| `summary` | \"summarize\", \"overview\", \"key points\" | 10 | bullets |\n", + "| `comparison` | \"compare\", \"difference\", \"vs\", \"between\" | 12 | bullets |\n", + "| `extraction` | (default) | 8 | direct |\n", + "| `reasoning` | \"explain\", \"how does\", \"why\" | 10 | steps |\n", + "\n", + "**Returns:** `QueryProfile(query_type, intent, needs_multi_docs, requires_comparison, answer_style, k)`\n", + "\n", + "---\n", + "\n", + "### 4. Query Enhancement Techniques\n", + "\n", + "**`_generate_hyde_document(query) β†’ str`** - HyDE (Hypothetical Document Embeddings)\n", + "\n", + "```\n", + "Query: \"What is attention?\"\n", + " ↓ LLM generates\n", + "HyDE Doc: \"The attention mechanism is a neural network component\n", + " that allows models to focus on relevant parts...\"\n", + " ↓\n", + "Used for retrieval (matches real docs better than short query!)\n", + "```\n", + "\n", + "**`_expand_query(query) β†’ List[str]`** - Multi-Query Expansion\n", + "\n", + "```\n", + "Original: \"What are the benefits of transformers?\"\n", + " ↓ LLM generates 3 variants\n", + "[\n", + " \"What are the benefits of transformers?\", # Original\n", + " \"What advantages do transformer models offer?\", # Variant 1\n", + " \"Why are transformers better than RNNs?\", # Variant 2\n", + " \"What makes transformer architecture effective?\" # Variant 3\n", + "]\n", + " ↓\n", + "All used for retrieval β†’ RRF fuses results\n", + "```\n", + "\n", + "---\n", + "\n", + "### 5. Hybrid Retrieval Pipeline\n", + "\n", + "**`_retrieve_with_rrf(query, k, fetch_factor=2) β†’ List[Document]`**\n", + "\n", + "```\n", + "Query\n", + " β”‚\n", + " β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + " ↓ ↓\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ Vector Search (MMR)β”‚ β”‚ BM25 Search β”‚\n", + "β”‚ β”‚ β”‚ β”‚\n", + "β”‚ β€’ Semantic match β”‚ β”‚ β€’ Exact keywords β”‚\n", + "β”‚ β€’ lambda=0.6 β”‚ β”‚ β€’ Term frequency β”‚\n", + "β”‚ (relevance+ β”‚ β”‚ β”‚\n", + "β”‚ diversity) β”‚ β”‚ β”‚\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " β”‚ β”‚\n", + " β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + " β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + " β”‚ RRF Fusion β”‚ score = Ξ£ 1/(60 + rank + 1)\n", + " β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " ↓\n", + " Fused ranked list\n", + "```\n", + "\n", + "**Why Hybrid?**\n", + "- Vector: Understands synonyms, semantic similarity\n", + "- BM25: Exact term matching, handles rare words\n", + "- Combined: Best of both worlds\n", + "\n", + "---\n", + "\n", + "### 6. Re-ranking & PDF Diversity\n", + "\n", + "**`_rerank_documents(query, documents, top_k) β†’ List[(Document, score)]`**\n", + "\n", + "Uses **CrossEncoder** for neural re-ranking:\n", + "- Bi-encoder (initial): Fast but less accurate (query/doc encoded separately)\n", + "- Cross-encoder (re-rank): Slower but accurate (query+doc processed together)\n", + "\n", + "**Comparison Query Boost:** For comparison queries, documents containing keywords like \"compared to\", \"in contrast\", \"whereas\" get +10% score boost per keyword.\n", + "\n", + "---\n", + "\n", + "**`_ensure_pdf_diversity(query, documents, target_docs=2) β†’ List[Document]`**\n", + "\n", + "For multi-document queries, ensures chunks from ALL loaded PDFs:\n", + "\n", + "```\n", + "Problem: Query about \"both papers\" returns only Paper A chunks\n", + " ↓\n", + "Solution: Detect missing PDFs β†’ filtered vector search β†’ add their chunks\n", + " ↓\n", + "Result: [chunk_A1, chunk_A2, chunk_A3, chunk_B1, chunk_B2]\n", + "```\n", + "\n", + "---\n", + "\n", + "### 7. Main Chat Pipeline\n", + "\n", + "**`chat(query, use_hyde=True, use_multi_query=True) β†’ (answer, citations, metadata)`**\n", + "\n", + "```\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ 1. CACHE CHECK β”‚ Return immediately if query cached β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 2. CLASSIFY QUERY β”‚ β†’ QueryProfile (type, k, style) β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 3. EXPAND QUERY β”‚ Generate 3 alternative phrasings β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 4. GENERATE HyDE β”‚ Create hypothetical answer document β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 5. RETRIEVE β”‚ For EACH query variant: β”‚\n", + "β”‚ β”‚ β€’ Vector search (MMR) β”‚\n", + "β”‚ β”‚ β€’ BM25 search β”‚\n", + "β”‚ β”‚ β€’ RRF fusion β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 6. GLOBAL RRF β”‚ Fuse results from all query variants β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 7. PDF DIVERSITY β”‚ Ensure chunks from all loaded PDFs β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 8. RERANK β”‚ CrossEncoder neural scoring β†’ top k β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 9. BUILD CONTEXT β”‚ Format: \"[Source 1]: chunk content...\" β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 10. LLM GENERATE β”‚ Answer with inline [Source X] citationsβ”‚\n", + "β”‚ β”‚ (Different prompts for comparison) β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 11. VERIFY (complex) β”‚ Self-check: direct? structured? If not β”‚\n", + "β”‚ β”‚ β†’ regenerate improved answer β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ 12. CACHE & RETURN β”‚ Store result, return (answer, cites, β”‚\n", + "β”‚ β”‚ metadata) β”‚\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + "```\n", + "\n", + "---\n", + "\n", + "### 8. Document Summarization\n", + "\n", + "**`summarize_document(max_chunks=None) β†’ (summary, metadata)`**\n", + "\n", + "Uses **Map-Reduce** pattern:\n", + "\n", + "```\n", + "MAP PHASE:\n", + " Chunks [1-10] β†’ LLM β†’ 3-5 bullet summary\n", + " Chunks [11-20] β†’ LLM β†’ 3-5 bullet summary\n", + " ...\n", + " Chunks [n-m] β†’ LLM β†’ 3-5 bullet summary\n", + "\n", + "REDUCE PHASE:\n", + " All batch summaries β†’ LLM β†’ Final structured summary:\n", + " β€’ Overview (2-3 sentences)\n", + " β€’ Main Topics (bullets)\n", + " β€’ Important Details (3-5 points)\n", + " β€’ Conclusion\n", + "```\n", + "\n", + "---\n", + "\n", + "### Key Parameters Reference\n", + "\n", + "| Parameter | Location | Default | Description |\n", + "|-----------|----------|---------|-------------|\n", + "| `k` | QueryProfile | 5-12 | Chunks to retrieve (auto-tuned by query type) |\n", + "| `fetch_factor` | _retrieve_with_rrf | 2 | Multiplier for initial retrieval pool |\n", + "| `lambda_mult` | MMR search | 0.6 | Diversity vs relevance (0=diverse, 1=relevant) |\n", + "| `similarity_threshold` | SemanticChunker | 0.5 | Cosine sim for chunk boundaries |\n", + "| `max_chunk_size` | SemanticChunker | 1000 | Max characters per chunk |\n", + "| `chunk_size` | TextSplitter | 800 | Fallback chunker size |\n", + "| `chunk_overlap` | TextSplitter | 150 | Character overlap between chunks |\n", + "| `max_size` | QueryCache | 100 | Maximum cached queries |\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iXFJQbsgNVpz" + }, + "outputs": [], + "source": [ + "class EnhancedRAGv3:\n", + " def __init__(self, api_key: str):\n", + " self.vector_db = None\n", + " self.bm25_retriever = None\n", + " self.documents = []\n", + " self.pdf_metadata = {} # Track multiple PDFs\n", + " self.doc_headers = {} # Store extracted headers (title, authors, abstract) per PDF\n", + " self.cache = QueryCache()\n", + " self.api_key = api_key\n", + " self.is_initialized = False\n", + "\n", + " # Initialize LLM\n", + " self.llm = ChatGroq(\n", + " temperature=0,\n", + " model_name=\"llama-3.3-70b-versatile\",\n", + " groq_api_key=api_key\n", + " )\n", + "\n", + " # Models (loaded on demand)\n", + " self.embedding_model = None\n", + " self.cross_encoder = None\n", + " self.semantic_chunker = None\n", + " self.query_model = None\n", + "\n", + " def load_models(self, progress=gr.Progress()):\n", + " \"\"\"Load all models with progress tracking\"\"\"\n", + " if self.is_initialized:\n", + " return \"Models already loaded.\"\n", + "\n", + " progress(0.1, desc=\"Loading BGE embeddings...\")\n", + " self.embedding_model = HuggingFaceEmbeddings(\n", + " model_name=\"BAAI/bge-large-en-v1.5\",\n", + " model_kwargs={'device': device, 'trust_remote_code': True},\n", + " encode_kwargs={'normalize_embeddings': True}\n", + " )\n", + "\n", + " progress(0.4, desc=\"Loading Re-ranker...\")\n", + " self.cross_encoder = CrossEncoder('BAAI/bge-reranker-v2-m3', device=device)\n", + "\n", + " progress(0.6, desc=\"Loading Semantic Chunker...\")\n", + " self.semantic_chunker = SemanticChunker()\n", + "\n", + " progress(0.8, desc=\"Loading Query Model...\")\n", + " self.query_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)\n", + "\n", + " progress(1.0, desc=\"Complete\")\n", + " self.is_initialized = True\n", + " return \"All models loaded successfully.\"\n", + "\n", + " def _extract_document_header(self, pages: List[Document], pdf_name: str, pdf_id: str) -> Dict:\n", + " \"\"\"Extract title, authors, and abstract from first pages of PDF\"\"\"\n", + " # Get text from first 2 pages (where metadata usually is)\n", + " header_text = \"\"\n", + " for i, page in enumerate(pages[:2]):\n", + " header_text += page.page_content + \"\\n\\n\"\n", + " \n", + " # Use LLM to extract structured metadata\n", + " extraction_prompt = f\"\"\"Extract the following information from this academic paper's first pages.\n", + "Return ONLY a JSON object with these keys:\n", + "- title: The paper's title (string)\n", + "- authors: List of author names (array of strings)\n", + "- abstract: The paper's abstract if present (string, or null if not found)\n", + "- institutions: List of institutions/affiliations if present (array of strings, or empty array)\n", + "\n", + "Text from first pages:\n", + "{header_text[:4000]}\n", + "\n", + "JSON:\"\"\"\n", + "\n", + " try:\n", + " response = self.llm.invoke(extraction_prompt)\n", + " # Parse JSON from response\n", + " import re\n", + " json_match = re.search(r'\\{[\\s\\S]*\\}', response.content)\n", + " if json_match:\n", + " metadata = json.loads(json_match.group())\n", + " metadata['pdf_name'] = pdf_name\n", + " metadata['pdf_id'] = pdf_id\n", + " metadata['raw_header'] = header_text[:2000] # Store raw text too\n", + " return metadata\n", + " except Exception as e:\n", + " print(f\"Header extraction error: {e}\")\n", + " \n", + " # Fallback: return raw header text\n", + " return {\n", + " 'title': None,\n", + " 'authors': [],\n", + " 'abstract': None,\n", + " 'institutions': [],\n", + " 'pdf_name': pdf_name,\n", + " 'pdf_id': pdf_id,\n", + " 'raw_header': header_text[:2000]\n", + " }\n", + "\n", + " def _is_metadata_query(self, query: str) -> Tuple[bool, str]:\n", + " \"\"\"Check if query is asking for basic document metadata\"\"\"\n", + " query_lower = query.lower()\n", + " \n", + " # Author queries\n", + " author_patterns = ['who are the authors', 'who wrote', 'author', 'authors', 'written by', 'by whom']\n", + " if any(p in query_lower for p in author_patterns):\n", + " return True, 'authors'\n", + " \n", + " # Title queries\n", + " title_patterns = ['what is the title', 'title of', 'paper title', 'document title', 'name of the paper']\n", + " if any(p in query_lower for p in title_patterns):\n", + " return True, 'title'\n", + " \n", + " # Abstract queries\n", + " abstract_patterns = ['what is the abstract', 'abstract of', 'paper abstract', 'summarize the abstract']\n", + " if any(p in query_lower for p in abstract_patterns):\n", + " return True, 'abstract'\n", + " \n", + " # Institution queries\n", + " institution_patterns = ['which institution', 'which university', 'affiliation', 'where are the authors from']\n", + " if any(p in query_lower for p in institution_patterns):\n", + " return True, 'institutions'\n", + " \n", + " return False, None\n", + "\n", + " def _answer_metadata_query(self, query: str, metadata_type: str) -> Tuple[str, str, str]:\n", + " \"\"\"Answer queries about document metadata directly\"\"\"\n", + " if not self.doc_headers:\n", + " return \"No document metadata available.\", \"\", \"\"\n", + " \n", + " # Build response from stored headers\n", + " responses = []\n", + " citations = []\n", + " \n", + " for pdf_name, header in self.doc_headers.items():\n", + " if metadata_type == 'authors':\n", + " authors = header.get('authors', [])\n", + " if authors:\n", + " author_str = \", \".join(authors)\n", + " responses.append(f\"**{pdf_name}**: {author_str}\")\n", + " else:\n", + " # Fallback to raw header\n", + " responses.append(f\"**{pdf_name}**: Authors could not be automatically extracted. See first page.\")\n", + " \n", + " elif metadata_type == 'title':\n", + " title = header.get('title')\n", + " if title:\n", + " responses.append(f\"**{pdf_name}**: {title}\")\n", + " else:\n", + " responses.append(f\"**{pdf_name}**: Title could not be automatically extracted.\")\n", + " \n", + " elif metadata_type == 'abstract':\n", + " abstract = header.get('abstract')\n", + " if abstract:\n", + " responses.append(f\"**{pdf_name}**:\\n{abstract}\")\n", + " else:\n", + " responses.append(f\"**{pdf_name}**: Abstract not found in first pages.\")\n", + " \n", + " elif metadata_type == 'institutions':\n", + " institutions = header.get('institutions', [])\n", + " if institutions:\n", + " inst_str = \", \".join(institutions)\n", + " responses.append(f\"**{pdf_name}**: {inst_str}\")\n", + " else:\n", + " responses.append(f\"**{pdf_name}**: Institutions could not be automatically extracted.\")\n", + " \n", + " # Create citation from raw header\n", + " snippet = header.get('raw_header', '')[:300] + \"...\"\n", + " citations.append(f\"\"\"\n", + "
\n", + "[1] {pdf_name} β€” Page 1 | Relevance: High (Document Header)\n", + "
{snippet}
\n", + "
\n", + "\"\"\")\n", + " \n", + " answer = \"\\n\\n\".join(responses)\n", + " citations_html = \"\\n\".join(citations)\n", + " metadata_str = f\"**Query Type:** Metadata ({metadata_type}) | **Direct extraction from document headers**\"\n", + " \n", + " return answer, citations_html, metadata_str\n", + "\n", + " def _classify_query(self, query: str) -> QueryProfile:\n", + " classification_prompt = f\"\"\"You route user questions for a RAG system.\n", + "Return ONLY compact JSON with keys:\n", + "- query_type: one of [factoid, summary, comparison, extraction, reasoning]\n", + "- needs_multi_docs: true/false (set true when the query likely spans multiple documents or asks for differences)\n", + "- requires_comparison: true/false\n", + "- answer_style: one of [direct, bullets, steps]\n", + "- k: integer between 5 and 12 indicating how many chunks to retrieve\n", + "\n", + "Question: {query}\n", + "\n", + "JSON:\"\"\"\n", + "\n", + " def heuristic_profile() -> QueryProfile:\n", + " ql = query.lower()\n", + " requires_comparison = any(word in ql for word in ['compare', 'difference', 'versus', 'vs', 'between', 'across'])\n", + " needs_multi = requires_comparison or any(word in ql for word in ['both', 'each document', 'all documents', 'across'])\n", + " if any(pattern in ql for pattern in ['what is', 'who is', 'when ', 'define', 'list ']):\n", + " qt = 'factoid'\n", + " k_val = 6\n", + " style = 'direct'\n", + " elif requires_comparison:\n", + " qt = 'comparison'\n", + " k_val = 12\n", + " style = 'bullets'\n", + " elif any(word in ql for word in ['summarize', 'overview', 'key points', 'conclusion']):\n", + " qt = 'summary'\n", + " k_val = 10\n", + " style = 'bullets'\n", + " elif any(word in ql for word in ['explain', 'how does', 'process', 'steps', 'methodology']):\n", + " qt = 'reasoning'\n", + " k_val = 10\n", + " style = 'steps'\n", + " else:\n", + " qt = 'extraction'\n", + " k_val = 8\n", + " style = 'direct'\n", + " return QueryProfile(\n", + " query_type=qt,\n", + " intent=qt,\n", + " needs_multi_docs=needs_multi,\n", + " requires_comparison=requires_comparison,\n", + " answer_style=style,\n", + " k=k_val\n", + " )\n", + "\n", + " try:\n", + " response = self.llm.invoke(classification_prompt)\n", + " data = json.loads(response.content)\n", + " qt = str(data.get('query_type', 'extraction')).lower()\n", + " needs_multi = bool(data.get('needs_multi_docs', False))\n", + " requires_comparison = bool(data.get('requires_comparison', False))\n", + " style = str(data.get('answer_style', 'direct')).lower()\n", + " k_val = int(data.get('k', 8))\n", + " k_val = max(5, min(k_val, 12))\n", + " if qt not in ['factoid', 'summary', 'comparison', 'extraction', 'reasoning']:\n", + " qt = 'extraction'\n", + " if style not in ['direct', 'bullets', 'steps']:\n", + " style = 'direct'\n", + " return QueryProfile(\n", + " query_type=qt,\n", + " intent=qt,\n", + " needs_multi_docs=needs_multi or requires_comparison,\n", + " requires_comparison=requires_comparison or qt == 'comparison',\n", + " answer_style=style,\n", + " k=k_val\n", + " )\n", + " except Exception:\n", + " return heuristic_profile()\n", + "\n", + " def _generate_hyde_document(self, query: str) -> str:\n", + " hyde_prompt = f\"\"\"Generate a detailed, factual paragraph that would answer this question:\n", + "\n", + "Question: {query}\n", + "\n", + "Write a comprehensive answer (2-3 sentences) as if from an expert document:\"\"\"\n", + " try:\n", + " response = self.llm.invoke(hyde_prompt)\n", + " return response.content\n", + " except:\n", + " return query\n", + "\n", + " def _expand_query(self, query: str) -> List[str]:\n", + " expansion_prompt = f\"\"\"Generate 3 different versions of this question to retrieve relevant documents:\n", + "\n", + "Original Question: {query}\n", + "\n", + "Generate 3 alternative phrasings (one per line):\"\"\"\n", + " try:\n", + " response = self.llm.invoke(expansion_prompt)\n", + " queries = response.content.strip().split('\\n')\n", + " queries = [q.strip().lstrip('1234567890.-) ') for q in queries if q.strip()]\n", + " return [query] + queries[:3]\n", + " except:\n", + " return [query]\n", + "\n", + " def _adaptive_retrieve(self, query: str, query_type: str) -> int:\n", + " k_map = {'factoid': 5, 'medium': 8, 'complex': 12}\n", + " return k_map.get(query_type, 8)\n", + "\n", + " def ingest_pdf(self, pdf_path: str, use_semantic_chunking=True, progress=gr.Progress()):\n", + " \"\"\"Ingest PDF with progress tracking - supports multiple PDFs\"\"\"\n", + " progress(0.1, desc=f\"Loading PDF: {os.path.basename(pdf_path)}...\")\n", + "\n", + " # Check if already loaded\n", + " pdf_name = os.path.basename(pdf_path)\n", + " if pdf_name in self.pdf_metadata:\n", + " return f\"Notice: Document '{pdf_name}' is already loaded.\"\n", + "\n", + " loader = PyPDFLoader(pdf_path)\n", + " docs = loader.load()\n", + "\n", + " # Add unique PDF identifier to metadata\n", + " pdf_id = hashlib.md5(pdf_path.encode()).hexdigest()[:8]\n", + " \n", + " progress(0.2, desc=\"Extracting document metadata (title, authors)...\")\n", + " # Extract and store document header metadata\n", + " header_info = self._extract_document_header(docs, pdf_name, pdf_id)\n", + " self.doc_headers[pdf_name] = header_info\n", + " \n", + " # Log extracted info\n", + " if header_info.get('authors'):\n", + " print(f\"Extracted authors: {header_info['authors']}\")\n", + " if header_info.get('title'):\n", + " print(f\"Extracted title: {header_info['title']}\")\n", + "\n", + " progress(0.3, desc=f\"Loaded {len(docs)} pages. Chunking...\")\n", + "\n", + " chunk_counter = len(self.documents)\n", + "\n", + " if use_semantic_chunking:\n", + " splits = []\n", + " for i, doc in enumerate(docs):\n", + " progress(0.3 + (0.3 * i / len(docs)), desc=f\"Semantic chunking page {i+1}/{len(docs)}...\")\n", + " semantic_chunks = self.semantic_chunker.chunk_document(doc.page_content)\n", + " for chunk in semantic_chunks:\n", + " chunk_counter += 1\n", + " # Mark first page chunks as header chunks\n", + " is_header = doc.metadata.get('page', 0) == 0\n", + " splits.append(Document(\n", + " page_content=chunk,\n", + " metadata={\n", + " 'page': doc.metadata.get('page', 0),\n", + " 'source': pdf_path,\n", + " 'pdf_name': pdf_name,\n", + " 'pdf_id': pdf_id,\n", + " 'chunk_id': f\"{pdf_id}-{chunk_counter}\",\n", + " 'is_header': is_header\n", + " }\n", + " ))\n", + " else:\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=800,\n", + " chunk_overlap=150,\n", + " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n", + " length_function=len\n", + " )\n", + " splits = text_splitter.split_documents(docs)\n", + " # Add PDF metadata\n", + " for split in splits:\n", + " chunk_counter += 1\n", + " is_header = split.metadata.get('page', 0) == 0\n", + " split.metadata['pdf_name'] = pdf_name\n", + " split.metadata['pdf_id'] = pdf_id\n", + " split.metadata['chunk_id'] = f\"{pdf_id}-{chunk_counter}\"\n", + " split.metadata['is_header'] = is_header\n", + "\n", + " # Add to existing documents\n", + " self.documents.extend(splits)\n", + "\n", + " # Track PDF metadata\n", + " total_pages = max([doc.metadata.get('page', 0) for doc in docs]) + 1\n", + " self.pdf_metadata[pdf_name] = {\n", + " 'path': pdf_path,\n", + " 'pages': total_pages,\n", + " 'chunks': len(splits),\n", + " 'pdf_id': pdf_id,\n", + " 'added': datetime.now().strftime(\"%Y-%m-%d %H:%M\")\n", + " }\n", + "\n", + " progress(0.7, desc=f\"Rebuilding Vector Index ({len(self.documents)} total chunks)...\")\n", + "\n", + " # Rebuild vector DB with all documents\n", + " self.vector_db = Chroma.from_documents(\n", + " documents=self.documents,\n", + " embedding=self.embedding_model,\n", + " collection_name=\"rag_gradio_v3\"\n", + " )\n", + "\n", + " progress(0.9, desc=\"Rebuilding Keyword Index...\")\n", + " self.bm25_retriever = BM25Retriever.from_documents(self.documents)\n", + "\n", + " progress(1.0, desc=\"Complete\")\n", + "\n", + " # Build return message with extracted metadata\n", + " extracted_info = \"\"\n", + " if header_info.get('title'):\n", + " extracted_info += f\"\\n**Title:** {header_info['title']}\"\n", + " if header_info.get('authors'):\n", + " extracted_info += f\"\\n**Authors:** {', '.join(header_info['authors'])}\"\n", + "\n", + " return f\"\"\"**Document Added Successfully**\n", + "\n", + "**File:** {pdf_name}\n", + "**Pages:** {total_pages}\n", + "**Chunks:** {len(splits)}\n", + "{extracted_info}\n", + "\n", + "**Total Collection:**\n", + "- {len(self.pdf_metadata)} document(s)\n", + "- {len(self.documents)} total chunks\n", + "\n", + "Ready to answer questions.\"\"\"\n", + "\n", + " def get_loaded_pdfs(self) -> str:\n", + " \"\"\"Return formatted list of loaded PDFs\"\"\"\n", + " if not self.pdf_metadata:\n", + " return \"No documents loaded yet.\"\n", + "\n", + " output = \"## Loaded Documents\\n\\n\"\n", + " for idx, (name, info) in enumerate(self.pdf_metadata.items(), 1):\n", + " output += f\"**{idx}. {name}**\\n\"\n", + " output += f\" - Pages: {info['pages']} | Chunks: {info['chunks']}\\n\"\n", + " output += f\" - Added: {info['added']}\\n\"\n", + " # Add extracted metadata if available\n", + " if name in self.doc_headers:\n", + " header = self.doc_headers[name]\n", + " if header.get('title'):\n", + " output += f\" - Title: {header['title']}\\n\"\n", + " if header.get('authors'):\n", + " output += f\" - Authors: {', '.join(header['authors'][:3])}{'...' if len(header['authors']) > 3 else ''}\\n\"\n", + " output += \"\\n\"\n", + "\n", + " output += f\"**Total:** {len(self.pdf_metadata)} document(s), {len(self.documents)} chunks\"\n", + " return output\n", + "\n", + " def clear_all_documents(self):\n", + " \"\"\"Clear all loaded documents\"\"\"\n", + " self.documents = []\n", + " self.pdf_metadata = {}\n", + " self.doc_headers = {} # Clear headers too\n", + " self.vector_db = None\n", + " self.bm25_retriever = None\n", + " self.cache = QueryCache() # Clear cache too\n", + " return \"All documents cleared.\"\n", + "\n", + " def _retrieve_with_rrf(self, query: str, k: int = 5, fetch_factor: int = 2, prioritize_header: bool = False) -> List[Document]:\n", + " fetch_k = max(k * fetch_factor, k)\n", + " vector_docs = self.vector_db.as_retriever(\n", + " search_type=\"mmr\",\n", + " search_kwargs={\"k\": fetch_k, \"fetch_k\": fetch_k * 2, \"lambda_mult\": 0.6}\n", + " ).invoke(query)\n", + " self.bm25_retriever.k = fetch_k\n", + " keyword_docs = self.bm25_retriever.invoke(query)\n", + " \n", + " # If prioritizing header, add first-page chunks\n", + " if prioritize_header:\n", + " header_docs = [doc for doc in self.documents if doc.metadata.get('is_header', False)]\n", + " fused_docs = ReciprocalRankFusion.fuse([vector_docs, keyword_docs, header_docs])\n", + " else:\n", + " fused_docs = ReciprocalRankFusion.fuse([vector_docs, keyword_docs])\n", + " \n", + " return fused_docs[:fetch_k]\n", + "\n", + " def _rerank_documents(self, query: str, documents: List[Document], top_k: int = 5, force_comparison: bool = False, boost_header: bool = False) -> List[Tuple[Document, float]]:\n", + " if not documents:\n", + " return []\n", + "\n", + " # For comparison queries, boost documents that likely contain comparative info\n", + " is_comparison = force_comparison or any(word in query.lower() for word in ['compare', 'difference', 'differ', 'versus', 'vs'])\n", + "\n", + " pairs = [[query, doc.page_content] for doc in documents]\n", + " scores = self.cross_encoder.predict(pairs)\n", + "\n", + " # Boost scores for docs that contain comparison keywords\n", + " if is_comparison:\n", + " comparison_keywords = ['compared to', 'in contrast', 'difference', 'whereas', 'unlike', 'while', 'however']\n", + " for i, doc in enumerate(documents):\n", + " content_lower = doc.page_content.lower()\n", + " keyword_count = sum(1 for kw in comparison_keywords if kw in content_lower)\n", + " if keyword_count > 0:\n", + " scores[i] *= (1 + 0.1 * keyword_count) # Boost by 10% per keyword\n", + " \n", + " # Boost header chunks for metadata-like queries\n", + " if boost_header:\n", + " for i, doc in enumerate(documents):\n", + " if doc.metadata.get('is_header', False) or doc.metadata.get('page', 99) == 0:\n", + " scores[i] *= 1.5 # 50% boost for first page content\n", + "\n", + " scored_docs = list(zip(documents, scores))\n", + " scored_docs.sort(key=lambda x: x[1], reverse=True)\n", + " return scored_docs[:top_k]\n", + "\n", + " def _dedupe_documents(self, documents: List[Document]) -> List[Document]:\n", + " deduped = []\n", + " seen = set()\n", + " for doc in documents:\n", + " key = doc.metadata.get('chunk_id') or f\"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}\"\n", + " if key in seen:\n", + " continue\n", + " seen.add(key)\n", + " deduped.append(doc)\n", + " return deduped\n", + "\n", + " def _ensure_pdf_diversity(self, query: str, documents: List[Document], target_docs: int = 2, per_pdf: int = 3) -> List[Document]:\n", + " if not documents or not self.pdf_metadata:\n", + " return documents\n", + "\n", + " seen_ids = set(doc.metadata.get('pdf_id') for doc in documents if doc.metadata.get('pdf_id'))\n", + " if len(seen_ids) >= target_docs:\n", + " return documents\n", + "\n", + " missing_ids = [info['pdf_id'] for info in self.pdf_metadata.values() if info['pdf_id'] not in seen_ids]\n", + " extra_docs = []\n", + " for pdf_id in missing_ids[:max(0, target_docs - len(seen_ids))]:\n", + " filtered_docs = self.vector_db.as_retriever(\n", + " search_type=\"mmr\",\n", + " search_kwargs={\n", + " \"k\": per_pdf,\n", + " \"fetch_k\": per_pdf * 2,\n", + " \"lambda_mult\": 0.6,\n", + " \"filter\": {\"pdf_id\": pdf_id}\n", + " }\n", + " ).invoke(query)\n", + " extra_docs.extend(filtered_docs)\n", + "\n", + " combined = documents + extra_docs\n", + " return self._dedupe_documents(combined)\n", + "\n", + " def _create_citation_card(self, idx: int, doc: Document, score: float) -> str:\n", + " \"\"\"Create a formatted citation card\"\"\"\n", + " page = doc.metadata.get('page', 'Unknown')\n", + " pdf_name = doc.metadata.get('pdf_name', 'Unknown Document')\n", + "\n", + " # Get snippet (first 200 chars)\n", + " snippet = doc.page_content[:200] + \"...\" if len(doc.page_content) > 200 else doc.page_content\n", + "\n", + " # Relevance label based on score\n", + " if score > 0.7:\n", + " relevance = \"High\"\n", + " elif score > 0.5:\n", + " relevance = \"Medium\"\n", + " else:\n", + " relevance = \"Low\"\n", + "\n", + " card = f\"\"\"\n", + "
\n", + "[{idx}] {pdf_name} β€” Page {page} | Relevance: {relevance} ({score:.2f})\n", + "
{snippet}
\n", + "
\n", + "\"\"\"\n", + " return card\n", + "\n", + " def chat(self, query: str, use_hyde: bool = True, use_multi_query: bool = True, progress=gr.Progress()):\n", + " \"\"\"Enhanced chat with better answers and citations\"\"\"\n", + " if not self.vector_db:\n", + " return \"Please upload at least one document first.\", \"\", \"\"\n", + "\n", + " # Check if this is a metadata query (authors, title, etc.)\n", + " is_metadata_query, metadata_type = self._is_metadata_query(query)\n", + " if is_metadata_query and self.doc_headers:\n", + " progress(0.5, desc=f\"Retrieving {metadata_type} from document metadata...\")\n", + " answer, citations, metadata = self._answer_metadata_query(query, metadata_type)\n", + " progress(1.0, desc=\"Complete\")\n", + " return answer, citations, metadata\n", + "\n", + " # Check cache\n", + " cached_response = self.cache.get(query)\n", + " if cached_response:\n", + " return f\"*Retrieved from cache*\\n\\n{cached_response}\", \"\", \"Cached result\"\n", + "\n", + " progress(0.1, desc=\"Classifying query...\")\n", + " profile = self._classify_query(query)\n", + " k = profile.k\n", + " \n", + " # Check if query might need header info (about the paper itself)\n", + " needs_header_boost = any(word in query.lower() for word in \n", + " ['paper', 'study', 'research', 'introduction', 'propose', 'contribution', 'this work'])\n", + "\n", + " base_queries = [query]\n", + "\n", + " if use_multi_query:\n", + " progress(0.22, desc=\"Expanding query variants...\")\n", + " expanded_queries = self._expand_query(query)\n", + " base_queries.extend(expanded_queries[:2])\n", + "\n", + " if use_hyde:\n", + " progress(0.32, desc=\"Generating HyDE document...\")\n", + " hyde_doc = self._generate_hyde_document(query)\n", + " base_queries.append(hyde_doc)\n", + "\n", + " progress(0.45, desc=\"Retrieving candidates (MMR + BM25)...\")\n", + " retrieval_results = []\n", + " for bq in base_queries:\n", + " retrieval_results.append(self._retrieve_with_rrf(bq, k=k, fetch_factor=2, prioritize_header=needs_header_boost))\n", + "\n", + " fused_docs = ReciprocalRankFusion.fuse(retrieval_results)\n", + " fused_docs = self._dedupe_documents(fused_docs)[:max(k * 3, k)]\n", + "\n", + " if profile.needs_multi_docs and len(self.pdf_metadata) > 1:\n", + " fused_docs = self._ensure_pdf_diversity(\n", + " query,\n", + " fused_docs,\n", + " target_docs=min(3, len(self.pdf_metadata)),\n", + " per_pdf=max(2, k // 3)\n", + " )\n", + "\n", + " progress(0.7, desc=\"Re-ranking with CrossEncoder...\")\n", + " reranked_docs = self._rerank_documents(query, fused_docs, top_k=max(5, k), \n", + " force_comparison=profile.requires_comparison,\n", + " boost_header=needs_header_boost)\n", + "\n", + " progress(0.8, desc=\"Building context...\")\n", + "\n", + " # Build context with inline citations\n", + " context_parts = []\n", + " citation_cards = []\n", + "\n", + " for idx, (doc, score) in enumerate(reranked_docs, 1):\n", + " page = doc.metadata.get('page', 'Unknown')\n", + " pdf_name = doc.metadata.get('pdf_name', 'Unknown')\n", + "\n", + " # Add to context\n", + " context_parts.append(f\"[Source {idx}]: {doc.page_content}\\n\")\n", + "\n", + " # Create citation card\n", + " citation_cards.append(self._create_citation_card(idx, doc, score))\n", + "\n", + " context_str = \"\\n\".join(context_parts)\n", + "\n", + " # Enhanced prompt for better answers\n", + " is_comparison = profile.requires_comparison\n", + " style_hint = \"\"\n", + " if profile.answer_style == 'bullets':\n", + " style_hint = \"Use concise bullet points.\"\n", + " elif profile.answer_style == 'steps':\n", + " style_hint = \"Use numbered steps when explaining processes.\"\n", + "\n", + " style_instruction = style_hint or \"Keep structure aligned to the question type.\"\n", + "\n", + " if is_comparison:\n", + " prompt = f\"\"\"You are an expert AI assistant analyzing academic/technical documents. Answer this COMPARISON question with precision and structure.\n", + "\n", + "## COMPARISON QUESTION TYPE\n", + "\n", + "## CRITICAL INSTRUCTIONS:\n", + "1. **Start with a direct comparison statement** - Don't give background first\n", + "2. **Use a structured format:**\n", + " - Brief 1-2 sentence overview of what's being compared\n", + " - Bullet points listing specific differences\n", + " - Each bullet should be concrete and factual\n", + "3. **Be specific with numbers, names, and technical details** from the sources\n", + "4. **Cite sources** [Source X] after each factual claim\n", + "5. **If sources lack comparison info**, explicitly state: \"The provided sources do not contain direct comparison information on [aspect]. Based on what's available: [answer what you can]\"\n", + "\n", + "## CONTEXT FROM DOCUMENTS:\n", + "{context_str}\n", + "\n", + "## COMPARISON QUESTION:\n", + "{query}\n", + "\n", + "## STRUCTURED COMPARISON ANSWER:\n", + "\"\"\"\n", + " else:\n", + " prompt = f\"\"\"You are an expert AI assistant analyzing academic/technical documents. Your goal is to provide accurate, well-structured, and comprehensive answers.\n", + "\n", + "## QUERY TYPE: {profile.query_type.upper()}\n", + "\n", + "## INSTRUCTIONS:\n", + "1. **Answer the question directly in the first sentence** - Don't start with background\n", + "2. **Use inline citations** [Source X] immediately after each claim or fact\n", + "3. **Structure your answer clearly:**\n", + " - For factoid queries: Direct answer (2-3 sentences) with supporting details\n", + " - For complex queries: Organized explanation with bullet points or numbered lists\n", + " - For \"explain\" queries: Start with simple definition, then elaborate\n", + "4. **Be comprehensive but concise** - NO repetition or filler words\n", + "5. **Use specific facts**: numbers, names, technical terms from sources\n", + "6. **If information is insufficient**, state: \"The sources provided do not fully address [aspect]. Based on available information: [what you can answer]\"\n", + "7. {style_instruction}\n", + "\n", + "## CONTEXT FROM DOCUMENTS:\n", + "{context_str}\n", + "\n", + "## QUESTION:\n", + "{query}\n", + "\n", + "## YOUR ANSWER:\n", + "\"\"\"\n", + "\n", + " progress(0.9, desc=\"Generating enhanced answer...\")\n", + " try:\n", + " response = self.llm.invoke(prompt)\n", + " answer = response.content\n", + "\n", + " # Add verification step for complex queries and comparisons\n", + " if profile.query_type in ['summary', 'comparison', 'reasoning'] or is_comparison:\n", + " verify_prompt = f\"\"\"Review this answer for a {profile.query_type} query. Check if it:\n", + "\n", + "Question: {query}\n", + "\n", + "Answer: {answer}\n", + "\n", + "**Evaluation Criteria:**\n", + "1. **Directness**: Does it answer the question in the first sentence?\n", + "2. **Structure**: Is it well-organized with bullet points for complex info?\n", + "3. **Specificity**: Does it use concrete facts/numbers from sources?\n", + "4. **Completeness**: Does it address all parts of the question?\n", + "5. **No fluff**: Is it concise without repetition?\n", + "\n", + "If the answer has issues, provide an IMPROVED VERSION following this format:\n", + "- Start with direct answer\n", + "- Use bullet points for lists/comparisons\n", + "- Include specific facts with citations\n", + "- Be concise\n", + "\n", + "If it's already good, respond with only: \"VERIFIED\"\n", + "\n", + "Your response:\"\"\"\n", + "\n", + " verify_response = self.llm.invoke(verify_prompt)\n", + " if \"VERIFIED\" not in verify_response.content.upper():\n", + " # Extract improved answer (remove any preamble)\n", + " improved = verify_response.content\n", + " if \"IMPROVED VERSION\" in improved or \"Here\" in improved[:50]:\n", + " # Find where actual answer starts\n", + " lines = improved.split('\\n')\n", + " answer_lines = []\n", + " started = False\n", + " for line in lines:\n", + " if started or (line.strip() and not line.strip().startswith(('**', 'If', 'Your', 'The answer'))):\n", + " started = True\n", + " answer_lines.append(line)\n", + " if answer_lines:\n", + " answer = '\\n'.join(answer_lines)\n", + " else:\n", + " answer = improved\n", + "\n", + " self.cache.set(query, answer)\n", + "\n", + " # Format citations\n", + " citations_html = \"\\n\".join(citation_cards)\n", + "\n", + " # Metadata\n", + " metadata = f\"\"\"**Query Type:** {profile.query_type.title()} | **Multi-Doc:** {\"Yes\" if profile.needs_multi_docs else \"No\"} | **Sources Used:** {len(reranked_docs)} | **Documents Searched:** {len(self.pdf_metadata)}\"\"\"\n", + "\n", + " progress(1.0, desc=\"Complete\")\n", + " return answer, citations_html, metadata\n", + "\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\", \"\"\n", + "\n", + " def summarize_document(self, max_chunks: int = None, progress=gr.Progress()):\n", + " \"\"\"Generate document summary\"\"\"\n", + " if not self.documents:\n", + " return \"No document loaded.\", \"\"\n", + "\n", + " chunks_to_process = self.documents[:max_chunks] if max_chunks else self.documents\n", + " total_chunks = len(chunks_to_process)\n", + "\n", + " progress(0.1, desc=f\"Processing {total_chunks} chunks...\")\n", + "\n", + " chunk_summaries = []\n", + " batch_size = 10\n", + "\n", + " for i in range(0, total_chunks, batch_size):\n", + " batch = chunks_to_process[i:i+batch_size]\n", + " batch_text = \"\\n\\n---\\n\\n\".join([doc.page_content for doc in batch])\n", + "\n", + " progress(0.1 + (0.6 * i / total_chunks), desc=f\"Summarizing chunks {i+1}-{min(i+batch_size, total_chunks)}...\")\n", + "\n", + " map_prompt = f\"\"\"Summarize the key points from this document section in 3-5 bullet points:\n", + "\n", + "{batch_text}\n", + "\n", + "Key Points:\"\"\"\n", + "\n", + " try:\n", + " response = self.llm.invoke(map_prompt)\n", + " chunk_summaries.append(response.content)\n", + " except Exception as e:\n", + " continue\n", + "\n", + " progress(0.8, desc=\"Synthesizing final summary...\")\n", + "\n", + " combined_summaries = \"\\n\\n\".join(chunk_summaries)\n", + "\n", + " reduce_prompt = f\"\"\"You are summarizing documents. Below are summaries of different sections.\n", + "\n", + "Create a comprehensive, well-structured summary that includes:\n", + "\n", + "1. **Overview**: What are these documents about? (2-3 sentences)\n", + "2. **Main Topics**: Key themes and subjects covered (bullet points)\n", + "3. **Important Details**: Critical information, findings, or arguments (3-5 points)\n", + "4. **Conclusion**: Overall takeaway or significance\n", + "\n", + "Section Summaries:\n", + "{combined_summaries}\n", + "\n", + "## COMPREHENSIVE SUMMARY:\"\"\"\n", + "\n", + " try:\n", + " final_response = self.llm.invoke(reduce_prompt)\n", + " summary = final_response.content\n", + "\n", + " # Build metadata\n", + " metadata = f\"\"\"## Summary Statistics\n", + "\n", + "**Documents Analyzed:** {len(self.pdf_metadata)}\n", + "**Total Chunks:** {total_chunks}\n", + "**Total Pages:** {sum(info['pages'] for info in self.pdf_metadata.values())}\n", + "\n", + "### Documents Included:\n", + "\"\"\"\n", + " for name, info in self.pdf_metadata.items():\n", + " metadata += f\"- **{name}** ({info['pages']} pages)\\n\"\n", + "\n", + " progress(1.0, desc=\"Complete\")\n", + " return summary, metadata\n", + "\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FOZPs9yFYxfn" + }, + "source": [ + "## Gradio Web Interface\n", + "\n", + "The `create_interface()` function creates a 4-tab web UI:\n", + "\n", + "| Tab | Purpose | Key Actions |\n", + "|-----|---------|-------------|\n", + "| **Setup** | Initialize system | Enter API key β†’ Load models β†’ Upload PDFs |\n", + "| **Chat** | Q&A interface | Ask questions with HyDE/Multi-Query options |\n", + "| **Summarize** | Document summary | Generate map-reduce summary of all docs |\n", + "| **Help** | Documentation | Usage guide and example questions |\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "63Rjs0gNBm_y" + }, + "outputs": [], + "source": [ + "def create_interface():\n", + " \"\"\"Create enhanced Gradio interface\"\"\"\n", + "\n", + " # Global RAG instance\n", + " rag_system = None\n", + "\n", + " def initialize_system(api_key):\n", + " nonlocal rag_system\n", + " if not api_key:\n", + " return \"Please enter your Groq API key.\", \"\"\n", + " try:\n", + " rag_system = EnhancedRAGv3(api_key)\n", + " status = rag_system.load_models()\n", + " return status, \"\"\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\"\n", + "\n", + " def upload_and_process(file, use_semantic):\n", + " if rag_system is None or not rag_system.is_initialized:\n", + " return \"Please initialize the system first.\", \"\"\n", + " try:\n", + " status = rag_system.ingest_pdf(file.name, use_semantic_chunking=use_semantic)\n", + " loaded_pdfs = rag_system.get_loaded_pdfs()\n", + " return status, loaded_pdfs\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\"\n", + "\n", + " def ask_question(query, use_hyde, use_multi_query):\n", + " if rag_system is None:\n", + " return \"Please initialize the system first.\", \"\", \"\"\n", + " if not query.strip():\n", + " return \"Please enter a question.\", \"\", \"\"\n", + " try:\n", + " answer, citations, metadata = rag_system.chat(query, use_hyde=use_hyde, use_multi_query=use_multi_query)\n", + " return answer, citations, metadata\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\", \"\"\n", + "\n", + " def summarize_doc():\n", + " if rag_system is None:\n", + " return \"Please initialize the system first.\", \"\"\n", + " try:\n", + " summary, metadata = rag_system.summarize_document()\n", + " return summary, metadata\n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\", \"\"\n", + "\n", + " def clear_docs():\n", + " if rag_system is None:\n", + " return \"No system initialized\", \"\"\n", + " status = rag_system.clear_all_documents()\n", + " return status, \"\"\n", + "\n", + " def get_pdf_list():\n", + " if rag_system is None:\n", + " return \"No system initialized\"\n", + " return rag_system.get_loaded_pdfs()\n", + "\n", + " # Create Gradio Blocks interface\n", + " with gr.Blocks(\n", + " title=\"Multi-PDF RAG System\",\n", + " theme=gr.themes.Base(),\n", + " css=\"\"\"\n", + " .gradio-container { max-width: 1200px; margin: auto; }\n", + " h1 { font-weight: 600; }\n", + " .prose { font-size: 14px; }\n", + " \"\"\"\n", + " ) as app:\n", + " gr.Markdown(\"\"\"\n", + " # Multi-Document RAG System\n", + " **Advanced Document Q&A with Multiple PDF Support** β€” Powered by Llama 3.3 70B\n", + "\n", + " Multi-document support | Enhanced citations | Verification system\n", + " \"\"\")\n", + "\n", + " with gr.Tab(\"Setup\"):\n", + " gr.Markdown(\"### Step 1: Initialize System\")\n", + " with gr.Row():\n", + " api_key_input = gr.Textbox(\n", + " label=\"Groq API Key\",\n", + " type=\"password\",\n", + " placeholder=\"Enter your Groq API key\",\n", + " scale=3\n", + " )\n", + " init_btn = gr.Button(\"Initialize\", variant=\"primary\", scale=1)\n", + " init_status = gr.Textbox(label=\"Status\", interactive=False)\n", + "\n", + " gr.Markdown(\"### Step 2: Upload Documents\")\n", + " gr.Markdown(\"*Multiple PDFs supported β€” each document will be added to the knowledge base.*\")\n", + "\n", + " with gr.Row():\n", + " with gr.Column(scale=2):\n", + " file_input = gr.File(label=\"Select PDF\", file_types=[\".pdf\"])\n", + " semantic_check = gr.Checkbox(label=\"Use Semantic Chunking (Recommended)\", value=True)\n", + " with gr.Row():\n", + " upload_btn = gr.Button(\"Add Document\", variant=\"primary\", scale=2)\n", + " clear_btn = gr.Button(\"Clear All\", variant=\"stop\", scale=1)\n", + " upload_status = gr.Markdown()\n", + "\n", + " with gr.Column(scale=1):\n", + " gr.Markdown(\"#### Document Library\")\n", + " loaded_pdfs_display = gr.Markdown(\"No documents loaded yet.\")\n", + " refresh_btn = gr.Button(\"Refresh\", size=\"sm\")\n", + "\n", + " init_btn.click(initialize_system, inputs=[api_key_input], outputs=[init_status, loaded_pdfs_display])\n", + " upload_btn.click(upload_and_process, inputs=[file_input, semantic_check], outputs=[upload_status, loaded_pdfs_display])\n", + " clear_btn.click(clear_docs, outputs=[upload_status, loaded_pdfs_display])\n", + " refresh_btn.click(get_pdf_list, outputs=[loaded_pdfs_display])\n", + "\n", + " with gr.Tab(\"Chat\"):\n", + " gr.Markdown(\"### Ask Questions About Your Documents\")\n", + "\n", + " with gr.Row():\n", + " with gr.Column(scale=3):\n", + " query_input = gr.Textbox(\n", + " label=\"Your Question\",\n", + " placeholder=\"What are the key conclusions of these papers?\",\n", + " lines=3\n", + " )\n", + " with gr.Row():\n", + " hyde_check = gr.Checkbox(label=\"HyDE (Better Retrieval)\", value=True)\n", + " multi_query_check = gr.Checkbox(label=\"Multi-Query (More Comprehensive)\", value=True)\n", + " ask_btn = gr.Button(\"Submit\", variant=\"primary\", size=\"lg\")\n", + "\n", + " with gr.Column(scale=1):\n", + " gr.Markdown(\"\"\"\n", + " #### Tips\n", + " - Be specific in your questions\n", + " - HyDE improves retrieval quality\n", + " - Multi-Query finds more context\n", + " - Questions search across all loaded documents\n", + " \"\"\")\n", + "\n", + " metadata_output = gr.Markdown(label=\"Query Info\")\n", + " answer_output = gr.Markdown(label=\"Answer\")\n", + "\n", + " gr.Markdown(\"#### Sources & Citations\")\n", + " sources_output = gr.HTML(label=\"Sources\")\n", + "\n", + " ask_btn.click(\n", + " ask_question,\n", + " inputs=[query_input, hyde_check, multi_query_check],\n", + " outputs=[answer_output, sources_output, metadata_output]\n", + " )\n", + "\n", + " gr.Examples(\n", + " examples=[\n", + " \"What are the main findings?\",\n", + " \"Explain the methodology in detail\",\n", + " \"What are the key conclusions?\",\n", + " \"Compare the approaches discussed\",\n", + " \"What are the limitations mentioned?\",\n", + " \"Summarize the most important contributions\"\n", + " ],\n", + " inputs=query_input\n", + " )\n", + "\n", + " with gr.Tab(\"Summarize\"):\n", + " gr.Markdown(\"### Generate Comprehensive Summary\")\n", + " gr.Markdown(\"Analyzes all loaded documents and creates a unified summary.\")\n", + "\n", + " summarize_btn = gr.Button(\"Generate Summary\", variant=\"primary\", size=\"lg\")\n", + " summary_metadata = gr.Markdown()\n", + " summary_output = gr.Markdown()\n", + "\n", + " summarize_btn.click(summarize_doc, outputs=[summary_output, summary_metadata])\n", + "\n", + " with gr.Tab(\"Help\"):\n", + " gr.Markdown(\"\"\"\n", + " ## How to Use This System\n", + "\n", + " ### Quick Start\n", + " 1. **Setup Tab**: Enter Groq API key and click \"Initialize\"\n", + " 2. **Setup Tab**: Upload PDF(s) and click \"Add Document\" (repeat for multiple files)\n", + " 3. **Chat Tab**: Ask questions about your documents\n", + " 4. **Summarize Tab**: Get unified summary of all documents\n", + "\n", + " ---\n", + "\n", + " ## Example PDFs for Testing\n", + "\n", + " Download these PDFs to test the system:\n", + "\n", + " ### Academic Research Papers\n", + " 1. **Attention is All You Need** (Transformer Paper)\n", + " - URL: `https://arxiv.org/pdf/1706.03762.pdf`\n", + " - Great for: Testing technical Q&A\n", + "\n", + " 2. **BERT: Pre-training of Deep Bidirectional Transformers**\n", + " - URL: `https://arxiv.org/pdf/1810.04805.pdf`\n", + " - Great for: Multi-paper comparison\n", + "\n", + " 3. **GPT-3 Paper** (Language Models are Few-Shot Learners)\n", + " - URL: `https://arxiv.org/pdf/2005.14165.pdf`\n", + " - Great for: Complex methodology questions\n", + "\n", + " ### Business & Finance\n", + " 4. **Tesla Q3 2024 Earnings Report**\n", + " - URL: `https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2024-Update`\n", + " - Great for: Financial analysis questions\n", + "\n", + " 5. **World Bank Annual Report**\n", + " - URL: `https://thedocs.worldbank.org/en/doc/9a8210d538854d29883cf3a19e66a3e2-0350012021/original/WBG-Annual-Report-2021-EN.pdf`\n", + " - Great for: Economic data queries\n", + "\n", + " ### Technical Documentation\n", + " 6. **Python Documentation (any topic)**\n", + " - URL: Search \"python [topic] pdf\" on official Python docs\n", + " - Great for: Technical Q&A\n", + "\n", + " ### Medical/Scientific\n", + " 7. **WHO COVID-19 Reports**\n", + " - URL: `https://www.who.int/publications` (download any PDF)\n", + " - Great for: Health information extraction\n", + "\n", + " ---\n", + "\n", + " ## Real-Life Use Case Questions\n", + "\n", + " ### Research Paper Analysis (Attention Paper)\n", + " ```\n", + " 1. \"What is the main innovation proposed in this paper?\"\n", + " 2. \"Explain the self-attention mechanism in detail\"\n", + " 3. \"How does the Transformer compare to RNN-based models?\"\n", + " 4. \"What are the computational complexity advantages?\"\n", + " 5. \"What datasets were used for evaluation?\"\n", + " 6. \"What are the key results on machine translation tasks?\"\n", + " 7. \"Describe the encoder-decoder architecture\"\n", + " 8. \"What are the limitations mentioned by the authors?\"\n", + " ```\n", + "\n", + " ### Business/Finance Analysis (Earnings Reports)\n", + " ```\n", + " 1. \"What was the total revenue this quarter?\"\n", + " 2. \"How did the company perform compared to last quarter?\"\n", + " 3. \"What are the main growth drivers mentioned?\"\n", + " 4. \"Summarize the key financial metrics\"\n", + " 5. \"What guidance did management provide?\"\n", + " 6. \"What are the main risks discussed?\"\n", + " 7. \"How much cash does the company have?\"\n", + " 8. \"What investments are being made in R&D?\"\n", + " ```\n", + "\n", + " ### Multi-Paper Comparison (Upload BERT + GPT-3)\n", + " ```\n", + " 1. \"Compare the pre-training objectives of these models\"\n", + " 2. \"What are the key differences in architecture?\"\n", + " 3. \"Which model performs better on which tasks?\"\n", + " 4. \"How do the training datasets differ?\"\n", + " 5. \"What are the main innovations in each paper?\"\n", + " 6. \"Compare the computational requirements\"\n", + " ```\n", + "\n", + " ### Policy/Report Analysis (World Bank Report)\n", + " ```\n", + " 1. \"What are the main economic trends discussed?\"\n", + " 2. \"Which regions showed the strongest growth?\"\n", + " 3. \"What interventions are recommended?\"\n", + " 4. \"Summarize the poverty reduction initiatives\"\n", + " 5. \"What are the key challenges identified?\"\n", + " 6. \"What metrics are used to measure success?\"\n", + " ```\n", + "\n", + " ### Medical Literature (WHO Reports)\n", + " ```\n", + " 1. \"What are the recommended treatment protocols?\"\n", + " 2. \"What evidence supports these guidelines?\"\n", + " 3. \"What are the risk factors mentioned?\"\n", + " 4. \"Summarize the epidemiological data\"\n", + " 5. \"What prevention measures are suggested?\"\n", + " ```\n", + "\n", + " ### Contract/Legal Document Analysis\n", + " ```\n", + " 1. \"What are the key obligations of each party?\"\n", + " 2. \"What are the termination conditions?\"\n", + " 3. \"Summarize the payment terms\"\n", + " 4. \"What warranties are provided?\"\n", + " 5. \"What are the liability limitations?\"\n", + " ```\n", + "\n", + " ---\n", + "\n", + " ## Complete Example Workflow\n", + "\n", + " **Scenario:** Analyzing the \"Attention is All You Need\" paper\n", + "\n", + " **Step 1:** Upload the PDF\n", + " - Download: `https://arxiv.org/pdf/1706.03762.pdf`\n", + " - Upload in Setup tab\n", + "\n", + " **Step 2:** Start with broad question:\n", + " ```\n", + " \"What is this paper about and what problem does it solve?\"\n", + " ```\n", + "\n", + " **Step 3:** Dive into specifics:\n", + " ```\n", + " \"Explain how multi-head attention works\"\n", + " \"What are the advantages over recurrent models?\"\n", + " \"How does positional encoding work?\"\n", + " ```\n", + "\n", + " **Step 4:** Compare (if you upload GPT-3 paper too):\n", + " ```\n", + " \"How does the Transformer architecture used in GPT-3 differ from the original?\"\n", + " ```\n", + "\n", + " **Step 5:** Get comprehensive summary:\n", + " - Go to Summarize tab\n", + " - Click \"Generate Summary\"\n", + "\n", + " ---\n", + "\n", + " ### Key Features\n", + "\n", + " #### Multi-Document Support\n", + " - Upload multiple PDFs to build a knowledge base\n", + " - Questions search across all loaded documents\n", + " - Each source is clearly attributed with document name and page\n", + "\n", + " #### Enhanced Citations\n", + " - Expandable citation cards with snippets\n", + " - Relevance scores (High, Medium, Low)\n", + " - Shows exact page numbers and document names\n", + "\n", + " #### Better Answer Quality\n", + " - Advanced query classification (factoid/medium/complex)\n", + " - Verification system for complex queries\n", + " - Structured, comprehensive responses\n", + " - Context-aware answer length\n", + "\n", + " #### Advanced Retrieval\n", + " - **HyDE**: Generates hypothetical documents for better retrieval\n", + " - **Multi-Query**: Expands questions for comprehensive coverage\n", + " - **RRF Fusion**: Combines vector + keyword search\n", + " - **Cross-Encoder Re-ranking**: Improves relevance scoring\n", + "\n", + " ### Best Practices\n", + "\n", + " **For Best Results:**\n", + " - Use semantic chunking (default)\n", + " - Keep HyDE and Multi-Query enabled for important questions\n", + " - Ask specific, well-formed questions\n", + " - Review citations to verify answers\n", + "\n", + " **Example Questions:**\n", + " - \"What is the transformer architecture?\" (Factoid)\n", + " - \"Explain how self-attention works in detail\" (Complex)\n", + " - \"Compare BERT and GPT approaches\" (Complex)\n", + " - \"What are the key innovations in this paper?\" (Medium)\n", + "\n", + " ### Performance\n", + " - **Simple queries**: ~5-8 seconds\n", + " - **Complex queries**: ~10-15 seconds (includes verification)\n", + " - **Full summary**: ~30-90 seconds (depends on document count)\n", + "\n", + " ### Technical Details\n", + " - **Embeddings**: BAAI/bge-large-en-v1.5 (SOTA)\n", + " - **Re-ranker**: BAAI/bge-reranker-v2-m3\n", + " - **LLM**: Llama 3.3 70B Versatile via Groq\n", + " - **Retrieval**: Hybrid (Vector + BM25) with RRF fusion\n", + "\n", + " ---\n", + "\n", + " **Need Help?** Make sure to initialize the system and upload at least one PDF before asking questions!\n", + " \"\"\")\n", + "\n", + " return app\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h1wXxw54Y7--" + }, + "source": [ + "### Launch Configuration:\n", + "```python\n", + "app.launch(share=True, debug=True, server_port=7860)\n", + "```\n", + "- `share=True` β†’ Creates public `gradio.live` URL\n", + "- Access locally at `http://localhost:7860`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000, + "referenced_widgets": [ + "16e7ec24f6c34211be67872719bf9563", + "30b098e339a54662bfd60b2378705b90", + "185e3b2ad58a4478a9fbd816ecbf9742", + "daf3357706804fccb6c31135a9026df0", + "79609124541c46708f54300e8327076d", + "bb46a6e38ec14107bb0c08b732b6d5c2", + "5fd6e74c4ce749889cf368fa5350b537", + "c834082a75e54e2c9b93e01d7fcc0e63", + "e9beb0ca922943c48062059151d03e84", + "23a80524f261439f880f9d1665546d2c", + "593b75ac4dcf4a879406f4ecbf09307d", + "a7f006a99ebc4f76a93f00b5bc81013d", + "20707ef4487b4181ac60f90c2cbdd654", + "1f07e31fdca34c27a2124e2e3752b573", + "3fe271f3a67b4976afccd5c87550b588", + "67094e8997af4300817747c0e5e0834e", + "f8072b44de5e4b25b53c6dab4ac03e5b", + "ede80e01c89742298a1dc17f3d4ad433", + "a2096398deb14083817d4b5476bcadd4", + "8de9e181d2b946aaad5c5b66a2742708", + "876047addea2453c8e97aefbee2a87fe", + "2d7afeb82fb84452be405e87488b7a1e", + "9aa90ace4d9e4586ac65496983a5e8df", + "7cb016c4d08346479c9fbc00100a7ccc", + "0171997bc10c4f1498aca91b8991b421", + "b35aa187abbc418296bc2645629b4e37", + "0d454c8bacb24275892705fc6f5a5342", + "e1547d895b58455daef9c6a190f3330c", + "f4b191e440f645a18d1904a795765ac8", + "d85dc47eaffe4a8eb269836bc421d657", + "95a985c765084d76b9ce331d49077243", + "5d2f39828b8049bb94c0515715df82d8", + "a7a3df6b70de49ec9161a0e7fa00cc67", + "90db4efa43024c9183e73696d8789256", + "9a6f7e9fcc1b42ffa4927801a8d907c1", + "dff4ff30c5594218b1906f427a054261", + "77e3497365d444ccaddfc7c827d5099d", + "d59caf966ee9400aa97210936745cbee", + "78e493bbe2474fe4bb363b943b58ba01", + "75548e5439cc4556a40fd1c79d40f695", + "6b9626df9125449e9f20d47a22b5fff0", + "6cb8f6cafc874488946fab0da980f80f", + "d59e08a3214e455c89f443656a31a787", + "019b6c48e98e4743819c5562e834597c", + "fabd252b6d494103b399efa3acac44be", + "43e348e91c9d44a08e3cb3b1473f3ea0", + "3c97b950cb5844a189cb83b586d5cde0", + "ff039e2efc5d4429bbd72b68d1cd4cf7", + "6926b0c7ee324e4d9b6a73cfeede6dfb", + "50fc937584e748cab0affb2eba8751f4", + "fb4639c037d34d8db7a9d7a264ae3632", + "67b62ec4c1794ba49151169e095fe71c", + "ab415f2ad72e49a48ea62af7d2053811", + "f91f7666be7f46c28ec6b68a8ac9202f", + "d27aa73b45a34d02be27e7b6728c8269", + "a1c261daf2bc4410b3c524ae65b584b6", + "2a568b3fb61d4009b30e97489299a0cf", + "b7251ba370ab42a0b923829cbf19823e", + "1074084c422a4932a49d0e6e76a3e291", + "4a5e54c4d7cc4c85a5eb12836d0b7519", + "02857216ca404b17b6d850f69fdc4167", + "cfebfe51ce2a4c1ab3423c99fac04de4", + "e23178cc86744a1e8f82a9f8b8b9d457", + "aefe8e9f423b46e99735d229b303f279", + "32935a20278c408dbf37cde6578f8c3e", + "e82cd751987d45e3a55925955855bacf", + "617f883b8d344d83aa92f5af4c4fa500", + "9757dfc308e04c7eac1879acb34a524d", + "188705886f344f84ad5190bb43297506", + "1c6c826032384883b3a3dcdfa93eda18", + "b453d0a3ff944998b6c482376a38896e", + "bc58242b7fb4400fa36f05059b320cbc", + "dcd53beda53c4722a343e3da200b1375", + "b00a22bc1ebd4566b3d556316d448986", + "0d607a5044a34a89bed74d0688647a73", + "b2a482182bf34c6f88968bbbe54d99b7", + "a0cce362b9a7462d9e0c37ac0870293b", + "986b8c9dcad245cea63a43ef4df3d36f", + "064086f0c0f444f2aa52c55b460d6ca7", + "417dd85b34a84c2684960a5c16f7a762", + "ac2673ad0288472a952351c018f1271d", + "ed30fe2e571c40d086425c063518e3b7", + "9ba656852b25431f8113b1b1783afcb0", + "17574ccf363a470f89927cbf4cd10004", + "557202483cdb49dea117927777cc6d35", + "a2ee87d4317a484cbd0550baa42dc419", + "18ddd81f0a7c4d7c81579be798cfe1dd", + "da919e0da2104bea95559e459e3955a0", + "7cc6b20897a34abbafb2b2eae17757e0", + "9d50a215d8954b36a5665a6a17ec5bdc", + "75e3dd18da934743a090fd095bb50c84", + "075d54e510934230abe640032c79fc02", + "8616627ec29e45ddbd8c04271ab0df04", + "36ccd95611f3475581c30d8258c7d1c8", + "e4f13b17e6534bd1976665d30455e0c0", + "f5292c055ae248d58f38adf45801e00c", + "9c2be268d3b34cc3b49791892d27ff87", + "f7ffa842a0b84916913e4094a5fb294c", + "5a23cfb2ad50450e99b4f2cb1411cc20", + "0a8b2c0ade00462d80de5e26d096718d", + "7d7b7921ca2a446896ce2ef4ba2d3ba4", + "2b933ba020274544ba009b64b4e5c430", + "aff8b0997198412a94db617bade60603", + "49717bef77dc4b86a0e3340463e54d82", + "bef49fe64f37432c8d0468d2047075e0", + "2bfa46f9d0174de581d0747d493308d1", + "03e5824940dd44bc8e93b40e882ef01d", + "71f836f754ae49a0be420b5dd736e4fe", + "6f53a9115aaf45cea6f45e6bd239d7d8", + "6bc03369ce31443ca05fe7a11b0a5514", + "8057ffb29e9c42dabda51185fdecd7dd", + "d86e4a45f0074d11832892dc2353c4a3", + "90238310655543b2a11b75ca7fd1e726", + "832e3ab1589c42e4bf7fe9e84d34b4e4", + "1ca167735ec940a387b02fbc93f827bb", + "5131abad55034d7896b6244b51368722", + "4fdc87ebdc4f4b108e8d2e4b820d3a62", + "ec8ad3d1497f4be4af8a9f8b30053060", + "0c93b8ec65c341d2927aa3fdcaf69736", + "3fb649ec3fed425285c971f63146f9f5", + "63bde89569a0433786559639a3edcc73", + "c5c5632955f94450b472b19dfe26c831", + "3a51ba19984140cbb595c7639edc8fc8", + "7b334f4843e14b5eb32351a884539495", + "501141a07ded4447884e84aede39598e", + "190d38c3a6824faa96f3096932ca187e", + "cabcd44775d24cc199dae2fe021f0ca1", + "e311630ede5b4e01a1a41a15c72c1ceb", + "85683bfe46bb4d4c988b64aee1569e4f", + "c02e6e4947794b10ad1a974739e5b429", + "76dbdc9a829942548be0444e79679e3a", + "207944749ecd4ecd85ee59e12dbf53fe", + "c2fdf8a947604af4829638a3cec51a84", + "13c8fad66efe409c995e85f07754007e", + "bb196705a1814abf9cfe12ed6ebc13bd", + "5cad3c37275a479196b91a32de160de3", + "0e8c7ae0c50447b8982966ed2651f051", + "abbe664ca8624bde93a03b573e7a7604", + "6fbf07ce310e4a6fb92e6da441a83aef", + "fb5f1d3cfc904387938a993c76fe5183", + "2bf70d9d271a425cae2a4c3acb422653", + "6ac0bb4623e449718f9bf4d10c097e79", + "55d83174c9264ea2b34a799fd91cd8b1", + "2b23db625dd441c89060ce72b2125d77", + "b614e5e0e8424bc0b6ed38c8fdae81ea", + "6b628b8819354295badd05c352b0dcd2", + "6fcec8b21cdf4a5483e5bc8b74de1851", + "f37c06b82d394070aacb6dcd5a666f5d", + "a5fbeac70f3549e8bce9af57829b9467", + "895493cb35c540f796a650194247131b", + "57e50172ddf64aa88d83701eb507be0e", + "0ea1f35898df46c4ae79c4f192a79259", + "b30b0d7b64ef44e2b09f93c87d9b453d", + "36c1bd64980349269ff4455264b8f75e", + "7e82594c67434da1888e0afe07ececd4", + "2ed045704e81491f8daa33f2e65d55ff", + "78eafefa1b384106a30a39529d855d0a", + "8e6f6a1ff043487687a7b748ecefcc6f", + "aeed8ee243854cdd9262e1632211a853", + "2523cbe84f9f4c0dad084bc074a39038", + "031042b3e35d47ee8a243172a90c24a7", + "ad884a180a42467d824e6ce654a35d2c", + "903aa9bcbb1e444895e4fb56a450dfd0", + "2ca6ea7a92c649c4bd63df2a5f791c4b", + "b8c313402cc74009abd4dde9c30d7b42", + "72e246cd9065482fbc45c256a49cee70", + "c836d8bbe5a0412e9bb69d1d74f22f33", + "9f5e547a23e04c8e99794bc4bda8bbbf", + "f5f8430f0dcf4325b86eadd97e73c49f", + "344d068bf2dc47468e957941f0498bf2", + "2a0e74c0400c435db5bfa184f02080b9", + "75ecddc97ef748a881ebab2d93441b5a", + "29dcf40da5c04d64890369b7364abf1a", + "4b77e86724314936b7ef1af13aa3c2ec", + "ccb224a2cabd429c9a24adf969457ace", + "139489a30a664a76ad6aebd6bf9d15fd", + "5905e6655c8840d3ba018a80324c1260", + "294421182a034eeca0744f0ee260c95f", + "6976c1d50d944367928aea95221497dd", + "60dce6fdb66a401c8fb2ef544cd50e71", + "48905aaf882d4bbda4dbc6ab237f63fb", + "544dfa758d464291a5703716fbd7cab0", + "e22124dc4309414fbb77c50cea435d81", + "7541ee8e44424b2582b88f9031761c2c", + "6e9b8499c67845eb8f631878145f2ce4", + "ade2cf56143d4146bf611b889e7f15be", + "682eb71ee4704cc1bc87bbaa4fd88ce1", + "e6b5f001215e4e1bb0197737c0044a88", + "07cfcdb368494466b41d219617e104d0", + "ff25b2aa452b4482a0ca4cd79b0fce5c", + "f88889237b254af9ab6de48091736773", + "8f8484604d9a4e2f9812803c11979bcd", + "498e839dad9641e4b7a7a6f58e8d2064", + "1da8f2a4f0a04be0ac49507fc5a39a6c", + "5cd91bda54ae41fdb8f0dd908b6c907f", + "2babc0facf4a4c7388e06e07fe01c9a8", + "8756c0bb65f44ba2887a42323df7f69d", + "511d0f7656e54cd08f12b2d8d849618a", + "934a9dd7613141668a98b89b666f1298", + "c5658ae5b5fc42b29dc2dee4af4380b0", + "233fe8c43399408390f5816cbbda87bd", + "c4d56ab96ec9471d8d63acc0d22ecc06", + "8bcf3e4c3ad943af8a53de8a35fa4f62", + "e0432b02ec5c4303a0dd665152ac5e5c", + "1b0913e77c3a42adbeb16da2cf504125", + "59c7e3887eff4232bae50fb145cea1c8", + "642cc483005b4111b338761b57a5894d", + "c6a1d5fb08014113969e8ac1cc3a750d", + "354d6e5abfe34bfbae4f12db3bf40c3d", + "ad7901bf45ae459bbc59840b84fc9592", + "fff471a9822c43dcb29b7da9ef8d5363", + "b55b481ae8184da4bcadd47e94859803", + "3d6620065647443f975656f54cc902ab", + "2c05c8942f8c40428d70fc0a1ccbc65b", + "0537c891baa94c9395f2833f06d6d355", + "7462484bfba043949511e20bef99e48a", + "540650e9ac2b49cb9f10dfbf5a28debd", + "5c2042cfd9274ebeaf07f53ee96ba77b", + "743ecbc5bb3d48c88aacfb65ade766d1", + "3c30728848ac47858f84b8ed298397f4", + "ee0ee61a3a55401f8a525a05a4150e7b", + "648444be4f3b453e86e05317b815514a", + "56bdbb9fc0634aa5aee44a229731584f", + "a1ac9d59c05c46b797493f8bb0d7d1bb", + "0163b07e777a44d581080b356a00fc46", + "4254a69095a347b8a69a6698abab6ec5", + "2179b292baae4e97bcfbf45f2450acfb", + "272ddd2794b140f99f1e74ef3dfa4645", + "e694886e08694c5ca8d719461279f66f", + "d85c4e7d80274e038dd7a27ee0e1eb8b", + "211bd4e5dce1412db17ac0113cacd6fe", + "63eef4b4fe2444f684ccd604cea646fe", + "488b7ef5d92849e782a1536fc9f8a496", + "a4cf2dc7060147bba169e3861e5e15af", + "3bb4ade00aed4f7997d569510335dded", + "82b0173c80cc458b82907f818975adc6", + "9f97a9dbaf7a458c8000fcb8f421da49", + "18ba732281ca4bac98fd3c7a6b23cb29", + "494d3ddb1f524442ba01f26dbaf26925", + "55aff6c99f714ba1b7a8cc8098708aed", + "9f692246e6864101a6f783609c2c9f8f", + "9aea4996473941b7a483e8887fc9d355", + "4d8f2b9838fa43ec9e1589039e7f34f2", + "8f8b53d0a9e44f078ff89c8b452a2b5d", + "4b9aff699cb443e5a3d27f8d16201b6a", + "e2922459c68f42ce9a45f8b0b147080e", + "960e5671ec0248a09526c97a094f8b86", + "5ebedc75db3a410ba278cb6450754c77", + "3b5ac35bda37444d9d4c8bad59cce3e6", + "9d943d4ce7704daeb20bf3d76983746f", + "c31c0f5c2edd424aa6306784c7171f0a", + "7cef63aeb4b340cba68b0f29e02105fb", + "6eb0668347d6466cacbd02cba1232527", + "3bebaa2a5e8549b6b064f686d9a9ad3e", + "87dc2afb532242c29893430181d76e82", + "702f2ca2d7344efa9fa8bce3cd495821", + "55251fc883e44f2fae5bb333abcda2b2", + "6e9019fbdb3840af9cd5b588931ff9e2", + "7bd2d0c10b124a5db1c31fdb1b90a442", + "10f1c29e12214ff2ba2bfd3a90b9ae45", + "ef2b929504f7476db18e614cee4c9085", + "37ff763014b74693893260be60196c17", + "097ad750bfae40fcbceb318350e76032", + "b6395f0b8700489fa5a61267ceb96a28", + "055feef07f8b4bdf9c2bbd54c43da4f7", + "e464a7fba0514c05918e597bb6bc21f7", + "c0ccbf04ea664460aff0e136fc64d2bb", + "51e2ed9da1e34cea8a3ef8673b798462", + "1ef59519ab8c4263b1b7f1de256d9592", + "09076be40be4454b8d8ee6665c37a12d", + "b39445bb1ece4cbf8129d685617f126d", + "7b863a82fbc84299adf0c3fef0878b26", + "9e923bd19c0c424495cc8bce188f8fb0", + "5ae24c0ca1324a6a8b7803017fd83f29", + "4ba90ec13cb44087b0d71121c4626fb5", + "326b4706bd2041089c26fa295c729e23", + "3e31f5ed5126417d9fb8046e565e39fd", + "131402914e014cde8e1f79895e78ae7c", + "a10203eb20d645b882d5dc6c512bdce5", + "7f749553b3b04ebc8881f0be3015ee28", + "443fc75017434cd7b76c8ff647339672", + "061e46bfbce547bcb3e2e7ffba88e216", + "33c139b906564b35bb1ab1564b2f158f", + "e920b1bded8441a1aeb991cce781ecce", + "4cb0327a8b4f4a2a882e51a374d0dc88", + "ee9255e3dbf240cda71efcbdef0f0898", + "8ee9c6b9afa64006a3896f7dde639a2e", + "8944df99bc914f86a135fe7c272dd871", + "a642ae7c12ea47d48acf758f3a4d3ab2", + "29d54b0e795a48ca8b9d4f4812bdec38", + "177de5d4346c496a865d8a4970fd1546", + "5df1e0b31f744e7dbe1df7927e48a69c", + "b4a4c708054044c69cd831d6a6ea0133", + "6e74b603f29140c09fc10ae9b5a2cdfb", + "21017fbebf794bc7a874c489d885fd72", + "a5effffb66674bc580a513398a8f57b2", + "069ee53a4f674e309153f5433c3e86ea", + "a4590c87c2524e5fba9ac5a9c78aa84b", + "148a5093003c49fdbee32958c72226d3", + "7dd80ab18b84467e9e8251f3aed4a727", + "b3fc7d8880a54cf1936248a29e89b87a", + "1b51170785904761b298def1f33d7733", + "bb6b6f85f0a34514b8e355091167552a", + "0ec8968901274132b55d63b6afbf062a", + "25e8ab54bda042298f80cc9ad6b8fb8e", + "38a60013a0584090ba7ae6c22b89fc12", + "0b49d08fc6db40c0b4a476cf8565a1a0", + "2c2a7055b4e3486fa86278bb1020e506" + ] + }, + "id": "VBv-xjEqByZr", + "outputId": "2557eda4-049b-4d88-b5a0-b508605c348b" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipython-input-3496743039.py:60: DeprecationWarning: The 'theme' parameter in the Blocks constructor will be removed in Gradio 6.0. You will need to pass 'theme' to Blocks.launch() instead.\n", + " with gr.Blocks(\n", + "/tmp/ipython-input-3496743039.py:60: DeprecationWarning: The 'css' parameter in the Blocks constructor will be removed in Gradio 6.0. You will need to pass 'css' to Blocks.launch() instead.\n", + " with gr.Blocks(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().\n", + "* Running on public URL: https://f02d585ac558d5f5c6.gradio.live\n", + "\n", + "This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipython-input-1459962086.py:30: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEmbeddings`.\n", + " self.embedding_model = HuggingFaceEmbeddings(\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "16e7ec24f6c34211be67872719bf9563", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "modules.json: 0%| | 0.00/349 [00:00