Spaces:
Sleeping
Sleeping
Deployment
Browse files- README.md +75 -12
- dockerfile +31 -0
- requirements.txt +0 -0
- src/app.py +210 -0
- src/rag_pipelline.py +226 -0
README.md
CHANGED
|
@@ -1,12 +1,75 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
PDF Chatbot with RAG
|
| 2 |
+
====================
|
| 3 |
+
|
| 4 |
+
Overview
|
| 5 |
+
--------
|
| 6 |
+
PDF Chatbot with RAG is a Streamlit-powered demo that lets you ask natural-language questions about any selectable-text PDF. The app takes a document, breaks it into overlapping chunks, embeds those chunks with Google Generative AI, and serves a LangChain agent that always consults retrieved context before responding. The goal is to keep answers concise, grounded, and easy for reviewers to follow without diving into the underlying code.
|
| 7 |
+
|
| 8 |
+
How it works
|
| 9 |
+
------------
|
| 10 |
+
1. A sidebar workflow handles file selection: upload your own PDF or choose one of the curated samples that live in `sample_pdf/`.
|
| 11 |
+
2. Once the document is confirmed, `rag_pipelline.py` extracts text with `PyPDF2`, splits it into 2,500-character chunks, embeds each chunk with Gemini embeddings, and stores the vectors in FAISS in memory.
|
| 12 |
+
3. A LangChain agent built around the Gemini 2.5 Flash chat model uses a retrieval tool to fetch the most relevant chunks and streams answers back to the Streamlit chat interface.
|
| 13 |
+
|
| 14 |
+
Components
|
| 15 |
+
----------
|
| 16 |
+
- `app.py`: Streamlit UI, session-state management, and chat orchestration. The sidebar coordinates uploads, sample selection, and processing states while the main area renders the dialog and chunk-level sources.
|
| 17 |
+
- `rag_pipelline.py`: Text extraction, chunking, embedding, FAISS creation, agent building, and helper utilities for rate-limit handling and retries.
|
| 18 |
+
- `sample_pdf/`: A handful of ready-to-use PDFs (e.g., GPT-4 technical report) so you can explore the experience without providing your own document.
|
| 19 |
+
- `requirements.txt`: Pinned dependencies for Streamlit, LangChain, FAISS, Google Generative AI, and related helpers.
|
| 20 |
+
- `.env`: Holds `GOOGLE_API_KEY` (or other Google credentials) needed to call the embedding service.
|
| 21 |
+
|
| 22 |
+
Setup
|
| 23 |
+
-----
|
| 24 |
+
Clone the repository and configure the environment before launching the app.
|
| 25 |
+
|
| 26 |
+
1. **Prerequisites**
|
| 27 |
+
- Install Python 3.12+ and keep it up to date.
|
| 28 |
+
- Have a Google Cloud project with the Generative AI API enabled and a valid API key (or service account credentials).
|
| 29 |
+
|
| 30 |
+
2. **Environment**
|
| 31 |
+
- Create a `.env` file at the project root.
|
| 32 |
+
- Add your key:
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
GOOGLE_API_KEY=your-generated-key
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
- If you prefer service account credentials, set `GOOGLE_APPLICATION_CREDENTIALS` instead of `GOOGLE_API_KEY`.
|
| 39 |
+
|
| 40 |
+
3. **Dependencies**
|
| 41 |
+
- Create and activate a virtual environment:
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
python -m venv .venv
|
| 45 |
+
.venv\\Scripts\\Activate.ps1 # PowerShell
|
| 46 |
+
.venv\\Scripts\\activate.bat # cmd.exe
|
| 47 |
+
source .venv/bin/activate # Bash
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
- Install the pinned packages:
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
pip install -r requirements.txt
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
4. **Launch**
|
| 57 |
+
- Start the Streamlit app:
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
streamlit run app.py
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
- Upload a text-based PDF or select a sample from the sidebar, click **Process PDF**, and wait for the four spinner steps (extract β chunk β embed β agent).
|
| 64 |
+
- Ask questions using the chat box once the processing completes.
|
| 65 |
+
|
| 66 |
+
Tips
|
| 67 |
+
----
|
| 68 |
+
- Keep questions focused so Gemini can stay concise and reuse the retrieved chunks that are shown in the expanders.
|
| 69 |
+
- Use the **Clear & Reset** button before switching documents to avoid leftover state.
|
| 70 |
+
- If embeddings hit rate limits, wait a minuteβ`rag_pipelline.py` already throttles calls, but the console also logs retries.
|
| 71 |
+
|
| 72 |
+
Next steps
|
| 73 |
+
----------
|
| 74 |
+
1. Persist FAISS to disk or a managed vector database if you need to reuse vector stores across sessions.
|
| 75 |
+
2. Add tests that cover chunk creation, embedding retries, and agent responses so you can refactor with confidence.
|
dockerfile
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.12.12-slim
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
|
| 5 |
+
# Install system dependencies
|
| 6 |
+
RUN apt-get update && apt-get install -y \
|
| 7 |
+
build-essential \
|
| 8 |
+
curl \
|
| 9 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 10 |
+
|
| 11 |
+
# install python dependencies
|
| 12 |
+
COPY requirements.txt .
|
| 13 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 14 |
+
|
| 15 |
+
# Copy your app files
|
| 16 |
+
COPY . .
|
| 17 |
+
|
| 18 |
+
# Huggingface Writes to /tmp/ .streamlit
|
| 19 |
+
ENV STREAMLIT_HOME=/tmp/.streamlit
|
| 20 |
+
ENV STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
|
| 21 |
+
|
| 22 |
+
# Expose streamlit default port
|
| 23 |
+
EXPOSE 8501
|
| 24 |
+
|
| 25 |
+
# Heathcheck so HF knows the app is running
|
| 26 |
+
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
|
| 27 |
+
|
| 28 |
+
#RUN the app
|
| 29 |
+
ENTRYPOINT [ "streamlit", "run","app.py", \
|
| 30 |
+
"--server.port=8501", \
|
| 31 |
+
"--server.address=0.0.0.0"]
|
requirements.txt
ADDED
|
Binary file (788 Bytes). View file
|
|
|
src/app.py
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
from rag_pipelline import (
|
| 3 |
+
extract_text_from_pdf,
|
| 4 |
+
split_text_into_chunks,
|
| 5 |
+
create_vector_store,
|
| 6 |
+
create_rag_agent,
|
| 7 |
+
get_answer
|
| 8 |
+
)
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
# Page Config-----
|
| 12 |
+
st.set_page_config(
|
| 13 |
+
page_title = "PDF Chatbot- using RAG",
|
| 14 |
+
page_icon = "π",
|
| 15 |
+
layout = "wide"
|
| 16 |
+
)
|
| 17 |
+
|
| 18 |
+
# Header-----
|
| 19 |
+
st.markdown("### π PDF Chatbot - RAG + Gemini")
|
| 20 |
+
st.markdown("Powered by Langchain and Gemini 2.5 Flash")
|
| 21 |
+
st.divider()
|
| 22 |
+
|
| 23 |
+
# Session State -----
|
| 24 |
+
if "agent" not in st.session_state:
|
| 25 |
+
st.session_state.agent = None
|
| 26 |
+
|
| 27 |
+
if "chat_history" not in st.session_state:
|
| 28 |
+
st.session_state.chat_history = []
|
| 29 |
+
|
| 30 |
+
if "display_messages" not in st.session_state:
|
| 31 |
+
st.session_state.display_messages = []
|
| 32 |
+
|
| 33 |
+
if "pdf_processed" not in st.session_state:
|
| 34 |
+
st.session_state.pdf_processed = False
|
| 35 |
+
|
| 36 |
+
if "pdf_name" not in st.session_state:
|
| 37 |
+
st.session_state.pdf_name = ""
|
| 38 |
+
|
| 39 |
+
# Sidebar ----
|
| 40 |
+
|
| 41 |
+
with st.sidebar:
|
| 42 |
+
st.header("βοΈ Stack Info")
|
| 43 |
+
st.markdown("**Framework:** Langchain 1.2.10")
|
| 44 |
+
st.markdown("**LLM:** Gemini 2.5 Flash")
|
| 45 |
+
st.markdown("**Embeddings:** Google embedding-001")
|
| 46 |
+
st.markdown("**Vector Store:** FAISS")
|
| 47 |
+
st.divider()
|
| 48 |
+
|
| 49 |
+
st.header("π Upload or Select PDF")
|
| 50 |
+
# Upload a PDF
|
| 51 |
+
uploaded_file = st.file_uploader(
|
| 52 |
+
"Upload a PDF",
|
| 53 |
+
type = ["pdf"],
|
| 54 |
+
help = "Max 10 MB Β· Max 50 pages Β· Must have selectable text (not scanned)"
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
# Select a sample PDF
|
| 58 |
+
sample_pdf = st.selectbox(
|
| 59 |
+
"Or pick a sample PDF:",
|
| 60 |
+
["None", "Attention is All You Need", "GPT-4 Technical Report", "WHO 2025 Report", "World Bank Annual Report 2024"]
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
# Ensure only one PDF is uploaded at a time
|
| 64 |
+
chosen_file , chosen_name = None,""
|
| 65 |
+
if uploaded_file is not None:
|
| 66 |
+
chosen_file = uploaded_file
|
| 67 |
+
chosen_name = uploaded_file.name
|
| 68 |
+
elif sample_pdf != "None":
|
| 69 |
+
sample_map = {
|
| 70 |
+
"Attention is All You Need": "sample_pdf/Attention_is_all_you_need.pdf",
|
| 71 |
+
"GPT-4 Technical Report":"sample_pdf/GPT-4_Technical_Report.pdf",
|
| 72 |
+
"WHO 2025 Report":"sample_pdf/WHO_2025.pdf",
|
| 73 |
+
"World Bank Annual Report 2024": "sample_pdf/World_Bank_Annual_Report_2024.pdf"
|
| 74 |
+
}
|
| 75 |
+
# Using a variable and closing after use
|
| 76 |
+
sample_path = sample_map.get(sample_pdf)
|
| 77 |
+
if sample_path:
|
| 78 |
+
try:
|
| 79 |
+
chosen_file = open(sample_path, "rb")
|
| 80 |
+
chosen_name = sample_pdf
|
| 81 |
+
st.info(f"π Using sample file: {chosen_name}")
|
| 82 |
+
except FileNotFoundError:
|
| 83 |
+
st.error(f"β Sample file not found: {sample_path}")
|
| 84 |
+
chosen_file = None
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
if chosen_file is not None:
|
| 88 |
+
if st.button("Process PDF", type = "primary", use_container_width = True):
|
| 89 |
+
with st.spinner("Step 1/4 - Extracting raw text"):
|
| 90 |
+
raw_text = extract_text_from_pdf(chosen_file)
|
| 91 |
+
|
| 92 |
+
# Close sample file after reading to avoid resource leak
|
| 93 |
+
if sample_pdf != "None" and hasattr(chosen_file, "close"):
|
| 94 |
+
chosen_file.close()
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
if not raw_text.strip():
|
| 98 |
+
st.error("β No text found, please check your PDF and confirm its text selectable")
|
| 99 |
+
|
| 100 |
+
else:
|
| 101 |
+
with st.spinner("Step 2/4 - Splitting text into chunks"):
|
| 102 |
+
chunks = split_text_into_chunks(raw_text)
|
| 103 |
+
|
| 104 |
+
with st.spinner("Step 3/4 - Creating embedding and vector store"):
|
| 105 |
+
vector_store = create_vector_store(chunks)
|
| 106 |
+
|
| 107 |
+
with st.spinner("Step 4/4 - Creating RAG Agent"):
|
| 108 |
+
st.session_state.agent = create_rag_agent(vector_store)
|
| 109 |
+
st.session_state.pdf_processed = True
|
| 110 |
+
st.session_state.pdf_name = chosen_name
|
| 111 |
+
st.session_state.chat_history = []
|
| 112 |
+
st.session_state.display_messages = []
|
| 113 |
+
|
| 114 |
+
st.success(f"β
Ready! {len(chunks)} chunks indexed")
|
| 115 |
+
|
| 116 |
+
if st.session_state.pdf_processed:
|
| 117 |
+
st.divider()
|
| 118 |
+
st.success(f" Active :\n{st.session_state.pdf_name}")
|
| 119 |
+
st.caption(f"Messages so far:{len(st.session_state.display_messages)}")
|
| 120 |
+
|
| 121 |
+
if st.button("Clear & Reset", use_container_width= True):
|
| 122 |
+
st.session_state.agent = None
|
| 123 |
+
st.session_state.chat_history = []
|
| 124 |
+
st.session_state.display_messages = []
|
| 125 |
+
st.session_state.pdf_processed = False
|
| 126 |
+
st.session_state.pdf_name = ""
|
| 127 |
+
st.rerun()
|
| 128 |
+
|
| 129 |
+
# Main Area -----
|
| 130 |
+
if not st.session_state.pdf_processed:
|
| 131 |
+
st.markdown("### How to use")
|
| 132 |
+
col1, col2, col3 = st.columns(3)
|
| 133 |
+
|
| 134 |
+
with col1:
|
| 135 |
+
st.markdown("Step 1 - Upload or select the PDF from sidebar")
|
| 136 |
+
|
| 137 |
+
with col2:
|
| 138 |
+
st.markdown("Step 2 - Click Process PDF")
|
| 139 |
+
|
| 140 |
+
with col3:
|
| 141 |
+
st.markdown("Step 3 - Ask your questions in the chat box")
|
| 142 |
+
|
| 143 |
+
st.divider()
|
| 144 |
+
|
| 145 |
+
else:
|
| 146 |
+
st.markdown(f"### Chatting with {st.session_state.pdf_name}")
|
| 147 |
+
|
| 148 |
+
# Display all previous messages
|
| 149 |
+
for msg in st.session_state.display_messages:
|
| 150 |
+
with st.chat_message(msg["role"]):
|
| 151 |
+
st.write(msg["content"])
|
| 152 |
+
|
| 153 |
+
# Show source chunks for assistant messages
|
| 154 |
+
if msg["role"] == "assistant" and msg.get("sources"):
|
| 155 |
+
with st.expander(" PDF Chunks used to generate this answer"):
|
| 156 |
+
for i, doc in enumerate(msg["sources"]):
|
| 157 |
+
st.markdown(f"**Chunk {i+1}:**")
|
| 158 |
+
st.markdown(f"> {doc.page_content[:400]}...")
|
| 159 |
+
st.divider()
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
#Chat Input
|
| 163 |
+
|
| 164 |
+
if st.session_state.pdf_processed:
|
| 165 |
+
user_question = st.chat_input(f"Ask Something about {st.session_state.pdf_name}...")
|
| 166 |
+
|
| 167 |
+
if user_question:
|
| 168 |
+
|
| 169 |
+
# Show user message
|
| 170 |
+
with st.chat_message("user"):
|
| 171 |
+
st.write(user_question)
|
| 172 |
+
|
| 173 |
+
# Store in both histories
|
| 174 |
+
st.session_state.chat_history.append({
|
| 175 |
+
"role":"user",
|
| 176 |
+
"content":user_question
|
| 177 |
+
})
|
| 178 |
+
st.session_state.display_messages.append({
|
| 179 |
+
"role": "user",
|
| 180 |
+
"content": user_question
|
| 181 |
+
})
|
| 182 |
+
|
| 183 |
+
# Get answer from agent
|
| 184 |
+
with st.chat_message("assistant"):
|
| 185 |
+
with st.spinner("Agent is searching PDF and thinking"):
|
| 186 |
+
answer, source_docs = get_answer(
|
| 187 |
+
st.session_state.agent,
|
| 188 |
+
user_question,
|
| 189 |
+
st.session_state.chat_history[:-1] # history without current question
|
| 190 |
+
|
| 191 |
+
)
|
| 192 |
+
st.write(answer)
|
| 193 |
+
|
| 194 |
+
if source_docs:
|
| 195 |
+
with st.expander(" PDF chunks used to generate this answer"):
|
| 196 |
+
for i, doc in enumerate(source_docs):
|
| 197 |
+
st.markdown(f"**Chunk {i+1}:**")
|
| 198 |
+
st.markdown(f"> {doc.page_content[:400]}...")
|
| 199 |
+
|
| 200 |
+
#Store assistant response
|
| 201 |
+
st.session_state.chat_history.append({
|
| 202 |
+
"role":"assistant",
|
| 203 |
+
"content" : answer
|
| 204 |
+
})
|
| 205 |
+
st.session_state.display_messages.append({
|
| 206 |
+
"role": "assistant",
|
| 207 |
+
"content": answer,
|
| 208 |
+
"sources":source_docs
|
| 209 |
+
})
|
| 210 |
+
|
src/rag_pipelline.py
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import re
|
| 3 |
+
import time
|
| 4 |
+
import random
|
| 5 |
+
from dotenv import load_dotenv
|
| 6 |
+
from PyPDF2 import PdfReader
|
| 7 |
+
|
| 8 |
+
from langchain.chat_models import init_chat_model # new universal model initializer
|
| 9 |
+
from langchain.agents import create_agent # replaces AgentExecutor
|
| 10 |
+
from langchain.tools import tool # tool decorator
|
| 11 |
+
from langchain_google_genai import GoogleGenerativeAIEmbeddings
|
| 12 |
+
from langchain_community.vectorstores import FAISS
|
| 13 |
+
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
| 14 |
+
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
|
| 15 |
+
|
| 16 |
+
load_dotenv()
|
| 17 |
+
|
| 18 |
+
# Constant values
|
| 19 |
+
CHUNK_SIZE = 2500
|
| 20 |
+
CHUNK_OVERLAP = 300
|
| 21 |
+
EMBED_DELAY_SECONDS = 1.5
|
| 22 |
+
TPM_SAFE_THRESHOLD = 27000
|
| 23 |
+
MAX_RETRIES = 3
|
| 24 |
+
|
| 25 |
+
# Extract Text from the PDF
|
| 26 |
+
def extract_text_from_pdf(pdf_file):
|
| 27 |
+
"""
|
| 28 |
+
Reads each page of the PDF and extracts raw text
|
| 29 |
+
Some page smay return None so we use ''or empty string' as safety
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
pdf_reader = PdfReader(pdf_file)
|
| 33 |
+
text = ""
|
| 34 |
+
for page in pdf_reader.pages:
|
| 35 |
+
text += page.extract_text() or ""
|
| 36 |
+
return text
|
| 37 |
+
|
| 38 |
+
# Split the text into Chunks
|
| 39 |
+
def split_text_into_chunks(raw_text):
|
| 40 |
+
|
| 41 |
+
"""
|
| 42 |
+
split the text into chunks of 1000 characters to avoid hiting the token limits
|
| 43 |
+
chunk_overlap = 200
|
| 44 |
+
"""
|
| 45 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 46 |
+
chunk_size = CHUNK_SIZE,
|
| 47 |
+
chunk_overlap = CHUNK_OVERLAP,
|
| 48 |
+
length_function = len
|
| 49 |
+
)
|
| 50 |
+
chunks = text_splitter.split_text(raw_text)
|
| 51 |
+
return chunks
|
| 52 |
+
|
| 53 |
+
# Create Vector Store
|
| 54 |
+
|
| 55 |
+
def create_vector_store(text_chunks):
|
| 56 |
+
"""
|
| 57 |
+
Embed chunks with Gemini and store in FAISS.
|
| 58 |
+
Respects all three free tier limits:
|
| 59 |
+
RPM = 100 β actual ~40 RPM (1.5s delay)
|
| 60 |
+
TPM = 30K β pauses at 27K tokens/min threshold
|
| 61 |
+
RPD = 1K β chunk_size=2500 minimises daily calls
|
| 62 |
+
|
| 63 |
+
On 429 : reads Google's retry delay from error, waits + jitter
|
| 64 |
+
On RPD exhausted : raises clear message immediately
|
| 65 |
+
"""
|
| 66 |
+
embeddings = GoogleGenerativeAIEmbeddings(
|
| 67 |
+
model = "models/gemini-embedding-001"
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
total = len(text_chunks)
|
| 71 |
+
est_min = (total * EMBED_DELAY_SECONDS) / 60
|
| 72 |
+
print(f"\nEmbedding {total} chunks β est. {est_min:.1f} min")
|
| 73 |
+
print(f"RPD usage: {total}/1,000 daily quota\n")
|
| 74 |
+
|
| 75 |
+
vector_store = None
|
| 76 |
+
tokens_this_minute = 0
|
| 77 |
+
minute_start = time.time()
|
| 78 |
+
|
| 79 |
+
for idx, chunk in enumerate(text_chunks):
|
| 80 |
+
chunk_tokens = max(1, len(chunk) // 4) # 1 token β 4 chars
|
| 81 |
+
|
| 82 |
+
# --- TPM Guard: pause if approaching 30K tokens/min ---
|
| 83 |
+
elapsed = time.time() - minute_start
|
| 84 |
+
if elapsed < 60 and (tokens_this_minute + chunk_tokens) > TPM_SAFE_THRESHOLD:
|
| 85 |
+
wait = 60 - elapsed + 2
|
| 86 |
+
print(f" TPM guard β {tokens_this_minute:,} tokens sent. Waiting {wait:.0f}s...")
|
| 87 |
+
time.sleep(wait)
|
| 88 |
+
tokens_this_minute = 0
|
| 89 |
+
minute_start = time.time()
|
| 90 |
+
|
| 91 |
+
if time.time() - minute_start >= 60:
|
| 92 |
+
tokens_this_minute = 0
|
| 93 |
+
minute_start = time.time()
|
| 94 |
+
|
| 95 |
+
# --- Embed with retry ---
|
| 96 |
+
for attempt in range(1, MAX_RETRIES + 1):
|
| 97 |
+
try:
|
| 98 |
+
if vector_store is None:
|
| 99 |
+
vector_store = FAISS.from_texts(texts=[chunk], embedding=embeddings)
|
| 100 |
+
else:
|
| 101 |
+
vector_store.add_texts(texts=[chunk])
|
| 102 |
+
|
| 103 |
+
tokens_this_minute += chunk_tokens
|
| 104 |
+
|
| 105 |
+
if (idx + 1) % 10 == 0 or (idx + 1) == total:
|
| 106 |
+
print(f" β {idx+1}/{total} chunks | ~{tokens_this_minute:,} tokens this min")
|
| 107 |
+
break
|
| 108 |
+
|
| 109 |
+
except Exception as e:
|
| 110 |
+
err = str(e)
|
| 111 |
+
|
| 112 |
+
if "429" in err or "RESOURCE_EXHAUSTED" in err:
|
| 113 |
+
match = re.search(r"retry[^\d]*(\d+\.?\d*)\s*s", err, re.IGNORECASE)
|
| 114 |
+
g_wait = float(match.group(1)) if match else 30
|
| 115 |
+
wait = g_wait + random.uniform(1, 3)
|
| 116 |
+
print(f" 429 on chunk {idx+1} (attempt {attempt}/{MAX_RETRIES}). Waiting {wait:.0f}s...")
|
| 117 |
+
time.sleep(wait)
|
| 118 |
+
tokens_this_minute = 0
|
| 119 |
+
minute_start = time.time()
|
| 120 |
+
|
| 121 |
+
if attempt == MAX_RETRIES:
|
| 122 |
+
raise Exception(
|
| 123 |
+
f"Chunk {idx+1} failed after {MAX_RETRIES} retries. "
|
| 124 |
+
f"Daily quota may be exhausted β try again tomorrow."
|
| 125 |
+
) from e
|
| 126 |
+
|
| 127 |
+
elif "per day" in err.lower():
|
| 128 |
+
raise Exception(
|
| 129 |
+
"Daily RPD quota (1,000) exhausted. "
|
| 130 |
+
"Resets at midnight Pacific Time. Try again tomorrow."
|
| 131 |
+
) from e
|
| 132 |
+
|
| 133 |
+
else:
|
| 134 |
+
raise
|
| 135 |
+
|
| 136 |
+
time.sleep(EMBED_DELAY_SECONDS)
|
| 137 |
+
|
| 138 |
+
print(f"\nβ
Done β {total} chunks stored in FAISS.\n")
|
| 139 |
+
return vector_store
|
| 140 |
+
|
| 141 |
+
# Build RAG agent
|
| 142 |
+
|
| 143 |
+
def create_rag_agent(vector_store):
|
| 144 |
+
|
| 145 |
+
model = init_chat_model(
|
| 146 |
+
"google_genai:gemini-2.5-flash",
|
| 147 |
+
temperature = 0
|
| 148 |
+
)
|
| 149 |
+
|
| 150 |
+
@tool(response_format = "content_and_artifact")
|
| 151 |
+
def retrieve_context(query:str):
|
| 152 |
+
"""
|
| 153 |
+
searches the PDF for context relevant to query and return relevant chunks from the PDF
|
| 154 |
+
"""
|
| 155 |
+
retrieved_docs = vector_store.similarity_search(query, k = 3)
|
| 156 |
+
|
| 157 |
+
# Format docs as readable text for the LLM
|
| 158 |
+
serialized = "\n\n".join(
|
| 159 |
+
f"[Chunk {i+1}] :\n{doc.page_content}"
|
| 160 |
+
for i, doc in enumerate(retrieved_docs)
|
| 161 |
+
)
|
| 162 |
+
return serialized, retrieved_docs # content for LLM, raw docs for UI
|
| 163 |
+
|
| 164 |
+
tools = [retrieve_context]
|
| 165 |
+
|
| 166 |
+
system_prompt = (
|
| 167 |
+
"You are a helpful assistant that answers questions about an uploaded PDF document "
|
| 168 |
+
"You have access to a retrieval tool that searches the PDF content for context relevant to the question"
|
| 169 |
+
"Always use the retrieval tool to find relevant information before answering the question"
|
| 170 |
+
"If the document does not contain the answer, say so clearly"
|
| 171 |
+
"Keep your answers concise, accurate and grounded in the document content"
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
agent = create_agent(model, tools, system_prompt = system_prompt)
|
| 175 |
+
return agent
|
| 176 |
+
|
| 177 |
+
# Get the answer with conversation history
|
| 178 |
+
def get_answer(agent, user_question, chat_history):
|
| 179 |
+
|
| 180 |
+
# convert history dicts to langchain message objects
|
| 181 |
+
messages = []
|
| 182 |
+
for msg in chat_history:
|
| 183 |
+
if msg['role'] == "user":
|
| 184 |
+
messages.append(HumanMessage(content = msg["content"]))
|
| 185 |
+
elif msg['role'] =="assistant":
|
| 186 |
+
messages.append(AIMessage(content = msg["content"]))
|
| 187 |
+
|
| 188 |
+
# Append the current question
|
| 189 |
+
messages.append(HumanMessage(content = user_question))
|
| 190 |
+
|
| 191 |
+
source_docs = []
|
| 192 |
+
final_answer = ""
|
| 193 |
+
|
| 194 |
+
# Stream through all agent steps
|
| 195 |
+
|
| 196 |
+
for step in agent.stream({"messages" : messages}, stream_mode = "values"):
|
| 197 |
+
last_message = step["messages"][-1]
|
| 198 |
+
|
| 199 |
+
#Collect source docs from tool message
|
| 200 |
+
if isinstance(last_message, ToolMessage):
|
| 201 |
+
if hasattr(last_message, "artifact") and isinstance(last_message.artifact, list):
|
| 202 |
+
source_docs = last_message.artifact
|
| 203 |
+
|
| 204 |
+
# Extract final answer from the AIMessage only
|
| 205 |
+
if isinstance(last_message, AIMessage):
|
| 206 |
+
content = last_message.content
|
| 207 |
+
|
| 208 |
+
# Handle string content (common case)
|
| 209 |
+
if isinstance(content, str) and content.strip():
|
| 210 |
+
final_answer = content
|
| 211 |
+
|
| 212 |
+
# Handle list of content blocks
|
| 213 |
+
elif isinstance(content, list):
|
| 214 |
+
text_parts = [
|
| 215 |
+
block.get("text", "") if isinstance(block, dict) else str(block)
|
| 216 |
+
for block in content
|
| 217 |
+
]
|
| 218 |
+
assembled = " ".join(part for part in text_parts if part.strip())
|
| 219 |
+
if assembled:
|
| 220 |
+
final_answer = assembled
|
| 221 |
+
|
| 222 |
+
# Fallback if no answer was captured
|
| 223 |
+
if not final_answer:
|
| 224 |
+
final_answer = "I was unable to generate response. Please try rephrasing your question"
|
| 225 |
+
|
| 226 |
+
return final_answer, source_docs
|