Spaces:

Piyush86
/

PDF-Chatbot-RAG

Sleeping

App Files Files Community

Piyush86 commited on Mar 1

Commit

204d2a4

1 Parent(s): 093122b

Deployment

Browse files

Files changed (5) hide show

README.md +75 -12
dockerfile +31 -0
requirements.txt +0 -0
src/app.py +210 -0
src/rag_pipelline.py +226 -0

README.md CHANGED Viewed

@@ -1,12 +1,75 @@
----
-title: PDF Chatbot RAG
-emoji: 👀
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
-license: mit
-short_description: 'AI chatbot for PDF: upload, embed, retrieve, answer question'
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+PDF Chatbot with RAG
+====================
+Overview
+--------
+PDF Chatbot with RAG is a Streamlit-powered demo that lets you ask natural-language questions about any selectable-text PDF. The app takes a document, breaks it into overlapping chunks, embeds those chunks with Google Generative AI, and serves a LangChain agent that always consults retrieved context before responding. The goal is to keep answers concise, grounded, and easy for reviewers to follow without diving into the underlying code.
+How it works
+------------
+1. A sidebar workflow handles file selection: upload your own PDF or choose one of the curated samples that live in `sample_pdf/`.
+2. Once the document is confirmed, `rag_pipelline.py` extracts text with `PyPDF2`, splits it into 2,500-character chunks, embeds each chunk with Gemini embeddings, and stores the vectors in FAISS in memory.
+3. A LangChain agent built around the Gemini 2.5 Flash chat model uses a retrieval tool to fetch the most relevant chunks and streams answers back to the Streamlit chat interface.
+Components
+----------
+- `app.py`: Streamlit UI, session-state management, and chat orchestration. The sidebar coordinates uploads, sample selection, and processing states while the main area renders the dialog and chunk-level sources.
+- `rag_pipelline.py`: Text extraction, chunking, embedding, FAISS creation, agent building, and helper utilities for rate-limit handling and retries.
+- `sample_pdf/`: A handful of ready-to-use PDFs (e.g., GPT-4 technical report) so you can explore the experience without providing your own document.
+- `requirements.txt`: Pinned dependencies for Streamlit, LangChain, FAISS, Google Generative AI, and related helpers.
+- `.env`: Holds `GOOGLE_API_KEY` (or other Google credentials) needed to call the embedding service.
+Setup
+-----
+Clone the repository and configure the environment before launching the app.
+1. **Prerequisites**
+   - Install Python 3.12+ and keep it up to date.
+   - Have a Google Cloud project with the Generative AI API enabled and a valid API key (or service account credentials).
+2. **Environment**
+   - Create a `.env` file at the project root.
+   - Add your key:
+     ```
+     GOOGLE_API_KEY=your-generated-key
+     ```
+   - If you prefer service account credentials, set `GOOGLE_APPLICATION_CREDENTIALS` instead of `GOOGLE_API_KEY`.
+3. **Dependencies**
+   - Create and activate a virtual environment:
+     ```
+     python -m venv .venv
+     .venv\\Scripts\\Activate.ps1   # PowerShell
+     .venv\\Scripts\\activate.bat   # cmd.exe
+     source .venv/bin/activate      # Bash
+     ```
+   - Install the pinned packages:
+     ```
+     pip install -r requirements.txt
+     ```
+4. **Launch**
+   - Start the Streamlit app:
+     ```
+     streamlit run app.py
+     ```
+   - Upload a text-based PDF or select a sample from the sidebar, click **Process PDF**, and wait for the four spinner steps (extract → chunk → embed → agent).
+   - Ask questions using the chat box once the processing completes.
+Tips
+----
+- Keep questions focused so Gemini can stay concise and reuse the retrieved chunks that are shown in the expanders.
+- Use the **Clear & Reset** button before switching documents to avoid leftover state.
+- If embeddings hit rate limits, wait a minute—`rag_pipelline.py` already throttles calls, but the console also logs retries.
+Next steps
+----------
+1. Persist FAISS to disk or a managed vector database if you need to reuse vector stores across sessions.
+2. Add tests that cover chunk creation, embedding retries, and agent responses so you can refactor with confidence.

dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM python:3.12.12-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# install python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy your app files
+COPY . .
+# Huggingface Writes to /tmp/ .streamlit
+ENV STREAMLIT_HOME=/tmp/.streamlit
+ENV STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
+# Expose streamlit default port
+EXPOSE 8501
+# Heathcheck so HF knows the app is running
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+#RUN the app
+ENTRYPOINT [ "streamlit", "run","app.py", \
+            "--server.port=8501", \
+            "--server.address=0.0.0.0"]

requirements.txt ADDED Viewed

Binary file (788 Bytes). View file

src/app.py ADDED Viewed

	@@ -0,0 +1,210 @@

+import streamlit as st
+from rag_pipelline import (
+    extract_text_from_pdf,
+    split_text_into_chunks,
+    create_vector_store,
+    create_rag_agent,
+    get_answer
+    )
+# Page Config-----
+st.set_page_config(
+    page_title = "PDF Chatbot- using RAG",
+    page_icon = "📄",
+    layout = "wide"
+)
+# Header-----
+st.markdown("### 📄 PDF Chatbot - RAG + Gemini")
+st.markdown("Powered by Langchain and Gemini 2.5 Flash")
+st.divider()
+# Session State -----
+if "agent" not in st.session_state:
+    st.session_state.agent = None
+if "chat_history" not in st.session_state:
+    st.session_state.chat_history = []
+if "display_messages" not in st.session_state:
+    st.session_state.display_messages = []
+if "pdf_processed" not in st.session_state:
+    st.session_state.pdf_processed = False
+if "pdf_name" not in st.session_state:
+    st.session_state.pdf_name = ""
+# Sidebar ----
+with st.sidebar:
+    st.header("⚙️ Stack Info")
+    st.markdown("**Framework:** Langchain 1.2.10")
+    st.markdown("**LLM:** Gemini 2.5 Flash")
+    st.markdown("**Embeddings:** Google embedding-001")
+    st.markdown("**Vector Store:** FAISS")
+    st.divider()
+    st.header("📁 Upload or Select PDF")
+    # Upload a PDF
+    uploaded_file = st.file_uploader(
+        "Upload a PDF",
+        type = ["pdf"],
+        help = "Max 10 MB · Max 50 pages · Must have selectable text (not scanned)"
+        )
+    # Select a sample PDF
+    sample_pdf = st.selectbox(
+        "Or pick a sample PDF:",
+        ["None", "Attention is All You Need", "GPT-4 Technical Report", "WHO 2025 Report", "World Bank Annual Report 2024"]
+    )
+    # Ensure only one PDF is uploaded at a time
+    chosen_file , chosen_name = None,""
+    if uploaded_file is not None:
+        chosen_file = uploaded_file
+        chosen_name = uploaded_file.name
+    elif sample_pdf != "None":
+        sample_map = {
+            "Attention is All You Need": "sample_pdf/Attention_is_all_you_need.pdf",
+            "GPT-4 Technical Report":"sample_pdf/GPT-4_Technical_Report.pdf",
+            "WHO 2025 Report":"sample_pdf/WHO_2025.pdf",
+            "World Bank Annual Report 2024": "sample_pdf/World_Bank_Annual_Report_2024.pdf"
+        }
+        # Using a variable and closing after use
+        sample_path = sample_map.get(sample_pdf)
+        if sample_path:
+            try:
+                chosen_file = open(sample_path, "rb")
+                chosen_name = sample_pdf
+                st.info(f"📄 Using sample file: {chosen_name}")
+            except FileNotFoundError:
+                st.error(f"❌ Sample file not found: {sample_path}")
+                chosen_file = None
+    if chosen_file is not None:
+        if st.button("Process PDF", type = "primary", use_container_width = True):
+            with st.spinner("Step 1/4 - Extracting raw text"):
+                raw_text = extract_text_from_pdf(chosen_file)
+            # Close sample file after reading to avoid resource leak
+            if sample_pdf != "None" and hasattr(chosen_file, "close"):
+                chosen_file.close()
+            if not raw_text.strip():
+                st.error("❌ No text found, please check your PDF and confirm its text selectable")
+            else:
+                with st.spinner("Step 2/4 - Splitting text into chunks"):
+                    chunks = split_text_into_chunks(raw_text)
+                with st.spinner("Step 3/4 - Creating embedding and vector store"):
+                    vector_store = create_vector_store(chunks)
+                with st.spinner("Step 4/4 - Creating RAG Agent"):
+                    st.session_state.agent = create_rag_agent(vector_store)
+                    st.session_state.pdf_processed = True
+                    st.session_state.pdf_name = chosen_name
+                    st.session_state.chat_history = []
+                    st.session_state.display_messages = []
+                st.success(f"✅ Ready! {len(chunks)} chunks indexed")
+    if st.session_state.pdf_processed:
+        st.divider()
+        st.success(f" Active :\n{st.session_state.pdf_name}")
+        st.caption(f"Messages so far:{len(st.session_state.display_messages)}")
+        if st.button("Clear & Reset", use_container_width= True):
+            st.session_state.agent = None
+            st.session_state.chat_history = []
+            st.session_state.display_messages = []
+            st.session_state.pdf_processed = False
+            st.session_state.pdf_name = ""
+            st.rerun()
+# Main Area -----
+if not st.session_state.pdf_processed:
+    st.markdown("### How to use")
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        st.markdown("Step 1 - Upload or select the PDF from sidebar")
+    with col2:
+        st.markdown("Step 2 - Click Process PDF")
+    with col3:
+        st.markdown("Step 3 - Ask your questions in the chat box")
+    st.divider()
+else:
+    st.markdown(f"### Chatting with {st.session_state.pdf_name}")
+    # Display all previous messages
+    for msg in st.session_state.display_messages:
+        with st.chat_message(msg["role"]):
+            st.write(msg["content"])
+            # Show source chunks for assistant messages
+            if msg["role"] == "assistant" and msg.get("sources"):
+                with st.expander(" PDF Chunks used to generate this answer"):
+                    for i, doc in enumerate(msg["sources"]):
+                        st.markdown(f"**Chunk {i+1}:**")
+                        st.markdown(f"> {doc.page_content[:400]}...")
+                        st.divider()
+#Chat Input
+if st.session_state.pdf_processed:
+    user_question = st.chat_input(f"Ask Something about {st.session_state.pdf_name}...")
+    if user_question:
+        # Show user message
+        with st.chat_message("user"):
+            st.write(user_question)
+        # Store in both histories
+        st.session_state.chat_history.append({
+            "role":"user",
+            "content":user_question
+        })
+        st.session_state.display_messages.append({
+            "role": "user",
+            "content": user_question
+        })
+        # Get answer from agent
+        with st.chat_message("assistant"):
+            with st.spinner("Agent is searching PDF and thinking"):
+                answer, source_docs = get_answer(
+                    st.session_state.agent,
+                    user_question,
+                    st.session_state.chat_history[:-1] # history without current question
+                )
+            st.write(answer)
+            if source_docs:
+                with st.expander(" PDF chunks used to generate this answer"):
+                    for i, doc in enumerate(source_docs):
+                        st.markdown(f"**Chunk {i+1}:**")
+                        st.markdown(f"> {doc.page_content[:400]}...")
+        #Store assistant response
+        st.session_state.chat_history.append({
+            "role":"assistant",
+            "content" : answer
+        })
+        st.session_state.display_messages.append({
+            "role": "assistant",
+            "content": answer,
+            "sources":source_docs
+        })

src/rag_pipelline.py ADDED Viewed

	@@ -0,0 +1,226 @@

+import os
+import re
+import time
+import random
+from dotenv import load_dotenv
+from PyPDF2 import PdfReader
+from langchain.chat_models import init_chat_model                     # new universal model initializer
+from langchain.agents import create_agent                             # replaces AgentExecutor
+from langchain.tools import tool                                      # tool decorator
+from langchain_google_genai import GoogleGenerativeAIEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
+load_dotenv()
+# Constant values
+CHUNK_SIZE = 2500
+CHUNK_OVERLAP = 300
+EMBED_DELAY_SECONDS = 1.5
+TPM_SAFE_THRESHOLD = 27000
+MAX_RETRIES = 3
+# Extract Text from the PDF
+def extract_text_from_pdf(pdf_file):
+    """
+    Reads each page of the PDF and extracts raw text
+    Some page smay return None so we use ''or empty string' as safety
+    """
+    pdf_reader = PdfReader(pdf_file)
+    text  = ""
+    for page in pdf_reader.pages:
+        text += page.extract_text() or ""
+    return text
+# Split the text into Chunks
+def split_text_into_chunks(raw_text):
+    """
+    split the text into chunks of 1000 characters to avoid hiting the token limits
+    chunk_overlap = 200
+    """
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size = CHUNK_SIZE,
+        chunk_overlap = CHUNK_OVERLAP,
+        length_function = len
+    )
+    chunks = text_splitter.split_text(raw_text)
+    return chunks
+# Create Vector Store
+def create_vector_store(text_chunks):
+    """
+    Embed chunks with Gemini and store in FAISS.
+    Respects all three free tier limits:
+        RPM = 100 → actual ~40 RPM (1.5s delay)
+        TPM = 30K → pauses at 27K tokens/min threshold
+        RPD = 1K  → chunk_size=2500 minimises daily calls
+    On 429 : reads Google's retry delay from error, waits + jitter
+    On RPD exhausted : raises clear message immediately
+    """
+    embeddings = GoogleGenerativeAIEmbeddings(
+        model = "models/gemini-embedding-001"
+    )
+    total        = len(text_chunks)
+    est_min      = (total * EMBED_DELAY_SECONDS) / 60
+    print(f"\nEmbedding {total} chunks — est. {est_min:.1f} min")
+    print(f"RPD usage: {total}/1,000 daily quota\n")
+    vector_store       = None
+    tokens_this_minute = 0
+    minute_start       = time.time()
+    for idx, chunk in enumerate(text_chunks):
+        chunk_tokens = max(1, len(chunk) // 4)   # 1 token ≈ 4 chars
+        # --- TPM Guard: pause if approaching 30K tokens/min ---
+        elapsed = time.time() - minute_start
+        if elapsed < 60 and (tokens_this_minute + chunk_tokens) > TPM_SAFE_THRESHOLD:
+            wait = 60 - elapsed + 2
+            print(f"  TPM guard — {tokens_this_minute:,} tokens sent. Waiting {wait:.0f}s...")
+            time.sleep(wait)
+            tokens_this_minute = 0
+            minute_start = time.time()
+        if time.time() - minute_start >= 60:
+            tokens_this_minute = 0
+            minute_start = time.time()
+        # --- Embed with retry ---
+        for attempt in range(1, MAX_RETRIES + 1):
+            try:
+                if vector_store is None:
+                    vector_store = FAISS.from_texts(texts=[chunk], embedding=embeddings)
+                else:
+                    vector_store.add_texts(texts=[chunk])
+                tokens_this_minute += chunk_tokens
+                if (idx + 1) % 10 == 0 or (idx + 1) == total:
+                    print(f"  ✓ {idx+1}/{total} chunks | ~{tokens_this_minute:,} tokens this min")
+                break
+            except Exception as e:
+                err = str(e)
+                if "429" in err or "RESOURCE_EXHAUSTED" in err:
+                    match  = re.search(r"retry[^\d]*(\d+\.?\d*)\s*s", err, re.IGNORECASE)
+                    g_wait = float(match.group(1)) if match else 30
+                    wait   = g_wait + random.uniform(1, 3)
+                    print(f"  429 on chunk {idx+1} (attempt {attempt}/{MAX_RETRIES}). Waiting {wait:.0f}s...")
+                    time.sleep(wait)
+                    tokens_this_minute = 0
+                    minute_start = time.time()
+                    if attempt == MAX_RETRIES:
+                        raise Exception(
+                            f"Chunk {idx+1} failed after {MAX_RETRIES} retries. "
+                            f"Daily quota may be exhausted — try again tomorrow."
+                        ) from e
+                elif "per day" in err.lower():
+                    raise Exception(
+                        "Daily RPD quota (1,000) exhausted. "
+                        "Resets at midnight Pacific Time. Try again tomorrow."
+                    ) from e
+                else:
+                    raise
+        time.sleep(EMBED_DELAY_SECONDS)
+    print(f"\n✅ Done — {total} chunks stored in FAISS.\n")
+    return vector_store
+# Build RAG agent
+def create_rag_agent(vector_store):
+    model = init_chat_model(
+        "google_genai:gemini-2.5-flash",
+        temperature = 0
+    )
+    @tool(response_format = "content_and_artifact")
+    def retrieve_context(query:str):
+        """
+        searches the PDF for context relevant to query and return relevant chunks from the PDF
+        """
+        retrieved_docs = vector_store.similarity_search(query, k = 3)
+        # Format docs as readable text for the LLM
+        serialized = "\n\n".join(
+            f"[Chunk {i+1}] :\n{doc.page_content}"
+            for i, doc in enumerate(retrieved_docs)
+        )
+        return serialized, retrieved_docs  # content for LLM, raw docs for UI
+    tools  = [retrieve_context]
+    system_prompt = (
+        "You are a helpful assistant that answers questions about an uploaded PDF document "
+        "You have access to a retrieval tool that searches the PDF content for context relevant to the question"
+        "Always use the retrieval tool to find relevant information before answering the question"
+        "If the document does not contain the answer, say so clearly"
+        "Keep your answers concise, accurate and grounded in the document content"
+    )
+    agent = create_agent(model, tools, system_prompt = system_prompt)
+    return agent
+# Get the answer with conversation history
+def get_answer(agent, user_question, chat_history):
+    # convert history dicts to langchain message objects
+    messages = []
+    for msg in chat_history:
+        if msg['role'] == "user":
+            messages.append(HumanMessage(content = msg["content"]))
+        elif msg['role'] =="assistant":
+            messages.append(AIMessage(content = msg["content"]))
+    # Append the current question
+    messages.append(HumanMessage(content = user_question))
+    source_docs = []
+    final_answer = ""
+    # Stream through all agent steps
+    for step in agent.stream({"messages" : messages}, stream_mode = "values"):
+        last_message = step["messages"][-1]
+        #Collect source docs from tool message
+        if isinstance(last_message, ToolMessage):
+            if hasattr(last_message, "artifact") and isinstance(last_message.artifact, list):
+                source_docs = last_message.artifact
+        # Extract final answer from the AIMessage only
+        if isinstance(last_message, AIMessage):
+            content = last_message.content
+            # Handle string content (common case)
+            if isinstance(content, str) and content.strip():
+                final_answer = content
+            # Handle list of content blocks
+            elif isinstance(content, list):
+                text_parts = [
+                    block.get("text", "") if isinstance(block, dict) else str(block)
+                    for block in content
+                ]
+                assembled = " ".join(part for part in text_parts if part.strip())
+                if assembled:
+                    final_answer = assembled
+    # Fallback if no answer was captured
+    if not final_answer:
+        final_answer = "I was unable to generate response. Please try rephrasing your question"
+    return final_answer, source_docs