Spaces:

hugging2021
/

RAG-System-with-Gemini-2

Sleeping

App Files Files Community

hugging2021 commited on Oct 25, 2025

Commit

ca637d1

verified ·

1 Parent(s): 4cf105c

Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitattributes +2 -0
.gitignore +6 -0
Dockerfile +37 -0
README.md +127 -10
assets/architecture.jpg +3 -0
assets/demo.gif +3 -0
main.py +394 -0
requirements.txt +0 -0
utils/agent.py +67 -0
utils/processor.py +58 -0
utils/vector_store.py +52 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/architecture.jpg filter=lfs diff=lfs merge=lfs -text
+assets/demo.gif filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+__pycache__
+utils/__pycache__
+chroma_db/
+venv/
+.env
+.dockerignore

Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+FROM python:3.11-slim AS base
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV POETRY_NO_INTERACTION=1
+# ---------------- Main Application Stage -----------------
+FROM base
+# Set the working directory in the container
+WORKDIR /app
+# Install dependencies
+RUN pip install --upgrade pip
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the application code into the container
+COPY main.py .
+COPY utils/ ./utils/
+# Set environment variable for ChromaDB path *inside* the container
+# Data will be mounted to this path using a volume
+ENV DB_PATH=chroma_db
+# Create the directory for ChromaDB data and declare it as a volume
+# This ensures the directory exists and signals it's for persistent data
+RUN mkdir -p chroma_db
+VOLUME chroma_db
+# Expose the port Streamlit runs on
+EXPOSE 8501
+# Define the command to run the application
+# Use 0.0.0.0 to make it accessible from outside the container
+CMD ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,10 +1,127 @@
----
-title: Hugging2021 Rag System With Gemin
-emoji: 👁
-colorFrom: blue
-colorTo: blue
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Agentic RAG Streamlit Application
+This project implements an Retrieval-Augmented Generation (RAG) system using **Gemini** and **Streamlit**. It allows users to ingest data from PDF files and web URLs, ask questions, and receive answers generated by a **Large Language Model (LLM)** leveraging the ingested context and optional web search results.
+![architecture ](./assets/architecture.jpg)
+### How it works
+* The user uploads PDF documents or provides web URLs, these documents are processed and stored in **Chroma** Vector Database.
+* The user submits a query, the query is first sent to a **Rewrite Agent**. This agent analyzes and reformulates the original query, aiming to improve its clarity and effectiveness for retrieval.
+* The rewritten query is forwarded to the LLM. The LLM searches the Vector DB (**Chroma**), retrieving relevant text chunks based on semantic similarity. Simultaneously or based on configuration, it can leverage Web Search (**DuckDuckGo**) to gather information not present in the uploaded documents. If no specific context found, the LLM answers based on its general knowledge.
+* The generated Response is sent back to the Streamlit interface, where it is displayed to the user.
+## Features
+* **Data Ingestion:** Upload PDF files or enter web URLs to populate the knowledge base.
+* **Persistent Vector Store:** Uses **ChromaDB** to store and retrieve text embeddings locally.
+* **Query Rewriting:** Employs an agent with **Agno** to reformulate user questions for potentially better retrieval results.
+* **Retrieval-Augmented Generation (RAG):**
+    * Retrieves relevant text chunks from the **ChromaDB** vector store based on the (rewritten) query.
+    * Uses a RAG agent (**Gemini**) to synthesize an answer based on the retrieved context.
+* **Web Search:** Optionally performs a web search via **DuckDuckGo** if:
+    * No relevant documents are found in the local vector store.
+    * Web search is explicitly forced via the UI.
+* **Configuration:** Allows users to configure:
+    * Enabling/disabling web search.
+    * Forcing web search.
+    * Adjusting the similarity score threshold for document retrieval.
+* **Database Management:** Options to clear chat history and the vector database.
+* **Dockerized:** Includes a `Dockerfile` for easy containerization and deployment.
+## Tech Stack
+* **Web Framework:** Streamlit
+* **Vector Database:** ChromaDB
+* **LLM & Embeddings:** Gemini
+* **Core Logic:** Langchain (for document processing, vector store integration), Agno (for agents)
+* **Containerization:** Docker
+## Prerequisites
+* **Python:** Version 3.11 or higher recommended.
+* **pip:** Python package installer.
+* **Git:** For cloning the repository.
+* **Docker:** Required for running the application using Docker (recommended for easy setup and persistence).
+* **Google API Key:** You need an API key for Google Generative AI (e.g., Gemini API). You can obtain one from [Google AI Studio](https://aistudio.google.com/app/apikey).
+## How to use
+### Without Docker
+1.  **Clone the Repository:**
+    ```bash
+    git clone https://github.com/luanntd/RAG-System-with-Gemini.git
+    cd RAG-System-with-Gemini
+    ```
+2.  **Create a Virtual Environment (Recommended):**
+    ```bash
+    python -m venv venv
+    # Activate it (Linux/macOS)
+    source venv/bin/activate
+    # Activate it (Windows)
+    .\venv\Scripts\activate
+    ```
+3.  **Install Dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+4. **Create Directory for Vector Store**
+    ```bash
+    mkdir chroma_db
+    ```
+5.  **Set Up Environment Variables:**
+    * Create a file named `.env` in the project's root directory.
+    * Add the following variables:
+        ```dotenv
+        GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
+        COLLECTION_NAME=rag_system
+        DB_PATH=chroma_db
+        ```
+    * Replace `"YOUR_GOOGLE_API_KEY"` with your actual Google API key.
+6. **Running the Application**
+    ```bash
+    streamlit run main.py
+    ```
+### With Docker (Recommended)
+You need to do steps 1 and 5 above before this.
+1.  **Build the Docker Image:**
+    ```bash
+    docker build -t rag-system .
+    ```
+2.  **Run the Docker Container:**
+    * Create the volume:
+        ```bash
+        docker volume create chroma_data
+        ```
+    * Run the container:
+        ```bash
+        docker run -d \
+            -p 8501:8501 \
+            --env-file ./.env \
+            -v chroma_data:/chroma_db \
+            --name rag-system-container \
+            rag-system
+        ```
+    * **Explanation of `docker run` flags:**
+        * `-d`: Run the container in detached mode (in the background).
+        * `-p 8501:8501`: Map port 8501 on your host machine to port 8501 inside the container.
+        * `--env-file ./.env`: Load environment variables from your local `.env` file into the container.
+        * `-v rag_chroma_data:/app/chroma_db`: Mounts persistent storage. It links the named volume `chroma_data` to the `/chroma_db` directory *inside* the container. This path (`/chroma_db`) is where ChromaDB will store its data.
+        * `--name rag-system-container`: Assigns a name to your running container.
+        * `rag-system`: The name of the Docker image you built.
+3.  **Access the Application:**
+    * Open your web browser and navigate to `http://localhost:8501`.
+## Demo
+![demo](./assets/demo.gif)

assets/architecture.jpg ADDED Viewed

Git LFS Details

SHA256: 24d4b70a7ef18cfc3186989180d72730277b96adae9ad1966e586646a0918abb
Pointer size: 131 Bytes
Size of remote file: 132 kB

assets/demo.gif ADDED Viewed

Git LFS Details

SHA256: 18de8a7db6f1eed5b0d06ee6c860d3a56f1c69c1357af1602dcb54f3f07920a2
Pointer size: 133 Bytes
Size of remote file: 31.5 MB

main.py ADDED Viewed

	@@ -0,0 +1,394 @@

+import os
+import streamlit as st
+from chromadb import PersistentClient
+from dotenv import load_dotenv
+from urllib.parse import urlparse, urlunparse
+from utils.processor import process_pdf, process_web
+from utils.vector_store import create_vector_store
+from utils.agent import get_query_rewriter_agent, get_web_search_agent, get_rag_agent
+# --- Constants and Configuration ---
+load_dotenv()
+GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
+COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_system") # Provide a default
+DB_PATH = os.getenv("DB_PATH", "chroma_db")
+DEFAULT_SIMILARITY_THRESHOLD = 0.7
+RETRIEVER_K = 5 # Number of documents to retrieve
+# --- Helper Functions ---
+def initialize_session_state():
+    """Initializes Streamlit session state variables if they don't exist."""
+    defaults = {
+        'google_api_key': GOOGLE_API_KEY,
+        'history': [],
+        'use_web_search': False,
+        'force_web_search': False,
+        'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
+        'vector_store': None,
+        'processed_documents': [],
+        'chroma_client': None,
+        'chroma_collection': None,
+        'url_input': "",
+        'clear_url_input_flag': False
+    }
+    for key, value in defaults.items():
+        if key not in st.session_state:
+            st.session_state[key] = value
+def normalize_url(url: str) -> str:
+    """
+    Normalizes a URL for consistent checking and storage.
+    - Adds 'http' if no scheme is present.
+    - Converts scheme and domain to lowercase.
+    - Removes 'www.' prefix.
+    - Removes trailing slashes from the path.
+    - Removes fragments (#...).
+    """
+    url = url.strip()
+    if not url:
+        return ""
+    # Add scheme if missing (default to http for parsing)
+    if '://' not in url:
+        url = 'http://' + url
+    try:
+        parts = urlparse(url)
+        # Lowercase scheme and netloc (domain)
+        scheme = parts.scheme.lower()
+        netloc = parts.netloc.lower()
+        # Remove 'www.' prefix
+        if netloc.startswith('www.'):
+            netloc = netloc[4:]
+        # Remove trailing slashes from path, but keep root '/'
+        path = parts.path.rstrip('/')
+        if not path and parts.path == '/': # Keep root slash if original path was only '/'
+             path = '/'
+        # If path became empty after stripping and wasn't root, ensure it starts with / if netloc exists
+        elif not path and parts.path != '/' and netloc:
+             path = '' # Or '/' depending on desired strictness, empty seems safer.
+        elif path and not path.startswith('/') and netloc:
+            path = '/' + path # Ensure path starts with / if not empty
+        # Reconstruct without query params and fragment for basic normalization
+        # Note: Ignoring query params for simplicity here. Robust normalization might sort/handle them.
+        normalized = urlunparse((scheme, netloc, path, '', '', ''))
+        return normalized
+    except ValueError:
+        st.warning(f"⚠️ Could not properly normalize URL: {url}. Using original.")
+        return url
+def load_vector_store():
+    """Loads or initializes the ChromaDB vector store and retrieves processed documents."""
+    if st.session_state.vector_store is None:
+        try:
+            st.session_state.chroma_client = PersistentClient(path=DB_PATH)
+            st.session_state.chroma_collection = st.session_state.chroma_client.get_or_create_collection(name=COLLECTION_NAME)
+            # Wrap collection in Langchain vector store
+            st.session_state.vector_store = create_vector_store(
+                st.session_state.google_api_key,
+                client=st.session_state.chroma_client
+            )
+            # Retrieve metadata (source names) of already processed documents
+            results = st.session_state.chroma_collection.get(include=['metadatas'])
+            if results and 'metadatas' in results and results['metadatas']:
+                processed_docs = set()
+                for meta in results['metadatas']:
+                    if meta and 'source' in meta:
+                         processed_docs.add(meta['source'])
+                st.session_state.processed_documents = list(processed_docs) # Convert back to list for consistency
+                st.success(f"✅ Loaded {len(st.session_state.processed_documents)} documents from database.")
+            else:
+                st.session_state.processed_documents = []
+                st.info("ℹ️ No existing documents found in the database.")
+        except Exception as e:
+            st.session_state.vector_store = None
+            st.session_state.processed_documents = []
+            st.session_state.chroma_client = None
+            st.session_state.chroma_collection = None
+            st.warning(f"⚠️ Error loading/creating vector store: {e}")
+def add_texts_to_vector_store(texts, source_name):
+    """Adds processed text documents to the vector store."""
+    if not texts:
+        st.warning(f"⚠️ No text extracted from {source_name}. Skipping.")
+        return False
+    try:
+        if st.session_state.vector_store is None:
+            # Initialize vector store if it doesn't exist yet
+             st.session_state.vector_store = create_vector_store(
+                 st.session_state.google_api_key,
+                 texts=texts, # Pass initial texts if needed by create_vector_store
+                 client=st.session_state.chroma_client
+             )
+             # Ensure collection is updated if vector store was just created
+             st.session_state.chroma_collection = st.session_state.chroma_client.get_or_create_collection(name=COLLECTION_NAME)
+        else:
+            st.session_state.vector_store.add_documents(texts)
+        st.session_state.processed_documents.append(source_name)
+        st.success(f"✅ Added source: {source_name} to the database.")
+        return True
+    except Exception as e:
+        st.error(f"❌ Error adding {source_name} to vector store: {e}")
+        return False
+def clear_chat_history():
+    """Clears the chat history."""
+    st.session_state.history = []
+    st.success("Chat history cleared.")
+def clear_vector_database():
+    """Clears all documents from the ChromaDB collection."""
+    if st.session_state.chroma_collection:
+        try:
+            existing_ids = st.session_state.chroma_collection.get(include=[])['ids']
+            if existing_ids:
+                st.session_state.chroma_collection.delete(ids=existing_ids)
+                st.session_state.processed_documents = []
+                st.success("✅ Database cleared successfully. Note that this action does not delete the uploaded files in current session state.")
+            else:
+                st.info("ℹ️ Database is already empty.")
+        except Exception as e:
+            st.error(f"❌ Error clearing database: {e}")
+    else:
+        st.warning("⚠️ Vector store not initialized. Cannot clear database.")
+def display_processed_sources():
+    """Displays the list of processed documents/URLs in the sidebar."""
+    if st.session_state.processed_documents:
+        st.sidebar.header("📚 Processed Sources")
+        for source in sorted(list(set(st.session_state.processed_documents))): # Ensure uniqueness and sort
+            icon = "📄" if source.lower().endswith(".pdf") else "🌐"
+            st.sidebar.text(f"{icon} {source}")
+def display_chat_history():
+    """Displays the chat messages from session state."""
+    for chat in st.session_state.history:
+        with st.chat_message(chat["role"]):
+            st.write(chat["content"])
+def rewrite_query(query):
+    """Rewrites the user query using the query rewriter agent."""
+    try:
+        query_rewriter = get_query_rewriter_agent()
+        rewritten_query = query_rewriter.run(query).content
+        # Optionally display the rewritten query
+        # with st.expander("🔄 Rewritten Query"):
+        #     st.write(f"Original: {query}")
+        #     st.write(f"Rewritten: {rewritten_query}")
+        return rewritten_query
+    except Exception as e:
+        st.error(f"❌ Error rewriting query: {str(e)}")
+        return query
+def search_documents(query):
+    """Searches the vector store for relevant documents."""
+    if not st.session_state.vector_store:
+        st.info("ℹ️ Vector store is not available for document search.")
+        return [], ""
+    retriever = st.session_state.vector_store.as_retriever(
+        search_type="similarity_score_threshold",
+        search_kwargs={
+            "k": RETRIEVER_K,
+            "score_threshold": st.session_state.similarity_threshold
+        }
+    )
+    try:
+        with st.spinner("Searching documents..."):
+            docs = retriever.invoke(query)
+            if docs:
+                context = "\n\n".join([d.page_content for d in docs])
+                st.info(f"📊 Found {len(docs)} relevant document chunks.")
+                return docs, context
+            else:
+                st.info("ℹ️ No relevant documents found matching the threshold.")
+                return [], ""
+    except Exception as e:
+        st.error(f"❌ Error searching documents: {e}")
+        return [], ""
+def search_web(query):
+    """Searches the web using the web search agent."""
+    try:
+        with st.spinner("🔍 Searching the web..."):
+            web_search_agent = get_web_search_agent()
+            web_results = web_search_agent.run(query).content
+            if web_results:
+                st.info("🌐 Web search successful.")
+                return f"Web Search Results:\n{web_results}"
+            else:
+                st.info("🕸️ Web search returned no results.")
+                return ""
+    except Exception as e:
+        st.error(f"❌ Web search error: {str(e)}")
+        return ""
+def generate_response(original_query, rewritten_query, context):
+    """Generates the final response using the RAG agent."""
+    try:
+        with st.spinner("🤖 Generating response..."):
+            rag_agent = get_rag_agent()
+            if context:
+                full_prompt = f"""Based on the following context, answer the question.
+Context:
+{context}
+Original Question: {original_query}
+Rewritten Question (for context search): {rewritten_query}
+Answer:"""
+            else:
+                # Fallback if no context from documents or web
+                full_prompt = f"Answer the following question: {rewritten_query}"
+                st.info("ℹ️ No specific context found. Answering based on general knowledge.")
+            response = rag_agent.run(full_prompt)
+            return response.content
+    except Exception as e:
+        st.error(f"❌ Error generating response: {str(e)}")
+        return "Sorry, I encountered an error while generating the response."
+# --- Streamlit App UI and Logic ---
+def main():
+    st.set_page_config(layout="wide")
+    st.title("🤔 RAG System")
+    initialize_session_state()
+    load_vector_store()
+    if st.session_state.get('clear_url_input_flag', False):
+        st.session_state.url_input = ""
+        st.session_state.clear_url_input_flag = False
+    # --- Sidebar ---
+    with st.sidebar:
+        st.header("⚙️ Controls")
+        if st.button("🗑️ Clear Chat History"):
+            clear_chat_history()
+        if st.button("⚠️ Clear Document Database"):
+            clear_vector_database()
+        st.header("🔧 Configuration")
+        st.session_state.use_web_search = st.checkbox(
+            "Enable Web Search", value=st.session_state.use_web_search
+        )
+        st.session_state.force_web_search = st.checkbox(
+            "Force Web Search", value=st.session_state.force_web_search,
+            help="Always use web search, even if documents are found."
+        )
+        st.session_state.similarity_threshold = st.slider(
+            "Document Similarity Threshold",
+            min_value=0.0, max_value=1.0, value=st.session_state.similarity_threshold, step=0.05,
+            help="Minimum relevance score for document retrieval (higher is stricter)."
+        )
+        st.header("💾 Data Input")
+        uploaded_files = st.file_uploader(
+            "Upload PDF Files", type=["pdf"], accept_multiple_files=True
+        )
+        web_url = st.text_input(
+            "Enter Website URL",
+            key="url_input"
+        )
+        display_processed_sources()
+    # --- Process Uploads ---
+    # Process PDFs
+    if uploaded_files:
+        for uploaded_file in uploaded_files:
+            file_name = uploaded_file.name
+            if file_name not in st.session_state.processed_documents:
+                with st.spinner(f'Processing PDF: {file_name}...'):
+                    texts = process_pdf(uploaded_file)
+                    add_texts_to_vector_store(texts, file_name)
+    if web_url:
+        normalized_url = normalize_url(web_url)
+        if normalized_url:
+            # Check if the *normalized* URL has already been processed
+            if normalized_url not in st.session_state.processed_documents:
+                with st.spinner(f'Processing URL: {web_url}...'):
+                    # Process using the *original* URL input
+                    texts = process_web(web_url)
+                    if add_texts_to_vector_store(texts, normalized_url):
+                        st.session_state.clear_url_input_flag = True
+                        st.rerun()
+    # --- Chat Interface ---
+    display_chat_history()
+    # Get user input
+    prompt = st.chat_input("Ask a question about your documents or the web...")
+    if prompt:
+        # Add user message to UI and history
+        st.chat_message("user").write(prompt)
+        st.session_state.history.append({"role": "user", "content": prompt})
+        # 1. Rewrite Query
+        rewritten_query = rewrite_query(prompt)
+        # 2. Search Strategy
+        doc_context = ""
+        web_context = ""
+        docs = []
+        # Try document search first unless web search is forced
+        if not st.session_state.force_web_search:
+            docs, doc_context = search_documents(rewritten_query)
+        # Decide if web search is needed
+        use_web = st.session_state.force_web_search or (st.session_state.use_web_search and not doc_context)
+        if use_web:
+            web_context = search_web(rewritten_query)
+            if st.session_state.force_web_search and not web_context:
+                 st.warning("Forced web search did not return results.")
+            elif not doc_context and web_context:
+                 st.info("Using web search results as fallback.")
+            elif st.session_state.force_web_search and web_context:
+                 st.info("Using forced web search results.")
+        # 3. Combine Context (prioritize document context if available and not forcing web)
+        final_context = ""
+        if st.session_state.force_web_search:
+            final_context = web_context # Use only web if forced
+        elif doc_context:
+            final_context = doc_context # Use docs if found
+        elif web_context: # Use web only if docs weren't found (and web search was enabled/successful)
+             final_context = web_context
+        # 4. Generate Response
+        assistant_response = generate_response(prompt, rewritten_query, final_context)
+        # Add assistant response to UI and history
+        st.chat_message("assistant").write(assistant_response)
+        st.session_state.history.append({"role": "assistant", "content": assistant_response})
+        # Optional: Display sources used if context came from documents
+        # if not st.session_state.force_web_search and docs:
+        #     with st.expander("📚 Document Sources Used"):
+        #         for i, doc in enumerate(docs):
+        #             source = doc.metadata.get('source', 'Unknown Source')
+        #             st.write(f"**{i+1}. {source}**")
+        #             st.caption(f"{doc.page_content[:250]}...") # Show snippet
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

Binary file (6.59 kB). View file

utils/agent.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from agno.agent import Agent
+from agno.models.google import Gemini
+from agno.tools.duckduckgo import DuckDuckGoTools
+def get_query_rewriter_agent() -> Agent:
+    """Initialize a query rewriting agent."""
+    return Agent(
+        name="Query Rewriter",
+        model=Gemini(id="gemini-exp-1206"),
+        instructions="""You are an expert at reformulating questions to be more precise and detailed.
+        Your task is to:
+        1. Analyze the user's question
+        2. Rewrite it to be more specific and search-friendly
+        3. Expand any acronyms or technical terms
+        4. Return ONLY the rewritten query without any additional text or explanations
+        Example 1:
+        User: "What does it say about ML?"
+        Output: "What are the key concepts, techniques, and applications of Machine Learning (ML) discussed in the context?"
+        Example 2:
+        User: "Tell me about transformers"
+        Output: "Explain the architecture, mechanisms, and applications of Transformer neural networks in natural language processing and deep learning"
+        """,
+        show_tool_calls=False,
+        markdown=True,
+    )
+def get_web_search_agent() -> Agent:
+    """Initialize a web search agent using DuckDuckGo."""
+    return Agent(
+        name="Web Search Agent",
+        model=Gemini(id="gemini-exp-1206"),
+        tools=[DuckDuckGoTools(
+            fixed_max_results=5
+        )],
+        instructions="""You are a web search expert. Your task is to:
+        1. Search the web for relevant information about the query
+        2. Compile and summarize the most relevant information
+        3. Include sources in your response
+        """,
+        show_tool_calls=True,
+        markdown=True,
+    )
+def get_rag_agent() -> Agent:
+    """Initialize the main RAG agent."""
+    return Agent(
+        name="Gemini RAG Agent",
+        model=Gemini(id="gemini-2.0-flash-thinking-exp-01-21"),
+        instructions="""You are an Intelligent Agent specializing in providing accurate answers.
+        When given context from documents:
+        - Focus on information from the provided documents
+        - Be precise and cite specific details
+        When given web search results:
+        - Clearly indicate that the information comes from web search
+        - Synthesize the information clearly
+        Always maintain high accuracy and clarity in your responses.
+        """,
+        show_tool_calls=True,
+        markdown=True,
+    )

utils/processor.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import tempfile
+from datetime import datetime
+from typing import List
+import streamlit as st
+from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+def process_pdf(file) -> List:
+    """Process PDF file and add source metadata."""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
+            tmp_file.write(file.getvalue())
+            loader = PyPDFLoader(tmp_file.name)
+            documents = loader.load()
+            # Add source metadata
+            for doc in documents:
+                doc.metadata.update({
+                    "source_type": "pdf",
+                    "file_name": file.name,
+                    "timestamp": datetime.now().isoformat()
+                })
+            text_splitter = RecursiveCharacterTextSplitter(
+                chunk_size=1000,
+                chunk_overlap=200
+            )
+            return text_splitter.split_documents(documents)
+    except Exception as e:
+        st.error(f"📄 PDF processing error: {str(e)}")
+        return []
+def process_web(url: str) -> List:
+    """Process web URL and add source metadata."""
+    try:
+        loader = WebBaseLoader(web_path=url)
+        documents = loader.load()
+        # Add source metadata
+        for doc in documents:
+            doc.metadata.update({
+                "source_type": "url",
+                "url": url,
+                "timestamp": datetime.now().isoformat()
+            })
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=1000,
+            chunk_overlap=200
+        )
+        return text_splitter.split_documents(documents)
+    except Exception as e:
+        st.error(f"🌐 Web processing error: {str(e)}")
+        return []

utils/vector_store.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from typing import List
+import os
+import streamlit as st
+import google.generativeai as genai
+from langchain_chroma import Chroma
+from langchain_core.embeddings import Embeddings
+from dotenv import load_dotenv
+# Load environment variables from .env file
+load_dotenv()
+COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'rag_system')
+class GeminiEmbedder(Embeddings):
+    def __init__(self, api_key, model_name="models/text-embedding-004"):
+        genai.configure(api_key=api_key)
+        self.model = model_name
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        return [self.embed_query(text) for text in texts]
+    def embed_query(self, text: str) -> List[float]:
+        response = genai.embed_content(
+            model=self.model,
+            content=text,
+            task_type="retrieval_document"
+        )
+        return response['embedding']
+def create_vector_store(api_key, texts=None, client=None):
+    """Create and initialize vector store with documents."""
+    try:
+        # Initialize vector store
+        vector_store = Chroma(
+            collection_name=COLLECTION_NAME,
+            embedding_function=GeminiEmbedder(api_key=api_key),
+            persist_directory="chroma_db",
+            client=client  # Pass the client if provided
+        )
+        # Add documents if provided
+        if texts:
+            with st.spinner('📤 Uploading documents to database...'):
+                vector_store.add_documents(texts)
+                st.success("✅ Documents stored successfully!")
+                return vector_store
+        return vector_store
+    except Exception as e:
+        st.error(f"🔴 Vector store error: {str(e)}")
+        return None