hugging2021 commited on
Commit
ca637d1
Β·
verified Β·
1 Parent(s): 4cf105c

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/architecture.jpg filter=lfs diff=lfs merge=lfs -text
37
+ assets/demo.gif filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ __pycache__
2
+ utils/__pycache__
3
+ chroma_db/
4
+ venv/
5
+ .env
6
+ .dockerignore
Dockerfile ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim AS base
2
+
3
+ # Set environment variables
4
+ ENV PYTHONDONTWRITEBYTECODE=1
5
+ ENV PYTHONUNBUFFERED=1
6
+ ENV POETRY_NO_INTERACTION=1
7
+
8
+ # ---------------- Main Application Stage -----------------
9
+ FROM base
10
+
11
+ # Set the working directory in the container
12
+ WORKDIR /app
13
+
14
+ # Install dependencies
15
+ RUN pip install --upgrade pip
16
+ COPY requirements.txt .
17
+ RUN pip install --no-cache-dir -r requirements.txt
18
+
19
+ # Copy the application code into the container
20
+ COPY main.py .
21
+ COPY utils/ ./utils/
22
+
23
+ # Set environment variable for ChromaDB path *inside* the container
24
+ # Data will be mounted to this path using a volume
25
+ ENV DB_PATH=chroma_db
26
+
27
+ # Create the directory for ChromaDB data and declare it as a volume
28
+ # This ensures the directory exists and signals it's for persistent data
29
+ RUN mkdir -p chroma_db
30
+ VOLUME chroma_db
31
+
32
+ # Expose the port Streamlit runs on
33
+ EXPOSE 8501
34
+
35
+ # Define the command to run the application
36
+ # Use 0.0.0.0 to make it accessible from outside the container
37
+ CMD ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -1,10 +1,127 @@
1
- ---
2
- title: Hugging2021 Rag System With Gemin
3
- emoji: πŸ‘
4
- colorFrom: blue
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Agentic RAG Streamlit Application
2
+
3
+ This project implements an Retrieval-Augmented Generation (RAG) system using **Gemini** and **Streamlit**. It allows users to ingest data from PDF files and web URLs, ask questions, and receive answers generated by a **Large Language Model (LLM)** leveraging the ingested context and optional web search results.
4
+ ![architecture ](./assets/architecture.jpg)
5
+
6
+ ### How it works
7
+
8
+ * The user uploads PDF documents or provides web URLs, these documents are processed and stored in **Chroma** Vector Database.
9
+ * The user submits a query, the query is first sent to a **Rewrite Agent**. This agent analyzes and reformulates the original query, aiming to improve its clarity and effectiveness for retrieval.
10
+ * The rewritten query is forwarded to the LLM. The LLM searches the Vector DB (**Chroma**), retrieving relevant text chunks based on semantic similarity. Simultaneously or based on configuration, it can leverage Web Search (**DuckDuckGo**) to gather information not present in the uploaded documents. If no specific context found, the LLM answers based on its general knowledge.
11
+ * The generated Response is sent back to the Streamlit interface, where it is displayed to the user.
12
+
13
+ ## Features
14
+
15
+ * **Data Ingestion:** Upload PDF files or enter web URLs to populate the knowledge base.
16
+ * **Persistent Vector Store:** Uses **ChromaDB** to store and retrieve text embeddings locally.
17
+ * **Query Rewriting:** Employs an agent with **Agno** to reformulate user questions for potentially better retrieval results.
18
+ * **Retrieval-Augmented Generation (RAG):**
19
+ * Retrieves relevant text chunks from the **ChromaDB** vector store based on the (rewritten) query.
20
+ * Uses a RAG agent (**Gemini**) to synthesize an answer based on the retrieved context.
21
+ * **Web Search:** Optionally performs a web search via **DuckDuckGo** if:
22
+ * No relevant documents are found in the local vector store.
23
+ * Web search is explicitly forced via the UI.
24
+ * **Configuration:** Allows users to configure:
25
+ * Enabling/disabling web search.
26
+ * Forcing web search.
27
+ * Adjusting the similarity score threshold for document retrieval.
28
+ * **Database Management:** Options to clear chat history and the vector database.
29
+ * **Dockerized:** Includes a `Dockerfile` for easy containerization and deployment.
30
+
31
+ ## Tech Stack
32
+
33
+ * **Web Framework:** Streamlit
34
+ * **Vector Database:** ChromaDB
35
+ * **LLM & Embeddings:** Gemini
36
+ * **Core Logic:** Langchain (for document processing, vector store integration), Agno (for agents)
37
+ * **Containerization:** Docker
38
+
39
+ ## Prerequisites
40
+
41
+ * **Python:** Version 3.11 or higher recommended.
42
+ * **pip:** Python package installer.
43
+ * **Git:** For cloning the repository.
44
+ * **Docker:** Required for running the application using Docker (recommended for easy setup and persistence).
45
+ * **Google API Key:** You need an API key for Google Generative AI (e.g., Gemini API). You can obtain one from [Google AI Studio](https://aistudio.google.com/app/apikey).
46
+
47
+ ## How to use
48
+ ### Without Docker
49
+
50
+ 1. **Clone the Repository:**
51
+ ```bash
52
+ git clone https://github.com/luanntd/RAG-System-with-Gemini.git
53
+ cd RAG-System-with-Gemini
54
+ ```
55
+
56
+ 2. **Create a Virtual Environment (Recommended):**
57
+ ```bash
58
+ python -m venv venv
59
+ # Activate it (Linux/macOS)
60
+ source venv/bin/activate
61
+ # Activate it (Windows)
62
+ .\venv\Scripts\activate
63
+ ```
64
+
65
+ 3. **Install Dependencies:**
66
+ ```bash
67
+ pip install -r requirements.txt
68
+ ```
69
+
70
+ 4. **Create Directory for Vector Store**
71
+ ```bash
72
+ mkdir chroma_db
73
+ ```
74
+
75
+ 5. **Set Up Environment Variables:**
76
+ * Create a file named `.env` in the project's root directory.
77
+ * Add the following variables:
78
+
79
+ ```dotenv
80
+ GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
81
+ COLLECTION_NAME=rag_system
82
+ DB_PATH=chroma_db
83
+ ```
84
+ * Replace `"YOUR_GOOGLE_API_KEY"` with your actual Google API key.
85
+
86
+ 6. **Running the Application**
87
+ ```bash
88
+ streamlit run main.py
89
+ ```
90
+
91
+ ### With Docker (Recommended)
92
+
93
+ You need to do steps 1 and 5 above before this.
94
+
95
+ 1. **Build the Docker Image:**
96
+ ```bash
97
+ docker build -t rag-system .
98
+ ```
99
+
100
+ 2. **Run the Docker Container:**
101
+ * Create the volume:
102
+ ```bash
103
+ docker volume create chroma_data
104
+ ```
105
+ * Run the container:
106
+ ```bash
107
+ docker run -d \
108
+ -p 8501:8501 \
109
+ --env-file ./.env \
110
+ -v chroma_data:/chroma_db \
111
+ --name rag-system-container \
112
+ rag-system
113
+ ```
114
+
115
+ * **Explanation of `docker run` flags:**
116
+ * `-d`: Run the container in detached mode (in the background).
117
+ * `-p 8501:8501`: Map port 8501 on your host machine to port 8501 inside the container.
118
+ * `--env-file ./.env`: Load environment variables from your local `.env` file into the container.
119
+ * `-v rag_chroma_data:/app/chroma_db`: Mounts persistent storage. It links the named volume `chroma_data` to the `/chroma_db` directory *inside* the container. This path (`/chroma_db`) is where ChromaDB will store its data.
120
+ * `--name rag-system-container`: Assigns a name to your running container.
121
+ * `rag-system`: The name of the Docker image you built.
122
+
123
+ 3. **Access the Application:**
124
+ * Open your web browser and navigate to `http://localhost:8501`.
125
+
126
+ ## Demo
127
+ ![demo](./assets/demo.gif)
assets/architecture.jpg ADDED

Git LFS Details

  • SHA256: 24d4b70a7ef18cfc3186989180d72730277b96adae9ad1966e586646a0918abb
  • Pointer size: 131 Bytes
  • Size of remote file: 132 kB
assets/demo.gif ADDED

Git LFS Details

  • SHA256: 18de8a7db6f1eed5b0d06ee6c860d3a56f1c69c1357af1602dcb54f3f07920a2
  • Pointer size: 133 Bytes
  • Size of remote file: 31.5 MB
main.py ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import streamlit as st
3
+ from chromadb import PersistentClient
4
+ from dotenv import load_dotenv
5
+ from urllib.parse import urlparse, urlunparse
6
+
7
+ from utils.processor import process_pdf, process_web
8
+ from utils.vector_store import create_vector_store
9
+ from utils.agent import get_query_rewriter_agent, get_web_search_agent, get_rag_agent
10
+
11
+ # --- Constants and Configuration ---
12
+ load_dotenv()
13
+ GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
14
+ COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_system") # Provide a default
15
+ DB_PATH = os.getenv("DB_PATH", "chroma_db")
16
+ DEFAULT_SIMILARITY_THRESHOLD = 0.7
17
+ RETRIEVER_K = 5 # Number of documents to retrieve
18
+
19
+ # --- Helper Functions ---
20
+
21
+ def initialize_session_state():
22
+ """Initializes Streamlit session state variables if they don't exist."""
23
+ defaults = {
24
+ 'google_api_key': GOOGLE_API_KEY,
25
+ 'history': [],
26
+ 'use_web_search': False,
27
+ 'force_web_search': False,
28
+ 'similarity_threshold': DEFAULT_SIMILARITY_THRESHOLD,
29
+ 'vector_store': None,
30
+ 'processed_documents': [],
31
+ 'chroma_client': None,
32
+ 'chroma_collection': None,
33
+ 'url_input': "",
34
+ 'clear_url_input_flag': False
35
+ }
36
+ for key, value in defaults.items():
37
+ if key not in st.session_state:
38
+ st.session_state[key] = value
39
+
40
+ def normalize_url(url: str) -> str:
41
+ """
42
+ Normalizes a URL for consistent checking and storage.
43
+ - Adds 'http' if no scheme is present.
44
+ - Converts scheme and domain to lowercase.
45
+ - Removes 'www.' prefix.
46
+ - Removes trailing slashes from the path.
47
+ - Removes fragments (#...).
48
+ """
49
+ url = url.strip()
50
+ if not url:
51
+ return ""
52
+
53
+ # Add scheme if missing (default to http for parsing)
54
+ if '://' not in url:
55
+ url = 'http://' + url
56
+
57
+ try:
58
+ parts = urlparse(url)
59
+
60
+ # Lowercase scheme and netloc (domain)
61
+ scheme = parts.scheme.lower()
62
+ netloc = parts.netloc.lower()
63
+
64
+ # Remove 'www.' prefix
65
+ if netloc.startswith('www.'):
66
+ netloc = netloc[4:]
67
+
68
+ # Remove trailing slashes from path, but keep root '/'
69
+ path = parts.path.rstrip('/')
70
+ if not path and parts.path == '/': # Keep root slash if original path was only '/'
71
+ path = '/'
72
+ # If path became empty after stripping and wasn't root, ensure it starts with / if netloc exists
73
+ elif not path and parts.path != '/' and netloc:
74
+ path = '' # Or '/' depending on desired strictness, empty seems safer.
75
+ elif path and not path.startswith('/') and netloc:
76
+ path = '/' + path # Ensure path starts with / if not empty
77
+
78
+ # Reconstruct without query params and fragment for basic normalization
79
+ # Note: Ignoring query params for simplicity here. Robust normalization might sort/handle them.
80
+ normalized = urlunparse((scheme, netloc, path, '', '', ''))
81
+ return normalized
82
+ except ValueError:
83
+ st.warning(f"⚠️ Could not properly normalize URL: {url}. Using original.")
84
+ return url
85
+
86
+
87
+ def load_vector_store():
88
+ """Loads or initializes the ChromaDB vector store and retrieves processed documents."""
89
+ if st.session_state.vector_store is None:
90
+ try:
91
+ st.session_state.chroma_client = PersistentClient(path=DB_PATH)
92
+ st.session_state.chroma_collection = st.session_state.chroma_client.get_or_create_collection(name=COLLECTION_NAME)
93
+
94
+ # Wrap collection in Langchain vector store
95
+ st.session_state.vector_store = create_vector_store(
96
+ st.session_state.google_api_key,
97
+ client=st.session_state.chroma_client
98
+ )
99
+
100
+ # Retrieve metadata (source names) of already processed documents
101
+ results = st.session_state.chroma_collection.get(include=['metadatas'])
102
+ if results and 'metadatas' in results and results['metadatas']:
103
+ processed_docs = set()
104
+ for meta in results['metadatas']:
105
+ if meta and 'source' in meta:
106
+ processed_docs.add(meta['source'])
107
+ st.session_state.processed_documents = list(processed_docs) # Convert back to list for consistency
108
+ st.success(f"βœ… Loaded {len(st.session_state.processed_documents)} documents from database.")
109
+ else:
110
+ st.session_state.processed_documents = []
111
+ st.info("ℹ️ No existing documents found in the database.")
112
+
113
+ except Exception as e:
114
+ st.session_state.vector_store = None
115
+ st.session_state.processed_documents = []
116
+ st.session_state.chroma_client = None
117
+ st.session_state.chroma_collection = None
118
+ st.warning(f"⚠️ Error loading/creating vector store: {e}")
119
+
120
+ def add_texts_to_vector_store(texts, source_name):
121
+ """Adds processed text documents to the vector store."""
122
+ if not texts:
123
+ st.warning(f"⚠️ No text extracted from {source_name}. Skipping.")
124
+ return False
125
+ try:
126
+ if st.session_state.vector_store is None:
127
+ # Initialize vector store if it doesn't exist yet
128
+ st.session_state.vector_store = create_vector_store(
129
+ st.session_state.google_api_key,
130
+ texts=texts, # Pass initial texts if needed by create_vector_store
131
+ client=st.session_state.chroma_client
132
+ )
133
+ # Ensure collection is updated if vector store was just created
134
+ st.session_state.chroma_collection = st.session_state.chroma_client.get_or_create_collection(name=COLLECTION_NAME)
135
+
136
+ else:
137
+ st.session_state.vector_store.add_documents(texts)
138
+
139
+ st.session_state.processed_documents.append(source_name)
140
+ st.success(f"βœ… Added source: {source_name} to the database.")
141
+ return True
142
+ except Exception as e:
143
+ st.error(f"❌ Error adding {source_name} to vector store: {e}")
144
+ return False
145
+
146
+ def clear_chat_history():
147
+ """Clears the chat history."""
148
+ st.session_state.history = []
149
+ st.success("Chat history cleared.")
150
+
151
+ def clear_vector_database():
152
+ """Clears all documents from the ChromaDB collection."""
153
+ if st.session_state.chroma_collection:
154
+ try:
155
+ existing_ids = st.session_state.chroma_collection.get(include=[])['ids']
156
+ if existing_ids:
157
+ st.session_state.chroma_collection.delete(ids=existing_ids)
158
+ st.session_state.processed_documents = []
159
+ st.success("βœ… Database cleared successfully. Note that this action does not delete the uploaded files in current session state.")
160
+ else:
161
+ st.info("ℹ️ Database is already empty.")
162
+ except Exception as e:
163
+ st.error(f"❌ Error clearing database: {e}")
164
+ else:
165
+ st.warning("⚠️ Vector store not initialized. Cannot clear database.")
166
+
167
+ def display_processed_sources():
168
+ """Displays the list of processed documents/URLs in the sidebar."""
169
+ if st.session_state.processed_documents:
170
+ st.sidebar.header("πŸ“š Processed Sources")
171
+ for source in sorted(list(set(st.session_state.processed_documents))): # Ensure uniqueness and sort
172
+ icon = "πŸ“„" if source.lower().endswith(".pdf") else "🌐"
173
+ st.sidebar.text(f"{icon} {source}")
174
+
175
+ def display_chat_history():
176
+ """Displays the chat messages from session state."""
177
+ for chat in st.session_state.history:
178
+ with st.chat_message(chat["role"]):
179
+ st.write(chat["content"])
180
+
181
+ def rewrite_query(query):
182
+ """Rewrites the user query using the query rewriter agent."""
183
+ try:
184
+ query_rewriter = get_query_rewriter_agent()
185
+ rewritten_query = query_rewriter.run(query).content
186
+ # Optionally display the rewritten query
187
+ # with st.expander("πŸ”„ Rewritten Query"):
188
+ # st.write(f"Original: {query}")
189
+ # st.write(f"Rewritten: {rewritten_query}")
190
+ return rewritten_query
191
+ except Exception as e:
192
+ st.error(f"❌ Error rewriting query: {str(e)}")
193
+ return query
194
+
195
+ def search_documents(query):
196
+ """Searches the vector store for relevant documents."""
197
+ if not st.session_state.vector_store:
198
+ st.info("ℹ️ Vector store is not available for document search.")
199
+ return [], ""
200
+
201
+ retriever = st.session_state.vector_store.as_retriever(
202
+ search_type="similarity_score_threshold",
203
+ search_kwargs={
204
+ "k": RETRIEVER_K,
205
+ "score_threshold": st.session_state.similarity_threshold
206
+ }
207
+ )
208
+ try:
209
+ with st.spinner("Searching documents..."):
210
+ docs = retriever.invoke(query)
211
+ if docs:
212
+ context = "\n\n".join([d.page_content for d in docs])
213
+ st.info(f"πŸ“Š Found {len(docs)} relevant document chunks.")
214
+ return docs, context
215
+ else:
216
+ st.info("ℹ️ No relevant documents found matching the threshold.")
217
+ return [], ""
218
+ except Exception as e:
219
+ st.error(f"❌ Error searching documents: {e}")
220
+ return [], ""
221
+
222
+ def search_web(query):
223
+ """Searches the web using the web search agent."""
224
+ try:
225
+ with st.spinner("πŸ” Searching the web..."):
226
+ web_search_agent = get_web_search_agent()
227
+ web_results = web_search_agent.run(query).content
228
+ if web_results:
229
+ st.info("🌐 Web search successful.")
230
+ return f"Web Search Results:\n{web_results}"
231
+ else:
232
+ st.info("πŸ•ΈοΈ Web search returned no results.")
233
+ return ""
234
+ except Exception as e:
235
+ st.error(f"❌ Web search error: {str(e)}")
236
+ return ""
237
+
238
+ def generate_response(original_query, rewritten_query, context):
239
+ """Generates the final response using the RAG agent."""
240
+ try:
241
+ with st.spinner("πŸ€– Generating response..."):
242
+ rag_agent = get_rag_agent()
243
+
244
+ if context:
245
+ full_prompt = f"""Based on the following context, answer the question.
246
+
247
+ Context:
248
+ {context}
249
+
250
+ Original Question: {original_query}
251
+ Rewritten Question (for context search): {rewritten_query}
252
+
253
+ Answer:"""
254
+ else:
255
+ # Fallback if no context from documents or web
256
+ full_prompt = f"Answer the following question: {rewritten_query}"
257
+ st.info("ℹ️ No specific context found. Answering based on general knowledge.")
258
+
259
+ response = rag_agent.run(full_prompt)
260
+ return response.content
261
+ except Exception as e:
262
+ st.error(f"❌ Error generating response: {str(e)}")
263
+ return "Sorry, I encountered an error while generating the response."
264
+
265
+ # --- Streamlit App UI and Logic ---
266
+
267
+ def main():
268
+ st.set_page_config(layout="wide")
269
+ st.title("πŸ€” RAG System")
270
+
271
+ initialize_session_state()
272
+ load_vector_store()
273
+
274
+ if st.session_state.get('clear_url_input_flag', False):
275
+ st.session_state.url_input = ""
276
+ st.session_state.clear_url_input_flag = False
277
+
278
+ # --- Sidebar ---
279
+ with st.sidebar:
280
+ st.header("βš™οΈ Controls")
281
+ if st.button("πŸ—‘οΈ Clear Chat History"):
282
+ clear_chat_history()
283
+ if st.button("⚠️ Clear Document Database"):
284
+ clear_vector_database()
285
+
286
+ st.header("πŸ”§ Configuration")
287
+ st.session_state.use_web_search = st.checkbox(
288
+ "Enable Web Search", value=st.session_state.use_web_search
289
+ )
290
+ st.session_state.force_web_search = st.checkbox(
291
+ "Force Web Search", value=st.session_state.force_web_search,
292
+ help="Always use web search, even if documents are found."
293
+ )
294
+ st.session_state.similarity_threshold = st.slider(
295
+ "Document Similarity Threshold",
296
+ min_value=0.0, max_value=1.0, value=st.session_state.similarity_threshold, step=0.05,
297
+ help="Minimum relevance score for document retrieval (higher is stricter)."
298
+ )
299
+
300
+ st.header("πŸ’Ύ Data Input")
301
+ uploaded_files = st.file_uploader(
302
+ "Upload PDF Files", type=["pdf"], accept_multiple_files=True
303
+ )
304
+ web_url = st.text_input(
305
+ "Enter Website URL",
306
+ key="url_input"
307
+ )
308
+
309
+ display_processed_sources()
310
+
311
+ # --- Process Uploads ---
312
+ # Process PDFs
313
+ if uploaded_files:
314
+ for uploaded_file in uploaded_files:
315
+ file_name = uploaded_file.name
316
+ if file_name not in st.session_state.processed_documents:
317
+ with st.spinner(f'Processing PDF: {file_name}...'):
318
+ texts = process_pdf(uploaded_file)
319
+ add_texts_to_vector_store(texts, file_name)
320
+
321
+ if web_url:
322
+ normalized_url = normalize_url(web_url)
323
+ if normalized_url:
324
+ # Check if the *normalized* URL has already been processed
325
+ if normalized_url not in st.session_state.processed_documents:
326
+ with st.spinner(f'Processing URL: {web_url}...'):
327
+ # Process using the *original* URL input
328
+ texts = process_web(web_url)
329
+ if add_texts_to_vector_store(texts, normalized_url):
330
+ st.session_state.clear_url_input_flag = True
331
+ st.rerun()
332
+
333
+ # --- Chat Interface ---
334
+ display_chat_history()
335
+
336
+ # Get user input
337
+ prompt = st.chat_input("Ask a question about your documents or the web...")
338
+
339
+ if prompt:
340
+ # Add user message to UI and history
341
+ st.chat_message("user").write(prompt)
342
+ st.session_state.history.append({"role": "user", "content": prompt})
343
+
344
+ # 1. Rewrite Query
345
+ rewritten_query = rewrite_query(prompt)
346
+
347
+ # 2. Search Strategy
348
+ doc_context = ""
349
+ web_context = ""
350
+ docs = []
351
+
352
+ # Try document search first unless web search is forced
353
+ if not st.session_state.force_web_search:
354
+ docs, doc_context = search_documents(rewritten_query)
355
+
356
+ # Decide if web search is needed
357
+ use_web = st.session_state.force_web_search or (st.session_state.use_web_search and not doc_context)
358
+
359
+ if use_web:
360
+ web_context = search_web(rewritten_query)
361
+ if st.session_state.force_web_search and not web_context:
362
+ st.warning("Forced web search did not return results.")
363
+ elif not doc_context and web_context:
364
+ st.info("Using web search results as fallback.")
365
+ elif st.session_state.force_web_search and web_context:
366
+ st.info("Using forced web search results.")
367
+
368
+
369
+ # 3. Combine Context (prioritize document context if available and not forcing web)
370
+ final_context = ""
371
+ if st.session_state.force_web_search:
372
+ final_context = web_context # Use only web if forced
373
+ elif doc_context:
374
+ final_context = doc_context # Use docs if found
375
+ elif web_context: # Use web only if docs weren't found (and web search was enabled/successful)
376
+ final_context = web_context
377
+
378
+ # 4. Generate Response
379
+ assistant_response = generate_response(prompt, rewritten_query, final_context)
380
+
381
+ # Add assistant response to UI and history
382
+ st.chat_message("assistant").write(assistant_response)
383
+ st.session_state.history.append({"role": "assistant", "content": assistant_response})
384
+
385
+ # Optional: Display sources used if context came from documents
386
+ # if not st.session_state.force_web_search and docs:
387
+ # with st.expander("πŸ“š Document Sources Used"):
388
+ # for i, doc in enumerate(docs):
389
+ # source = doc.metadata.get('source', 'Unknown Source')
390
+ # st.write(f"**{i+1}. {source}**")
391
+ # st.caption(f"{doc.page_content[:250]}...") # Show snippet
392
+
393
+ if __name__ == "__main__":
394
+ main()
requirements.txt ADDED
Binary file (6.59 kB). View file
 
utils/agent.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from agno.agent import Agent
2
+ from agno.models.google import Gemini
3
+ from agno.tools.duckduckgo import DuckDuckGoTools
4
+
5
+ def get_query_rewriter_agent() -> Agent:
6
+ """Initialize a query rewriting agent."""
7
+ return Agent(
8
+ name="Query Rewriter",
9
+ model=Gemini(id="gemini-exp-1206"),
10
+ instructions="""You are an expert at reformulating questions to be more precise and detailed.
11
+ Your task is to:
12
+ 1. Analyze the user's question
13
+ 2. Rewrite it to be more specific and search-friendly
14
+ 3. Expand any acronyms or technical terms
15
+ 4. Return ONLY the rewritten query without any additional text or explanations
16
+
17
+ Example 1:
18
+ User: "What does it say about ML?"
19
+ Output: "What are the key concepts, techniques, and applications of Machine Learning (ML) discussed in the context?"
20
+
21
+ Example 2:
22
+ User: "Tell me about transformers"
23
+ Output: "Explain the architecture, mechanisms, and applications of Transformer neural networks in natural language processing and deep learning"
24
+ """,
25
+ show_tool_calls=False,
26
+ markdown=True,
27
+ )
28
+
29
+
30
+ def get_web_search_agent() -> Agent:
31
+ """Initialize a web search agent using DuckDuckGo."""
32
+ return Agent(
33
+ name="Web Search Agent",
34
+ model=Gemini(id="gemini-exp-1206"),
35
+ tools=[DuckDuckGoTools(
36
+ fixed_max_results=5
37
+ )],
38
+ instructions="""You are a web search expert. Your task is to:
39
+ 1. Search the web for relevant information about the query
40
+ 2. Compile and summarize the most relevant information
41
+ 3. Include sources in your response
42
+ """,
43
+ show_tool_calls=True,
44
+ markdown=True,
45
+ )
46
+
47
+
48
+ def get_rag_agent() -> Agent:
49
+ """Initialize the main RAG agent."""
50
+ return Agent(
51
+ name="Gemini RAG Agent",
52
+ model=Gemini(id="gemini-2.0-flash-thinking-exp-01-21"),
53
+ instructions="""You are an Intelligent Agent specializing in providing accurate answers.
54
+
55
+ When given context from documents:
56
+ - Focus on information from the provided documents
57
+ - Be precise and cite specific details
58
+
59
+ When given web search results:
60
+ - Clearly indicate that the information comes from web search
61
+ - Synthesize the information clearly
62
+
63
+ Always maintain high accuracy and clarity in your responses.
64
+ """,
65
+ show_tool_calls=True,
66
+ markdown=True,
67
+ )
utils/processor.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import tempfile
2
+ from datetime import datetime
3
+ from typing import List
4
+
5
+ import streamlit as st
6
+ from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
7
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
8
+
9
+ def process_pdf(file) -> List:
10
+ """Process PDF file and add source metadata."""
11
+ try:
12
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
13
+ tmp_file.write(file.getvalue())
14
+ loader = PyPDFLoader(tmp_file.name)
15
+ documents = loader.load()
16
+
17
+ # Add source metadata
18
+ for doc in documents:
19
+ doc.metadata.update({
20
+ "source_type": "pdf",
21
+ "file_name": file.name,
22
+ "timestamp": datetime.now().isoformat()
23
+ })
24
+
25
+ text_splitter = RecursiveCharacterTextSplitter(
26
+ chunk_size=1000,
27
+ chunk_overlap=200
28
+ )
29
+ return text_splitter.split_documents(documents)
30
+
31
+ except Exception as e:
32
+ st.error(f"πŸ“„ PDF processing error: {str(e)}")
33
+ return []
34
+
35
+
36
+ def process_web(url: str) -> List:
37
+ """Process web URL and add source metadata."""
38
+ try:
39
+ loader = WebBaseLoader(web_path=url)
40
+ documents = loader.load()
41
+
42
+ # Add source metadata
43
+ for doc in documents:
44
+ doc.metadata.update({
45
+ "source_type": "url",
46
+ "url": url,
47
+ "timestamp": datetime.now().isoformat()
48
+ })
49
+
50
+ text_splitter = RecursiveCharacterTextSplitter(
51
+ chunk_size=1000,
52
+ chunk_overlap=200
53
+ )
54
+ return text_splitter.split_documents(documents)
55
+
56
+ except Exception as e:
57
+ st.error(f"🌐 Web processing error: {str(e)}")
58
+ return []
utils/vector_store.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+ import os
3
+ import streamlit as st
4
+ import google.generativeai as genai
5
+ from langchain_chroma import Chroma
6
+ from langchain_core.embeddings import Embeddings
7
+ from dotenv import load_dotenv
8
+
9
+ # Load environment variables from .env file
10
+ load_dotenv()
11
+ COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'rag_system')
12
+
13
+ class GeminiEmbedder(Embeddings):
14
+ def __init__(self, api_key, model_name="models/text-embedding-004"):
15
+ genai.configure(api_key=api_key)
16
+ self.model = model_name
17
+
18
+ def embed_documents(self, texts: List[str]) -> List[List[float]]:
19
+ return [self.embed_query(text) for text in texts]
20
+
21
+ def embed_query(self, text: str) -> List[float]:
22
+ response = genai.embed_content(
23
+ model=self.model,
24
+ content=text,
25
+ task_type="retrieval_document"
26
+ )
27
+ return response['embedding']
28
+
29
+ def create_vector_store(api_key, texts=None, client=None):
30
+ """Create and initialize vector store with documents."""
31
+ try:
32
+ # Initialize vector store
33
+ vector_store = Chroma(
34
+ collection_name=COLLECTION_NAME,
35
+ embedding_function=GeminiEmbedder(api_key=api_key),
36
+ persist_directory="chroma_db",
37
+ client=client # Pass the client if provided
38
+ )
39
+
40
+ # Add documents if provided
41
+ if texts:
42
+ with st.spinner('πŸ“€ Uploading documents to database...'):
43
+ vector_store.add_documents(texts)
44
+ st.success("βœ… Documents stored successfully!")
45
+ return vector_store
46
+
47
+ return vector_store
48
+
49
+ except Exception as e:
50
+ st.error(f"πŸ”΄ Vector store error: {str(e)}")
51
+ return None
52
+