Saketh12345 commited on
Commit
82e2888
·
0 Parent(s):

Initial clean commit for Streamlit Hugging Face Spaces deployment

Browse files
Files changed (5) hide show
  1. .gitignore +50 -0
  2. LICENSE +21 -0
  3. README.md +78 -0
  4. app.py +350 -0
  5. requirements.txt +35 -0
.gitignore ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+
28
+ # IDE
29
+ .idea/
30
+ .vscode/
31
+ *.swp
32
+ *.swo
33
+
34
+ # Environment variables
35
+ .env
36
+
37
+ # ChromaDB
38
+ chroma_db/
39
+
40
+ # Jupyter Notebook
41
+ .ipynb_checkpoints
42
+
43
+ # OS generated files
44
+ .DS_Store
45
+ .DS_Store?
46
+ ._*
47
+ .Spotlight-V100
48
+ .Trashes
49
+ ehthumbs.db
50
+ Thumbs.db
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Saketh Jangala
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: PDF Chat App
3
+ emoji: "📄"
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: streamlit
7
+ sdk_version: 1.45.1
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # PDF Chat Application
13
+
14
+ A PDF chat application that allows you to upload PDFs and ask questions about their content using natural language processing. Built with Streamlit, LangChain, and Hugging Face Transformers, this app runs entirely in your browser on Hugging Face Spaces.
15
+
16
+ ## ✨ Features
17
+
18
+ - 📄 Upload and process PDF documents
19
+ - 💬 Chat with your documents using natural language
20
+ - 🔒 Local processing - no data leaves your machine
21
+ - 🤗 Uses Hugging Face models for embeddings and question answering
22
+ - 🚀 Built with Streamlit for a clean web interface
23
+
24
+ ## 🛠 Prerequisites
25
+
26
+ - A Hugging Face account (for Spaces deployment)
27
+ - Git (for cloning the repository)
28
+ - At least 4GB of free RAM (for running the models)
29
+
30
+ ## 🚀 Getting Started
31
+
32
+ 1. Clone the repository:
33
+ ```bash
34
+ git clone https://github.com/saketh-005/pdf-chat-app.git
35
+ cd pdf-chat-app
36
+ ```
37
+
38
+ 2. Install dependencies and run locally (optional):
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ streamlit run app.py
42
+ ```
43
+
44
+ 3. Or deploy directly to Hugging Face Spaces by pushing this folder to your Space.
45
+
46
+ ## 🖥️ Usage
47
+
48
+ 1. Click "Browse files" to upload a PDF document
49
+ 2. Wait for the document to be processed (you'll see a success message)
50
+ 3. Type your question in the chat input and press Enter
51
+ 4. The app will analyze the document and provide an answer
52
+
53
+ ## 🏗️ Project Structure
54
+
55
+ ```
56
+ .
57
+ ├── app.py # Main Streamlit application
58
+ ├── requirements.txt # Python dependencies
59
+ ├── .gitignore # Git ignore file
60
+ └── README.md # This file
61
+ ```
62
+
63
+ ## 🤖 Technologies Used
64
+
65
+ - [Streamlit](https://streamlit.io/) - Web application framework
66
+ - [LangChain](https://python.langchain.com/) - Framework for LLM applications
67
+ - [Hugging Face Transformers](https://huggingface.co/transformers/) - NLP models
68
+ - [Chroma DB](https://www.trychroma.com/) - Vector database for document storage
69
+
70
+ ## 📜 License
71
+
72
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
73
+
74
+ ## 🙏 Acknowledgments
75
+
76
+ - [Hugging Face](https://huggingface.co/) for their amazing open-source models
77
+ - [LangChain](https://python.langchain.com/) for simplifying LLM application development
78
+ - [Streamlit](https://streamlit.io/) for the intuitive web interface
app.py ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import torch
4
+ import streamlit as st
5
+ from PyPDF2 import PdfReader
6
+ from typing import List, Dict, Any, Optional
7
+
8
+ # LangChain imports
9
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
10
+ from langchain_huggingface import HuggingFaceEmbeddings
11
+ from langchain_community.vectorstores import Chroma
12
+ from langchain_huggingface import HuggingFacePipeline
13
+ from langchain.chains import RetrievalQA
14
+ from langchain.prompts import PromptTemplate
15
+ from langchain_core.documents import Document
16
+
17
+ # Transformers imports
18
+ from transformers import (
19
+ AutoTokenizer,
20
+ AutoModelForSeq2SeqLM,
21
+ pipeline,
22
+ set_seed
23
+ )
24
+
25
+ # Set random seed for reproducibility
26
+ set_seed(42)
27
+
28
+ # Disable HuggingFace warnings
29
+ os.environ['TOKENIZERS_PARALLELISM'] = 'false'
30
+
31
+ def extract_text_from_pdf(pdf_file):
32
+ """Extract text from a PDF file."""
33
+ text = ""
34
+ try:
35
+ pdf_reader = PdfReader(pdf_file)
36
+ for page in pdf_reader.pages:
37
+ page_text = page.extract_text()
38
+ if page_text:
39
+ text += page_text + "\n"
40
+ if not text.strip():
41
+ st.error("Could not extract any text from the PDF. The PDF might be scanned or protected.")
42
+ return None
43
+ return text
44
+ except Exception as e:
45
+ st.error(f"Error reading PDF file: {str(e)}")
46
+ return None
47
+
48
+ def generate_response(uploaded_file, query_text):
49
+ """
50
+ Handles the main logic using local Hugging Face models.
51
+ No API key required as everything runs locally.
52
+ """
53
+ if uploaded_file is None:
54
+ return "Error: No file uploaded."
55
+
56
+ # 1. Extract text from PDF
57
+ st.info("Reading your PDF document...")
58
+ raw_text = extract_text_from_pdf(uploaded_file)
59
+ if raw_text is None:
60
+ return "Error: Could not extract text from the PDF."
61
+
62
+ # 2. Split text into manageable chunks
63
+ st.info("Splitting text into chunks...")
64
+ # Split the text into chunks with attention to model's max sequence length (512 tokens)
65
+ # Using a conservative chunk size to account for tokenization differences
66
+ text_splitter = RecursiveCharacterTextSplitter(
67
+ chunk_size=400, # Reduced from 1000 to stay well under 512 tokens
68
+ chunk_overlap=100, # Slightly reduced overlap
69
+ length_function=len,
70
+ is_separator_regex=False,
71
+ separators=["\n\n", "\n", ". ", " ", ""], # Added explicit separators
72
+ )
73
+ texts = text_splitter.split_text(raw_text)
74
+
75
+ # 3. Create embeddings and vector store
76
+ st.info("Creating document embeddings...")
77
+
78
+ # Use GPU if available, otherwise CPU
79
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
80
+
81
+ try:
82
+ # Try to use a more powerful embeddings model first
83
+ embeddings = HuggingFaceEmbeddings(
84
+ model_name='sentence-transformers/all-mpnet-base-v2',
85
+ model_kwargs={'device': device},
86
+ encode_kwargs={'normalize_embeddings': True}
87
+ )
88
+ # Test the embeddings model
89
+ test_emb = embeddings.embed_query("test")
90
+ if not test_emb or len(test_emb) == 0:
91
+ raise Exception("Embeddings model returned empty result")
92
+
93
+ except Exception as e:
94
+ st.warning(f"Falling back to smaller embeddings model due to: {str(e)}")
95
+ try:
96
+ embeddings = HuggingFaceEmbeddings(
97
+ model_name='sentence-transformers/all-MiniLM-L6-v2',
98
+ model_kwargs={'device': device},
99
+ encode_kwargs={'normalize_embeddings': True}
100
+ )
101
+ except Exception as e:
102
+ st.error(f"Failed to load embeddings model: {str(e)}")
103
+ return "Error: Could not load embeddings model."
104
+
105
+ try:
106
+ # Create ChromaDB vector store with metadata
107
+ try:
108
+ document_search = Chroma.from_texts(
109
+ texts=texts,
110
+ embedding=embeddings,
111
+ metadatas=[{"source": f"chunk-{i}", "page": i+1} for i in range(len(texts))],
112
+ collection_metadata={"hnsw:space": "cosine"}
113
+ )
114
+ # Test the vector store
115
+ _ = document_search.similarity_search("test", k=1)
116
+ except Exception as e:
117
+ st.error(f"Error creating vector store: {str(e)}")
118
+ st.stop()
119
+
120
+ # Force a small operation to verify the vector store works
121
+ _ = document_search.similarity_search("test", k=1)
122
+
123
+ except Exception as e:
124
+ st.error(f"Failed to create vector store: {str(e)}")
125
+ st.exception(e) # Show full traceback for debugging
126
+ return "Error: Could not process document content."
127
+
128
+ # 4. Load the question-answering model
129
+ st.info("Loading question-answering model...")
130
+
131
+ # Model selection with fallback
132
+ model_name = "google/flan-t5-large"
133
+ fallback_model = "google/flan-t5-base"
134
+
135
+ try:
136
+ # Try to use the base model first
137
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
138
+ model = AutoModelForSeq2SeqLM.from_pretrained(
139
+ model_name,
140
+ device_map="auto" if torch.cuda.is_available() else None,
141
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
142
+ low_cpu_mem_usage=True
143
+ )
144
+ except Exception as e:
145
+ st.warning(f"Falling back to smaller model due to: {str(e)}")
146
+ try:
147
+ model_name = fallback_model
148
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
149
+ model = AutoModelForSeq2SeqLM.from_pretrained(
150
+ model_name,
151
+ device_map="auto" if torch.cuda.is_available() else None,
152
+ torch_dtype=torch.float32, # Use float32 for stability on CPU
153
+ low_cpu_mem_usage=True
154
+ )
155
+ except Exception as e:
156
+ st.error(f"Failed to load language model: {str(e)}")
157
+ return "Error: Could not load question-answering model."
158
+
159
+ try:
160
+ # Create text generation pipeline
161
+ pipe = pipeline(
162
+ "text2text-generation",
163
+ model=model,
164
+ tokenizer=tokenizer,
165
+ max_length=1024,
166
+ temperature=0.2,
167
+ do_sample=True,
168
+ top_p=0.92,
169
+ top_k=50,
170
+ num_beams=4,
171
+ device=0 if torch.cuda.is_available() else -1,
172
+ )
173
+
174
+ llm = HuggingFacePipeline(
175
+ pipeline=pipe,
176
+ model_kwargs={
177
+ "temperature": 0.2,
178
+ "max_length": 1024,
179
+ "repetition_penalty": 1.2,
180
+ "no_repeat_ngram_size": 3
181
+ }
182
+ )
183
+
184
+ # 5. Create a retriever with MMR for better diversity
185
+ retriever = document_search.as_retriever(
186
+ search_type="mmr",
187
+ search_kwargs={
188
+ "k": 5,
189
+ "fetch_k": min(20, len(texts)),
190
+ "lambda_mult": 0.5
191
+ }
192
+ )
193
+
194
+ # 6. Create a prompt template for better answers
195
+ template = """Use the following pieces of context to answer the question at the end.
196
+ If the context doesn't contain enough information to answer the question,
197
+ just say that you don't know based on the provided information.
198
+
199
+ Context:
200
+ {context}
201
+
202
+ Question: {question}
203
+
204
+ Provide a detailed and comprehensive answer based on the context above.
205
+ Answer:"""
206
+
207
+ QA_CHAIN_PROMPT = PromptTemplate(
208
+ input_variables=["context", "question"],
209
+ template=template,
210
+ )
211
+
212
+ # 7. Create the QA chain
213
+ qa_chain = RetrievalQA.from_chain_type(
214
+ llm=llm,
215
+ chain_type="stuff",
216
+ retriever=retriever,
217
+ chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
218
+ return_source_documents=True
219
+ )
220
+
221
+ # 8. Get the answer
222
+ st.info("Generating answer...")
223
+ # Using invoke() instead of __call__ to avoid deprecation warning
224
+ result = qa_chain.invoke({"query": query_text})
225
+
226
+ # 9. Format the response with sources
227
+ response = {
228
+ "answer": result["result"],
229
+ "sources": []
230
+ }
231
+
232
+ # Add source documents if available
233
+ if result.get("source_documents"):
234
+ for i, doc in enumerate(result["source_documents"], 1):
235
+ response["sources"].append({
236
+ "id": i,
237
+ "page": doc.metadata.get("page", "N/A"),
238
+ "content": doc.page_content[:500] + ("..." if len(doc.page_content) > 500 else "")
239
+ })
240
+
241
+ return response
242
+
243
+ except Exception as e:
244
+ st.error(f"Error generating response: {str(e)}")
245
+ return f"Error: Could not generate a response. {str(e)}"
246
+
247
+ def extract_text_from_pdf(pdf_file):
248
+ text = ""
249
+ try:
250
+ pdf_reader = PdfReader(pdf_file)
251
+ for page in pdf_reader.pages:
252
+ page_text = page.extract_text()
253
+ if page_text:
254
+ text += page_text + "\n"
255
+ if not text.strip():
256
+ st.error("Could not extract any text from the PDF. The PDF might be scanned or protected.")
257
+ return None
258
+ return text
259
+ except Exception as e:
260
+ st.error(f"Error reading PDF file: {str(e)}")
261
+ return None
262
+
263
+ def main():
264
+ """Main function to run the Streamlit app."""
265
+ # --- Streamlit Page Configuration ---
266
+ st.set_page_config(
267
+ page_title="Chat with your PDF (Local Version)",
268
+ page_icon="💬",
269
+ layout="wide"
270
+ )
271
+
272
+ st.title("Chat with Your Notes (100% Local) 💬")
273
+
274
+ # Sidebar with instructions
275
+ with st.sidebar:
276
+ st.title("ℹ️ How to use")
277
+ st.markdown("""
278
+ 1. Upload a PDF file
279
+ 2. Ask a question about the document
280
+ 3. Get instant answers!
281
+
282
+ *No API keys needed. Everything runs locally on your machine.*
283
+ *First run may take a few minutes to download the models.*
284
+ """)
285
+
286
+ st.markdown("---")
287
+ st.markdown("### System Information")
288
+ st.write(f"Python: {sys.version.split()[0]}")
289
+ st.write(f"PyTorch: {torch.__version__}")
290
+ st.write(f"CUDA Available: {torch.cuda.is_available()}")
291
+ if torch.cuda.is_available():
292
+ st.write(f"GPU: {torch.cuda.get_device_name(0)}")
293
+
294
+ # File upload
295
+ st.header("1. Upload your PDF")
296
+ uploaded_file = st.file_uploader(
297
+ "Choose a PDF file",
298
+ type=["pdf"],
299
+ label_visibility="collapsed"
300
+ )
301
+
302
+ st.header("2. Ask a question")
303
+ question = st.text_area(
304
+ "Enter your question about the document:",
305
+ placeholder="What is this document about?",
306
+ label_visibility="collapsed"
307
+ )
308
+
309
+ return uploaded_file, question
310
+
311
+ if __name__ == "__main__":
312
+ # Get user inputs
313
+ uploaded_file, question = main()
314
+
315
+ # Add some spacing
316
+ st.write("")
317
+
318
+ # Generate response when button is clicked
319
+ if st.button("Get Answer", type="primary", use_container_width=True):
320
+ if not uploaded_file:
321
+ st.error("Please upload a PDF file first!")
322
+ elif not question.strip():
323
+ st.error("Please enter a question!")
324
+ else:
325
+ with st.spinner("Processing your question..."):
326
+ try:
327
+ response = generate_response(uploaded_file, question)
328
+
329
+ if isinstance(response, str) and response.startswith("Error:"):
330
+ st.error(response)
331
+ else:
332
+ # Display the answer
333
+ st.markdown("### Answer")
334
+ st.markdown(response["answer"])
335
+
336
+ # Display sources if available
337
+ if response["sources"]:
338
+ st.markdown("\n### Sources")
339
+ for source in response["sources"]:
340
+ with st.expander(f"Source {source['id']} (Page {source['page']})"):
341
+ st.markdown(source['content'])
342
+
343
+ # Add some spacing at the bottom
344
+ st.write("")
345
+ st.markdown("---")
346
+ st.caption("Note: This is a local AI model. No data was sent to any external servers.")
347
+
348
+ except Exception as e:
349
+ st.error(f"An error occurred while generating the response.")
350
+ st.exception(e) # Show full traceback for debugging
requirements.txt ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ numpy
3
+ setuptools
4
+ wheel
5
+ cython
6
+
7
+ # Streamlit and web framework
8
+ streamlit>=1.45.1
9
+
10
+ # PDF processing
11
+ PyPDF2
12
+
13
+ # Vector database and embeddings
14
+ chromadb
15
+ sentence_transformers
16
+
17
+ # LangChain and related
18
+ langchain_community
19
+ langchain_core
20
+ langchain_huggingface
21
+ langchain
22
+ langchain_text_splitters
23
+
24
+ # Hugging Face ecosystem
25
+ transformers
26
+ accelerate
27
+ huggingface_hub
28
+
29
+ # Utilities
30
+ tqdm
31
+ python_dotenv
32
+
33
+ torch
34
+
35
+ watchdog