Spaces:
Sleeping
Sleeping
Updated Readme
Browse files
README.md
CHANGED
|
@@ -1,148 +1,36 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
- **Concise Memory**: Automatically summarizes answers to keep conversation history efficient and reduce token usage
|
| 12 |
-
- **REST API**: Full REST API for integration with any application or custom UIs
|
| 13 |
-
- **Streamlit UI**: User-friendly web interface for document upload and interactive querying
|
| 14 |
-
- **Multiple LLM Support**: Currently supports Groq LLM (easily extensible to other providers)
|
| 15 |
-
|
| 16 |
-
## 📦 Installation
|
| 17 |
-
|
| 18 |
-
1. **Clone the repository** and navigate to the project directory
|
| 19 |
-
2. **Install dependencies**: `pip install -r requirements.txt`
|
| 20 |
-
3. **Set up environment**: Create a `.env` file with your `GROQ_API_KEY` (get it from https://console.groq.com/)
|
| 21 |
-
|
| 22 |
-
## 🚀 Quick Start
|
| 23 |
-
|
| 24 |
-
### Streamlit Web App
|
| 25 |
-
Run `streamlit run app.py` and open http://localhost:8501 in your browser. Upload PDFs in the "Upload & Process" tab, then query them in the "Chat" tab.
|
| 26 |
-
|
| 27 |
-
### REST API
|
| 28 |
-
Run `uvicorn api:app --reload` and visit http://localhost:8000/docs for interactive API documentation. Use the API endpoints to upload documents and query them programmatically.
|
| 29 |
-
|
| 30 |
-
### Python Scripts
|
| 31 |
-
Use the core functions from `src/rag_pipeline.py` directly in your Python code, or run `python example_client.py` for a complete example.
|
| 32 |
-
|
| 33 |
-
## 📁 Project Structure
|
| 34 |
-
|
| 35 |
-
- **`src/rag_pipeline.py`**: Core RAG pipeline components (document processing, embeddings, vector store, retrieval, generation)
|
| 36 |
-
- **`app.py`**: Streamlit web application with UI for document upload and chat interface
|
| 37 |
-
- **`api.py`**: FastAPI REST API server for programmatic access
|
| 38 |
-
- **`example_client.py`**: Example Python client demonstrating API usage
|
| 39 |
-
- **`data/pdf/`**: Directory for PDF documents
|
| 40 |
-
- **`data/vector_store/`**: ChromaDB vector store persistence directory
|
| 41 |
-
- **`notebook/`**: Jupyter notebooks for experimentation and development
|
| 42 |
-
|
| 43 |
-
## 💡 How It Works
|
| 44 |
-
|
| 45 |
-
### Document Processing Flow
|
| 46 |
-
|
| 47 |
-
1. **Upload**: PDF files are uploaded and loaded using PyMuPDFLoader
|
| 48 |
-
2. **Chunking**: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with configurable size and overlap
|
| 49 |
-
3. **Embedding**: Each chunk is converted to a vector embedding using SentenceTransformer models
|
| 50 |
-
4. **Storage**: Embeddings and documents are stored in ChromaDB vector database with metadata
|
| 51 |
-
5. **Retrieval**: When querying, the query is embedded and semantically similar documents are retrieved
|
| 52 |
-
6. **Generation**: Retrieved context is combined with the query and conversation history, then sent to the LLM for answer generation
|
| 53 |
-
|
| 54 |
-
### Key Components
|
| 55 |
-
|
| 56 |
-
- **EmbeddingModel**: Manages sentence transformer models for generating document and query embeddings
|
| 57 |
-
- **VectorStore**: Handles ChromaDB operations for storing and querying document embeddings
|
| 58 |
-
- **RagRetriever**: Performs semantic search with optional metadata filtering
|
| 59 |
-
- **RAG Pipeline Functions**: Combine retrieval with LLM generation, supporting conversation memory
|
| 60 |
-
|
| 61 |
-
### Conversation Memory
|
| 62 |
-
|
| 63 |
-
The system maintains conversation history per session, storing:
|
| 64 |
-
- Full user queries for context
|
| 65 |
-
- Concise summaries of assistant answers (extracted key points) to save space
|
| 66 |
-
- Previous conversation context is included in prompts to enable follow-up questions
|
| 67 |
-
|
| 68 |
-
### Metadata Filtering
|
| 69 |
-
|
| 70 |
-
Filter documents before retrieval to:
|
| 71 |
-
- Search only in specific source files
|
| 72 |
-
- Limit to certain page ranges
|
| 73 |
-
- Apply custom metadata filters
|
| 74 |
-
- Improve retrieval speed and accuracy by reducing search space
|
| 75 |
-
|
| 76 |
-
## 📚 Usage
|
| 77 |
-
|
| 78 |
-
### Streamlit App
|
| 79 |
-
|
| 80 |
-
The web interface provides two main tabs:
|
| 81 |
-
- **Upload & Process**: Upload PDF files, configure chunking parameters (chunk size, overlap), and process documents. View system status including document and chunk counts.
|
| 82 |
-
- **Chat**: Interactive chat interface where you can ask questions about uploaded documents. The chat remembers previous conversations within the session. You can enable metadata filtering in the sidebar to narrow down searches.
|
| 83 |
-
|
| 84 |
-
### REST API
|
| 85 |
-
|
| 86 |
-
The API provides endpoints for:
|
| 87 |
-
- **Upload**: Upload and process PDF documents with custom chunking parameters
|
| 88 |
-
- **Query**: Query documents with optional conversation memory and metadata filtering
|
| 89 |
-
- **Chat History**: Retrieve or clear conversation history for specific sessions
|
| 90 |
-
- **Status**: Check system status and document counts
|
| 91 |
-
- **Reset**: Clear all documents and chat histories
|
| 92 |
-
|
| 93 |
-
See `API_USAGE.md` for detailed API documentation and examples.
|
| 94 |
-
|
| 95 |
-
### Python Integration
|
| 96 |
-
|
| 97 |
-
Import functions from `src/rag_pipeline.py` to:
|
| 98 |
-
- Process PDFs from directories
|
| 99 |
-
- Chunk documents with custom parameters
|
| 100 |
-
- Generate embeddings
|
| 101 |
-
- Store in vector database
|
| 102 |
-
- Retrieve relevant documents
|
| 103 |
-
- Generate answers with conversation context
|
| 104 |
-
|
| 105 |
-
## ⚙️ Configuration
|
| 106 |
-
|
| 107 |
-
### Environment Variables
|
| 108 |
-
|
| 109 |
-
Set `GROQ_API_KEY` in your `.env` file to use Groq LLM models.
|
| 110 |
-
|
| 111 |
-
### Customization Options
|
| 112 |
-
|
| 113 |
-
- **Embedding Model**: Change the SentenceTransformer model (default: `all-MiniLM-L6-v2`)
|
| 114 |
-
- **Vector Store**: Customize collection name and persistence directory
|
| 115 |
-
- **LLM Model**: Choose different Groq models or extend to other providers
|
| 116 |
-
- **Chunking**: Adjust chunk size and overlap based on your document types
|
| 117 |
-
- **Retrieval**: Configure top_k results and similarity score thresholds
|
| 118 |
-
|
| 119 |
-
## 🔧 Troubleshooting
|
| 120 |
-
|
| 121 |
-
**Module not found errors**: Ensure all dependencies are installed with `pip install -r requirements.txt`
|
| 122 |
|
| 123 |
-
|
| 124 |
|
| 125 |
-
**No documents retrieved**: Check that documents were successfully processed, verify the query matches document content, and try rephrasing
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
-
- **Interactive API Docs**: Visit http://localhost:8000/docs when the API server is running
|
| 135 |
-
- **Example Client**: Run `python example_client.py` to see a complete usage example
|
| 136 |
|
| 137 |
-
##
|
| 138 |
|
| 139 |
-
-
|
| 140 |
-
-
|
| 141 |
-
-
|
| 142 |
-
-
|
| 143 |
-
-
|
| 144 |
-
-
|
|
|
|
| 145 |
|
| 146 |
---
|
| 147 |
|
| 148 |
-
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: PDF Chatbot – RAG Pipeline
|
| 3 |
+
emoji: 📄
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: streamlit
|
| 7 |
+
sdk_version: "1.35.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 13 |
|
|
|
|
| 14 |
|
| 15 |
+
# 📄 PDF Chatbot – RAG Pipeline
|
| 16 |
|
| 17 |
+
This Space hosts an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline that allows users to upload PDFs and ask questions about their content.
|
| 18 |
|
| 19 |
+
The system extracts text, chunks it intelligently, embeds it into a vector database, and retrieves relevant context to answer queries using a large language model (LLM).
|
| 20 |
|
| 21 |
+
---
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
## 🚀 Features
|
| 24 |
|
| 25 |
+
- 🔹 PDF upload support
|
| 26 |
+
- 🔹 Automatic text extraction
|
| 27 |
+
- 🔹 Smart document chunking
|
| 28 |
+
- 🔹 Vector storage using ChromaDB
|
| 29 |
+
- 🔹 LLM-powered question answering
|
| 30 |
+
- 🔹 Streamlit-based interface
|
| 31 |
+
- 🔹 Clean RAG pipeline implementation (`src/rag_pipeline.py`)
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
+
## 🏗️ Project Structure
|
| 36 |
+
|