manasvikalyan commited on
Commit
eea2580
·
verified ·
1 Parent(s): bf10662

Updated Readme

Browse files
Files changed (1) hide show
  1. README.md +25 -137
README.md CHANGED
@@ -1,148 +1,36 @@
1
- # RAG Pipeline - Retrieval-Augmented Generation System
2
-
3
- A comprehensive RAG (Retrieval-Augmented Generation) pipeline that processes PDF documents, creates embeddings, stores them in a vector database, and enables intelligent querying using Large Language Models with conversation memory.
4
-
5
- ## 🚀 Features
6
-
7
- - **PDF Document Processing**: Upload and process multiple PDF files with configurable chunking
8
- - **Semantic Search**: Vector-based document retrieval using ChromaDB for finding relevant content
9
- - **Conversational AI**: Query documents with conversation memory that remembers previous context
10
- - **Metadata Filtering**: Filter documents by source file, page number, or custom metadata for faster and more accurate retrieval
11
- - **Concise Memory**: Automatically summarizes answers to keep conversation history efficient and reduce token usage
12
- - **REST API**: Full REST API for integration with any application or custom UIs
13
- - **Streamlit UI**: User-friendly web interface for document upload and interactive querying
14
- - **Multiple LLM Support**: Currently supports Groq LLM (easily extensible to other providers)
15
-
16
- ## 📦 Installation
17
-
18
- 1. **Clone the repository** and navigate to the project directory
19
- 2. **Install dependencies**: `pip install -r requirements.txt`
20
- 3. **Set up environment**: Create a `.env` file with your `GROQ_API_KEY` (get it from https://console.groq.com/)
21
-
22
- ## 🚀 Quick Start
23
-
24
- ### Streamlit Web App
25
- Run `streamlit run app.py` and open http://localhost:8501 in your browser. Upload PDFs in the "Upload & Process" tab, then query them in the "Chat" tab.
26
-
27
- ### REST API
28
- Run `uvicorn api:app --reload` and visit http://localhost:8000/docs for interactive API documentation. Use the API endpoints to upload documents and query them programmatically.
29
-
30
- ### Python Scripts
31
- Use the core functions from `src/rag_pipeline.py` directly in your Python code, or run `python example_client.py` for a complete example.
32
-
33
- ## 📁 Project Structure
34
-
35
- - **`src/rag_pipeline.py`**: Core RAG pipeline components (document processing, embeddings, vector store, retrieval, generation)
36
- - **`app.py`**: Streamlit web application with UI for document upload and chat interface
37
- - **`api.py`**: FastAPI REST API server for programmatic access
38
- - **`example_client.py`**: Example Python client demonstrating API usage
39
- - **`data/pdf/`**: Directory for PDF documents
40
- - **`data/vector_store/`**: ChromaDB vector store persistence directory
41
- - **`notebook/`**: Jupyter notebooks for experimentation and development
42
-
43
- ## 💡 How It Works
44
-
45
- ### Document Processing Flow
46
-
47
- 1. **Upload**: PDF files are uploaded and loaded using PyMuPDFLoader
48
- 2. **Chunking**: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with configurable size and overlap
49
- 3. **Embedding**: Each chunk is converted to a vector embedding using SentenceTransformer models
50
- 4. **Storage**: Embeddings and documents are stored in ChromaDB vector database with metadata
51
- 5. **Retrieval**: When querying, the query is embedded and semantically similar documents are retrieved
52
- 6. **Generation**: Retrieved context is combined with the query and conversation history, then sent to the LLM for answer generation
53
-
54
- ### Key Components
55
-
56
- - **EmbeddingModel**: Manages sentence transformer models for generating document and query embeddings
57
- - **VectorStore**: Handles ChromaDB operations for storing and querying document embeddings
58
- - **RagRetriever**: Performs semantic search with optional metadata filtering
59
- - **RAG Pipeline Functions**: Combine retrieval with LLM generation, supporting conversation memory
60
-
61
- ### Conversation Memory
62
-
63
- The system maintains conversation history per session, storing:
64
- - Full user queries for context
65
- - Concise summaries of assistant answers (extracted key points) to save space
66
- - Previous conversation context is included in prompts to enable follow-up questions
67
-
68
- ### Metadata Filtering
69
-
70
- Filter documents before retrieval to:
71
- - Search only in specific source files
72
- - Limit to certain page ranges
73
- - Apply custom metadata filters
74
- - Improve retrieval speed and accuracy by reducing search space
75
-
76
- ## 📚 Usage
77
-
78
- ### Streamlit App
79
-
80
- The web interface provides two main tabs:
81
- - **Upload & Process**: Upload PDF files, configure chunking parameters (chunk size, overlap), and process documents. View system status including document and chunk counts.
82
- - **Chat**: Interactive chat interface where you can ask questions about uploaded documents. The chat remembers previous conversations within the session. You can enable metadata filtering in the sidebar to narrow down searches.
83
-
84
- ### REST API
85
-
86
- The API provides endpoints for:
87
- - **Upload**: Upload and process PDF documents with custom chunking parameters
88
- - **Query**: Query documents with optional conversation memory and metadata filtering
89
- - **Chat History**: Retrieve or clear conversation history for specific sessions
90
- - **Status**: Check system status and document counts
91
- - **Reset**: Clear all documents and chat histories
92
-
93
- See `API_USAGE.md` for detailed API documentation and examples.
94
-
95
- ### Python Integration
96
-
97
- Import functions from `src/rag_pipeline.py` to:
98
- - Process PDFs from directories
99
- - Chunk documents with custom parameters
100
- - Generate embeddings
101
- - Store in vector database
102
- - Retrieve relevant documents
103
- - Generate answers with conversation context
104
-
105
- ## ⚙️ Configuration
106
-
107
- ### Environment Variables
108
-
109
- Set `GROQ_API_KEY` in your `.env` file to use Groq LLM models.
110
-
111
- ### Customization Options
112
-
113
- - **Embedding Model**: Change the SentenceTransformer model (default: `all-MiniLM-L6-v2`)
114
- - **Vector Store**: Customize collection name and persistence directory
115
- - **LLM Model**: Choose different Groq models or extend to other providers
116
- - **Chunking**: Adjust chunk size and overlap based on your document types
117
- - **Retrieval**: Configure top_k results and similarity score thresholds
118
-
119
- ## 🔧 Troubleshooting
120
-
121
- **Module not found errors**: Ensure all dependencies are installed with `pip install -r requirements.txt`
122
 
123
- **API key errors**: Verify your `.env` file contains the correct `GROQ_API_KEY`
124
 
125
- **No documents retrieved**: Check that documents were successfully processed, verify the query matches document content, and try rephrasing
126
 
127
- **Metadata filtering issues**: Ensure metadata fields exist in your documents and restart the server after code changes
128
 
129
- **Negative similarity scores**: This is normal for some queries - the system will still return results even with low similarity
130
 
131
- ## 📖 Additional Resources
132
 
133
- - **API Documentation**: See `API_USAGE.md` for complete REST API usage guide
134
- - **Interactive API Docs**: Visit http://localhost:8000/docs when the API server is running
135
- - **Example Client**: Run `python example_client.py` to see a complete usage example
136
 
137
- ## 🛠️ Technology Stack
138
 
139
- - **LangChain**: Document processing and text splitting
140
- - **Sentence Transformers**: Embedding generation
141
- - **ChromaDB**: Vector database for semantic search
142
- - **Groq**: Fast LLM inference
143
- - **FastAPI**: REST API framework
144
- - **Streamlit**: Web UI framework
 
145
 
146
  ---
147
 
148
- **Built for efficient document querying with AI-powered retrieval and generation.**
 
 
1
+ ---
2
+ title: PDF Chatbot – RAG Pipeline
3
+ emoji: 📄
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: streamlit
7
+ sdk_version: "1.35.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
 
 
14
 
15
+ # 📄 PDF Chatbot RAG Pipeline
16
 
17
+ This Space hosts an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline that allows users to upload PDFs and ask questions about their content.
18
 
19
+ The system extracts text, chunks it intelligently, embeds it into a vector database, and retrieves relevant context to answer queries using a large language model (LLM).
20
 
21
+ ---
 
 
22
 
23
+ ## 🚀 Features
24
 
25
+ - 🔹 PDF upload support
26
+ - 🔹 Automatic text extraction
27
+ - 🔹 Smart document chunking
28
+ - 🔹 Vector storage using ChromaDB
29
+ - 🔹 LLM-powered question answering
30
+ - 🔹 Streamlit-based interface
31
+ - 🔹 Clean RAG pipeline implementation (`src/rag_pipeline.py`)
32
 
33
  ---
34
 
35
+ ## 🏗️ Project Structure
36
+