TuNan52 commited on
Commit
d97f61e
Β·
verified Β·
1 Parent(s): c69a4d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -185
README.md CHANGED
@@ -1,185 +1,192 @@
1
- # RAG_Mini
2
- ---
3
-
4
- # Enterprise-Ready RAG System with Gradio Interface
5
-
6
- This is a powerful, enterprise-grade Retrieval-Augmented Generation (RAG) system designed to transform your documents into an interactive and intelligent knowledge base. Users can upload their own documents (PDFs, TXT files), build a searchable vector index, and ask complex questions in natural language to receive accurate, context-aware answers sourced directly from the provided materials.
7
-
8
- The entire application is wrapped in a clean, user-friendly web interface powered by Gradio.
9
-
10
- ![App Screenshot](assets/1.png)
11
- ![App Screenshot](assets/2.png)
12
-
13
- ## ✨ Features
14
-
15
- - **Intuitive Web UI**: Simple, clean interface built with Gradio for uploading documents and chatting.
16
- - **Multi-Document Support**: Natively handles PDF and TXT files.
17
- - **Advanced Text Splitting**: Uses a `HierarchicalSemanticSplitter` that first splits documents into large parent chunks (for context) and then into smaller child chunks (for precise search), respecting semantic boundaries.
18
- - **Hybrid Search**: Combines the strengths of dense vector search (FAISS) and sparse keyword search (BM25) for robust and accurate retrieval.
19
- - **Reranking for Accuracy**: Employs a Cross-Encoder model to rerank the retrieved documents, ensuring the most relevant context is passed to the language model.
20
- - **Persistent Knowledge Base**: Automatically saves the built vector index and metadata, allowing you to load an existing knowledge base instantly on startup.
21
- - **Modular & Extensible Codebase**: The project is logically structured into services for loading, splitting, embedding, and generation, making it easy to maintain and extend.
22
-
23
- ## πŸ›οΈ System Architecture
24
-
25
- The RAG pipeline follows a logical, multi-step process to ensure high-quality answers:
26
-
27
- 1. **Load**: Documents are loaded from various formats and parsed into a standardized `Document` object, preserving metadata like source and page number.
28
- 2. **Split**: The raw text is processed by the `HierarchicalSemanticSplitter`, creating parent and child text chunks. This provides both broad context and fine-grained detail.
29
- 3. **Embed & Index**: The child chunks are converted into vector embeddings using a `SentenceTransformer` model and indexed in a FAISS vector store. A parallel BM25 index is also built for keyword search.
30
- 4. **Retrieve**: When a user asks a question, a hybrid search query is performed against the FAISS and BM25 indices to retrieve the most relevant child chunks.
31
- 5. **Fetch Context**: The parent chunks corresponding to the retrieved child chunks are fetched. This ensures the LLM receives a wider, more complete context.
32
- 6. **Rerank**: A powerful Cross-Encoder model re-evaluates the relevance of the parent chunks against the query, pushing the best matches to the top.
33
- 7. **Generate**: The top-ranked, reranked documents are combined with the user's query into a final prompt. This prompt is sent to a Large Language Model (LLM) to generate a final, coherent answer.
34
-
35
- ```
36
- [User Uploads Docs] -> [Loader] -> [Splitter] -> [Embedder & Vector Store] -> [Knowledge Base Saved]
37
-
38
- [User Asks Question] -> [Hybrid Search] -> [Get Parent Docs] -> [Reranker] -> [LLM] -> [Answer & Sources]
39
- ```
40
-
41
- ## πŸ› οΈ Tech Stack
42
-
43
- - **Backend**: Python 3.9+
44
- - **UI**: Gradio
45
- - **LLM & Embedding Framework**: Hugging Face Transformers, Sentence-Transformers
46
- - **Vector Search**: Faiss (from Facebook AI)
47
- - **Keyword Search**: rank-bm25
48
- - **PDF Parsing**: PyMuPDF (fitz)
49
- - **Configuration**: PyYAML
50
-
51
- ## πŸš€ Getting Started
52
-
53
- Follow these steps to set up and run the project on your local machine.
54
-
55
- ### 1. Prerequisites
56
-
57
- - Python 3.9 or higher
58
- - `pip` for package management
59
-
60
- ### 2. Create a `requirements.txt` file
61
-
62
- Before proceeding, it's crucial to have a `requirements.txt` file so others can easily install the necessary dependencies. In your activated terminal, run:
63
-
64
- ```bash
65
- pip freeze > requirements.txt
66
- ```
67
- This will save all the packages from your environment into the file. Make sure this file is committed to your GitHub repository. The key packages it should contain are: `gradio`, `torch`, `transformers`, `sentence-transformers`, `faiss-cpu`, `rank_bm25`, `PyMuPDF`, `pyyaml`, `numpy`.
68
-
69
- ### 3. Installation & Setup
70
-
71
- **1. Clone the repository:**
72
- ```bash
73
- git clone https://github.com/YOUR_USERNAME/YOUR_REPOSITORY_NAME.git
74
- cd YOUR_REPOSITORY_NAME
75
- ```
76
-
77
- **2. Create and activate a virtual environment (recommended):**
78
- ```bash
79
- # For Windows
80
- python -m venv venv
81
- .\venv\Scripts\activate
82
-
83
- # For macOS/Linux
84
- python3 -m venv venv
85
- source venv/bin/activate
86
- ```
87
-
88
- **3. Install the required packages:**
89
- ```bash
90
- pip install -r requirements.txt
91
- ```
92
-
93
- **4. Configure the system:**
94
- Review the `configs/config.yaml` file. You can change the models, chunk sizes, and other parameters here. The default settings are a good starting point.
95
-
96
- > **Note:** The first time you run the application, the models specified in the config file will be downloaded from Hugging Face. This may take some time depending on your internet connection.
97
-
98
- ### 4. Running the Application
99
-
100
- To start the Gradio web server, run the `main.py` script:
101
-
102
- ```bash
103
- python main.py
104
- ```
105
-
106
- The application will be available at **`http://localhost:7860`**.
107
-
108
- ## πŸ“– How to Use
109
-
110
- The application has two primary workflows:
111
-
112
- **1. Build a New Knowledge Base:**
113
- - Drag and drop one or more `.pdf` or `.txt` files into the "Upload New Docs to Build" area.
114
- - Click the **"Build New KB"** button.
115
- - The system status will show the progress (Loading -> Splitting -> Indexing).
116
- - Once complete, the status will confirm that the knowledge base is ready, and the chat window will appear.
117
-
118
- **2. Load an Existing Knowledge Base:**
119
- - If you have previously built a knowledge base, simply click the **"Load Existing KB"** button.
120
- - The system will load the saved FAISS index and metadata from the `storage` directory.
121
- - The chat window will appear, and you can start asking questions immediately.
122
-
123
- **Chatting with Your Documents:**
124
- - Once the knowledge base is ready, type your question into the chat box at the bottom and press Enter or click "Submit".
125
- - The model will generate an answer based on the documents you provided.
126
- - The sources used to generate the answer will be displayed below the chat window.
127
-
128
- ## πŸ“‚ Project Structure
129
-
130
- ```
131
- .
132
- β”œβ”€β”€ configs/
133
- β”‚ └── config.yaml # Main configuration file for models, paths, etc.
134
- β”œβ”€β”€ core/
135
- β”‚ β”œβ”€β”€ embedder.py # Handles text embedding.
136
- β”‚ β”œβ”€β”€ llm_interface.py # Handles reranking and answer generation.
137
- β”‚ β”œβ”€β”€ loader.py # Loads and parses documents.
138
- β”‚ β”œβ”€β”€ schema.py # Defines data structures (Document, Chunk).
139
- β”‚ β”œβ”€β”€ splitter.py # Splits documents into chunks.
140
- β”‚ └── vector_store.py # Manages FAISS & BM25 indices.
141
- β”œβ”€β”€ service/
142
- β”‚ └── rag_service.py # Orchestrates the entire RAG pipeline.
143
- β”œβ”€β”€ storage/ # Default location for saved indices (auto-generated).
144
- β”‚ └── ...
145
- β”œβ”€β”€ ui/
146
- β”‚ └── app.py # Contains the Gradio UI logic.
147
- β”œβ”€β”€ utils/
148
- β”‚ └── logger.py # Logging configuration.
149
- β”œβ”€β”€ assets/
150
- β”‚ └── 1.png # Screenshot of the application.
151
- β”œβ”€β”€ main.py # Entry point to run the application.
152
- └── requirements.txt # Python package dependencies.
153
- ```
154
-
155
- ## πŸ”§ Configuration Details (`config.yaml`)
156
-
157
- You can customize the RAG pipeline by modifying `configs/config.yaml`:
158
-
159
- - **`models`**: Specify the Hugging Face models for embedding, reranking, and generation.
160
- - **`vector_store`**: Define the paths where the FAISS index and metadata will be saved.
161
- - **`splitter`**: Control the `HierarchicalSemanticSplitter` behavior.
162
- - `parent_chunk_size`: The target size for larger context chunks.
163
- - `parent_chunk_overlap`: The overlap between parent chunks.
164
- - `child_chunk_size`: The target size for smaller, searchable chunks.
165
- - **`retrieval`**: Tune the retrieval and reranking process.
166
- - `retrieval_top_k`: How many initial candidates to retrieve with hybrid search.
167
- - `rerank_top_k`: How many final documents to pass to the LLM after reranking.
168
- - `hybrid_search_alpha`: The weighting between vector search (`alpha`) and BM25 search (`1 - alpha`). `1.0` is pure vector search, `0.0` is pure keyword search.
169
- - **`generation`**: Set parameters for the final answer generation, like `max_new_tokens`.
170
-
171
- ## πŸ›£οΈ Future Roadmap
172
-
173
- - [ ] Support for more document types (e.g., `.docx`, `.pptx`, `.html`).
174
- - [ ] Implement response streaming for a more interactive chat experience.
175
- - [ ] Integrate with other vector databases like ChromaDB or Pinecone.
176
- - [ ] Create API endpoints for programmatic access to the RAG service.
177
- - [ ] Add more advanced logging and monitoring for enterprise use.
178
-
179
- ## 🀝 Contributing
180
-
181
- Contributions are welcome! If you have ideas for improvements or find a bug, please feel free to open an issue or submit a pull request.
182
-
183
- ## πŸ“„ License
184
-
185
- This project is licensed under the MIT License. See the `LICENSE` file for details.
 
 
 
 
 
 
 
 
1
+ ---
2
+ sdk: gradio
3
+ sdk_version: 3.50.2
4
+
5
+ ---
6
+
7
+
8
+ # RAG_Mini
9
+ ---
10
+
11
+ # Enterprise-Ready RAG System with Gradio Interface
12
+
13
+ This is a powerful, enterprise-grade Retrieval-Augmented Generation (RAG) system designed to transform your documents into an interactive and intelligent knowledge base. Users can upload their own documents (PDFs, TXT files), build a searchable vector index, and ask complex questions in natural language to receive accurate, context-aware answers sourced directly from the provided materials.
14
+
15
+ The entire application is wrapped in a clean, user-friendly web interface powered by Gradio.
16
+
17
+ ![App Screenshot](assets/1.png)
18
+ ![App Screenshot](assets/2.png)
19
+
20
+ ## ✨ Features
21
+
22
+ - **Intuitive Web UI**: Simple, clean interface built with Gradio for uploading documents and chatting.
23
+ - **Multi-Document Support**: Natively handles PDF and TXT files.
24
+ - **Advanced Text Splitting**: Uses a `HierarchicalSemanticSplitter` that first splits documents into large parent chunks (for context) and then into smaller child chunks (for precise search), respecting semantic boundaries.
25
+ - **Hybrid Search**: Combines the strengths of dense vector search (FAISS) and sparse keyword search (BM25) for robust and accurate retrieval.
26
+ - **Reranking for Accuracy**: Employs a Cross-Encoder model to rerank the retrieved documents, ensuring the most relevant context is passed to the language model.
27
+ - **Persistent Knowledge Base**: Automatically saves the built vector index and metadata, allowing you to load an existing knowledge base instantly on startup.
28
+ - **Modular & Extensible Codebase**: The project is logically structured into services for loading, splitting, embedding, and generation, making it easy to maintain and extend.
29
+
30
+ ## πŸ›οΈ System Architecture
31
+
32
+ The RAG pipeline follows a logical, multi-step process to ensure high-quality answers:
33
+
34
+ 1. **Load**: Documents are loaded from various formats and parsed into a standardized `Document` object, preserving metadata like source and page number.
35
+ 2. **Split**: The raw text is processed by the `HierarchicalSemanticSplitter`, creating parent and child text chunks. This provides both broad context and fine-grained detail.
36
+ 3. **Embed & Index**: The child chunks are converted into vector embeddings using a `SentenceTransformer` model and indexed in a FAISS vector store. A parallel BM25 index is also built for keyword search.
37
+ 4. **Retrieve**: When a user asks a question, a hybrid search query is performed against the FAISS and BM25 indices to retrieve the most relevant child chunks.
38
+ 5. **Fetch Context**: The parent chunks corresponding to the retrieved child chunks are fetched. This ensures the LLM receives a wider, more complete context.
39
+ 6. **Rerank**: A powerful Cross-Encoder model re-evaluates the relevance of the parent chunks against the query, pushing the best matches to the top.
40
+ 7. **Generate**: The top-ranked, reranked documents are combined with the user's query into a final prompt. This prompt is sent to a Large Language Model (LLM) to generate a final, coherent answer.
41
+
42
+ ```
43
+ [User Uploads Docs] -> [Loader] -> [Splitter] -> [Embedder & Vector Store] -> [Knowledge Base Saved]
44
+
45
+ [User Asks Question] -> [Hybrid Search] -> [Get Parent Docs] -> [Reranker] -> [LLM] -> [Answer & Sources]
46
+ ```
47
+
48
+ ## πŸ› οΈ Tech Stack
49
+
50
+ - **Backend**: Python 3.9+
51
+ - **UI**: Gradio
52
+ - **LLM & Embedding Framework**: Hugging Face Transformers, Sentence-Transformers
53
+ - **Vector Search**: Faiss (from Facebook AI)
54
+ - **Keyword Search**: rank-bm25
55
+ - **PDF Parsing**: PyMuPDF (fitz)
56
+ - **Configuration**: PyYAML
57
+
58
+ ## πŸš€ Getting Started
59
+
60
+ Follow these steps to set up and run the project on your local machine.
61
+
62
+ ### 1. Prerequisites
63
+
64
+ - Python 3.9 or higher
65
+ - `pip` for package management
66
+
67
+ ### 2. Create a `requirements.txt` file
68
+
69
+ Before proceeding, it's crucial to have a `requirements.txt` file so others can easily install the necessary dependencies. In your activated terminal, run:
70
+
71
+ ```bash
72
+ pip freeze > requirements.txt
73
+ ```
74
+ This will save all the packages from your environment into the file. Make sure this file is committed to your GitHub repository. The key packages it should contain are: `gradio`, `torch`, `transformers`, `sentence-transformers`, `faiss-cpu`, `rank_bm25`, `PyMuPDF`, `pyyaml`, `numpy`.
75
+
76
+ ### 3. Installation & Setup
77
+
78
+ **1. Clone the repository:**
79
+ ```bash
80
+ git clone https://github.com/YOUR_USERNAME/YOUR_REPOSITORY_NAME.git
81
+ cd YOUR_REPOSITORY_NAME
82
+ ```
83
+
84
+ **2. Create and activate a virtual environment (recommended):**
85
+ ```bash
86
+ # For Windows
87
+ python -m venv venv
88
+ .\venv\Scripts\activate
89
+
90
+ # For macOS/Linux
91
+ python3 -m venv venv
92
+ source venv/bin/activate
93
+ ```
94
+
95
+ **3. Install the required packages:**
96
+ ```bash
97
+ pip install -r requirements.txt
98
+ ```
99
+
100
+ **4. Configure the system:**
101
+ Review the `configs/config.yaml` file. You can change the models, chunk sizes, and other parameters here. The default settings are a good starting point.
102
+
103
+ > **Note:** The first time you run the application, the models specified in the config file will be downloaded from Hugging Face. This may take some time depending on your internet connection.
104
+
105
+ ### 4. Running the Application
106
+
107
+ To start the Gradio web server, run the `main.py` script:
108
+
109
+ ```bash
110
+ python main.py
111
+ ```
112
+
113
+ The application will be available at **`http://localhost:7860`**.
114
+
115
+ ## πŸ“– How to Use
116
+
117
+ The application has two primary workflows:
118
+
119
+ **1. Build a New Knowledge Base:**
120
+ - Drag and drop one or more `.pdf` or `.txt` files into the "Upload New Docs to Build" area.
121
+ - Click the **"Build New KB"** button.
122
+ - The system status will show the progress (Loading -> Splitting -> Indexing).
123
+ - Once complete, the status will confirm that the knowledge base is ready, and the chat window will appear.
124
+
125
+ **2. Load an Existing Knowledge Base:**
126
+ - If you have previously built a knowledge base, simply click the **"Load Existing KB"** button.
127
+ - The system will load the saved FAISS index and metadata from the `storage` directory.
128
+ - The chat window will appear, and you can start asking questions immediately.
129
+
130
+ **Chatting with Your Documents:**
131
+ - Once the knowledge base is ready, type your question into the chat box at the bottom and press Enter or click "Submit".
132
+ - The model will generate an answer based on the documents you provided.
133
+ - The sources used to generate the answer will be displayed below the chat window.
134
+
135
+ ## πŸ“‚ Project Structure
136
+
137
+ ```
138
+ .
139
+ β”œβ”€β”€ configs/
140
+ β”‚ └── config.yaml # Main configuration file for models, paths, etc.
141
+ β”œβ”€β”€ core/
142
+ β”‚ β”œβ”€β”€ embedder.py # Handles text embedding.
143
+ β”‚ β”œβ”€β”€ llm_interface.py # Handles reranking and answer generation.
144
+ β”‚ β”œβ”€β”€ loader.py # Loads and parses documents.
145
+ β”‚ β”œβ”€β”€ schema.py # Defines data structures (Document, Chunk).
146
+ β”‚ β”œβ”€β”€ splitter.py # Splits documents into chunks.
147
+ β”‚ └── vector_store.py # Manages FAISS & BM25 indices.
148
+ β”œβ”€β”€ service/
149
+ β”‚ └── rag_service.py # Orchestrates the entire RAG pipeline.
150
+ β”œβ”€β”€ storage/ # Default location for saved indices (auto-generated).
151
+ β”‚ └── ...
152
+ β”œβ”€β”€ ui/
153
+ β”‚ └── app.py # Contains the Gradio UI logic.
154
+ β”œβ”€β”€ utils/
155
+ β”‚ └── logger.py # Logging configuration.
156
+ β”œβ”€β”€ assets/
157
+ β”‚ └── 1.png # Screenshot of the application.
158
+ β”œβ”€β”€ main.py # Entry point to run the application.
159
+ └── requirements.txt # Python package dependencies.
160
+ ```
161
+
162
+ ## πŸ”§ Configuration Details (`config.yaml`)
163
+
164
+ You can customize the RAG pipeline by modifying `configs/config.yaml`:
165
+
166
+ - **`models`**: Specify the Hugging Face models for embedding, reranking, and generation.
167
+ - **`vector_store`**: Define the paths where the FAISS index and metadata will be saved.
168
+ - **`splitter`**: Control the `HierarchicalSemanticSplitter` behavior.
169
+ - `parent_chunk_size`: The target size for larger context chunks.
170
+ - `parent_chunk_overlap`: The overlap between parent chunks.
171
+ - `child_chunk_size`: The target size for smaller, searchable chunks.
172
+ - **`retrieval`**: Tune the retrieval and reranking process.
173
+ - `retrieval_top_k`: How many initial candidates to retrieve with hybrid search.
174
+ - `rerank_top_k`: How many final documents to pass to the LLM after reranking.
175
+ - `hybrid_search_alpha`: The weighting between vector search (`alpha`) and BM25 search (`1 - alpha`). `1.0` is pure vector search, `0.0` is pure keyword search.
176
+ - **`generation`**: Set parameters for the final answer generation, like `max_new_tokens`.
177
+
178
+ ## πŸ›£οΈ Future Roadmap
179
+
180
+ - [ ] Support for more document types (e.g., `.docx`, `.pptx`, `.html`).
181
+ - [ ] Implement response streaming for a more interactive chat experience.
182
+ - [ ] Integrate with other vector databases like ChromaDB or Pinecone.
183
+ - [ ] Create API endpoints for programmatic access to the RAG service.
184
+ - [ ] Add more advanced logging and monitoring for enterprise use.
185
+
186
+ ## 🀝 Contributing
187
+
188
+ Contributions are welcome! If you have ideas for improvements or find a bug, please feel free to open an issue or submit a pull request.
189
+
190
+ ## πŸ“„ License
191
+
192
+ This project is licensed under the MIT License. See the `LICENSE` file for details.