minhvtt commited on
Commit
55249bb
·
verified ·
1 Parent(s): 2216e03

Delete SUMMARY.md

Browse files
Files changed (1) hide show
  1. SUMMARY.md +0 -429
SUMMARY.md DELETED
@@ -1,429 +0,0 @@
1
- # ChatbotRAG - Complete Summary
2
-
3
- ## Tổng Quan Hệ Thống
4
-
5
- Hệ thống ChatbotRAG hiện đã được nâng cấp toàn diện với các tính năng advanced:
6
-
7
- ### ✨ Tính Năng Chính
8
-
9
- 1. **Multiple Inputs Support** (/index)
10
- - Index tối đa 10 texts + 10 images cùng lúc
11
- - Average embeddings tự động
12
-
13
- 2. **Advanced RAG Pipeline** (/chat)
14
- - Query Expansion
15
- - Multi-Query Retrieval
16
- - Reranking with semantic similarity
17
- - Contextual Compression
18
- - Better Prompt Engineering
19
-
20
- 3. **PDF Support** (/upload-pdf)
21
- - Parse PDF thành chunks
22
- - Auto chunking với overlap
23
- - Index vào RAG system
24
-
25
- 4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
26
- - Extract text + image URLs từ PDF
27
- - Link images với text chunks
28
- - Return images cùng text trong chat
29
- - Perfect cho user guides với screenshots
30
-
31
- ---
32
-
33
- ## Kiến Trúc Hệ Thống
34
-
35
- ```
36
- ┌─────────────────────────────────────────────────────────────┐
37
- │ FastAPI Application │
38
- ├─────────────────────────────────────────────────────────────┤
39
- │ │
40
- │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
41
- │ │ Indexing │ │ Search │ │ Chat │ │
42
- │ │ Endpoints │ │ Endpoints │ │ Endpoint │ │
43
- │ └──────────────┘ └──────────────┘ └──────────────┘ │
44
- │ │
45
- ├─────────────────────────────────────────────────────────────┤
46
- │ │
47
- │ ┌──────────────────────────────────────────────────────┐ │
48
- │ │ Advanced RAG Pipeline │ │
49
- │ │ • Query Expansion │ │
50
- │ │ • Multi-Query Retrieval │ │
51
- │ │ • Reranking │ │
52
- │ │ • Contextual Compression │ │
53
- │ └──────────────────────────────────────────────────────┘ │
54
- │ │
55
- ├─────────────────────────────────────────────────────────────┤
56
- │ │
57
- │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
58
- │ │ Jina CLIP │ │ Qdrant │ │ MongoDB │ │
59
- │ │ v2 │ │ Vector DB │ │ Documents │ │
60
- │ └──────────────┘ └──────────────┘ └──────────────┘ │
61
- │ │
62
- │ ┌──────────────┐ ┌──────────────┐ │
63
- │ │ PDF │ │ Multimodal │ │
64
- │ │ Parser │ │ PDF Parser │ │
65
- │ └──────────────┘ └──────────────┘ │
66
- │ │
67
- └─────────────────────────────────────────────────────────────┘
68
- ```
69
-
70
- ---
71
-
72
- ## Files Quan Trọng
73
-
74
- ### Core System
75
- - **main.py** - FastAPI application với tất cả endpoints
76
- - **embedding_service.py** - Jina CLIP v2 embedding
77
- - **qdrant_service.py** - Qdrant vector DB operations
78
- - **advanced_rag.py** - Advanced RAG pipeline
79
-
80
- ### PDF Processing
81
- - **pdf_parser.py** - Basic PDF parser (text only)
82
- - **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
83
- - **batch_index_pdfs.py** - Batch indexing script
84
-
85
- ### Documentation
86
- - **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
87
- - **PDF_RAG_GUIDE.md** - PDF usage guide
88
- - **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
89
- - **QUICK_START_PDF.md** - Quick start for PDF
90
- - **chatbot_guide_template.md** - Template for user guide PDF
91
-
92
- ### Testing
93
- - **test_advanced_features.py** - Test advanced features
94
- - **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)
95
-
96
- ---
97
-
98
- ## API Endpoints
99
-
100
- ### 1. Indexing
101
-
102
- | Endpoint | Method | Description |
103
- |----------|--------|-------------|
104
- | `/index` | POST | Index texts + images (max 10 each) |
105
- | `/documents` | POST | Add text document |
106
- | `/upload-pdf` | POST | Upload PDF (text only) |
107
- | `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |
108
-
109
- ### 2. Search
110
-
111
- | Endpoint | Method | Description |
112
- |----------|--------|-------------|
113
- | `/search` | POST | Hybrid search (text + image) |
114
- | `/search/text` | POST | Text-only search |
115
- | `/search/image` | POST | Image-only search |
116
- | `/rag/search` | POST | RAG knowledge base search |
117
-
118
- ### 3. Chat
119
-
120
- | Endpoint | Method | Description |
121
- |----------|--------|-------------|
122
- | `/chat` | POST | Chat with Advanced RAG |
123
-
124
- ### 4. Management
125
-
126
- | Endpoint | Method | Description |
127
- |----------|--------|-------------|
128
- | `/documents/pdf` | GET | List all PDFs |
129
- | `/documents/pdf/{id}` | DELETE | Delete PDF document |
130
- | `/delete/{doc_id}` | DELETE | Delete document |
131
- | `/document/{doc_id}` | GET | Get document by ID |
132
- | `/history` | GET | Get chat history |
133
- | `/stats` | GET | Collection statistics |
134
- | `/` | GET | Health check + API docs |
135
-
136
- ---
137
-
138
- ## Use Cases & Recommendations
139
-
140
- ### Case 1: PDF Hướng Dẫn Chỉ Có Text
141
-
142
- **Scenario:** FAQ, policy document, text guide
143
-
144
- **Solution:** `/upload-pdf`
145
-
146
- ```bash
147
- curl -X POST "http://localhost:8000/upload-pdf" \
148
- -F "file=@faq.pdf" \
149
- -F "title=FAQ"
150
- ```
151
-
152
- ### Case 2: PDF Hướng Dẫn Có Hình Ảnh ⭐ (Your Case)
153
-
154
- **Scenario:** User guide với screenshots, tutorial với diagrams
155
-
156
- **Solution:** `/upload-pdf-multimodal`
157
-
158
- ```bash
159
- curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
160
- -F "file=@user_guide_with_images.pdf" \
161
- -F "title=User Guide" \
162
- -F "category=guide"
163
- ```
164
-
165
- **Benefits:**
166
- - ✓ Extract text + image URLs
167
- - ✓ Link images với text chunks
168
- - ✓ Chatbot return images in response
169
- - ✓ Visual context for users
170
-
171
- ### Case 3: Multiple Social Media Posts
172
-
173
- **Scenario:** Index nhiều posts với texts và images
174
-
175
- **Solution:** `/index` with multiple inputs
176
-
177
- ```python
178
- data = {
179
- 'id': 'post123',
180
- 'texts': ['Post text 1', 'Post text 2', ...], # Max 10
181
- }
182
- files = [
183
- ('images', open('img1.jpg', 'rb')),
184
- ('images', open('img2.jpg', 'rb')), # Max 10
185
- ]
186
- requests.post('http://localhost:8000/index', data=data, files=files)
187
- ```
188
-
189
- ### Case 4: Complex Queries
190
-
191
- **Scenario:** Câu hỏi phức tạp, cần độ chính xác cao
192
-
193
- **Solution:** Advanced RAG with full options
194
-
195
- ```python
196
- {
197
- 'message': 'Complex question',
198
- 'use_rag': True,
199
- 'use_advanced_rag': True,
200
- 'use_reranking': True,
201
- 'use_compression': True,
202
- 'score_threshold': 0.5,
203
- 'top_k': 5
204
- }
205
- ```
206
-
207
- ---
208
-
209
- ## Workflow Đề Xuất Cho Bạn
210
-
211
- ### Setup Ban Đầu
212
-
213
- 1. **Tạo PDF hướng dẫn sử dụng**
214
- - Dùng template: `chatbot_guide_template.md`
215
- - Customize nội dung cho hệ thống của bạn
216
- - Thêm image URLs (screenshots, diagrams)
217
- - Convert to PDF: `pandoc template.md -o guide.pdf`
218
-
219
- 2. **Upload PDF**
220
- ```bash
221
- curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
222
- -F "file=@chatbot_user_guide.pdf" \
223
- -F "title=Hướng dẫn sử dụng ChatbotRAG" \
224
- -F "category=user_guide"
225
- ```
226
-
227
- 3. **Verify**
228
- ```bash
229
- curl http://localhost:8000/documents/pdf
230
- # Check "type": "multimodal_pdf" và "total_images"
231
- ```
232
-
233
- ### Sử Dụng Hàng Ngày
234
-
235
- 1. **Chat với user**
236
- ```python
237
- response = requests.post('http://localhost:8000/chat', json={
238
- 'message': user_question,
239
- 'use_rag': True,
240
- 'use_advanced_rag': True,
241
- 'hf_token': 'your_token'
242
- })
243
- ```
244
-
245
- 2. **Display response + images**
246
- ```python
247
- # Text answer
248
- print(response.json()['response'])
249
-
250
- # Images (if any)
251
- for ctx in response.json()['context_used']:
252
- if ctx['metadata'].get('has_images'):
253
- for url in ctx['metadata']['image_urls']:
254
- # Display image in your UI
255
- print(f"Image: {url}")
256
- ```
257
-
258
- ### Cập Nhật Content
259
-
260
- 1. **Update PDF** - Edit và re-export
261
- 2. **Xóa PDF cũ**
262
- ```bash
263
- curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
264
- ```
265
- 3. **Upload PDF mới**
266
- ```bash
267
- curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
268
- ```
269
-
270
- ---
271
-
272
- ## Performance Tips
273
-
274
- ### 1. Chunking
275
-
276
- **Default:**
277
- - chunk_size: 500 words
278
- - chunk_overlap: 50 words
279
-
280
- **Tối ưu:**
281
- ```python
282
- # In multimodal_pdf_parser.py
283
- parser = MultimodalPDFParser(
284
- chunk_size=400, # Shorter for faster retrieval
285
- chunk_overlap=40,
286
- min_chunk_size=50
287
- )
288
- ```
289
-
290
- ### 2. Retrieval
291
-
292
- **Settings tốt:**
293
- ```python
294
- {
295
- 'top_k': 5, # 3-7 is optimal
296
- 'score_threshold': 0.5, # 0.4-0.6 is good
297
- 'use_reranking': True, # Always enable
298
- 'use_compression': True # Keeps context relevant
299
- }
300
- ```
301
-
302
- ### 3. LLM
303
-
304
- **For factual answers:**
305
- ```python
306
- {
307
- 'temperature': 0.3, # Low for accuracy
308
- 'max_tokens': 512, # Concise answers
309
- 'top_p': 0.9
310
- }
311
- ```
312
-
313
- ---
314
-
315
- ## Troubleshooting
316
-
317
- ### Issue 1: Images không được detect
318
-
319
- **Solution:**
320
- - Verify PDF có image URLs (http://, https://)
321
- - Check format: markdown `![](url)` hoặc HTML `<img src>`
322
- - Test regex:
323
- ```python
324
- from multimodal_pdf_parser import MultimodalPDFParser
325
- parser = MultimodalPDFParser()
326
- urls = parser.extract_image_urls("![](https://example.com/img.png)")
327
- print(urls) # Should return ['https://example.com/img.png']
328
- ```
329
-
330
- ### Issue 2: Chatbot không tìm thấy thông tin
331
-
332
- **Solution:**
333
- - Lower score_threshold: `0.3-0.5`
334
- - Increase top_k: `5-10`
335
- - Enable Advanced RAG
336
- - Rephrase question
337
-
338
- ### Issue 3: Response quá chậm
339
-
340
- **Solution:**
341
- - Giảm top_k
342
- - Disable compression nếu không cần
343
- - Use basic RAG thay vì advanced for simple queries
344
-
345
- ---
346
-
347
- ## Next Steps
348
-
349
- ### Immediate (Bây Giờ)
350
-
351
- 1. ✓ System đã ready!
352
- 2. Tạo PDF hướng dẫn của bạn
353
- 3. Upload qua `/upload-pdf-multimodal`
354
- 4. Test với câu hỏi thực tế
355
-
356
- ### Short Term (1-2 tuần)
357
-
358
- 1. Collect user feedback
359
- 2. Fine-tune parameters (top_k, threshold)
360
- 3. Add more PDFs (FAQ, tutorials, etc.)
361
- 4. Monitor chat history để improve content
362
-
363
- ### Long Term (Sau này)
364
-
365
- 1. **Hybrid Search với BM25**
366
- - Combine dense + sparse retrieval
367
- - Better for keyword queries
368
-
369
- 2. **Cross-Encoder Reranking**
370
- - Replace embedding similarity
371
- - More accurate ranking
372
-
373
- 3. **Image Processing**
374
- - Download và process actual images
375
- - Use Jina CLIP for image embeddings
376
- - True multimodal embeddings (text + image vectors)
377
-
378
- 4. **RAG-Anything Integration** (Nếu cần)
379
- - For complex PDFs with tables, charts
380
- - Vision encoder for embedded images
381
- - Advanced document understanding
382
-
383
- ---
384
-
385
- ## Comparison Matrix
386
-
387
- | Approach | Text | Images | URLs | Complexity | Your Case |
388
- |----------|------|--------|------|------------|-----------|
389
- | Basic RAG | ✓ | ✗ | ✗ | Low | ✗ |
390
- | PDF Parser | ✓ | ✗ | ✗ | Low | ✗ |
391
- | **Multimodal PDF** | ✓ | ✗ | ✓ | **Medium** | **✓** |
392
- | RAG-Anything | ✓ | ✓ | ✓ | High | Overkill |
393
-
394
- **Recommendation:** **Multimodal PDF** là perfect cho case của bạn!
395
-
396
- ---
397
-
398
- ## Kết Luận
399
-
400
- ### Bạn Có Gì?
401
-
402
- ✅ **Multiple Inputs**: Index 10 texts + 10 images
403
- ✅ **Advanced RAG**: Query expansion, reranking, compression
404
- ✅ **PDF Support**: Parse và index PDFs
405
- ✅ **Multimodal PDF**: Extract text + image URLs, link together
406
- ✅ **Complete Documentation**: Guides, examples, troubleshooting
407
-
408
- ### Làm Gì Tiếp?
409
-
410
- 1. **Tạo PDF** hướng dẫn với nội dung của bạn (có image URLs)
411
- 2. **Upload** qua `/upload-pdf-multimodal`
412
- 3. **Test** với câu hỏi thực tế
413
- 4. **Iterate** - fine-tune based on feedback
414
-
415
- ### Files Cần Đọc
416
-
417
- **Cho PDF với hình ảnh (Your case):**
418
- - [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
419
- - [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)
420
-
421
- **Cho Advanced RAG:**
422
- - [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)
423
-
424
- **Quick Start:**
425
- - [QUICK_START_PDF.md](QUICK_START_PDF.md)
426
-
427
- ---
428
-
429
- **Hệ thống của bạn bây giờ rất mạnh! Chỉ cần upload PDF và chat thôi! 🚀📄🤖**