Spaces:

minhvtt
/

ChatbotRAG

Runtime error

App Files Files Community

minhvtt commited on Oct 28, 2025

Commit

500cf95

verified ·

1 Parent(s): 5aa7215

Upload 20 files

Browse files

Files changed (13) hide show

ADVANCED_RAG_GUIDE.md +256 -0
MULTIMODAL_PDF_GUIDE.md +525 -0
PDF_RAG_GUIDE.md +390 -0
QUICK_START_PDF.md +310 -0
SUMMARY.md +429 -0
advanced_rag.py +301 -0
batch_index_pdfs.py +151 -0
chatbot_guide_template.md +369 -0
multimodal_pdf_parser.py +390 -0
pdf_parser.py +371 -0
requirements.txt +3 -0
test_advanced_features.py +260 -0
verify_dependencies.py +102 -0

ADVANCED_RAG_GUIDE.md ADDED Viewed

	@@ -0,0 +1,256 @@

+# Advanced RAG Chatbot - User Guide
+## What's New?
+### 1. Multiple Images & Texts Support in `/index` API
+The `/index` endpoint now supports indexing multiple texts and images in a single request (max 10 each).
+**Before:**
+```python
+# Old: Only 1 text and 1 image
+data = {
+    'id': 'doc1',
+    'text': 'Single text',
+}
+files = {'image': open('image.jpg', 'rb')}
+```
+**After:**
+```python
+# New: Multiple texts and images (max 10 each)
+data = {
+    'id': 'doc1',
+    'texts': ['Text 1', 'Text 2', 'Text 3'],  # Up to 10
+}
+files = [
+    ('images', open('image1.jpg', 'rb')),
+    ('images', open('image2.jpg', 'rb')),
+    ('images', open('image3.jpg', 'rb')),  # Up to 10
+]
+response = requests.post('http://localhost:8000/index', data=data, files=files)
+```
+**Example with cURL:**
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=event123" \
+  -F "texts=Sự kiện âm nhạc tại Hà Nội" \
+  -F "texts=Diễn ra vào ngày 20/10/2025" \
+  -F "texts=Địa điểm: Trung tâm Hội nghị Quốc gia" \
+  -F "images=@poster1.jpg" \
+  -F "images=@poster2.jpg" \
+  -F "images=@poster3.jpg"
+```
+### 2. Advanced RAG Pipeline in `/chat` API
+The chat endpoint now uses modern RAG techniques for better response quality:
+#### Key Improvements:
+1. **Query Expansion**: Automatically expands your question with variations
+2. **Multi-Query Retrieval**: Searches with multiple query variants
+3. **Reranking**: Re-scores results for better relevance
+4. **Contextual Compression**: Keeps only the most relevant parts
+5. **Better Prompt Engineering**: Optimized prompts for LLM
+#### How to Use:
+**Basic Usage (Auto-enabled):**
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Dao có nguy hiểm không?',
+    'use_rag': True,
+    'use_advanced_rag': True,  # Default: True
+    'hf_token': 'hf_xxxxx'
+})
+result = response.json()
+print("Response:", result['response'])
+print("RAG Stats:", result['rag_stats'])  # See pipeline statistics
+```
+**Advanced Configuration:**
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    # RAG Pipeline Options
+    'use_query_expansion': True,    # Expand query with variations
+    'use_reranking': True,          # Rerank results
+    'use_compression': True,        # Compress context
+    'score_threshold': 0.5,         # Min relevance score (0-1)
+    'top_k': 5,                     # Number of documents to retrieve
+    # LLM Options
+    'max_tokens': 512,
+    'temperature': 0.7,
+    'hf_token': 'hf_xxxxx'
+})
+```
+**Disable Advanced RAG (Use Basic):**
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Your question',
+    'use_rag': True,
+    'use_advanced_rag': False,  # Use basic RAG
+})
+```
+## API Changes Summary
+### `/index` Endpoint
+**Old Parameters:**
+- `id`: str (required)
+- `text`: str (required)
+- `image`: UploadFile (optional)
+**New Parameters:**
+- `id`: str (required)
+- `texts`: List[str] (optional, max 10)
+- `images`: List[UploadFile] (optional, max 10)
+**Response:**
+```json
+{
+  "success": true,
+  "id": "doc123",
+  "message": "Đã index thành công document doc123 với 3 texts và 2 images"
+}
+```
+### `/chat` Endpoint
+**New Parameters:**
+- `use_advanced_rag`: bool (default: True) - Enable advanced RAG
+- `use_query_expansion`: bool (default: True) - Expand query
+- `use_reranking`: bool (default: True) - Rerank results
+- `use_compression`: bool (default: True) - Compress context
+- `score_threshold`: float (default: 0.5) - Min relevance score
+**Response (New):**
+```json
+{
+  "response": "AI generated answer...",
+  "context_used": [...],
+  "timestamp": "2025-10-29T...",
+  "rag_stats": {
+    "original_query": "Your question",
+    "expanded_queries": ["Query variant 1", "Query variant 2"],
+    "initial_results": 10,
+    "after_rerank": 5,
+    "after_compression": 5
+  }
+}
+```
+## Complete Examples
+### Example 1: Index Multiple Social Media Posts
+```python
+import requests
+# Index a social media event with multiple posts and images
+data = {
+    'id': 'event_festival_2025',
+    'texts': [
+        'Festival âm nhạc quốc tế Hà Nội 2025',
+        'Ngày 15-17 tháng 11 năm 2025',
+        'Địa điểm: Công viên Thống Nhất',
+        'Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh',
+        'Giá vé từ 500.000đ - 2.000.000đ'
+    ]
+}
+files = [
+    ('images', open('poster_festival.jpg', 'rb')),
+    ('images', open('lineup.jpg', 'rb')),
+    ('images', open('venue_map.jpg', 'rb'))
+]
+response = requests.post('http://localhost:8000/index', data=data, files=files)
+print(response.json())
+```
+### Example 2: Advanced RAG Chat
+```python
+import requests
+# Chat with advanced RAG
+chat_response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Festival âm nhạc Hà Nội diễn ra khi nào và ở đâu?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'score_threshold': 0.6,
+    'hf_token': 'your_hf_token_here'
+})
+result = chat_response.json()
+print("Answer:", result['response'])
+print("\nRetrieved Context:")
+for ctx in result['context_used']:
+    print(f"- [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
+print("\nRAG Pipeline Stats:")
+print(f"- Original query: {result['rag_stats']['original_query']}")
+print(f"- Query variants: {result['rag_stats']['expanded_queries']}")
+print(f"- Documents retrieved: {result['rag_stats']['initial_results']}")
+print(f"- After reranking: {result['rag_stats']['after_rerank']}")
+```
+## Performance Comparison
+| Feature | Basic RAG | Advanced RAG |
+|---------|-----------|--------------|
+| Query Understanding | Single query | Multiple query variants |
+| Retrieval Method | Direct vector search | Multi-query + hybrid |
+| Result Ranking | Score from DB | Reranked with semantic similarity |
+| Context Quality | Full text | Compressed, relevant parts only |
+| Response Accuracy | Good | Better |
+| Response Time | Faster | Slightly slower but better quality |
+## When to Use What?
+**Use Basic RAG when:**
+- You need fast response time
+- Queries are straightforward
+- Context is already well-structured
+**Use Advanced RAG when:**
+- You need higher accuracy
+- Queries are complex or ambiguous
+- Context documents are long
+- You want better relevance
+## Troubleshooting
+### Error: "Tối đa 10 texts"
+You're sending more than 10 texts. Reduce to max 10.
+### Error: "Tối đa 10 images"
+You're sending more than 10 images. Reduce to max 10.
+### RAG stats show 0 results
+Your `score_threshold` might be too high. Try lowering it (e.g., 0.3-0.5).
+## Next Steps
+To further improve RAG, consider:
+1. **Add BM25 Hybrid Search**: Combine dense + sparse retrieval
+2. **Use Cross-Encoder for Reranking**: Better than embedding similarity
+3. **Implement Query Decomposition**: Break complex queries into sub-queries
+4. **Add Citation/Source Tracking**: Show which document each fact comes from
+5. **Integrate RAG-Anything**: For advanced multimodal document processing
+For RAG-Anything integration (more complex), see: https://github.com/HKUDS/RAG-Anything

MULTIMODAL_PDF_GUIDE.md ADDED Viewed

	@@ -0,0 +1,525 @@

+# Multimodal PDF Guide - PDFs với Text + Hình Ảnh
+## Tổng Quan
+Hệ thống giờ hỗ trợ **Multimodal PDF** - PDFs có:
+- ✅ Text hướng dẫn
+- ✅ Image URLs (links đến hình ảnh)
+- ✅ Markdown images: `![alt](url)`
+- ✅ HTML images: `<img src="url">`
+**Perfect cho**: User guides với screenshots, tutorials với diagrams, documentation với visual aids.
+---
+## Tại Sao Cần Multimodal?
+### Vấn Đề Với PDF Thông Thường
+PDF hướng dẫn thường có:
+```
+Bước 1: Mở trang chủ
+[Xem hình ảnh: https://example.com/homepage.png]
+Bước 2: Click vào "Tạo mới"
+![Create button](https://example.com/create-button.png)
+Bước 3: Điền thông tin
+<img src="https://example.com/form.png" alt="Form" />
+```
+**PDF parser cũ** chỉ extract text → **MẤT hết image URLs** → Chatbot không biết hình ảnh nào liên quan!
+**Multimodal PDF parser mới**:
+- ✓ Extract text
+- ✓ Detect tất cả image URLs
+- ✓ Link images với text chunks tương ứng
+- ✓ Store URLs trong metadata
+- ✓ Return images cùng text khi chat
+---
+## So Sánh: PDF Thường vs Multimodal PDF
+| Feature | PDF Thường (`/upload-pdf`) | Multimodal PDF (`/upload-pdf-multimodal`) |
+|---------|---------------------------|-------------------------------------------|
+| Extract text | ✓ | ✓ |
+| Detect image URLs | ✗ | ✓ |
+| Link images to chunks | ✗ | ✓ |
+| Return images in chat | ✗ | ✓ |
+| URL formats supported | ✗ | http://, https://, markdown, HTML |
+| Use case | Simple text documents | User guides, tutorials, docs with images |
+---
+## Cách Sử Dụng
+### 1. Upload Multimodal PDF
+**Endpoint:** `POST /upload-pdf-multimodal`
+**Curl:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@user_guide_with_images.pdf" \
+  -F "title=Hướng dẫn sử dụng hệ thống" \
+  -F "description=User guide with screenshots" \
+  -F "category=user_guide"
+```
+**Python:**
+```python
+import requests
+with open('user_guide_with_images.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf-multimodal',
+        files={'file': f},
+        data={
+            'title': 'User Guide with Screenshots',
+            'category': 'user_guide'
+        }
+    )
+result = response.json()
+print(f"Indexed: {result['chunks_indexed']} chunks")
+print(f"Images found: {result['message']}")
+```
+**Response:**
+```json
+{
+  "success": true,
+  "document_id": "pdf_multimodal_20251029_150000",
+  "filename": "user_guide_with_images.pdf",
+  "chunks_indexed": 25,
+  "message": "PDF 'user_guide_with_images.pdf' indexed successfully with 25 chunks and 15 images"
+}
+```
+### 2. Chat Với Multimodal Context
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'hf_token': 'your_token'
+})
+result = response.json()
+# Response text
+print("Answer:", result['response'])
+# Retrieved context with images
+for ctx in result['context_used']:
+    print(f"\n--- Source: Page {ctx['metadata']['page']} ---")
+    print(f"Text: {ctx['metadata']['text'][:200]}...")
+    # Check if this chunk has images
+    if ctx['metadata'].get('has_images'):
+        print(f"Images ({ctx['metadata']['num_images']}):")
+        for img_url in ctx['metadata'].get('image_urls', []):
+            print(f"  - {img_url}")
+```
+**Example Output:**
+```
+Answer: Để tạo event mới, bạn thực hiện các bước sau:
+1. Mở trang chủ và click vào nút "Tạo Event" (xem hình minh họa)
+2. Điền thông tin event...
+--- Source: Page 5 ---
+Text: Bước 1: Mở trang chủ và click vào nút "Tạo Event"...
+Images (2):
+  - https://example.com/homepage.png
+  - https://example.com/create-button.png
+```
+---
+## Cách Chuẩn Bị PDF
+### Format Hỗ Trợ
+Multimodal parser detect các format sau:
+1. **Standard URLs:**
+   ```
+   Xem hình: https://example.com/image.png
+   Screenshot: http://cdn.example.com/screenshot.jpg
+   ```
+2. **Markdown Images:**
+   ```markdown
+   ![Homepage](https://example.com/homepage.png)
+   ![Button](https://example.com/button.png)
+   ```
+3. **HTML Images:**
+   ```html
+   <img src="https://example.com/form.png" alt="Form" />
+   <img src="http://example.com/result.jpg">
+   ```
+4. **Image Extensions:**
+   ```
+   https://example.com/pic.jpg
+   https://example.com/chart.png
+   https://example.com/diagram.svg
+   ```
+### Best Practices
+#### ✓ Tốt
+**PDF Content Example:**
+```
+# Hướng Dẫn Tạo Event
+## Bước 1: Mở Trang Chủ
+Truy cập vào trang chủ hệ thống tại homepage.
+![Homepage Screenshot](https://docs.example.com/images/homepage.png)
+Bạn sẽ thấy màn hình chính với menu bên trái.
+## Bước 2: Click "Tạo Event"
+Tìm và click vào nút "Tạo Event" ở góc trên phải.
+![Create Event Button](https://docs.example.com/images/create-button.png)
+## Bước 3: Điền Thông Tin
+Điền các thông tin sau vào form:
+- Tên event
+- Ngày giờ
+- Địa điểm
+Xem mẫu form: https://docs.example.com/images/event-form.png
+```
+**Why good:**
+- Có cấu trúc rõ ràng (headings)
+- Mỗi bước có text + hình ảnh
+- URLs rõ ràng, dễ detect
+- Context gắn chặt với hình
+#### ✗ Tránh
+```
+Xem các hình dưới đây [1] [2] [3]
+[Các hình ảnh ở cuối tài liệu]
+...
+[1] homepage.png
+[2] button.png
+[3] form.png
+```
+**Why bad:**
+- Images references không có URLs
+- Images tách biệt khỏi context
+- Không có full URLs (chỉ filenames)
+---
+## Ví Dụ Thực Tế
+### Tạo PDF Hướng Dẫn Multimodal
+**File: `chatbot_guide_with_images.md`**
+```markdown
+# Hướng Dẫn Sử Dụng ChatbotRAG
+## 1. Upload PDF
+### Bước 1: Chuẩn bị file PDF
+Đảm bảo file PDF của bạn đã sẵn sàng.
+![PDF File Icon](https://via.placeholder.com/150?text=PDF+File)
+### Bước 2: Sử dụng cURL hoặc Python
+**Với cURL:**
+\`\`\`bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \\
+  -F "file=@your_file.pdf"
+\`\`\`
+![cURL Command Example](https://via.placeholder.com/400x100?text=cURL+Command)
+**Với Python:**
+\`\`\`python
+import requests
+# Upload code here
+\`\`\`
+### Bước 3: Verify Upload
+Kiểm tra kết quả upload:
+https://via.placeholder.com/500x300?text=Upload+Success+Message
+## 2. Chat Với Chatbot
+Sau khi upload, bạn có thể hỏi chatbot:
+![Chat Interface](https://via.placeholder.com/600x400?text=Chat+Interface)
+**Ví dụ câu hỏi:**
+- "Làm sao để upload PDF?"
+- "Các bước tạo event là gì?"
+![Chat Example](https://via.placeholder.com/600x300?text=Chat+Example)
+## 3. Xem Kết Quả
+Chatbot sẽ trả lời dựa trên PDF content:
+https://via.placeholder.com/600x350?text=Chat+Response+with+Images
+```
+**Convert to PDF:**
+```bash
+pandoc chatbot_guide_with_images.md -o chatbot_guide_with_images.pdf
+```
+**Upload:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@chatbot_guide_with_images.pdf" \
+  -F "title=ChatbotRAG Guide" \
+  -F "category=user_guide"
+```
+---
+## Advanced: Custom Image Handling
+### Option 1: Local Images
+Nếu images ở local, bạn cần host chúng:
+```bash
+# Simple HTTP server
+cd /path/to/images
+python -m http.server 8080
+# Images available at:
+# http://localhost:8080/image1.png
+# http://localhost:8080/image2.png
+```
+Trong PDF, reference:
+```
+![Image](http://localhost:8080/image1.png)
+```
+### Option 2: Cloud Storage
+Upload images lên cloud (AWS S3, Cloudinary, Imgur, etc.):
+```python
+# Example: Upload to Imgur
+import requests
+def upload_to_imgur(image_path):
+    client_id = 'YOUR_CLIENT_ID'
+    headers = {'Authorization': f'Client-ID {client_id}'}
+    with open(image_path, 'rb') as img:
+        response = requests.post(
+            'https://api.imgur.com/3/image',
+            headers=headers,
+            files={'image': img}
+        )
+    return response.json()['data']['link']
+# Upload images
+url1 = upload_to_imgur('screenshot1.png')
+url2 = upload_to_imgur('screenshot2.png')
+# Use URLs in PDF
+print(f"![Screenshot 1]({url1})")
+```
+### Option 3: Embed Images as Base64
+Nếu PDF có images embedded, extract chúng:
+```python
+import pypdfium2 as pdfium
+from PIL import Image
+import io
+import base64
+def extract_images_from_pdf(pdf_path):
+    """Extract embedded images from PDF"""
+    pdf = pdfium.PdfDocument(pdf_path)
+    images = []
+    for page_num in range(len(pdf)):
+        page = pdf[page_num]
+        # Render page as image
+        bitmap = page.render(scale=2.0)
+        pil_image = bitmap.to_pil()
+        # Save or convert to base64
+        buffered = io.BytesIO()
+        pil_image.save(buffered, format="PNG")
+        img_str = base64.b64encode(buffered.getvalue()).decode()
+        images.append({
+            'page': page_num + 1,
+            'base64': img_str,
+            'url': f'data:image/png;base64,{img_str}'
+        })
+    return images
+```
+---
+## Troubleshooting
+### Images không được detect
+**Nguyên nhân:**
+- URLs không đúng format (thiếu http://)
+- URLs bị line break
+- Markdown syntax sai
+**Giải pháp:**
+```python
+# Test URL detection
+from multimodal_pdf_parser import MultimodalPDFParser
+parser = MultimodalPDFParser()
+test_text = """
+Xem hình: https://example.com/image.png
+![Alt](https://example.com/pic.jpg)
+"""
+urls = parser.extract_image_urls(test_text)
+print("Found URLs:", urls)
+```
+### Chatbot không return images
+**Check:**
+1. Verify PDF đã được index với multimodal parser:
+   ```bash
+   curl http://localhost:8000/documents/pdf
+   # Look for "type": "multimodal_pdf"
+   ```
+2. Check metadata có `image_urls`:
+   ```python
+   response = requests.post('http://localhost:8000/chat', ...)
+   for ctx in response.json()['context_used']:
+       print(ctx['metadata'].get('image_urls', []))
+   ```
+### Images quá nhiều → chunks lớn
+**Solution:** Giảm số images mỗi chunk:
+```python
+# In multimodal_pdf_parser.py
+parser = MultimodalPDFParser(
+    chunk_size=300,      # Smaller chunks
+    chunk_overlap=30,
+    extract_images=True
+)
+```
+---
+## Kết Luận
+### Khi Nào Dùng Multimodal PDF?
+✓ **Sử dụng `/upload-pdf-multimodal` khi:**
+- PDF có hình ảnh minh họa (screenshots, diagrams)
+- Cần chatbot reference hình ảnh khi trả lời
+- User guides, tutorials với visual instructions
+- Documentation với charts, tables as images
+✓ **Sử dụng `/upload-pdf` thường khi:**
+- PDF chỉ có text thuần
+- Không cần images trong context
+- Simple documents, FAQs
+### Workflow Hoàn Chỉnh
+1. **Tạo PDF** với text + image URLs (Markdown/HTML)
+2. **Upload** qua `/upload-pdf-multimodal`
+3. **Verify** images đã được detect
+4. **Chat** - images sẽ tự động được include in context
+5. **Display** images trong UI của bạn
+---
+## Example: Full Workflow
+```python
+"""
+Complete workflow: Create, upload, and chat with multimodal PDF
+"""
+import requests
+# 1. Upload multimodal PDF
+print("=== Uploading Multimodal PDF ===")
+with open('user_guide_with_images.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf-multimodal',
+        files={'file': f},
+        data={'title': 'User Guide', 'category': 'guide'}
+    )
+result = response.json()
+print(f"✓ Indexed: {result['chunks_indexed']} chunks")
+print(f"✓ Message: {result['message']}")
+# 2. Chat with multimodal context
+print("\n=== Chatting ===")
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới? Cho tôi xem hình minh họa.',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'hf_token': 'your_token'
+})
+chat_result = response.json()
+print(f"Answer: {chat_result['response']}\n")
+# 3. Display context with images
+print("=== Context with Images ===")
+for i, ctx in enumerate(chat_result['context_used'], 1):
+    print(f"\n[{i}] Page {ctx['metadata']['page']}, Confidence: {ctx['confidence']:.2%}")
+    print(f"Text: {ctx['metadata']['text'][:150]}...")
+    if ctx['metadata'].get('has_images'):
+        print(f"Images ({ctx['metadata']['num_images']}):")
+        for url in ctx['metadata']['image_urls']:
+            print(f"  🖼️ {url}")
+```
+---
+**Bây giờ PDF của bạn có hình ảnh minh họa sẽ work perfectly! 🎨📄**

PDF_RAG_GUIDE.md ADDED Viewed

	@@ -0,0 +1,390 @@

+# Hướng Dẫn Sử Dụng PDF với ChatbotRAG
+## Tổng Quan
+Hệ thống ChatbotRAG hiện đã hỗ trợ **tải lên và index PDF** để chatbot có thể trả lời câu hỏi dựa trên nội dung trong PDF. Điều này rất hữu ích cho:
+- Hướng dẫn sử dụng sản phẩm
+- Tài liệu FAQ
+- Chính sách, quy định
+- Tài liệu kỹ thuật
+## Cách Thức Hoạt Động
+1. **Upload PDF** → Hệ thống parse PDF thành text
+2. **Chunking** → Text được chia thành các chunks (mặc định: 500 words/chunk, overlap 50 words)
+3. **Embedding** → Mỗi chunk được convert thành vector embedding
+4. **Indexing** → Lưu vào Qdrant + MongoDB
+5. **Chat** → Chatbot tìm kiếm chunks liên quan và trả lời câu hỏi
+## Cách 1: Upload PDF Qua API
+### Endpoint: `POST /upload-pdf`
+**Request:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@huong_dan_su_dung.pdf" \
+  -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+  -F "description=Tài liệu hướng dẫn đầy đủ về ChatbotRAG" \
+  -F "category=user_guide"
+```
+**Python:**
+```python
+import requests
+with open('huong_dan_su_dung.pdf', 'rb') as f:
+    files = {'file': f}
+    data = {
+        'title': 'Hướng dẫn sử dụng ChatbotRAG',
+        'description': 'Tài liệu hướng dẫn đầy đủ',
+        'category': 'user_guide'
+    }
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files=files,
+        data=data
+    )
+    print(response.json())
+```
+**Response:**
+```json
+{
+  "success": true,
+  "document_id": "pdf_20251029_143022",
+  "filename": "huong_dan_su_dung.pdf",
+  "chunks_indexed": 45,
+  "message": "PDF 'huong_dan_su_dung.pdf' đã được index thành công với 45 chunks"
+}
+```
+### Tham Số:
+- `file` (required): File PDF
+- `document_id` (optional): ID tùy chỉnh, mặc định auto-generate
+- `title` (optional): Tiêu đề tài liệu
+- `description` (optional): Mô tả
+- `category` (optional): Danh mục (user_guide, faq, policy, etc.)
+## Cách 2: Batch Index Nhiều PDFs
+Nếu bạn có nhiều PDF files, sử dụng script batch:
+```bash
+# Index tất cả PDFs trong thư mục
+python batch_index_pdfs.py ./docs/user_guides
+# Với category tùy chỉnh
+python batch_index_pdfs.py ./docs/policies --category=policy
+# Force reindex (ghi đè nếu đã có)
+python batch_index_pdfs.py ./docs/faq --category=faq --force
+```
+Script sẽ tự động:
+- Scan tất cả file .pdf trong thư mục
+- Index từng file với metadata phù hợp
+- Skip những file đã index (trừ khi dùng --force)
+- Hiển thị progress và summary
+## Quản Lý PDF Documents
+### Xem Danh Sách PDFs
+```bash
+curl http://localhost:8000/documents/pdf
+```
+**Response:**
+```json
+{
+  "documents": [
+    {
+      "document_id": "pdf_user_guide",
+      "type": "pdf",
+      "filename": "huong_dan_su_dung.pdf",
+      "num_chunks": 45,
+      "metadata": {
+        "title": "Hướng dẫn sử dụng",
+        "category": "user_guide"
+      }
+    }
+  ],
+  "total": 1
+}
+```
+### Xóa PDF Document
+```bash
+# Xóa document và tất cả chunks của nó
+curl -X DELETE http://localhost:8000/documents/pdf/pdf_user_guide
+```
+## Chat Với PDF Content
+Sau khi index PDF, bạn có thể chat như bình thường:
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để upload PDF vào ChatbotRAG?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_hf_token'
+})
+result = response.json()
+print("Answer:", result['response'])
+# Xem sources
+for ctx in result['context_used']:
+    print(f"- Page {ctx['metadata']['page']}: {ctx['metadata']['text'][:100]}...")
+```
+Chatbot sẽ tự động tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
+## Tạo PDF Hướng Dẫn Sử Dụng
+### Template Nội Dung
+Dưới đây là cấu trúc đề xuất cho PDF hướng dẫn ChatbotRAG:
+```
+HƯỚNG DẪN SỬ DỤNG CHATBOTRAG
+1. GIỚI THIỆU
+   - ChatbotRAG là gì?
+   - Tính năng chính
+   - Use cases
+2. BẮT ĐẦU NHANH
+   2.1. Cài đặt
+   2.2. Khởi động server
+   2.3. Truy cập API
+3. INDEX DỮ LIỆU
+   3.1. Index text đơn giản
+   3.2. Index với images
+   3.3. Index nhiều texts và images cùng lúc
+   3.4. Upload PDF
+4. TÌM KIẾM
+   4.1. Search bằng text
+   4.2. Search bằng image
+   4.3. Hybrid search
+5. CHAT VỚI CHATBOT
+   5.1. Chat cơ bản
+   5.2. Chat với RAG
+   5.3. Advanced RAG options
+   5.4. Tùy chỉnh LLM parameters
+6. QUẢN LÝ DOCUMENTS
+   6.1. Xem danh sách documents
+   6.2. Xóa documents
+   6.3. Quản lý PDF files
+7. CÂU HỎI THƯỜNG GẶP (FAQ)
+   - Làm sao để upload PDF?
+   - Chatbot không tìm thấy thông tin?
+   - Làm sao để cải thiện độ chính xác?
+   - Token limit là bao nhiêu?
+8. API REFERENCE
+   - POST /index
+   - POST /search
+   - POST /chat
+   - POST /upload-pdf
+   - GET /documents/pdf
+```
+### Tạo PDF Từ Markdown
+Bạn có thể tạo PDF từ Markdown bằng nhiều tools:
+**1. Pandoc (Recommended):**
+```bash
+pandoc guide.md -o guide.pdf --pdf-engine=xelatex
+```
+**2. Online Tools:**
+- https://www.markdowntopdf.com/
+- https://md2pdf.netlify.app/
+**3. VS Code Extension:**
+- Install "Markdown PDF" extension
+- Right-click file .md → "Markdown PDF: Export (pdf)"
+### Ví Dụ Markdown Content
+Tạo file `chatbot_guide.md`:
+```markdown
+# Hướng Dẫn Sử Dụng ChatbotRAG
+## 1. Upload PDF
+Để upload PDF vào hệ thống:
+### Bước 1: Chuẩn bị file PDF
+- File phải có định dạng .pdf
+- Nội dung nên rõ ràng, có cấu trúc
+### Bước 2: Upload qua API
+\`\`\`bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@your_file.pdf" \
+  -F "title=Tên tài liệu"
+\`\`\`
+### Bước 3: Kiểm tra
+Sau khi upload, hệ thống sẽ trả về số chunks đã được index.
+## 2. Chat Với Chatbot
+Sau khi upload PDF, bạn có thể hỏi chatbot:
+**Ví dụ:**
+- "Làm sao để upload PDF?"
+- "Các bước tạo event là gì?"
+- "Tính năng nào trong hệ thống?"
+Chatbot sẽ tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
+## 3. FAQ
+### Câu hỏi 1: Upload PDF tối đa bao nhiêu trang?
+Không giới hạn, nhưng PDF càng lớn thì thời gian index càng lâu.
+### Câu hỏi 2: Có thể upload nhiều PDFs không?
+Có, bạn có thể upload nhiều PDFs. Mỗi PDF sẽ có document_id riêng.
+### Câu hỏi 3: Làm sao để xóa PDF đã upload?
+Sử dụng endpoint DELETE /documents/pdf/{document_id}
+```
+Sau đó convert sang PDF:
+```bash
+pandoc chatbot_guide.md -o chatbot_guide.pdf
+```
+## Best Practices
+### 1. Cấu Trúc PDF
+- ✓ Có tiêu đề rõ ràng
+- ✓ Chia sections/chapters
+- ✓ Sử dụng bullet points
+- ✓ Tránh quá nhiều hình ảnh phức tạp (text extraction khó)
+### 2. Nội Dung
+- ✓ Viết câu ngắn gọn, dễ hiểu
+- ✓ Mỗi section tập trung 1 chủ đề
+- ✓ Có ví dụ cụ thể
+- ✗ Tránh văn xuôi dài, khó tách câu
+### 3. Metadata
+- Luôn đặt `title` rõ ràng
+- Sử dụng `category` để phân loại
+- Thêm `description` cho dễ quản lý
+### 4. Chunking
+Mặc định:
+- Chunk size: 500 words
+- Overlap: 50 words
+Có thể tùy chỉnh trong `pdf_parser.py`:
+```python
+parser = PDFParser(
+    chunk_size=500,      # Tăng nếu muốn context dài hơn
+    chunk_overlap=50,    # Tăng để giữ context tốt hơn
+    min_chunk_size=50    # Min words cho 1 chunk
+)
+```
+## Troubleshooting
+### Lỗi: "Error reading PDF"
+- Kiểm tra file PDF có bị corrupt không
+- Thử mở bằng PDF reader để verify
+- Convert lại PDF nếu cần
+### Lỗi: "No text extracted"
+- PDF có thể là scanned images (không có text layer)
+- Cần OCR trước khi index (dùng tools như Tesseract)
+### Chatbot không tìm thấy thông tin
+- Kiểm tra `score_threshold` - thử giảm xuống (e.g., 0.3)
+- Tăng `top_k` để retrieve nhiều documents hơn
+- Rephrase câu hỏi
+### Chunks quá ngắn/dài
+- Điều chỉnh `chunk_size` trong `pdf_parser.py`
+- Reindex PDF với settings mới
+## Complete Example
+```python
+# 1. Upload PDF
+import requests
+with open('user_guide.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files={'file': f},
+        data={
+            'title': 'Hướng dẫn sử dụng',
+            'category': 'user_guide'
+        }
+    )
+doc_id = response.json()['document_id']
+print(f"Uploaded: {doc_id}")
+# 2. List PDFs
+response = requests.get('http://localhost:8000/documents/pdf')
+print(response.json())
+# 3. Chat
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'hf_token': 'your_token'
+})
+print("Answer:", response.json()['response'])
+# 4. Delete PDF (if needed)
+response = requests.delete(f'http://localhost:8000/documents/pdf/{doc_id}')
+print(response.json())
+```
+## Next Steps
+1. **Tạo PDF hướng dẫn của bạn** với nội dung về hệ thống của bạn
+2. **Upload PDF** vào hệ thống
+3. **Test chatbot** - hỏi các câu hỏi về nội dung trong PDF
+4. **Fine-tune** - điều chỉnh parameters nếu cần
+5. **Add more PDFs** - thêm FAQs, policies, etc.
+## Support
+Nếu có vấn đề, check:
+- Server logs để xem errors
+- MongoDB để xem documents đã được lưu chưa
+- Qdrant collection để verify chunks đã được index
+## Conclusion
+Hệ thống PDF RAG giúp chatbot của bạn trả lời câu hỏi dựa trên tài liệu có sẵn, không cần train lại model. Bạn chỉ cần:
+1. Upload PDF
+2. Chat như bình thường
+3. Chatbot sẽ tìm kiếm và trả lời dựa trên PDF content
+Đơn giản và hiệu quả!

QUICK_START_PDF.md ADDED Viewed

	@@ -0,0 +1,310 @@

+# Quick Start: PDF-Based ChatbotRAG
+## Tóm Tắt Nhanh
+Bây giờ bạn có thể:
+1. **Upload PDF** hướng dẫn sử dụng vào hệ thống
+2. **Chatbot tự động trả lời** các câu hỏi dựa trên nội dung trong PDF
+3. Không cần train model, chỉ cần upload PDF!
+---
+## Quy Trình Hoàn Chỉnh
+### Bước 1: Tạo PDF Hướng Dẫn
+Bạn có 2 cách:
+**Cách 1: Sử dụng Template Có Sẵn**
+File `chatbot_guide_template.md` đã sẵn sàng. Customize nội dung cho hệ thống của bạn, sau đó convert sang PDF:
+```bash
+# Cài pandoc (nếu chưa có)
+# Windows: choco install pandoc
+# Mac: brew install pandoc
+# Linux: sudo apt-get install pandoc
+# Convert markdown to PDF
+pandoc chatbot_guide_template.md -o chatbot_user_guide.pdf --pdf-engine=xelatex
+```
+**Cách 2: Tự Viết Content**
+Tạo file Word/Google Docs với nội dung hướng dẫn, sau đó:
+- File → Export → PDF
+**Nội dung nên bao gồm:**
+- Giới thiệu hệ thống
+- Các chức năng chính
+- Hướng dẫn sử dụng từng tính năng
+- FAQ (Câu hỏi thường gặp)
+- Examples
+### Bước 2: Upload PDF Vào Hệ Thống
+```bash
+# Khởi động server
+cd ChatbotRAG
+python main.py
+```
+Trong terminal khác:
+```bash
+# Upload PDF
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@chatbot_user_guide.pdf" \
+  -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+  -F "description=Tài liệu hướng dẫn đầy đủ" \
+  -F "category=user_guide"
+```
+Hoặc dùng Python:
+```python
+import requests
+with open('chatbot_user_guide.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files={'file': f},
+        data={
+            'title': 'Hướng dẫn sử dụng ChatbotRAG',
+            'category': 'user_guide'
+        }
+    )
+print(response.json())
+# Output: {"success": true, "document_id": "pdf_...", "chunks_indexed": 45}
+```
+### Bước 3: Verify Upload
+```bash
+# Xem danh sách PDFs
+curl http://localhost:8000/documents/pdf
+```
+### Bước 4: Chat!
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để upload PDF vào ChatbotRAG?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_huggingface_token'  # Get from https://huggingface.co/settings/tokens
+})
+result = response.json()
+print("Answer:", result['response'])
+print("\nSources:")
+for ctx in result['context_used']:
+    print(f"- Page {ctx['metadata']['page']}: Confidence {ctx['confidence']:.2%}")
+```
+---
+## Test Script Mẫu
+File `test_pdf_chatbot.py`:
+```python
+"""
+Test PDF-based chatbot
+"""
+import requests
+import time
+BASE_URL = "http://localhost:8000"
+HF_TOKEN = "your_huggingface_token"  # Replace with your token
+def upload_pdf():
+    """Upload PDF guide"""
+    print("=== Uploading PDF ===")
+    with open('chatbot_user_guide.pdf', 'rb') as f:
+        response = requests.post(
+            f'{BASE_URL}/upload-pdf',
+            files={'file': f},
+            data={
+                'title': 'ChatbotRAG User Guide',
+                'category': 'user_guide'
+            }
+        )
+    result = response.json()
+    print(f"✓ Uploaded: {result['chunks_indexed']} chunks")
+    return result['document_id']
+def chat(question):
+    """Ask chatbot"""
+    print(f"\n=== Question: {question} ===")
+    response = requests.post(f'{BASE_URL}/chat', json={
+        'message': question,
+        'use_rag': True,
+        'use_advanced_rag': True,
+        'top_k': 5,
+        'hf_token': HF_TOKEN
+    })
+    result = response.json()
+    print(f"Answer: {result['response']}\n")
+    print(f"Retrieved {len(result['context_used'])} documents:")
+    for i, ctx in enumerate(result['context_used'], 1):
+        print(f"{i}. Page {ctx['metadata'].get('page')}, Confidence: {ctx['confidence']:.2%}")
+def main():
+    # 1. Upload PDF
+    doc_id = upload_pdf()
+    # Wait for indexing to complete
+    time.sleep(2)
+    # 2. Test questions
+    questions = [
+        "Làm sao để upload PDF vào hệ thống?",
+        "Chatbot có support tiếng Việt không?",
+        "Tối đa bao nhiêu texts có thể index cùng lúc?",
+        "Advanced RAG có những tính năng gì?"
+    ]
+    for q in questions:
+        chat(q)
+        time.sleep(1)
+if __name__ == "__main__":
+    main()
+```
+Chạy:
+```bash
+python test_pdf_chatbot.py
+```
+---
+## Upload Nhiều PDFs Cùng Lúc
+Nếu bạn có nhiều PDFs (FAQ, User Guide, Policies, etc.):
+```bash
+# Đặt tất cả PDFs vào thư mục
+mkdir docs
+# Copy PDFs vào docs/
+# Batch index
+python batch_index_pdfs.py ./docs --category=user_guide
+```
+Script sẽ tự động index tất cả PDFs và skip những file đã có.
+---
+## Câu Hỏi Test Mẫu
+Sau khi upload PDF hướng dẫn, test với các câu hỏi:
+**Về tính năng:**
+- "ChatbotRAG có những tính năng gì?"
+- "Làm sao để index dữ liệu?"
+- "Advanced RAG là gì?"
+**Hướng dẫn sử dụng:**
+- "Làm sao để upload PDF?"
+- "Cách chat với chatbot như thế nào?"
+- "Làm sao để xem lịch sử chat?"
+**FAQ:**
+- "Chatbot không tìm thấy thông tin phải làm sao?"
+- "Tối đa bao nhiêu images có thể upload?"
+- "Token limit là bao nhiêu?"
+**Technical:**
+- "Score threshold là gì?"
+- "Top_k trong chat request có ý nghĩa gì?"
+- "Làm sao để cải thiện độ chính xác?"
+---
+## Tips Để Chatbot Trả Lời Tốt
+### 1. PDF Content Quality
+- Viết rõ ràng, có cấu trúc
+- Mỗi section tập trung 1 topic
+- Có examples cụ thể
+- FAQ với câu hỏi thực tế
+### 2. Chat Settings
+```python
+{
+    'use_advanced_rag': True,      # Luôn bật
+    'use_reranking': True,          # Rerank cho accuracy
+    'use_compression': True,        # Nén context
+    'score_threshold': 0.5,         # 0.4-0.6 là tốt
+    'top_k': 5,                     # 3-7 tùy use case
+    'temperature': 0.3              # Thấp cho factual answers
+}
+```
+### 3. Query Tips
+- Hỏi câu rõ ràng, cụ thể
+- Tránh câu hỏi quá chung chung
+- Nếu không tìm thấy, rephrase câu hỏi
+---
+## Monitoring
+### Check Index Status
+```bash
+curl http://localhost:8000/stats
+```
+### View PDFs
+```bash
+curl http://localhost:8000/documents/pdf
+```
+### Check Chat History
+```bash
+curl "http://localhost:8000/history?limit=10"
+```
+---
+## Kết Luận
+Bây giờ bạn có thể:
+✓ Tạo PDF hướng dẫn với nội dung của bạn
+✓ Upload PDF vào hệ thống trong vài giây
+✓ Chatbot tự động trả lời dựa trên PDF content
+✓ Không cần train, không cần code phức tạp
+✓ Update content? Chỉ cần upload PDF mới!
+**Next Steps:**
+1. Tạo PDF hướng dẫn của bạn (hoặc customize template)
+2. Upload vào hệ thống
+3. Test với câu hỏi thực tế
+4. Fine-tune settings nếu cần
+5. Add thêm PDFs (FAQ, policies, etc.)
+---
+## Files Quan Trọng
+- `pdf_parser.py` - PDF parsing engine
+- `batch_index_pdfs.py` - Batch indexing script
+- `chatbot_guide_template.md` - Template PDF content
+- `PDF_RAG_GUIDE.md` - Chi tiết về PDF RAG
+- `ADVANCED_RAG_GUIDE.md` - Advanced RAG features
+---
+**Chúc bạn thành công! 🚀**

SUMMARY.md ADDED Viewed

	@@ -0,0 +1,429 @@

+# ChatbotRAG - Complete Summary
+## Tổng Quan Hệ Thống
+Hệ thống ChatbotRAG hiện đã được nâng cấp toàn diện với các tính năng advanced:
+### ✨ Tính Năng Chính
+1. **Multiple Inputs Support** (/index)
+   - Index tối đa 10 texts + 10 images cùng lúc
+   - Average embeddings tự động
+2. **Advanced RAG Pipeline** (/chat)
+   - Query Expansion
+   - Multi-Query Retrieval
+   - Reranking with semantic similarity
+   - Contextual Compression
+   - Better Prompt Engineering
+3. **PDF Support** (/upload-pdf)
+   - Parse PDF thành chunks
+   - Auto chunking với overlap
+   - Index vào RAG system
+4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
+   - Extract text + image URLs từ PDF
+   - Link images với text chunks
+   - Return images cùng text trong chat
+   - Perfect cho user guides với screenshots
+---
+## Kiến Trúc Hệ Thống
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    FastAPI Application                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │   Indexing   │  │   Search     │  │   Chat       │      │
+│  │   Endpoints  │  │   Endpoints  │  │   Endpoint   │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+│                                                               │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │            Advanced RAG Pipeline                      │   │
+│  │  • Query Expansion                                    │   │
+│  │  • Multi-Query Retrieval                              │   │
+│  │  • Reranking                                          │   │
+│  │  • Contextual Compression                             │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                                                               │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │   Jina CLIP  │  │   Qdrant     │  │   MongoDB    │      │
+│  │   v2         │  │   Vector DB  │  │   Documents  │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐                         │
+│  │   PDF        │  │  Multimodal  │                         │
+│  │   Parser     │  │  PDF Parser  │                         │
+│  └──────────────┘  └──────────────┘                         │
+│                                                               │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## Files Quan Trọng
+### Core System
+- **main.py** - FastAPI application với tất cả endpoints
+- **embedding_service.py** - Jina CLIP v2 embedding
+- **qdrant_service.py** - Qdrant vector DB operations
+- **advanced_rag.py** - Advanced RAG pipeline
+### PDF Processing
+- **pdf_parser.py** - Basic PDF parser (text only)
+- **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
+- **batch_index_pdfs.py** - Batch indexing script
+### Documentation
+- **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
+- **PDF_RAG_GUIDE.md** - PDF usage guide
+- **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
+- **QUICK_START_PDF.md** - Quick start for PDF
+- **chatbot_guide_template.md** - Template for user guide PDF
+### Testing
+- **test_advanced_features.py** - Test advanced features
+- **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)
+---
+## API Endpoints
+### 1. Indexing
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/index` | POST | Index texts + images (max 10 each) |
+| `/documents` | POST | Add text document |
+| `/upload-pdf` | POST | Upload PDF (text only) |
+| `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |
+### 2. Search
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/search` | POST | Hybrid search (text + image) |
+| `/search/text` | POST | Text-only search |
+| `/search/image` | POST | Image-only search |
+| `/rag/search` | POST | RAG knowledge base search |
+### 3. Chat
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/chat` | POST | Chat with Advanced RAG |
+### 4. Management
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/documents/pdf` | GET | List all PDFs |
+| `/documents/pdf/{id}` | DELETE | Delete PDF document |
+| `/delete/{doc_id}` | DELETE | Delete document |
+| `/document/{doc_id}` | GET | Get document by ID |
+| `/history` | GET | Get chat history |
+| `/stats` | GET | Collection statistics |
+| `/` | GET | Health check + API docs |
+---
+## Use Cases & Recommendations
+### Case 1: PDF Hướng Dẫn Chỉ Có Text
+**Scenario:** FAQ, policy document, text guide
+**Solution:** `/upload-pdf`
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@faq.pdf" \
+  -F "title=FAQ"
+```
+### Case 2: PDF Hướng Dẫn Có Hình Ảnh ⭐ (Your Case)
+**Scenario:** User guide với screenshots, tutorial với diagrams
+**Solution:** `/upload-pdf-multimodal`
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@user_guide_with_images.pdf" \
+  -F "title=User Guide" \
+  -F "category=guide"
+```
+**Benefits:**
+- ✓ Extract text + image URLs
+- ✓ Link images với text chunks
+- ✓ Chatbot return images in response
+- ✓ Visual context for users
+### Case 3: Multiple Social Media Posts
+**Scenario:** Index nhiều posts với texts và images
+**Solution:** `/index` with multiple inputs
+```python
+data = {
+    'id': 'post123',
+    'texts': ['Post text 1', 'Post text 2', ...],  # Max 10
+}
+files = [
+    ('images', open('img1.jpg', 'rb')),
+    ('images', open('img2.jpg', 'rb')),  # Max 10
+]
+requests.post('http://localhost:8000/index', data=data, files=files)
+```
+### Case 4: Complex Queries
+**Scenario:** Câu hỏi phức tạp, cần độ chính xác cao
+**Solution:** Advanced RAG with full options
+```python
+{
+    'message': 'Complex question',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'use_reranking': True,
+    'use_compression': True,
+    'score_threshold': 0.5,
+    'top_k': 5
+}
+```
+---
+## Workflow Đề Xuất Cho Bạn
+### Setup Ban Đầu
+1. **Tạo PDF hướng dẫn sử dụng**
+   - Dùng template: `chatbot_guide_template.md`
+   - Customize nội dung cho hệ thống của bạn
+   - Thêm image URLs (screenshots, diagrams)
+   - Convert to PDF: `pandoc template.md -o guide.pdf`
+2. **Upload PDF**
+   ```bash
+   curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+     -F "file=@chatbot_user_guide.pdf" \
+     -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+     -F "category=user_guide"
+   ```
+3. **Verify**
+   ```bash
+   curl http://localhost:8000/documents/pdf
+   # Check "type": "multimodal_pdf" và "total_images"
+   ```
+### Sử Dụng Hàng Ngày
+1. **Chat với user**
+   ```python
+   response = requests.post('http://localhost:8000/chat', json={
+       'message': user_question,
+       'use_rag': True,
+       'use_advanced_rag': True,
+       'hf_token': 'your_token'
+   })
+   ```
+2. **Display response + images**
+   ```python
+   # Text answer
+   print(response.json()['response'])
+   # Images (if any)
+   for ctx in response.json()['context_used']:
+       if ctx['metadata'].get('has_images'):
+           for url in ctx['metadata']['image_urls']:
+               # Display image in your UI
+               print(f"Image: {url}")
+   ```
+### Cập Nhật Content
+1. **Update PDF** - Edit và re-export
+2. **Xóa PDF cũ**
+   ```bash
+   curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
+   ```
+3. **Upload PDF mới**
+   ```bash
+   curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
+   ```
+---
+## Performance Tips
+### 1. Chunking
+**Default:**
+- chunk_size: 500 words
+- chunk_overlap: 50 words
+**Tối ưu:**
+```python
+# In multimodal_pdf_parser.py
+parser = MultimodalPDFParser(
+    chunk_size=400,      # Shorter for faster retrieval
+    chunk_overlap=40,
+    min_chunk_size=50
+)
+```
+### 2. Retrieval
+**Settings tốt:**
+```python
+{
+    'top_k': 5,              # 3-7 is optimal
+    'score_threshold': 0.5,   # 0.4-0.6 is good
+    'use_reranking': True,    # Always enable
+    'use_compression': True   # Keeps context relevant
+}
+```
+### 3. LLM
+**For factual answers:**
+```python
+{
+    'temperature': 0.3,   # Low for accuracy
+    'max_tokens': 512,    # Concise answers
+    'top_p': 0.9
+}
+```
+---
+## Troubleshooting
+### Issue 1: Images không được detect
+**Solution:**
+- Verify PDF có image URLs (http://, https://)
+- Check format: markdown `![](url)` hoặc HTML `<img src>`
+- Test regex:
+  ```python
+  from multimodal_pdf_parser import MultimodalPDFParser
+  parser = MultimodalPDFParser()
+  urls = parser.extract_image_urls("![](https://example.com/img.png)")
+  print(urls)  # Should return ['https://example.com/img.png']
+  ```
+### Issue 2: Chatbot không tìm thấy thông tin
+**Solution:**
+- Lower score_threshold: `0.3-0.5`
+- Increase top_k: `5-10`
+- Enable Advanced RAG
+- Rephrase question
+### Issue 3: Response quá chậm
+**Solution:**
+- Giảm top_k
+- Disable compression nếu không cần
+- Use basic RAG thay vì advanced for simple queries
+---
+## Next Steps
+### Immediate (Bây Giờ)
+1. ✓ System đã ready!
+2. Tạo PDF hướng dẫn của bạn
+3. Upload qua `/upload-pdf-multimodal`
+4. Test với câu hỏi thực tế
+### Short Term (1-2 tuần)
+1. Collect user feedback
+2. Fine-tune parameters (top_k, threshold)
+3. Add more PDFs (FAQ, tutorials, etc.)
+4. Monitor chat history để improve content
+### Long Term (Sau này)
+1. **Hybrid Search với BM25**
+   - Combine dense + sparse retrieval
+   - Better for keyword queries
+2. **Cross-Encoder Reranking**
+   - Replace embedding similarity
+   - More accurate ranking
+3. **Image Processing**
+   - Download và process actual images
+   - Use Jina CLIP for image embeddings
+   - True multimodal embeddings (text + image vectors)
+4. **RAG-Anything Integration** (Nếu cần)
+   - For complex PDFs with tables, charts
+   - Vision encoder for embedded images
+   - Advanced document understanding
+---
+## Comparison Matrix
+| Approach | Text | Images | URLs | Complexity | Your Case |
+|----------|------|--------|------|------------|-----------|
+| Basic RAG | ✓ | ✗ | ✗ | Low | ✗ |
+| PDF Parser | ✓ | ✗ | ✗ | Low | ✗ |
+| **Multimodal PDF** | ✓ | ✗ | ✓ | **Medium** | **✓** |
+| RAG-Anything | ✓ | ✓ | ✓ | High | Overkill |
+**Recommendation:** **Multimodal PDF** là perfect cho case của bạn!
+---
+## Kết Luận
+### Bạn Có Gì?
+✅ **Multiple Inputs**: Index 10 texts + 10 images
+✅ **Advanced RAG**: Query expansion, reranking, compression
+✅ **PDF Support**: Parse và index PDFs
+✅ **Multimodal PDF**: Extract text + image URLs, link together
+✅ **Complete Documentation**: Guides, examples, troubleshooting
+### Làm Gì Tiếp?
+1. **Tạo PDF** hướng dẫn với nội dung của bạn (có image URLs)
+2. **Upload** qua `/upload-pdf-multimodal`
+3. **Test** với câu hỏi thực tế
+4. **Iterate** - fine-tune based on feedback
+### Files Cần Đọc
+**Cho PDF với hình ảnh (Your case):**
+- [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
+- [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)
+**Cho Advanced RAG:**
+- [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)
+**Quick Start:**
+- [QUICK_START_PDF.md](QUICK_START_PDF.md)
+---
+**Hệ thống của bạn bây giờ rất mạnh! Chỉ cần upload PDF và chat thôi! 🚀📄🤖**

advanced_rag.py ADDED Viewed

	@@ -0,0 +1,301 @@

+"""
+Advanced RAG techniques for improved retrieval and generation
+Includes: Query Expansion, Reranking, Contextual Compression, Hybrid Search
+"""
+from typing import List, Dict, Optional, Tuple
+import numpy as np
+from dataclasses import dataclass
+import re
+@dataclass
+class RetrievedDocument:
+    """Document retrieved from vector database"""
+    id: str
+    text: str
+    confidence: float
+    metadata: Dict
+class AdvancedRAG:
+    """Advanced RAG system with modern techniques"""
+    def __init__(self, embedding_service, qdrant_service):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+    def expand_query(self, query: str) -> List[str]:
+        """
+        Expand query with related terms and variations
+        Simple rule-based expansion for Vietnamese queries
+        """
+        queries = [query]
+        # Add query variations
+        # Remove question words for alternative search
+        question_words = ['ai', 'gì', 'nào', 'đâu', 'khi nào', 'như thế nào',
+                         'tại sao', 'có', 'là', 'được', 'không']
+        query_lower = query.lower()
+        for qw in question_words:
+            if qw in query_lower:
+                variant = query_lower.replace(qw, '').strip()
+                if variant and variant != query_lower:
+                    queries.append(variant)
+        # Extract key nouns/phrases (simple approach)
+        words = query.split()
+        if len(words) > 3:
+            # Take important words (skip first question word)
+            key_phrases = ' '.join(words[1:]) if words[0].lower() in question_words else ' '.join(words[:3])
+            if key_phrases not in queries:
+                queries.append(key_phrases)
+        return queries[:3]  # Return top 3 variations
+    def multi_query_retrieval(
+        self,
+        query: str,
+        top_k: int = 5,
+        score_threshold: float = 0.5
+    ) -> List[RetrievedDocument]:
+        """
+        Retrieve documents using multiple query variations
+        Combines results from all query variations
+        """
+        expanded_queries = self.expand_query(query)
+        all_results = {}  # Use dict to deduplicate by doc_id
+        for q in expanded_queries:
+            # Generate embedding for each query variant
+            query_embedding = self.embedding_service.encode_text(q)
+            # Search in Qdrant
+            results = self.qdrant_service.search(
+                query_embedding=query_embedding,
+                limit=top_k,
+                score_threshold=score_threshold
+            )
+            # Add to results (keep highest score for duplicates)
+            for result in results:
+                doc_id = result["id"]
+                if doc_id not in all_results or result["confidence"] > all_results[doc_id].confidence:
+                    all_results[doc_id] = RetrievedDocument(
+                        id=doc_id,
+                        text=result["metadata"].get("text", ""),
+                        confidence=result["confidence"],
+                        metadata=result["metadata"]
+                    )
+        # Sort by confidence and return top_k
+        sorted_results = sorted(all_results.values(), key=lambda x: x.confidence, reverse=True)
+        return sorted_results[:top_k]
+    def rerank_documents(
+        self,
+        query: str,
+        documents: List[RetrievedDocument],
+        use_cross_encoder: bool = False
+    ) -> List[RetrievedDocument]:
+        """
+        Rerank documents based on semantic similarity
+        Simple reranking using embedding similarity (can be upgraded to cross-encoder)
+        """
+        if not documents:
+            return documents
+        # Simple reranking: recalculate similarity with original query
+        query_embedding = self.embedding_service.encode_text(query)
+        reranked = []
+        for doc in documents:
+            # Get document embedding
+            doc_embedding = self.embedding_service.encode_text(doc.text)
+            # Calculate cosine similarity
+            similarity = np.dot(query_embedding.flatten(), doc_embedding.flatten())
+            # Combine with original confidence (weighted average)
+            new_score = 0.6 * similarity + 0.4 * doc.confidence
+            reranked.append(RetrievedDocument(
+                id=doc.id,
+                text=doc.text,
+                confidence=float(new_score),
+                metadata=doc.metadata
+            ))
+        # Sort by new score
+        reranked.sort(key=lambda x: x.confidence, reverse=True)
+        return reranked
+    def compress_context(
+        self,
+        query: str,
+        documents: List[RetrievedDocument],
+        max_tokens: int = 500
+    ) -> List[RetrievedDocument]:
+        """
+        Compress context to most relevant parts
+        Remove redundant information and keep only relevant sentences
+        """
+        compressed_docs = []
+        for doc in documents:
+            # Split into sentences
+            sentences = self._split_sentences(doc.text)
+            # Score each sentence based on relevance to query
+            scored_sentences = []
+            query_words = set(query.lower().split())
+            for sent in sentences:
+                sent_words = set(sent.lower().split())
+                # Simple relevance: word overlap
+                overlap = len(query_words & sent_words)
+                if overlap > 0:
+                    scored_sentences.append((sent, overlap))
+            # Sort by relevance and take top sentences
+            scored_sentences.sort(key=lambda x: x[1], reverse=True)
+            # Reconstruct compressed text (up to max_tokens)
+            compressed_text = ""
+            word_count = 0
+            for sent, score in scored_sentences:
+                sent_words = len(sent.split())
+                if word_count + sent_words <= max_tokens:
+                    compressed_text += sent + " "
+                    word_count += sent_words
+                else:
+                    break
+            # If nothing selected, take original first part
+            if not compressed_text.strip():
+                compressed_text = doc.text[:max_tokens * 5]  # Rough estimate
+            compressed_docs.append(RetrievedDocument(
+                id=doc.id,
+                text=compressed_text.strip(),
+                confidence=doc.confidence,
+                metadata=doc.metadata
+            ))
+        return compressed_docs
+    def _split_sentences(self, text: str) -> List[str]:
+        """Split text into sentences (Vietnamese-aware)"""
+        # Simple sentence splitter
+        sentences = re.split(r'[.!?]+', text)
+        return [s.strip() for s in sentences if s.strip()]
+    def hybrid_rag_pipeline(
+        self,
+        query: str,
+        top_k: int = 5,
+        score_threshold: float = 0.5,
+        use_reranking: bool = True,
+        use_compression: bool = True,
+        max_context_tokens: int = 500
+    ) -> Tuple[List[RetrievedDocument], Dict]:
+        """
+        Complete advanced RAG pipeline
+        1. Multi-query retrieval
+        2. Reranking
+        3. Contextual compression
+        """
+        stats = {
+            "original_query": query,
+            "expanded_queries": [],
+            "initial_results": 0,
+            "after_rerank": 0,
+            "after_compression": 0
+        }
+        # Step 1: Multi-query retrieval
+        expanded_queries = self.expand_query(query)
+        stats["expanded_queries"] = expanded_queries
+        documents = self.multi_query_retrieval(
+            query=query,
+            top_k=top_k * 2,  # Get more candidates for reranking
+            score_threshold=score_threshold
+        )
+        stats["initial_results"] = len(documents)
+        # Step 2: Reranking (optional)
+        if use_reranking and documents:
+            documents = self.rerank_documents(query, documents)
+            documents = documents[:top_k]  # Keep top_k after reranking
+        stats["after_rerank"] = len(documents)
+        # Step 3: Contextual compression (optional)
+        if use_compression and documents:
+            documents = self.compress_context(
+                query=query,
+                documents=documents,
+                max_tokens=max_context_tokens
+            )
+        stats["after_compression"] = len(documents)
+        return documents, stats
+    def format_context_for_llm(
+        self,
+        documents: List[RetrievedDocument],
+        include_metadata: bool = True
+    ) -> str:
+        """
+        Format retrieved documents into context string for LLM
+        Uses better structure for improved LLM understanding
+        """
+        if not documents:
+            return ""
+        context_parts = ["RELEVANT CONTEXT:\n"]
+        for i, doc in enumerate(documents, 1):
+            context_parts.append(f"\n--- Document {i} (Relevance: {doc.confidence:.2%}) ---")
+            context_parts.append(doc.text)
+            if include_metadata and doc.metadata:
+                # Add useful metadata
+                meta_str = []
+                for key, value in doc.metadata.items():
+                    if key not in ['text', 'texts'] and value:
+                        meta_str.append(f"{key}: {value}")
+                if meta_str:
+                    context_parts.append(f"[Metadata: {', '.join(meta_str)}]")
+        context_parts.append("\n--- End of Context ---\n")
+        return "\n".join(context_parts)
+    def build_rag_prompt(
+        self,
+        query: str,
+        context: str,
+        system_message: str = "You are a helpful AI assistant."
+    ) -> str:
+        """
+        Build optimized RAG prompt for LLM
+        Uses best practices for prompt engineering
+        """
+        prompt_template = f"""{system_message}
+{context}
+INSTRUCTIONS:
+1. Answer the user's question using ONLY the information provided in the context above
+2. If the context doesn't contain relevant information, say "Tôi không tìm thấy thông tin liên quan trong dữ liệu."
+3. Cite relevant parts of the context when answering
+4. Be concise and accurate
+5. Answer in Vietnamese if the question is in Vietnamese
+USER QUESTION: {query}
+YOUR ANSWER:"""
+        return prompt_template

batch_index_pdfs.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+Batch script to index PDF files into RAG knowledge base
+Usage: python batch_index_pdfs.py <pdf_directory> [options]
+"""
+import os
+import sys
+from pathlib import Path
+from pymongo import MongoClient
+from embedding_service import JinaClipEmbeddingService
+from qdrant_service import QdrantVectorService
+from pdf_parser import PDFIndexer
+def index_pdf_directory(
+    pdf_dir: str,
+    category: str = "user_guide",
+    force: bool = False
+):
+    """
+    Index all PDF files in a directory
+    Args:
+        pdf_dir: Directory containing PDF files
+        category: Category for the PDFs (default: "user_guide")
+        force: Force reindex even if already indexed (default: False)
+    """
+    print("="*60)
+    print("PDF Batch Indexer")
+    print("="*60)
+    # Initialize services (same as main.py)
+    print("\n[1/5] Initializing services...")
+    embedding_service = JinaClipEmbeddingService(model_path="jinaai/jina-clip-v2")
+    collection_name = os.getenv("COLLECTION_NAME", "event_social_media")
+    qdrant_service = QdrantVectorService(
+        collection_name=collection_name,
+        vector_size=embedding_service.get_embedding_dimension()
+    )
+    # MongoDB
+    mongodb_uri = os.getenv("MONGODB_URI", "mongodb+srv://truongtn7122003:7KaI9OT5KTUxWjVI@truongtn7122003.xogin4q.mongodb.net/")
+    mongo_client = MongoClient(mongodb_uri)
+    db = mongo_client[os.getenv("MONGODB_DB_NAME", "chatbot_rag")]
+    documents_collection = db["documents"]
+    # Initialize PDF indexer
+    pdf_indexer = PDFIndexer(
+        embedding_service=embedding_service,
+        qdrant_service=qdrant_service,
+        documents_collection=documents_collection
+    )
+    print("✓ Services initialized")
+    # Find all PDF files
+    print(f"\n[2/5] Scanning directory: {pdf_dir}")
+    pdf_files = list(Path(pdf_dir).glob("*.pdf"))
+    if not pdf_files:
+        print("✗ No PDF files found in directory")
+        return
+    print(f"✓ Found {len(pdf_files)} PDF file(s)")
+    # Index each PDF
+    print(f"\n[3/5] Indexing PDFs...")
+    indexed_count = 0
+    skipped_count = 0
+    error_count = 0
+    for i, pdf_path in enumerate(pdf_files, 1):
+        print(f"\n--- [{i}/{len(pdf_files)}] Processing: {pdf_path.name} ---")
+        # Generate document ID
+        doc_id = f"pdf_{pdf_path.stem}"
+        # Check if already indexed
+        if not force:
+            existing = documents_collection.find_one({"document_id": doc_id})
+            if existing:
+                print(f"⊘ Already indexed (use --force to reindex)")
+                skipped_count += 1
+                continue
+        try:
+            # Index PDF
+            metadata = {
+                'title': pdf_path.stem.replace('_', ' ').title(),
+                'category': category,
+                'source_file': str(pdf_path)
+            }
+            result = pdf_indexer.index_pdf(
+                pdf_path=str(pdf_path),
+                document_id=doc_id,
+                document_metadata=metadata
+            )
+            print(f"✓ Indexed: {result['chunks_indexed']} chunks")
+            indexed_count += 1
+        except Exception as e:
+            print(f"✗ Error: {str(e)}")
+            error_count += 1
+    # Summary
+    print("\n" + "="*60)
+    print("SUMMARY")
+    print("="*60)
+    print(f"Total PDFs found: {len(pdf_files)}")
+    print(f"✓ Successfully indexed: {indexed_count}")
+    print(f"⊘ Skipped (already indexed): {skipped_count}")
+    print(f"✗ Errors: {error_count}")
+    if indexed_count > 0:
+        print(f"\n✓ Knowledge base updated successfully!")
+        print(f"You can now chat with your chatbot about the content in these PDFs.")
+def main():
+    """Main entry point"""
+    if len(sys.argv) < 2:
+        print("Usage: python batch_index_pdfs.py <pdf_directory> [--category=<category>] [--force]")
+        print("\nExample:")
+        print("  python batch_index_pdfs.py ./docs/guides")
+        print("  python batch_index_pdfs.py ./docs/guides --category=user_guide --force")
+        sys.exit(1)
+    pdf_dir = sys.argv[1]
+    if not os.path.isdir(pdf_dir):
+        print(f"Error: Directory not found: {pdf_dir}")
+        sys.exit(1)
+    # Parse options
+    category = "user_guide"
+    force = False
+    for arg in sys.argv[2:]:
+        if arg.startswith("--category="):
+            category = arg.split("=")[1]
+        elif arg == "--force":
+            force = True
+    # Index PDFs
+    index_pdf_directory(pdf_dir, category=category, force=force)
+if __name__ == "__main__":
+    main()

chatbot_guide_template.md ADDED Viewed

	@@ -0,0 +1,369 @@

+# Hướng Dẫn Sử Dụng ChatbotRAG
+*Version 2.0 - Tháng 10, 2025*
+---
+## 1. Giới Thiệu
+### ChatbotRAG là gì?
+ChatbotRAG là hệ thống chatbot thông minh sử dụng công nghệ RAG (Retrieval-Augmented Generation) để trả lời câu hỏi dựa trên cơ sở dữ liệu kiến thức của bạn.
+### Tính năng chính
+- **Multimodal Search**: Tìm kiếm bằng text và hình ảnh
+- **Advanced RAG**: Query expansion, reranking, context compression
+- **PDF Support**: Upload PDF và chat về nội dung trong PDF
+- **Multiple Inputs**: Index nhiều texts và images cùng lúc (tối đa 10 mỗi loại)
+- **Chat History**: Lưu lịch sử chat để theo dõi
+---
+## 2. Bắt Đầu Nhanh
+### Bước 1: Khởi động server
+```bash
+cd ChatbotRAG
+python main.py
+```
+Server sẽ chạy tại: `http://localhost:8000`
+### Bước 2: Truy cập API Documentation
+Mở trình duyệt và truy cập:
+- API Docs: `http://localhost:8000/docs`
+- ReDoc: `http://localhost:8000/redoc`
+### Bước 3: Test với câu hỏi đơn giản
+```bash
+curl -X POST "http://localhost:8000/chat" \
+  -H "Content-Type: application/json" \
+  -d '{"message": "Xin chào, bạn là ai?"}'
+```
+---
+## 3. Index Dữ Liệu
+### 3.1. Index Text Đơn Giản
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=doc1" \
+  -F "texts=Đây là text nội dung 1" \
+  -F "texts=Đây là text nội dung 2"
+```
+### 3.2. Index Với Images
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=event123" \
+  -F "texts=Sự kiện âm nhạc tại Hà Nội" \
+  -F "images=@poster1.jpg" \
+  -F "images=@poster2.jpg"
+```
+**Lưu ý**: Tối đa 10 texts và 10 images mỗi request.
+### 3.3. Upload PDF
+Để upload tài liệu PDF vào hệ thống:
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@user_guide.pdf" \
+  -F "title=Hướng dẫn sử dụng" \
+  -F "category=user_guide"
+```
+Sau khi upload, chatbot có thể trả lời câu hỏi về nội dung trong PDF.
+---
+## 4. Tìm Kiếm Dữ Liệu
+### 4.1. Search Bằng Text
+```bash
+curl -X POST "http://localhost:8000/search/text" \
+  -F "text=sự kiện âm nhạc" \
+  -F "limit=5"
+```
+### 4.2. Search Bằng Image
+```bash
+curl -X POST "http://localhost:8000/search/image" \
+  -F "image=@query_image.jpg" \
+  -F "limit=5"
+```
+### 4.3. Hybrid Search (Text + Image)
+```bash
+curl -X POST "http://localhost:8000/search" \
+  -F "text=festival music" \
+  -F "image=@query.jpg" \
+  -F "text_weight=0.6" \
+  -F "image_weight=0.4"
+```
+---
+## 5. Chat Với Chatbot
+### 5.1. Chat Cơ Bản (Không RAG)
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Xin chào!',
+    'use_rag': False,
+    'hf_token': 'your_huggingface_token'
+})
+print(response.json()['response'])
+```
+### 5.2. Chat Với RAG (Recommended)
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Festival âm nhạc diễn ra khi nào?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_token'
+})
+result = response.json()
+print("Answer:", result['response'])
+print("Sources:", result['context_used'])
+```
+### 5.3. Advanced RAG Options
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Câu hỏi của bạn',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    # Advanced RAG settings
+    'use_query_expansion': True,    # Mở rộng câu hỏi
+    'use_reranking': True,          # Rerank kết quả
+    'use_compression': True,        # Nén context
+    'score_threshold': 0.5,         # Ngưỡng relevance (0-1)
+    'top_k': 5,                     # Số documents retrieve
+    # LLM settings
+    'max_tokens': 512,
+    'temperature': 0.7,
+    'hf_token': 'your_token'
+})
+```
+---
+## 6. Quản Lý Documents
+### 6.1. Xem Danh Sách Documents
+```bash
+# Xem stats collection
+curl http://localhost:8000/stats
+# Xem PDFs
+curl http://localhost:8000/documents/pdf
+```
+### 6.2. Get Document By ID
+```bash
+curl http://localhost:8000/document/doc123
+```
+### 6.3. Xóa Document
+```bash
+curl -X DELETE http://localhost:8000/delete/doc123
+```
+### 6.4. Xóa PDF Document
+```bash
+curl -X DELETE http://localhost:8000/documents/pdf/pdf_20251029_143022
+```
+---
+## 7. Câu Hỏi Thường Gặp (FAQ)
+### Q1: Làm sao để upload PDF vào hệ thống?
+**A:** Sử dụng endpoint `/upload-pdf`:
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@your_file.pdf" \
+  -F "title=Tên tài liệu"
+```
+### Q2: Chatbot không tìm thấy thông tin phù hợp?
+**A:** Thử các cách sau:
+1. Giảm `score_threshold` xuống (0.3 - 0.5)
+2. Tăng `top_k` lên (5-10)
+3. Sử dụng `use_advanced_rag=True`
+4. Rephrase câu hỏi rõ ràng hơn
+### Q3: Làm sao để cải thi��n độ chính xác của chatbot?
+**A:**
+- Bật Advanced RAG: `use_advanced_rag=True`
+- Bật tất cả RAG features: `use_reranking=True`, `use_compression=True`
+- Index nhiều documents với nội dung chi tiết
+- Sử dụng metadata phù hợp khi index
+### Q4: Token limit của LLM là bao nhiêu?
+**A:** Mặc định `max_tokens=512`. Bạn có thể tăng lên trong request:
+```python
+{
+    'message': 'Your question',
+    'max_tokens': 1024,  # Tăng lên
+    'hf_token': 'your_token'
+}
+```
+### Q5: Có thể upload bao nhiêu texts/images cùng lúc?
+**A:** Tối đa **10 texts** và **10 images** mỗi request tại endpoint `/index`.
+### Q6: Chatbot có support tiếng Việt không?
+**A:** Có! Hệ thống sử dụng Jina CLIP v2 hỗ trợ đa ngôn ngữ, bao gồm tiếng Việt.
+### Q7: Làm sao để xem lịch sử chat?
+**A:**
+```bash
+curl "http://localhost:8000/history?limit=10&skip=0"
+```
+### Q8: PDF của tôi có nhiều hình ảnh, có vấn đề gì không?
+**A:** Hệ thống hiện chỉ extract text từ PDF. Hình ảnh trong PDF chưa được xử lý. Nếu cần xử lý hình ảnh trong PDF, có thể integrate RAG-Anything sau.
+---
+## 8. API Reference
+### Endpoints Chính
+| Endpoint | Method | Mô tả |
+|----------|--------|-------|
+| `/` | GET | Health check & API docs |
+| `/index` | POST | Index texts + images (tối đa 10 mỗi loại) |
+| `/search` | POST | Hybrid search (text + image) |
+| `/search/text` | POST | Search chỉ bằng text |
+| `/search/image` | POST | Search chỉ bằng image |
+| `/chat` | POST | Chat với RAG |
+| `/documents` | POST | Add text document |
+| `/upload-pdf` | POST | Upload và index PDF |
+| `/documents/pdf` | GET | List PDFs |
+| `/documents/pdf/{id}` | DELETE | Delete PDF |
+| `/history` | GET | Get chat history |
+| `/stats` | GET | Collection statistics |
+### Request Examples
+**Index with multiple texts:**
+```json
+POST /index
+{
+  "id": "doc123",
+  "texts": ["Text 1", "Text 2", "Text 3"]
+}
+```
+**Chat with Advanced RAG:**
+```json
+POST /chat
+{
+  "message": "Your question",
+  "use_rag": true,
+  "use_advanced_rag": true,
+  "use_reranking": true,
+  "top_k": 5,
+  "score_threshold": 0.5,
+  "hf_token": "hf_xxxxx"
+}
+```
+---
+## 9. Best Practices
+### Index Dữ Liệu
+✓ Chia nhỏ nội dung thành các chunks có nghĩa
+✓ Thêm metadata đầy đủ (title, category, source)
+✓ Sử dụng texts array cho multiple paragraphs
+✗ Tránh index text quá dài trong 1 chunk
+### Chat
+✓ Bật Advanced RAG cho câu hỏi phức tạp
+✓ Điều chỉnh `top_k` và `score_threshold` phù hợp
+✓ Sử dụng `temperature` thấp (0.3-0.5) cho câu trả lời factual
+✗ Tránh đặt `score_threshold` quá cao (>0.8)
+### PDF
+✓ PDF có text layer (không phải scanned image)
+✓ Cấu trúc rõ ràng với headings, paragraphs
+✓ Nội dung ngắn gọn, dễ hiểu
+✗ Tránh PDF quá nhiều hình ảnh phức tạp
+---
+## 10. Troubleshooting
+### Server không khởi động
+- Kiểm tra dependencies: `pip install -r requirements.txt`
+- Kiểm tra MongoDB connection string
+- Kiểm tra Qdrant service
+### Upload PDF lỗi
+- Verify file là PDF hợp lệ
+- Check file không bị corrupt
+- Thử convert lại PDF nếu cần
+### Chatbot không trả lời đúng
+- Kiểm tra documents đã được index chưa: `/stats`
+- Thử giảm `score_threshold`
+- Bật Advanced RAG options
+- Check LLM token (Hugging Face)
+### Out of memory
+- Giảm `chunk_size` trong PDF parser
+- Giảm `top_k` trong chat request
+- Index ít documents hơn mỗi lần
+---
+## 11. Liên Hệ & Support
+Nếu có thắc mắc hoặc vấn đề:
+- Check server logs
+- Review API documentation tại `/docs`
+- Xem GitHub issues
+---
+**Happy Chatting! 🤖**

multimodal_pdf_parser.py ADDED Viewed

	@@ -0,0 +1,390 @@

+"""
+Enhanced Multimodal PDF Parser for PDFs with Text + Image URLs
+Extracts text, detects image URLs, and links them together
+"""
+import pypdfium2 as pdfium
+from typing import List, Dict, Optional, Tuple
+import re
+from dataclasses import dataclass, field
+@dataclass
+class MultimodalChunk:
+    """Represents a chunk with text and associated images"""
+    text: str
+    page_number: int
+    chunk_index: int
+    image_urls: List[str] = field(default_factory=list)
+    metadata: Dict = field(default_factory=dict)
+class MultimodalPDFParser:
+    """
+    Enhanced PDF Parser that extracts text and image URLs
+    Perfect for user guides with screenshots and visual instructions
+    """
+    def __init__(
+        self,
+        chunk_size: int = 500,
+        chunk_overlap: int = 50,
+        min_chunk_size: int = 50,
+        extract_images: bool = True
+    ):
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.min_chunk_size = min_chunk_size
+        self.extract_images = extract_images
+        # URL patterns
+        self.url_patterns = [
+            # Standard URLs
+            r'https?://[^\s<>"{}|\\^`\[\]]+',
+            # Markdown images: ![alt](url)
+            r'!\[.*?\]\((https?://[^\s)]+)\)',
+            # HTML images: <img src="url">
+            r'<img[^>]+src=["\']([^"\']+)["\']',
+            # Direct image extensions
+            r'https?://[^\s<>"{}|\\^`\[\]]+\.(?:jpg|jpeg|png|gif|bmp|svg|webp)',
+        ]
+    def extract_image_urls(self, text: str) -> List[str]:
+        """
+        Extract all image URLs from text
+        Args:
+            text: Text content
+        Returns:
+            List of image URLs found
+        """
+        urls = []
+        for pattern in self.url_patterns:
+            matches = re.findall(pattern, text, re.IGNORECASE)
+            urls.extend(matches)
+        # Remove duplicates while preserving order
+        seen = set()
+        unique_urls = []
+        for url in urls:
+            if url not in seen:
+                seen.add(url)
+                unique_urls.append(url)
+        return unique_urls
+    def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, Tuple[str, List[str]]]:
+        """
+        Extract text and image URLs from PDF
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary mapping page number to (text, image_urls) tuple
+        """
+        pdf_pages = {}
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            for page_num in range(len(pdf)):
+                page = pdf[page_num]
+                textpage = page.get_textpage()
+                text = textpage.get_text_range()
+                # Clean text
+                text = self._clean_text(text)
+                # Extract image URLs if enabled
+                image_urls = []
+                if self.extract_images:
+                    image_urls = self.extract_image_urls(text)
+                pdf_pages[page_num + 1] = (text, image_urls)
+            return pdf_pages
+        except Exception as e:
+            raise Exception(f"Error reading PDF: {str(e)}")
+    def _clean_text(self, text: str) -> str:
+        """Clean extracted text"""
+        # Remove excessive whitespace
+        text = re.sub(r'\s+', ' ', text)
+        # Remove special characters
+        text = text.replace('\x00', '')
+        return text.strip()
+    def chunk_text_with_images(
+        self,
+        text: str,
+        image_urls: List[str],
+        page_number: int
+    ) -> List[MultimodalChunk]:
+        """
+        Split text into chunks and associate images with relevant chunks
+        Args:
+            text: Text to chunk
+            image_urls: Image URLs from the page
+            page_number: Page number
+        Returns:
+            List of MultimodalChunk objects
+        """
+        # Split into words
+        words = text.split()
+        if len(words) < self.min_chunk_size:
+            if len(words) > 0:
+                return [MultimodalChunk(
+                    text=text,
+                    page_number=page_number,
+                    chunk_index=0,
+                    image_urls=image_urls,  # All images go to single chunk
+                    metadata={'page': page_number, 'chunk': 0}
+                )]
+            return []
+        chunks = []
+        chunk_index = 0
+        start = 0
+        # Calculate how to distribute images across chunks
+        images_per_chunk = len(image_urls) // max(1, len(words) // self.chunk_size) if image_urls else 0
+        image_index = 0
+        while start < len(words):
+            end = min(start + self.chunk_size, len(words))
+            chunk_words = words[start:end]
+            chunk_text = ' '.join(chunk_words)
+            # Assign images to this chunk
+            chunk_images = []
+            if image_urls:
+                # Simple strategy: distribute images evenly
+                # or detect if URL appears in chunk text
+                for url in image_urls:
+                    if url in chunk_text:
+                        chunk_images.append(url)
+                # If no URLs found in text, distribute evenly
+                if not chunk_images and image_index < len(image_urls):
+                    # Assign remaining images to chunks
+                    num_imgs = min(images_per_chunk + 1, len(image_urls) - image_index)
+                    chunk_images = image_urls[image_index:image_index + num_imgs]
+                    image_index += num_imgs
+            chunks.append(MultimodalChunk(
+                text=chunk_text,
+                page_number=page_number,
+                chunk_index=chunk_index,
+                image_urls=chunk_images,
+                metadata={
+                    'page': page_number,
+                    'chunk': chunk_index,
+                    'start_word': start,
+                    'end_word': end,
+                    'has_images': len(chunk_images) > 0,
+                    'num_images': len(chunk_images)
+                }
+            ))
+            chunk_index += 1
+            start = end - self.chunk_overlap
+            if start >= len(words) - self.min_chunk_size:
+                break
+        return chunks
+    def parse_pdf(
+        self,
+        pdf_path: str,
+        document_metadata: Optional[Dict] = None
+    ) -> List[MultimodalChunk]:
+        """
+        Parse PDF into multimodal chunks
+        Args:
+            pdf_path: Path to PDF file
+            document_metadata: Additional metadata
+        Returns:
+            List of MultimodalChunk objects
+        """
+        pages_data = self.extract_text_from_pdf(pdf_path)
+        all_chunks = []
+        for page_num, (text, image_urls) in pages_data.items():
+            chunks = self.chunk_text_with_images(text, image_urls, page_num)
+            # Add document metadata
+            if document_metadata:
+                for chunk in chunks:
+                    chunk.metadata.update(document_metadata)
+            all_chunks.extend(chunks)
+        return all_chunks
+    def parse_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_metadata: Optional[Dict] = None
+    ) -> List[MultimodalChunk]:
+        """Parse PDF from bytes"""
+        import tempfile
+        import os
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
+            tmp.write(pdf_bytes)
+            tmp_path = tmp.name
+        try:
+            chunks = self.parse_pdf(tmp_path, document_metadata)
+            return chunks
+        finally:
+            if os.path.exists(tmp_path):
+                os.unlink(tmp_path)
+class MultimodalPDFIndexer:
+    """Index multimodal PDF chunks into RAG system"""
+    def __init__(self, embedding_service, qdrant_service, documents_collection):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+        self.documents_collection = documents_collection
+        self.parser = MultimodalPDFParser()
+    def index_pdf(
+        self,
+        pdf_path: str,
+        document_id: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """Index PDF with image URLs"""
+        chunks = self.parser.parse_pdf(pdf_path, document_metadata)
+        indexed_count = 0
+        chunk_ids = []
+        total_images = 0
+        for chunk in chunks:
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding (text-based)
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata with image URLs
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                'has_images': len(chunk.image_urls) > 0,
+                'image_urls': chunk.image_urls,  # Store image URLs!
+                'num_images': len(chunk.image_urls),
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+            total_images += len(chunk.image_urls)
+        # Save document info
+        doc_info = {
+            'document_id': document_id,
+            'type': 'multimodal_pdf',
+            'file_path': pdf_path,
+            'num_chunks': indexed_count,
+            'total_images': total_images,
+            'chunk_ids': chunk_ids,
+            'metadata': document_metadata or {}
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'chunks_indexed': indexed_count,
+            'images_found': total_images,
+            'chunk_ids': chunk_ids[:5]
+        }
+    def index_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_id: str,
+        filename: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """Index PDF from bytes"""
+        metadata = document_metadata or {}
+        metadata['filename'] = filename
+        chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
+        indexed_count = 0
+        chunk_ids = []
+        total_images = 0
+        for chunk in chunks:
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            embedding = self.embedding_service.encode_text(chunk.text)
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'multimodal_pdf',
+                'filename': filename,
+                'has_images': len(chunk.image_urls) > 0,
+                'image_urls': chunk.image_urls,
+                'num_images': len(chunk.image_urls),
+                **chunk.metadata
+            }
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+            total_images += len(chunk.image_urls)
+        doc_info = {
+            'document_id': document_id,
+            'type': 'multimodal_pdf',
+            'filename': filename,
+            'num_chunks': indexed_count,
+            'total_images': total_images,
+            'chunk_ids': chunk_ids,
+            'metadata': metadata
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'filename': filename,
+            'chunks_indexed': indexed_count,
+            'images_found': total_images,
+            'chunk_ids': chunk_ids[:5]
+        }

pdf_parser.py ADDED Viewed

	@@ -0,0 +1,371 @@

+"""
+PDF Parser Service for RAG Chatbot
+Extracts text from PDF and splits into chunks for indexing
+"""
+import pypdfium2 as pdfium
+from typing import List, Dict, Optional
+import re
+from dataclasses import dataclass
+@dataclass
+class PDFChunk:
+    """Represents a chunk of text from PDF"""
+    text: str
+    page_number: int
+    chunk_index: int
+    metadata: Dict
+class PDFParser:
+    """Parse PDF files and prepare for RAG indexing"""
+    def __init__(
+        self,
+        chunk_size: int = 500,  # words per chunk
+        chunk_overlap: int = 50,  # words overlap between chunks
+        min_chunk_size: int = 50  # minimum words in a chunk
+    ):
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.min_chunk_size = min_chunk_size
+    def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, str]:
+        """
+        Extract text from PDF file
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary mapping page number to text content
+        """
+        pdf_text = {}
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            for page_num in range(len(pdf)):
+                page = pdf[page_num]
+                textpage = page.get_textpage()
+                text = textpage.get_text_range()
+                # Clean text
+                text = self._clean_text(text)
+                pdf_text[page_num + 1] = text  # 1-indexed pages
+            return pdf_text
+        except Exception as e:
+            raise Exception(f"Error reading PDF: {str(e)}")
+    def _clean_text(self, text: str) -> str:
+        """Clean extracted text"""
+        # Remove excessive whitespace
+        text = re.sub(r'\s+', ' ', text)
+        # Remove special characters that might cause issues
+        text = text.replace('\x00', '')
+        return text.strip()
+    def chunk_text(self, text: str, page_number: int) -> List[PDFChunk]:
+        """
+        Split text into overlapping chunks
+        Args:
+            text: Text to chunk
+            page_number: Page number this text came from
+        Returns:
+            List of PDFChunk objects
+        """
+        # Split into words
+        words = text.split()
+        if len(words) < self.min_chunk_size:
+            # Text too short, return as single chunk
+            if len(words) > 0:
+                return [PDFChunk(
+                    text=text,
+                    page_number=page_number,
+                    chunk_index=0,
+                    metadata={'page': page_number, 'chunk': 0}
+                )]
+            return []
+        chunks = []
+        chunk_index = 0
+        start = 0
+        while start < len(words):
+            # Get chunk
+            end = min(start + self.chunk_size, len(words))
+            chunk_words = words[start:end]
+            chunk_text = ' '.join(chunk_words)
+            chunks.append(PDFChunk(
+                text=chunk_text,
+                page_number=page_number,
+                chunk_index=chunk_index,
+                metadata={
+                    'page': page_number,
+                    'chunk': chunk_index,
+                    'start_word': start,
+                    'end_word': end
+                }
+            ))
+            chunk_index += 1
+            # Move start position with overlap
+            start = end - self.chunk_overlap
+            # Avoid infinite loop
+            if start >= len(words) - self.min_chunk_size:
+                break
+        return chunks
+    def parse_pdf(
+        self,
+        pdf_path: str,
+        document_metadata: Optional[Dict] = None
+    ) -> List[PDFChunk]:
+        """
+        Parse entire PDF into chunks
+        Args:
+            pdf_path: Path to PDF file
+            document_metadata: Additional metadata for the document
+        Returns:
+            List of all chunks from the PDF
+        """
+        # Extract text from all pages
+        pages_text = self.extract_text_from_pdf(pdf_path)
+        # Chunk each page
+        all_chunks = []
+        for page_num, text in pages_text.items():
+            chunks = self.chunk_text(text, page_num)
+            # Add document metadata
+            if document_metadata:
+                for chunk in chunks:
+                    chunk.metadata.update(document_metadata)
+            all_chunks.extend(chunks)
+        return all_chunks
+    def parse_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_metadata: Optional[Dict] = None
+    ) -> List[PDFChunk]:
+        """
+        Parse PDF from bytes (for uploaded files)
+        Args:
+            pdf_bytes: PDF file as bytes
+            document_metadata: Additional metadata
+        Returns:
+            List of chunks
+        """
+        import tempfile
+        import os
+        # Save to temp file
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
+            tmp.write(pdf_bytes)
+            tmp_path = tmp.name
+        try:
+            chunks = self.parse_pdf(tmp_path, document_metadata)
+            return chunks
+        finally:
+            # Clean up temp file
+            if os.path.exists(tmp_path):
+                os.unlink(tmp_path)
+    def get_pdf_info(self, pdf_path: str) -> Dict:
+        """
+        Get basic info about PDF
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary with PDF information
+        """
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            info = {
+                'num_pages': len(pdf),
+                'file_path': pdf_path,
+            }
+            return info
+        except Exception as e:
+            raise Exception(f"Error reading PDF info: {str(e)}")
+class PDFIndexer:
+    """Index PDF chunks into RAG system"""
+    def __init__(self, embedding_service, qdrant_service, documents_collection):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+        self.documents_collection = documents_collection
+        self.parser = PDFParser()
+    def index_pdf(
+        self,
+        pdf_path: str,
+        document_id: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """
+        Index entire PDF into RAG system
+        Args:
+            pdf_path: Path to PDF file
+            document_id: Unique ID for this document
+            document_metadata: Additional metadata (title, author, etc.)
+        Returns:
+            Indexing results
+        """
+        # Parse PDF
+        chunks = self.parser.parse_pdf(pdf_path, document_metadata)
+        # Index each chunk
+        indexed_count = 0
+        chunk_ids = []
+        for chunk in chunks:
+            # Generate unique ID for chunk
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+        # Save document info to MongoDB
+        doc_info = {
+            'document_id': document_id,
+            'type': 'pdf',
+            'file_path': pdf_path,
+            'num_chunks': indexed_count,
+            'chunk_ids': chunk_ids,
+            'metadata': document_metadata or {},
+            'pdf_info': self.parser.get_pdf_info(pdf_path)
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'chunks_indexed': indexed_count,
+            'chunk_ids': chunk_ids[:5]  # Return first 5 as sample
+        }
+    def index_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_id: str,
+        filename: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """
+        Index PDF from bytes (for uploaded files)
+        Args:
+            pdf_bytes: PDF file as bytes
+            document_id: Unique ID for this document
+            filename: Original filename
+            document_metadata: Additional metadata
+        Returns:
+            Indexing results
+        """
+        # Parse PDF
+        metadata = document_metadata or {}
+        metadata['filename'] = filename
+        chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
+        # Index each chunk
+        indexed_count = 0
+        chunk_ids = []
+        for chunk in chunks:
+            # Generate unique ID for chunk
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                'filename': filename,
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+        # Save document info to MongoDB
+        doc_info = {
+            'document_id': document_id,
+            'type': 'pdf',
+            'filename': filename,
+            'num_chunks': indexed_count,
+            'chunk_ids': chunk_ids,
+            'metadata': metadata
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'filename': filename,
+            'chunks_indexed': indexed_count,
+            'chunk_ids': chunk_ids[:5]
+        }

requirements.txt CHANGED Viewed

@@ -28,4 +28,7 @@ huggingface-hub>=0.20.0
 timm
 einops

 timm
 einops
+# PDF Processing
+pypdfium2>=4.30.0

test_advanced_features.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Test script for Advanced RAG features
+Demonstrates new capabilities: multiple texts/images indexing and advanced RAG chat
+"""
+import requests
+import json
+from typing import List, Optional
+class AdvancedRAGTester:
+    """Test client for Advanced RAG API"""
+    def __init__(self, base_url: str = "http://localhost:8000"):
+        self.base_url = base_url
+    def test_multiple_index(self, doc_id: str, texts: List[str], image_paths: Optional[List[str]] = None):
+        """
+        Test indexing with multiple texts and images
+        Args:
+            doc_id: Document ID
+            texts: List of texts (max 10)
+            image_paths: List of image file paths (max 10)
+        """
+        print(f"\n{'='*60}")
+        print(f"TEST: Indexing document '{doc_id}' with multiple texts/images")
+        print(f"{'='*60}")
+        # Prepare form data
+        data = {'id': doc_id}
+        # Add texts
+        if texts:
+            if len(texts) > 10:
+                print("WARNING: Maximum 10 texts allowed. Taking first 10.")
+                texts = texts[:10]
+            data['texts'] = texts
+            print(f"✓ Texts: {len(texts)} items")
+        # Prepare files
+        files = []
+        if image_paths:
+            if len(image_paths) > 10:
+                print("WARNING: Maximum 10 images allowed. Taking first 10.")
+                image_paths = image_paths[:10]
+            for img_path in image_paths:
+                try:
+                    files.append(('images', open(img_path, 'rb')))
+                except FileNotFoundError:
+                    print(f"WARNING: Image not found: {img_path}")
+            print(f"✓ Images: {len(files)} files")
+        # Make request
+        try:
+            response = requests.post(f"{self.base_url}/index", data=data, files=files)
+            response.raise_for_status()
+            result = response.json()
+            print(f"\n✓ SUCCESS")
+            print(f"  - Document ID: {result['id']}")
+            print(f"  - Message: {result['message']}")
+            return result
+        except requests.exceptions.RequestException as e:
+            print(f"\n✗ ERROR: {e}")
+            if hasattr(e.response, 'text'):
+                print(f"  Response: {e.response.text}")
+            return None
+        finally:
+            # Close file handles
+            for _, file_obj in files:
+                file_obj.close()
+    def test_advanced_rag_chat(
+        self,
+        message: str,
+        hf_token: Optional[str] = None,
+        use_advanced_rag: bool = True,
+        use_reranking: bool = True,
+        use_compression: bool = True,
+        top_k: int = 3,
+        score_threshold: float = 0.5
+    ):
+        """
+        Test advanced RAG chat
+        Args:
+            message: User question
+            hf_token: Hugging Face token (optional)
+            use_advanced_rag: Use advanced RAG pipeline
+            use_reranking: Enable reranking
+            use_compression: Enable context compression
+            top_k: Number of documents to retrieve
+            score_threshold: Minimum relevance score
+        """
+        print(f"\n{'='*60}")
+        print(f"TEST: Advanced RAG Chat")
+        print(f"{'='*60}")
+        print(f"Question: {message}")
+        print(f"Advanced RAG: {use_advanced_rag}")
+        print(f"Reranking: {use_reranking}")
+        print(f"Compression: {use_compression}")
+        payload = {
+            'message': message,
+            'use_rag': True,
+            'use_advanced_rag': use_advanced_rag,
+            'use_reranking': use_reranking,
+            'use_compression': use_compression,
+            'top_k': top_k,
+            'score_threshold': score_threshold,
+        }
+        if hf_token:
+            payload['hf_token'] = hf_token
+        try:
+            response = requests.post(f"{self.base_url}/chat", json=payload)
+            response.raise_for_status()
+            result = response.json()
+            print(f"\n✓ SUCCESS")
+            print(f"\n--- Answer ---")
+            print(result['response'])
+            print(f"\n--- Retrieved Context ({len(result['context_used'])} documents) ---")
+            for i, ctx in enumerate(result['context_used'], 1):
+                print(f"{i}. [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
+                text_preview = ctx['metadata'].get('text', '')[:100]
+                print(f"   Text: {text_preview}...")
+            if result.get('rag_stats'):
+                print(f"\n--- RAG Pipeline Statistics ---")
+                stats = result['rag_stats']
+                print(f"  Original query: {stats.get('original_query')}")
+                print(f"  Expanded queries: {stats.get('expanded_queries')}")
+                print(f"  Initial results: {stats.get('initial_results')}")
+                print(f"  After reranking: {stats.get('after_rerank')}")
+                print(f"  After compression: {stats.get('after_compression')}")
+            return result
+        except requests.exceptions.RequestException as e:
+            print(f"\n✗ ERROR: {e}")
+            if hasattr(e.response, 'text'):
+                print(f"  Response: {e.response.text}")
+            return None
+    def compare_basic_vs_advanced_rag(self, message: str, hf_token: Optional[str] = None):
+        """Compare basic RAG vs advanced RAG side by side"""
+        print(f"\n{'='*60}")
+        print(f"COMPARISON: Basic RAG vs Advanced RAG")
+        print(f"{'='*60}")
+        print(f"Question: {message}\n")
+        # Test Basic RAG
+        print("\n--- BASIC RAG ---")
+        basic_result = self.test_advanced_rag_chat(
+            message=message,
+            hf_token=hf_token,
+            use_advanced_rag=False
+        )
+        # Test Advanced RAG
+        print("\n--- ADVANCED RAG ---")
+        advanced_result = self.test_advanced_rag_chat(
+            message=message,
+            hf_token=hf_token,
+            use_advanced_rag=True
+        )
+        # Compare
+        print(f"\n{'='*60}")
+        print("COMPARISON SUMMARY")
+        print(f"{'='*60}")
+        if basic_result and advanced_result:
+            print(f"Basic RAG:")
+            print(f"  - Retrieved docs: {len(basic_result['context_used'])}")
+            print(f"\nAdvanced RAG:")
+            print(f"  - Retrieved docs: {len(advanced_result['context_used'])}")
+            if advanced_result.get('rag_stats'):
+                stats = advanced_result['rag_stats']
+                print(f"  - Query expansion: {len(stats.get('expanded_queries', []))} variants")
+                print(f"  - Initial retrieval: {stats.get('initial_results', 0)} docs")
+                print(f"  - After reranking: {stats.get('after_rerank', 0)} docs")
+def main():
+    """Run tests"""
+    tester = AdvancedRAGTester()
+    print("="*60)
+    print("ADVANCED RAG FEATURE TESTS")
+    print("="*60)
+    # Test 1: Index with multiple texts (no images for demo)
+    print("\n\n### TEST 1: Index Multiple Texts ###")
+    tester.test_multiple_index(
+        doc_id="event_music_festival_2025",
+        texts=[
+            "Festival âm nhạc quốc tế Hà Nội 2025",
+            "Thời gian: 15-17 tháng 11 năm 2025",
+            "Địa điểm: Công viên Thống Nhất, Hà Nội",
+            "Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh, Mỹ Tâm",
+            "Giá vé: Early bird 500.000đ, VIP 2.000.000đ",
+            "Dự kiến 50.000 khán giả tham dự",
+            "3 sân khấu chính, 5 food court, khu vực cắm trại"
+        ]
+    )
+    # Test 2: Index another document
+    print("\n\n### TEST 2: Index Another Document ###")
+    tester.test_multiple_index(
+        doc_id="safety_guidelines",
+        texts=[
+            "Vũ khí và đồ vật nguy hiểm bị cấm mang vào sự kiện",
+            "Dao, kiếm, súng và các loại vũ khí nguy hiểm nghiêm cấm",
+            "An ninh sẽ kiểm tra tất cả túi xách và đồ mang theo",
+            "Vi phạm sẽ bị tịch thu và có thể bị trục xuất khỏi sự kiện"
+        ]
+    )
+    # Test 3: Basic chat (without HF token - will show placeholder)
+    print("\n\n### TEST 3: Basic RAG Chat (No LLM) ###")
+    tester.test_advanced_rag_chat(
+        message="Festival Hà Nội diễn ra khi nào?",
+        use_advanced_rag=False
+    )
+    # Test 4: Advanced RAG chat
+    print("\n\n### TEST 4: Advanced RAG Chat (No LLM) ###")
+    tester.test_advanced_rag_chat(
+        message="Festival Hà Nội diễn ra khi nào và có những nghệ sĩ nào?",
+        use_advanced_rag=True,
+        use_reranking=True,
+        use_compression=True
+    )
+    # Test 5: Compare basic vs advanced
+    print("\n\n### TEST 5: Comparison Test ###")
+    tester.compare_basic_vs_advanced_rag(
+        message="Dao có được mang vào sự kiện không?"
+    )
+    print("\n\n" + "="*60)
+    print("ALL TESTS COMPLETED")
+    print("="*60)
+    print("\nNOTE: To test with actual LLM responses, add your Hugging Face token:")
+    print("  tester.test_advanced_rag_chat(message='...', hf_token='hf_xxxxx')")
+if __name__ == "__main__":
+    main()

verify_dependencies.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+Verify all dependencies are installed correctly
+Run: python verify_dependencies.py
+"""
+import sys
+def check_dependency(module_name, package_name=None):
+    """Check if a dependency is installed"""
+    if package_name is None:
+        package_name = module_name
+    try:
+        __import__(module_name)
+        print(f"✓ {package_name}")
+        return True
+    except ImportError as e:
+        print(f"✗ {package_name} - NOT INSTALLED")
+        print(f"  Error: {e}")
+        return False
+def main():
+    print("="*60)
+    print("Dependency Verification")
+    print("="*60)
+    dependencies = [
+        # Web framework
+        ("fastapi", "fastapi"),
+        ("uvicorn", "uvicorn"),
+        ("multipart", "python-multipart"),
+        # ML & Embeddings
+        ("torch", "torch"),
+        ("transformers", "transformers"),
+        ("PIL", "pillow"),
+        ("numpy", "numpy"),
+        # Vector DB
+        ("qdrant_client", "qdrant-client"),
+        # Utilities
+        ("pydantic", "pydantic"),
+        ("dotenv", "python-dotenv"),
+        # MongoDB
+        ("pymongo", "pymongo"),
+        ("huggingface_hub", "huggingface-hub"),
+        ("timm", "timm"),
+        ("einops", "einops"),
+        # PDF Processing (NEW)
+        ("pypdfium2", "pypdfium2"),
+    ]
+    print("\nChecking dependencies...\n")
+    all_ok = True
+    for module, package in dependencies:
+        if not check_dependency(module, package):
+            all_ok = False
+    print("\n" + "="*60)
+    if all_ok:
+        print("✓ All dependencies installed successfully!")
+        print("\nYou can now run:")
+        print("  python main.py")
+    else:
+        print("✗ Some dependencies are missing!")
+        print("\nPlease install missing dependencies:")
+        print("  pip install -r requirements.txt")
+        sys.exit(1)
+    print("="*60)
+    # Check optional features
+    print("\nChecking system modules...\n")
+    # Check our custom modules
+    custom_modules = [
+        "embedding_service",
+        "qdrant_service",
+        "advanced_rag",
+        "pdf_parser",
+        "multimodal_pdf_parser",
+    ]
+    for module in custom_modules:
+        try:
+            __import__(module)
+            print(f"✓ {module}.py")
+        except ImportError as e:
+            print(f"✗ {module}.py - ERROR: {e}")
+    print("\n" + "="*60)
+    print("Verification complete!")
+    print("="*60)
+if __name__ == "__main__":
+    main()