Spaces:

FauzanAriyatmoko
/

LLM-ChatBot-Document

Running

App Files Files Community

FauzanAriyatmoko commited on Jan 23

Commit

43aac82

1 Parent(s): a86d063

feat: Add an interactive document viewer with real-time citation highlighting, enhanced PDF processing, and speech-to-text input.

Browse files

Files changed (2) hide show

README.md +279 -49
app.py +4 -4

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ python_version: "3.12"
 app_file: app.py
 pinned: false
 license: mit
-short_description: Chat dengan dokumen PDF menggunakan RAG dan GLM Model
 ---
 # RAG ChatBot dengan QWEN Model 🤖
@@ -20,7 +20,7 @@ short_description: Chat dengan dokumen PDF menggunakan RAG dan GLM Model
 ![Python](https://img.shields.io/badge/python-3.8+-brightgreen.svg)
 ![Gradio](https://img.shields.io/badge/gradio-5.42.0-orange.svg)
-**Chat dengan dokumen PDF Anda menggunakan AI dengan teknologi RAG (Retrieval-Augmented Generation)**
 [Demo](#demo) • [Fitur](#fitur) • [Instalasi](#instalasi) • [Penggunaan](#penggunaan) • [Arsitektur](#arsitektur)
@@ -30,23 +30,36 @@ short_description: Chat dengan dokumen PDF menggunakan RAG dan GLM Model
 ## 📖 Deskripsi
-RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen PDF dan melakukan tanya jawab interaktif tentang isi dokumen tersebut. Sistem menggunakan:
 - **Qwen2-0.5B-Instruct**: Model bahasa generatif yang ringan untuk menghasilkan jawaban
 - **RAG (Retrieval-Augmented Generation)**: Teknik untuk mencari informasi relevan dari dokumen
 - **ChromaDB**: Vector database untuk penyimpanan dan pencarian semantic
 - **Gradio**: Interface web yang modern dan interaktif
 ## ✨ Fitur
 - 📤 **Upload Multiple PDF**: Upload satu atau beberapa file PDF sekaligus
 - 🔍 **Semantic Search**: Pencarian konteks menggunakan embeddings
 - 💬 **Interactive Chat**: Chat dengan streaming response
-- 📚 **Source Citations**: Lihat sumber informasi dari dokumen
 - 🎨 **Modern UI**: Interface premium dengan gradients dan animasi
 - ⚙️ **Configurable**: Atur parameters seperti temperature, top-p, dan retrieval count
 - 💾 **Persistent Storage**: Dokumen tersimpan di vector database
 - 🌐 **Bahasa Indonesia**: Full support untuk bahasa Indonesia
 ## 🚀 Instalasi
@@ -54,6 +67,7 @@ RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen P
 - Python 3.8 atau lebih tinggi
 - (Opsional) NVIDIA GPU dengan CUDA untuk performa optimal
 ### Langkah Instalasi
@@ -92,39 +106,93 @@ python app.py
 Aplikasi akan berjalan di `http://localhost:7860`
-### Workflow
 1. **Upload Dokumen** (Tab 📤 Upload Dokumen)
    - Pilih file PDF dari komputer Anda
-   - Klik "Process PDF"
-   - Tunggu hingga proses ekstraksi dan indexing selesai
-2. **Chat dengan Dokumen** (Tab 💬 Chat)
-   - Ketik pertanyaan Anda tentang isi dokumen
-   - Sistem akan mencari informasi relevan dan menjawab
-   - Lihat source citations untuk referensi
 3. **Kelola Dokumen** (Tab 📚 Kelola Dokumen)
    - Lihat daftar dokumen yang tersimpan
-   - Hapus dokumen jika diperlukan
    - Clear all untuk reset database
 ## 🏗️ Arsitektur
 ```
 ┌─────────────────┐
 │   PDF Upload    │
 └────────┬────────┘
          │
          ▼
-┌─────────────────┐
-│ Text Extraction │  (PyPDF2 + pdfplumber)
-└────────┬────────┘
          │
          ▼
-┌─────────────────┐
-│  Text Chunking  │  (LangChain)
-└────────┬────────┘
          │
          ▼
 ┌─────────────────┐
@@ -132,45 +200,93 @@ Aplikasi akan berjalan di `http://localhost:7860`
 └────────┬────────┘
          │
          ▼
-┌─────────────────┐
-│   ChromaDB      │  (Vector Storage)
-└────────┬────────┘
          │
     ┌────┴─────┐
-    │   RAG    │
     └────┬─────┘
          │
     ┌────▼─────┐
     │  Qwen2   │  (Response Generation)
-    └──────────┘
 ```
 ## 📁 Struktur Project
 ```
 LLM-ChatBot-Document/
 │
-├── app.py                 # Main application
-├── requirements.txt       # Dependencies
-├── .env.example          # Environment template
-├── .gitignore            # Git ignore rules
 │
 ├── config/
 │   ├── __init__.py
-│   └── model_config.py   # Model & app configuration
 │
 ├── utils/
 │   ├── __init__.py
-│   ├── pdf_processor.py  # PDF extraction & chunking
-│   ├── vector_store.py   # ChromaDB management
-│   ├── rag_pipeline.py   # RAG implementation
-│   └── ui_components.py  # Gradio UI components
 │
 ├── data/
-│   ├── uploads/          # Temporary PDF storage
-│   └── vector_db/        # ChromaDB persistent storage
 │
-└── tests/                # Unit & integration tests
 ```
 ## ⚙️ Konfigurasi
@@ -186,16 +302,20 @@ EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
 DEVICE=auto
 # Text Processing
-CHUNK_SIZE=500
-CHUNK_OVERLAP=50
 # Retrieval
-TOP_K_RETRIEVAL=3
 # Generation
-MAX_LENGTH=2048
-TEMPERATURE=0.7
-TOP_P=0.9
 ```
 ## 🔧 Requirements
@@ -210,14 +330,55 @@ Berikut dependencies utama yang digunakan:
 - `langchain>=0.1.0` - Text processing
 - `PyPDF2>=3.0.0` - PDF extraction
 - `pdfplumber>=0.10.0` - Alternative PDF extraction
 ## 💡 Tips & Best Practices
 1. **Ukuran PDF**: Untuk hasil terbaik, gunakan PDF < 50MB
 2. **Format PDF**: Pastikan PDF berisi teks yang bisa di-extract (bukan scan gambar)
-3. **Chunk Size**: Sesuaikan `CHUNK_SIZE` berdasarkan jenis dokumen (500-1000 optimal)
-4. **GPU**: Gunakan GPU untuk loading model yang lebih cepat
-5. **Temperature**: Nilai lebih rendah (0.3-0.5) untuk jawaban lebih faktual
 ## 🐛 Troubleshooting
@@ -225,15 +386,51 @@ Berikut dependencies utama yang digunakan:
 ```bash
 # Model saat ini sudah sangat ringan (~1GB)
 # Jika masih ada masalah, pastikan koneksi internet stabil untuk download
 ```
 ### PDF Extraction Error
 - Coba method alternatif dengan edit `pdf_processor.py`
-- Pastikan PDF tidak ter-password
 ### Memory Error
-- Reduce `CHUNK_SIZE` and `BATCH_SIZE`
-- Use CPU instead of GPU if OOM on GPU
 ## 📝 License
@@ -243,6 +440,35 @@ MIT License - lihat file LICENSE untuk detail
 Contributions welcome! Silakan buat issue atau pull request.
 ## 📧 Contact
 Untuk pertanyaan dan support, silakan buat issue di repository ini.
@@ -250,5 +476,9 @@ Untuk pertanyaan dan support, silakan buat issue di repository ini.
 ---
 <div align="center">
-Made with ❤️ using Gradio and Qwen2
 </div>

 app_file: app.py
 pinned: false
 license: mit
+short_description: Chat dengan dokumen PDF menggunakan RAG dan Document Viewer
 ---
 # RAG ChatBot dengan QWEN Model 🤖
 ![Python](https://img.shields.io/badge/python-3.8+-brightgreen.svg)
 ![Gradio](https://img.shields.io/badge/gradio-5.42.0-orange.svg)
+**Chat dengan dokumen PDF Anda menggunakan AI dengan teknologi RAG (Retrieval-Augmented Generation) dan Document Viewer Interaktif**
 [Demo](#demo) • [Fitur](#fitur) • [Instalasi](#instalasi) • [Penggunaan](#penggunaan) • [Arsitektur](#arsitektur)
 ## 📖 Deskripsi
+RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen PDF dan melakukan tanya jawab interaktif tentang isi dokumen tersebut dengan **Document Viewer yang menampilkan sitasi sumber secara real-time**. Sistem menggunakan:
 - **Qwen2-0.5B-Instruct**: Model bahasa generatif yang ringan untuk menghasilkan jawaban
 - **RAG (Retrieval-Augmented Generation)**: Teknik untuk mencari informasi relevan dari dokumen
 - **ChromaDB**: Vector database untuk penyimpanan dan pencarian semantic
+- **Document Viewer**: Viewer interaktif dengan citation highlighting
 - **Gradio**: Interface web yang modern dan interaktif
 ## ✨ Fitur
+### 🎯 Fitur Utama
 - 📤 **Upload Multiple PDF**: Upload satu atau beberapa file PDF sekaligus
+- 📄 **Document Viewer**: Tampilan dokumen side-by-side dengan chat interface
+- 🎨 **Citation Highlighting**: Highlighting otomatis paragraf sumber pada dokumen
+- 🔗 **Interactive Citations**: Klik sitasi untuk scroll ke paragraf terkait
 - 🔍 **Semantic Search**: Pencarian konteks menggunakan embeddings
 - 💬 **Interactive Chat**: Chat dengan streaming response
+- 📚 **Source Citations**: Lihat sumber informasi dari dokumen dengan detail
+- 🎙️ **Speech-to-Text**: Input suara untuk pertanyaan (opsional)
+### 💎 Fitur Premium
 - 🎨 **Modern UI**: Interface premium dengan gradients dan animasi
 - ⚙️ **Configurable**: Atur parameters seperti temperature, top-p, dan retrieval count
 - 💾 **Persistent Storage**: Dokumen tersimpan di vector database
 - 🌐 **Bahasa Indonesia**: Full support untuk bahasa Indonesia
+- 📊 **Document Metadata**: Track paragraphs, pages, dan chunks
+- 🔄 **Real-time Updates**: Document viewer update otomatis saat chat
+- 🎯 **Paragraph-level Citations**: Sitasi akurat hingga level paragraf
 ## 🚀 Instalasi
 - Python 3.8 atau lebih tinggi
 - (Opsional) NVIDIA GPU dengan CUDA untuk performa optimal
+- (Opsional) Microphone untuk Speech-to-Text
 ### Langkah Instalasi
 Aplikasi akan berjalan di `http://localhost:7860`
+### Workflow Lengkap
 1. **Upload Dokumen** (Tab 📤 Upload Dokumen)
    - Pilih file PDF dari komputer Anda
+   - Klik "🚀 Process PDF"
+   - Sistem akan:
+     - Ekstrak teks dengan struktur paragraf
+     - Membuat chunks dengan metadata
+     - Generate HTML preview untuk viewer
+     - Simpan ke vector database
+   - Tunggu hingga proses selesai (✓ status sukses)
+2. **Chat dengan Document Viewer** (Tab 💬 Chat)
+   - **Kolom Kiri (60%)**: Chat Interface
+     - 🎤 Input Suara (jika diaktifkan)
+     - ⌨️ Input Teks untuk pertanyaan
+     - 📚 Sumber Referensi dengan citation cards
+     - ⚙️ Parameter Chat (RAG, Temperature, dll)
+   - **Kolom Kanan (40%)**: Document Viewer
+     - 📄 Tampilan dokumen dengan struktur paragraf
+     - 🎯 Highlighting otomatis paragraf sumber
+     - 🔍 Kontrol zoom (zoom in/out/reset)
+     - 📖 Nomor halaman per section
+   - **Cara Kerja Citation**:
+     1. Ajukan pertanyaan tentang dokumen
+     2. ChatBot menjawab dengan mencari sumber relevan
+     3. Citation cards muncul di bawah jawaban
+     4. Document viewer otomatis highlight paragraf sumber
+     5. Klik citation card untuk scroll ke paragraf
 3. **Kelola Dokumen** (Tab 📚 Kelola Dokumen)
    - Lihat daftar dokumen yang tersimpan
+   - Info detail: jumlah chunks, pages, paragraphs
+   - Hapus dokumen individual
    - Clear all untuk reset database
+### Fitur Document Viewer Detail
+#### Highlighting & Navigation
+- **Auto-scroll**: Viewer otomatis scroll ke paragraf sumber pertama
+- **Click-to-highlight**: Klik citation untuk highlight dan scroll
+- **Flash animation**: Paragraf yang diklik akan flash untuk visibility
+- **Multi-source**: Support multiple paragraphs dari berbagai pages
+#### Kontrol Viewer
+- **🔍−** Zoom Out: Perkecil teks
+- **🔍** Reset: Kembali ke ukuran normal
+- **🔍+** Zoom In: Perbesar teks
+- **Scroll bar**: Custom styled untuk dark theme
+#### Citation Format
+```
+📄 [Filename] (Hal. X)
+"[Preview teks 150 karakter...]"
+```
 ## 🏗️ Arsitektur
+### Pipeline Lengkap
 ```
 ┌─────────────────┐
 │   PDF Upload    │
 └────────┬────────┘
          │
          ▼
+┌─────────────────────────┐
+│  Structured Extraction  │  ← NEW! Extract with paragraphs
+│  - Pages                │
+│  - Paragraphs + IDs     │
+│  - Character positions  │
+└────────┬────────────────┘
          │
          ▼
+┌─────────────────────────┐
+│  HTML Preview Gen       │  ← NEW! For document viewer
+└────────┬────────────────┘
+         │
+         ▼
+┌─────────────────────────┐
+│  Chunking + Metadata    │  ← Enhanced with para IDs
+│  - Paragraph IDs        │
+│  - Page numbers         │
+│  - Character offsets    │
+└────────┬────────────────┘
          │
          ▼
 ┌─────────────────┐
 └────────┬────────┘
          │
          ▼
+┌─────────────────────────┐
+│   ChromaDB Storage      │  ← Enhanced metadata
+│   - Chunks + embeddings │
+│   - Paragraph metadata  │
+│   - HTML preview        │
+└────────┬────────────────┘
          │
     ┌────┴─────┐
+    │   RAG    │  ← Returns enriched sources
     └────┬─────┘
          │
     ┌────▼─────┐
     │  Qwen2   │  (Response Generation)
+    └────┬─────┘
+         │
+         ▼
+┌─────────────────────────┐
+│  Document Viewer        │  ← NEW! Citation highlighting
+│  - Render HTML          │
+│  - Highlight paragraphs │
+│  - Interactive citations│
+└─────────────────────────┘
 ```
+### Komponen Utama
+1. **PDF Processor** (`utils/pdf_processor.py`)
+   - Extract text with paragraph structure
+   - Generate HTML preview
+   - Chunk text with metadata (paragraph IDs, pages)
+2. **Vector Store** (`utils/vector_store.py`)
+   - Store chunks with enhanced metadata
+   - Store HTML preview for each document
+   - Query with paragraph ID return
+3. **Document Viewer** (`utils/document_viewer.py`)
+   - Render HTML with highlighting
+   - Create interactive citation links
+   - Handle viewer controls
+4. **RAG Pipeline** (`utils/rag_pipeline.py`)
+   - Retrieve relevant chunks
+   - Return source metadata (paragraph IDs, pages)
+   - Stream responses
+5. **UI Components** (`utils/ui_components.py`)
+   - Premium CSS styling
+   - Document viewer styles
+   - Citation card styles
 ## 📁 Struktur Project
 ```
 LLM-ChatBot-Document/
 │
+├── app.py                      # Main application (updated)
+├── requirements.txt            # Dependencies
+├── .env.example               # Environment template
+├── .gitignore                 # Git ignore rules
+├── README.md                  # Documentation (this file)
+├── QUICKSTART.md              # Quick start guide
 │
 ├── config/
 │   ├── __init__.py
+│   └── model_config.py        # Model & app configuration
 │
 ├── utils/
 │   ├── __init__.py
+│   ├── pdf_processor.py       # PDF extraction & chunking (enhanced)
+│   ├── vector_store.py        # ChromaDB management (enhanced)
+│   ├── rag_pipeline.py        # RAG implementation
+│   ├── document_viewer.py     # Document viewer component (NEW!)
+│   ├── speech_to_text.py      # STT for voice input
+│   └── ui_components.py       # Gradio UI components (enhanced)
 │
 ├── data/
+│   ├── uploads/               # Temporary PDF storage
+│   └── vector_db/             # ChromaDB persistent storage
+│       ├── chroma.sqlite3     # Vector database
+│       └── documents_metadata.json  # Document metadata (enhanced)
 │
+└── tests/                     # Unit & integration tests
+    ├── __init__.py
+    ├── test_imports.py
+    ├── test_pdf_processor.py
+    └── test_vector_store.py
 ```
 ## ⚙️ Konfigurasi
 DEVICE=auto
 # Text Processing
+CHUNK_SIZE=500              # Ukuran chunk (karakter)
+CHUNK_OVERLAP=50           # Overlap antar chunk
 # Retrieval
+TOP_K_RETRIEVAL=3          # Jumlah chunks yang diambil
 # Generation
+MAX_LENGTH=2048            # Max token untuk response
+TEMPERATURE=0.7            # Kreativitas (0.0-2.0)
+TOP_P=0.9                  # Nucleus sampling
+# Speech-to-Text (Optional)
+STT_ENABLED=False          # Enable/disable STT
+STT_LANGUAGE=id-ID         # Bahasa untuk STT
 ```
 ## 🔧 Requirements
 - `langchain>=0.1.0` - Text processing
 - `PyPDF2>=3.0.0` - PDF extraction
 - `pdfplumber>=0.10.0` - Alternative PDF extraction
+- `python-dotenv>=1.0.0` - Environment variables
 ## 💡 Tips & Best Practices
+### Upload & Processing
 1. **Ukuran PDF**: Untuk hasil terbaik, gunakan PDF < 50MB
 2. **Format PDF**: Pastikan PDF berisi teks yang bisa di-extract (bukan scan gambar)
+3. **Multi-file**: Upload multiple files untuk cross-document search
+### Chat & RAG
+4. **Chunk Size**: Sesuaikan `CHUNK_SIZE` berdasarkan jenis dokumen:
+   - Academic papers: 500-700
+   - Books/novels: 800-1000
+   - Technical docs: 400-600
+5. **Top-K**: Gunakan 3-5 untuk most cases, tingkatkan untuk dokumen kompleks
+6. **Temperature**:
+   - 0.3-0.5: Jawaban faktual, strict
+   - 0.7-0.9: Balanced
+   - 1.0-1.5: Creative, expansive
+### Document Viewer
+7. **Citation Quality**: Makin specific pertanyaan, makin akurat citation
+8. **Multi-page**: Viewer support citations dari berbagai pages
+9. **Zoom**: Gunakan kontrol zoom untuk readability optimal
+### Performance
+10. **GPU**: Gunakan GPU untuk loading model yang lebih cepat
+11. **Memory**: Document viewer cache HTML, butuh RAM sesuai jumlah docs
+12. **Cleanup**: Hapus dokumen lama untuk optimasi storage
+## 🎨 UI/UX Features
+### Dark Theme Premium
+- Gradient backgrounds (purple to blue)
+- Glass morphism effects
+- Smooth animations & transitions
+- Custom scrollbars
+### Interactive Elements
+- Hover effects on citations
+- Click-to-scroll functionality
+- Flash animations for highlights
+- Responsive layout (60/40 split)
+### Visual Feedback
+- Status indicators (success/error/loading)
+- Progress bars for uploads
+- Real-time streaming responses
+- Auto-scroll to citations
 ## 🐛 Troubleshooting
 ```bash
 # Model saat ini sudah sangat ringan (~1GB)
 # Jika masih ada masalah, pastikan koneksi internet stabil untuk download
+# Model akan di-cache setelah download pertama
 ```
 ### PDF Extraction Error
 - Coba method alternatif dengan edit `pdf_processor.py`
+- Pastikan PDF tidak ter-password atau ter-encrypt
+- Check jika PDF adalah scan (perlu OCR terpisah)
 ### Memory Error
+```bash
+# Reduce CHUNK_SIZE dan BATCH_SIZE
+CHUNK_SIZE=300
+# Use CPU instead of GPU if OOM
+DEVICE=cpu
+# Batasi jumlah dokumen di database
+```
+### Document Viewer Issues
+- **Tidak muncul highlighting**: Pastikan RAG enabled
+- **Citation tidak scroll**: Periksa console untuk JS errors
+- **HTML tidak render**: Check metadata di vector store
+- **Slow performance**: Reduce document size atau jumlah paragraphs
+### Citation Not Accurate
+- Increase `TOP_K_RETRIEVAL` untuk more sources
+- Reduce `CHUNK_SIZE` untuk more granular chunks
+- Adjust `CHUNK_OVERLAP` untuk better context
+## 🔄 Update & Migration
+### Dari Versi Sebelumnya
+Jika upgrade dari versi tanpa document viewer:
+```bash
+# 1. Backup vector database
+cp -r data/vector_db data/vector_db_backup
+# 2. Clear dan re-process semua dokumen
+# (metadata lama tidak compatible)
+rm -rf data/vector_db/*
+# 3. Re-upload dokumen melalui UI
+# Dokumen akan di-process dengan metadata baru
+```
 ## 📝 License
 Contributions welcome! Silakan buat issue atau pull request.
+### Development Setup
+```bash
+# Setup development environment
+pip install -r requirements.txt
+# Run tests
+pytest tests/
+# Format code (if using black)
+black utils/ app.py
+```
+## 🗺️ Roadmap
+### Upcoming Features
+- [ ] Multi-document selector dropdown
+- [ ] Search in document viewer
+- [ ] Export chat with citations
+- [ ] PDF annotation support
+- [ ] Custom highlight colors
+- [ ] Mobile responsive improvements
+### Completed ✓
+- [x] Document viewer with citations
+- [x] Paragraph-level highlighting
+- [x] Interactive citation links
+- [x] Structured PDF processing
+- [x] Enhanced metadata storage
 ## 📧 Contact
 Untuk pertanyaan dan support, silakan buat issue di repository ini.
 ---
 <div align="center">
+**Made with ❤️ using Gradio and Qwen2**
+*Enhanced with Document Viewer & Citation Highlighting*
 </div>

app.py CHANGED Viewed

@@ -310,15 +310,15 @@ with gr.Blocks(css=CUSTOM_CSS, theme=gr.themes.Soft(), title="RAG ChatBot - QWEN
             )
         # ===== Tab 2: Chat Interface =====
-        with gr.Tab("💬 Chat dengan Viewer"):
             gr.Markdown("""
-            ### Chat Interaktif dengan Document Viewer
-            Ajukan pertanyaan dan lihat sitasi sumber langsung pada dokumen di sebelah kanan.
             """)
             with gr.Row():
                 # Left Column: Chat Interface (60%)
-                with gr.Column(scale=3):
                     chatbot = gr.Chatbot(
                         label="Percakapan",
                         type="messages",

             )
         # ===== Tab 2: Chat Interface =====
+        with gr.Tab("💬 Chat"):
             gr.Markdown("""
+            ### Chat Interaktif dengan Document Viewer atau tanpa sumber terkait ilmu pengetahuan umum
+            Ajukan pertanyaan dan lihat sitasi sumber langsung pada dokumen atau tanpa sumber terkait ilmu pengetahuan umum.
             """)
             with gr.Row():
                 # Left Column: Chat Interface (60%)
+                with gr.Column(scale=5):
                     chatbot = gr.Chatbot(
                         label="Percakapan",
                         type="messages",