Commit Β·
43aac82
1
Parent(s): a86d063
feat: Add an interactive document viewer with real-time citation highlighting, enhanced PDF processing, and speech-to-text input.
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ python_version: "3.12"
|
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
-
short_description: Chat dengan dokumen PDF menggunakan RAG dan
|
| 13 |
---
|
| 14 |
|
| 15 |
# RAG ChatBot dengan QWEN Model π€
|
|
@@ -20,7 +20,7 @@ short_description: Chat dengan dokumen PDF menggunakan RAG dan GLM Model
|
|
| 20 |

|
| 21 |

|
| 22 |
|
| 23 |
-
**Chat dengan dokumen PDF Anda menggunakan AI dengan teknologi RAG (Retrieval-Augmented Generation)**
|
| 24 |
|
| 25 |
[Demo](#demo) β’ [Fitur](#fitur) β’ [Instalasi](#instalasi) β’ [Penggunaan](#penggunaan) β’ [Arsitektur](#arsitektur)
|
| 26 |
|
|
@@ -30,23 +30,36 @@ short_description: Chat dengan dokumen PDF menggunakan RAG dan GLM Model
|
|
| 30 |
|
| 31 |
## π Deskripsi
|
| 32 |
|
| 33 |
-
RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen PDF dan melakukan tanya jawab interaktif tentang isi dokumen tersebut. Sistem menggunakan:
|
| 34 |
|
| 35 |
- **Qwen2-0.5B-Instruct**: Model bahasa generatif yang ringan untuk menghasilkan jawaban
|
| 36 |
- **RAG (Retrieval-Augmented Generation)**: Teknik untuk mencari informasi relevan dari dokumen
|
| 37 |
- **ChromaDB**: Vector database untuk penyimpanan dan pencarian semantic
|
|
|
|
| 38 |
- **Gradio**: Interface web yang modern dan interaktif
|
| 39 |
|
| 40 |
## β¨ Fitur
|
| 41 |
|
|
|
|
|
|
|
| 42 |
- π€ **Upload Multiple PDF**: Upload satu atau beberapa file PDF sekaligus
|
|
|
|
|
|
|
|
|
|
| 43 |
- π **Semantic Search**: Pencarian konteks menggunakan embeddings
|
| 44 |
- π¬ **Interactive Chat**: Chat dengan streaming response
|
| 45 |
-
- π **Source Citations**: Lihat sumber informasi dari dokumen
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
- π¨ **Modern UI**: Interface premium dengan gradients dan animasi
|
| 47 |
- βοΈ **Configurable**: Atur parameters seperti temperature, top-p, dan retrieval count
|
| 48 |
- πΎ **Persistent Storage**: Dokumen tersimpan di vector database
|
| 49 |
- π **Bahasa Indonesia**: Full support untuk bahasa Indonesia
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## π Instalasi
|
| 52 |
|
|
@@ -54,6 +67,7 @@ RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen P
|
|
| 54 |
|
| 55 |
- Python 3.8 atau lebih tinggi
|
| 56 |
- (Opsional) NVIDIA GPU dengan CUDA untuk performa optimal
|
|
|
|
| 57 |
|
| 58 |
### Langkah Instalasi
|
| 59 |
|
|
@@ -92,39 +106,93 @@ python app.py
|
|
| 92 |
|
| 93 |
Aplikasi akan berjalan di `http://localhost:7860`
|
| 94 |
|
| 95 |
-
### Workflow
|
| 96 |
|
| 97 |
1. **Upload Dokumen** (Tab π€ Upload Dokumen)
|
| 98 |
- Pilih file PDF dari komputer Anda
|
| 99 |
-
- Klik "Process PDF"
|
| 100 |
-
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
3. **Kelola Dokumen** (Tab π Kelola Dokumen)
|
| 108 |
- Lihat daftar dokumen yang tersimpan
|
| 109 |
-
-
|
|
|
|
| 110 |
- Clear all untuk reset database
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
## ποΈ Arsitektur
|
| 113 |
|
|
|
|
|
|
|
| 114 |
```
|
| 115 |
βββββββββββββββββββ
|
| 116 |
β PDF Upload β
|
| 117 |
ββββββββββ¬βββββββββ
|
| 118 |
β
|
| 119 |
βΌ
|
| 120 |
-
βββββββββββββββββββ
|
| 121 |
-
β
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
| 123 |
β
|
| 124 |
βΌ
|
| 125 |
-
βββββββββββββββββββ
|
| 126 |
-
β
|
| 127 |
-
ββββββββββ¬βββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
β
|
| 129 |
βΌ
|
| 130 |
βββββββββββββββββββ
|
|
@@ -132,45 +200,93 @@ Aplikasi akan berjalan di `http://localhost:7860`
|
|
| 132 |
ββββββββββ¬βββββββββ
|
| 133 |
β
|
| 134 |
βΌ
|
| 135 |
-
βββββββββββββββββββ
|
| 136 |
-
β ChromaDB β
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
| 138 |
β
|
| 139 |
ββββββ΄ββββββ
|
| 140 |
-
β RAG β
|
| 141 |
ββββββ¬ββββββ
|
| 142 |
β
|
| 143 |
ββββββΌββββββ
|
| 144 |
β Qwen2 β (Response Generation)
|
| 145 |
-
ββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
```
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
## π Struktur Project
|
| 149 |
|
| 150 |
```
|
| 151 |
LLM-ChatBot-Document/
|
| 152 |
β
|
| 153 |
-
βββ app.py
|
| 154 |
-
βββ requirements.txt
|
| 155 |
-
βββ .env.example
|
| 156 |
-
βββ .gitignore
|
|
|
|
|
|
|
| 157 |
β
|
| 158 |
βββ config/
|
| 159 |
β βββ __init__.py
|
| 160 |
-
β βββ model_config.py
|
| 161 |
β
|
| 162 |
βββ utils/
|
| 163 |
β βββ __init__.py
|
| 164 |
-
β βββ pdf_processor.py
|
| 165 |
-
β βββ vector_store.py
|
| 166 |
-
β βββ rag_pipeline.py
|
| 167 |
-
β
|
|
|
|
|
|
|
| 168 |
β
|
| 169 |
βββ data/
|
| 170 |
-
β βββ uploads/
|
| 171 |
-
β βββ vector_db/
|
|
|
|
|
|
|
| 172 |
β
|
| 173 |
-
βββ tests/
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
```
|
| 175 |
|
| 176 |
## βοΈ Konfigurasi
|
|
@@ -186,16 +302,20 @@ EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
|
|
| 186 |
DEVICE=auto
|
| 187 |
|
| 188 |
# Text Processing
|
| 189 |
-
CHUNK_SIZE=500
|
| 190 |
-
CHUNK_OVERLAP=50
|
| 191 |
|
| 192 |
# Retrieval
|
| 193 |
-
TOP_K_RETRIEVAL=3
|
| 194 |
|
| 195 |
# Generation
|
| 196 |
-
MAX_LENGTH=2048
|
| 197 |
-
TEMPERATURE=0.7
|
| 198 |
-
TOP_P=0.9
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
```
|
| 200 |
|
| 201 |
## π§ Requirements
|
|
@@ -210,14 +330,55 @@ Berikut dependencies utama yang digunakan:
|
|
| 210 |
- `langchain>=0.1.0` - Text processing
|
| 211 |
- `PyPDF2>=3.0.0` - PDF extraction
|
| 212 |
- `pdfplumber>=0.10.0` - Alternative PDF extraction
|
|
|
|
| 213 |
|
| 214 |
## π‘ Tips & Best Practices
|
| 215 |
|
|
|
|
| 216 |
1. **Ukuran PDF**: Untuk hasil terbaik, gunakan PDF < 50MB
|
| 217 |
2. **Format PDF**: Pastikan PDF berisi teks yang bisa di-extract (bukan scan gambar)
|
| 218 |
-
3. **
|
| 219 |
-
|
| 220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
## π Troubleshooting
|
| 223 |
|
|
@@ -225,15 +386,51 @@ Berikut dependencies utama yang digunakan:
|
|
| 225 |
```bash
|
| 226 |
# Model saat ini sudah sangat ringan (~1GB)
|
| 227 |
# Jika masih ada masalah, pastikan koneksi internet stabil untuk download
|
|
|
|
| 228 |
```
|
| 229 |
|
| 230 |
### PDF Extraction Error
|
| 231 |
- Coba method alternatif dengan edit `pdf_processor.py`
|
| 232 |
-
- Pastikan PDF tidak ter-password
|
|
|
|
| 233 |
|
| 234 |
### Memory Error
|
| 235 |
-
|
| 236 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
|
| 238 |
## π License
|
| 239 |
|
|
@@ -243,6 +440,35 @@ MIT License - lihat file LICENSE untuk detail
|
|
| 243 |
|
| 244 |
Contributions welcome! Silakan buat issue atau pull request.
|
| 245 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
## π§ Contact
|
| 247 |
|
| 248 |
Untuk pertanyaan dan support, silakan buat issue di repository ini.
|
|
@@ -250,5 +476,9 @@ Untuk pertanyaan dan support, silakan buat issue di repository ini.
|
|
| 250 |
---
|
| 251 |
|
| 252 |
<div align="center">
|
| 253 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
</div>
|
|
|
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
+
short_description: Chat dengan dokumen PDF menggunakan RAG dan Document Viewer
|
| 13 |
---
|
| 14 |
|
| 15 |
# RAG ChatBot dengan QWEN Model π€
|
|
|
|
| 20 |

|
| 21 |

|
| 22 |
|
| 23 |
+
**Chat dengan dokumen PDF Anda menggunakan AI dengan teknologi RAG (Retrieval-Augmented Generation) dan Document Viewer Interaktif**
|
| 24 |
|
| 25 |
[Demo](#demo) β’ [Fitur](#fitur) β’ [Instalasi](#instalasi) β’ [Penggunaan](#penggunaan) β’ [Arsitektur](#arsitektur)
|
| 26 |
|
|
|
|
| 30 |
|
| 31 |
## π Deskripsi
|
| 32 |
|
| 33 |
+
RAG ChatBot adalah aplikasi AI yang memungkinkan Anda untuk mengupload dokumen PDF dan melakukan tanya jawab interaktif tentang isi dokumen tersebut dengan **Document Viewer yang menampilkan sitasi sumber secara real-time**. Sistem menggunakan:
|
| 34 |
|
| 35 |
- **Qwen2-0.5B-Instruct**: Model bahasa generatif yang ringan untuk menghasilkan jawaban
|
| 36 |
- **RAG (Retrieval-Augmented Generation)**: Teknik untuk mencari informasi relevan dari dokumen
|
| 37 |
- **ChromaDB**: Vector database untuk penyimpanan dan pencarian semantic
|
| 38 |
+
- **Document Viewer**: Viewer interaktif dengan citation highlighting
|
| 39 |
- **Gradio**: Interface web yang modern dan interaktif
|
| 40 |
|
| 41 |
## β¨ Fitur
|
| 42 |
|
| 43 |
+
### π― Fitur Utama
|
| 44 |
+
|
| 45 |
- π€ **Upload Multiple PDF**: Upload satu atau beberapa file PDF sekaligus
|
| 46 |
+
- π **Document Viewer**: Tampilan dokumen side-by-side dengan chat interface
|
| 47 |
+
- π¨ **Citation Highlighting**: Highlighting otomatis paragraf sumber pada dokumen
|
| 48 |
+
- π **Interactive Citations**: Klik sitasi untuk scroll ke paragraf terkait
|
| 49 |
- π **Semantic Search**: Pencarian konteks menggunakan embeddings
|
| 50 |
- π¬ **Interactive Chat**: Chat dengan streaming response
|
| 51 |
+
- π **Source Citations**: Lihat sumber informasi dari dokumen dengan detail
|
| 52 |
+
- ποΈ **Speech-to-Text**: Input suara untuk pertanyaan (opsional)
|
| 53 |
+
|
| 54 |
+
### π Fitur Premium
|
| 55 |
+
|
| 56 |
- π¨ **Modern UI**: Interface premium dengan gradients dan animasi
|
| 57 |
- βοΈ **Configurable**: Atur parameters seperti temperature, top-p, dan retrieval count
|
| 58 |
- πΎ **Persistent Storage**: Dokumen tersimpan di vector database
|
| 59 |
- π **Bahasa Indonesia**: Full support untuk bahasa Indonesia
|
| 60 |
+
- π **Document Metadata**: Track paragraphs, pages, dan chunks
|
| 61 |
+
- π **Real-time Updates**: Document viewer update otomatis saat chat
|
| 62 |
+
- π― **Paragraph-level Citations**: Sitasi akurat hingga level paragraf
|
| 63 |
|
| 64 |
## π Instalasi
|
| 65 |
|
|
|
|
| 67 |
|
| 68 |
- Python 3.8 atau lebih tinggi
|
| 69 |
- (Opsional) NVIDIA GPU dengan CUDA untuk performa optimal
|
| 70 |
+
- (Opsional) Microphone untuk Speech-to-Text
|
| 71 |
|
| 72 |
### Langkah Instalasi
|
| 73 |
|
|
|
|
| 106 |
|
| 107 |
Aplikasi akan berjalan di `http://localhost:7860`
|
| 108 |
|
| 109 |
+
### Workflow Lengkap
|
| 110 |
|
| 111 |
1. **Upload Dokumen** (Tab π€ Upload Dokumen)
|
| 112 |
- Pilih file PDF dari komputer Anda
|
| 113 |
+
- Klik "π Process PDF"
|
| 114 |
+
- Sistem akan:
|
| 115 |
+
- Ekstrak teks dengan struktur paragraf
|
| 116 |
+
- Membuat chunks dengan metadata
|
| 117 |
+
- Generate HTML preview untuk viewer
|
| 118 |
+
- Simpan ke vector database
|
| 119 |
+
- Tunggu hingga proses selesai (β status sukses)
|
| 120 |
+
|
| 121 |
+
2. **Chat dengan Document Viewer** (Tab π¬ Chat)
|
| 122 |
+
- **Kolom Kiri (60%)**: Chat Interface
|
| 123 |
+
- π€ Input Suara (jika diaktifkan)
|
| 124 |
+
- β¨οΈ Input Teks untuk pertanyaan
|
| 125 |
+
- π Sumber Referensi dengan citation cards
|
| 126 |
+
- βοΈ Parameter Chat (RAG, Temperature, dll)
|
| 127 |
+
|
| 128 |
+
- **Kolom Kanan (40%)**: Document Viewer
|
| 129 |
+
- π Tampilan dokumen dengan struktur paragraf
|
| 130 |
+
- π― Highlighting otomatis paragraf sumber
|
| 131 |
+
- π Kontrol zoom (zoom in/out/reset)
|
| 132 |
+
- π Nomor halaman per section
|
| 133 |
+
|
| 134 |
+
- **Cara Kerja Citation**:
|
| 135 |
+
1. Ajukan pertanyaan tentang dokumen
|
| 136 |
+
2. ChatBot menjawab dengan mencari sumber relevan
|
| 137 |
+
3. Citation cards muncul di bawah jawaban
|
| 138 |
+
4. Document viewer otomatis highlight paragraf sumber
|
| 139 |
+
5. Klik citation card untuk scroll ke paragraf
|
| 140 |
|
| 141 |
3. **Kelola Dokumen** (Tab π Kelola Dokumen)
|
| 142 |
- Lihat daftar dokumen yang tersimpan
|
| 143 |
+
- Info detail: jumlah chunks, pages, paragraphs
|
| 144 |
+
- Hapus dokumen individual
|
| 145 |
- Clear all untuk reset database
|
| 146 |
|
| 147 |
+
### Fitur Document Viewer Detail
|
| 148 |
+
|
| 149 |
+
#### Highlighting & Navigation
|
| 150 |
+
- **Auto-scroll**: Viewer otomatis scroll ke paragraf sumber pertama
|
| 151 |
+
- **Click-to-highlight**: Klik citation untuk highlight dan scroll
|
| 152 |
+
- **Flash animation**: Paragraf yang diklik akan flash untuk visibility
|
| 153 |
+
- **Multi-source**: Support multiple paragraphs dari berbagai pages
|
| 154 |
+
|
| 155 |
+
#### Kontrol Viewer
|
| 156 |
+
- **πβ** Zoom Out: Perkecil teks
|
| 157 |
+
- **π** Reset: Kembali ke ukuran normal
|
| 158 |
+
- **π+** Zoom In: Perbesar teks
|
| 159 |
+
- **Scroll bar**: Custom styled untuk dark theme
|
| 160 |
+
|
| 161 |
+
#### Citation Format
|
| 162 |
+
```
|
| 163 |
+
π [Filename] (Hal. X)
|
| 164 |
+
"[Preview teks 150 karakter...]"
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
## ποΈ Arsitektur
|
| 168 |
|
| 169 |
+
### Pipeline Lengkap
|
| 170 |
+
|
| 171 |
```
|
| 172 |
βββββββββββββββββββ
|
| 173 |
β PDF Upload β
|
| 174 |
ββββββββββ¬βββββββββ
|
| 175 |
β
|
| 176 |
βΌ
|
| 177 |
+
βββββββββββββββββββββββββββ
|
| 178 |
+
β Structured Extraction β β NEW! Extract with paragraphs
|
| 179 |
+
β - Pages β
|
| 180 |
+
β - Paragraphs + IDs β
|
| 181 |
+
β - Character positions β
|
| 182 |
+
ββββββββββ¬βββββββββββββββββ
|
| 183 |
β
|
| 184 |
βΌ
|
| 185 |
+
βββββββββββββββββββββββββββ
|
| 186 |
+
β HTML Preview Gen β β NEW! For document viewer
|
| 187 |
+
ββββββββββ¬βββββββββββββββββ
|
| 188 |
+
β
|
| 189 |
+
βΌ
|
| 190 |
+
βββββββββββββββββββββββββββ
|
| 191 |
+
β Chunking + Metadata β β Enhanced with para IDs
|
| 192 |
+
β - Paragraph IDs β
|
| 193 |
+
β - Page numbers β
|
| 194 |
+
β - Character offsets β
|
| 195 |
+
ββββββββββ¬βββββββββββββββββ
|
| 196 |
β
|
| 197 |
βΌ
|
| 198 |
βββββββββββββββββββ
|
|
|
|
| 200 |
ββββββββββ¬βββββββββ
|
| 201 |
β
|
| 202 |
βΌ
|
| 203 |
+
βββββββββββββββββββββββββββ
|
| 204 |
+
β ChromaDB Storage β β Enhanced metadata
|
| 205 |
+
β - Chunks + embeddings β
|
| 206 |
+
β - Paragraph metadata β
|
| 207 |
+
β - HTML preview β
|
| 208 |
+
ββββββββββ¬βββββββββββββββββ
|
| 209 |
β
|
| 210 |
ββββββ΄ββββββ
|
| 211 |
+
β RAG β β Returns enriched sources
|
| 212 |
ββββββ¬ββββββ
|
| 213 |
β
|
| 214 |
ββββββΌββββββ
|
| 215 |
β Qwen2 β (Response Generation)
|
| 216 |
+
ββββββ¬ββββββ
|
| 217 |
+
β
|
| 218 |
+
βΌ
|
| 219 |
+
βββββββββββββββββββββββββββ
|
| 220 |
+
β Document Viewer β β NEW! Citation highlighting
|
| 221 |
+
β - Render HTML β
|
| 222 |
+
β - Highlight paragraphs β
|
| 223 |
+
β - Interactive citationsβ
|
| 224 |
+
βββββββββββββββββββββββββββ
|
| 225 |
```
|
| 226 |
|
| 227 |
+
### Komponen Utama
|
| 228 |
+
|
| 229 |
+
1. **PDF Processor** (`utils/pdf_processor.py`)
|
| 230 |
+
- Extract text with paragraph structure
|
| 231 |
+
- Generate HTML preview
|
| 232 |
+
- Chunk text with metadata (paragraph IDs, pages)
|
| 233 |
+
|
| 234 |
+
2. **Vector Store** (`utils/vector_store.py`)
|
| 235 |
+
- Store chunks with enhanced metadata
|
| 236 |
+
- Store HTML preview for each document
|
| 237 |
+
- Query with paragraph ID return
|
| 238 |
+
|
| 239 |
+
3. **Document Viewer** (`utils/document_viewer.py`)
|
| 240 |
+
- Render HTML with highlighting
|
| 241 |
+
- Create interactive citation links
|
| 242 |
+
- Handle viewer controls
|
| 243 |
+
|
| 244 |
+
4. **RAG Pipeline** (`utils/rag_pipeline.py`)
|
| 245 |
+
- Retrieve relevant chunks
|
| 246 |
+
- Return source metadata (paragraph IDs, pages)
|
| 247 |
+
- Stream responses
|
| 248 |
+
|
| 249 |
+
5. **UI Components** (`utils/ui_components.py`)
|
| 250 |
+
- Premium CSS styling
|
| 251 |
+
- Document viewer styles
|
| 252 |
+
- Citation card styles
|
| 253 |
+
|
| 254 |
## π Struktur Project
|
| 255 |
|
| 256 |
```
|
| 257 |
LLM-ChatBot-Document/
|
| 258 |
β
|
| 259 |
+
βββ app.py # Main application (updated)
|
| 260 |
+
βββ requirements.txt # Dependencies
|
| 261 |
+
βββ .env.example # Environment template
|
| 262 |
+
βββ .gitignore # Git ignore rules
|
| 263 |
+
βββ README.md # Documentation (this file)
|
| 264 |
+
βββ QUICKSTART.md # Quick start guide
|
| 265 |
β
|
| 266 |
βββ config/
|
| 267 |
β βββ __init__.py
|
| 268 |
+
β βββ model_config.py # Model & app configuration
|
| 269 |
β
|
| 270 |
βββ utils/
|
| 271 |
β βββ __init__.py
|
| 272 |
+
β βββ pdf_processor.py # PDF extraction & chunking (enhanced)
|
| 273 |
+
β βββ vector_store.py # ChromaDB management (enhanced)
|
| 274 |
+
β βββ rag_pipeline.py # RAG implementation
|
| 275 |
+
β βββ document_viewer.py # Document viewer component (NEW!)
|
| 276 |
+
β βββ speech_to_text.py # STT for voice input
|
| 277 |
+
β βββ ui_components.py # Gradio UI components (enhanced)
|
| 278 |
β
|
| 279 |
βββ data/
|
| 280 |
+
β βββ uploads/ # Temporary PDF storage
|
| 281 |
+
β βββ vector_db/ # ChromaDB persistent storage
|
| 282 |
+
β βββ chroma.sqlite3 # Vector database
|
| 283 |
+
β βββ documents_metadata.json # Document metadata (enhanced)
|
| 284 |
β
|
| 285 |
+
βββ tests/ # Unit & integration tests
|
| 286 |
+
βββ __init__.py
|
| 287 |
+
βββ test_imports.py
|
| 288 |
+
βββ test_pdf_processor.py
|
| 289 |
+
βββ test_vector_store.py
|
| 290 |
```
|
| 291 |
|
| 292 |
## βοΈ Konfigurasi
|
|
|
|
| 302 |
DEVICE=auto
|
| 303 |
|
| 304 |
# Text Processing
|
| 305 |
+
CHUNK_SIZE=500 # Ukuran chunk (karakter)
|
| 306 |
+
CHUNK_OVERLAP=50 # Overlap antar chunk
|
| 307 |
|
| 308 |
# Retrieval
|
| 309 |
+
TOP_K_RETRIEVAL=3 # Jumlah chunks yang diambil
|
| 310 |
|
| 311 |
# Generation
|
| 312 |
+
MAX_LENGTH=2048 # Max token untuk response
|
| 313 |
+
TEMPERATURE=0.7 # Kreativitas (0.0-2.0)
|
| 314 |
+
TOP_P=0.9 # Nucleus sampling
|
| 315 |
+
|
| 316 |
+
# Speech-to-Text (Optional)
|
| 317 |
+
STT_ENABLED=False # Enable/disable STT
|
| 318 |
+
STT_LANGUAGE=id-ID # Bahasa untuk STT
|
| 319 |
```
|
| 320 |
|
| 321 |
## π§ Requirements
|
|
|
|
| 330 |
- `langchain>=0.1.0` - Text processing
|
| 331 |
- `PyPDF2>=3.0.0` - PDF extraction
|
| 332 |
- `pdfplumber>=0.10.0` - Alternative PDF extraction
|
| 333 |
+
- `python-dotenv>=1.0.0` - Environment variables
|
| 334 |
|
| 335 |
## π‘ Tips & Best Practices
|
| 336 |
|
| 337 |
+
### Upload & Processing
|
| 338 |
1. **Ukuran PDF**: Untuk hasil terbaik, gunakan PDF < 50MB
|
| 339 |
2. **Format PDF**: Pastikan PDF berisi teks yang bisa di-extract (bukan scan gambar)
|
| 340 |
+
3. **Multi-file**: Upload multiple files untuk cross-document search
|
| 341 |
+
|
| 342 |
+
### Chat & RAG
|
| 343 |
+
4. **Chunk Size**: Sesuaikan `CHUNK_SIZE` berdasarkan jenis dokumen:
|
| 344 |
+
- Academic papers: 500-700
|
| 345 |
+
- Books/novels: 800-1000
|
| 346 |
+
- Technical docs: 400-600
|
| 347 |
+
5. **Top-K**: Gunakan 3-5 untuk most cases, tingkatkan untuk dokumen kompleks
|
| 348 |
+
6. **Temperature**:
|
| 349 |
+
- 0.3-0.5: Jawaban faktual, strict
|
| 350 |
+
- 0.7-0.9: Balanced
|
| 351 |
+
- 1.0-1.5: Creative, expansive
|
| 352 |
+
|
| 353 |
+
### Document Viewer
|
| 354 |
+
7. **Citation Quality**: Makin specific pertanyaan, makin akurat citation
|
| 355 |
+
8. **Multi-page**: Viewer support citations dari berbagai pages
|
| 356 |
+
9. **Zoom**: Gunakan kontrol zoom untuk readability optimal
|
| 357 |
+
|
| 358 |
+
### Performance
|
| 359 |
+
10. **GPU**: Gunakan GPU untuk loading model yang lebih cepat
|
| 360 |
+
11. **Memory**: Document viewer cache HTML, butuh RAM sesuai jumlah docs
|
| 361 |
+
12. **Cleanup**: Hapus dokumen lama untuk optimasi storage
|
| 362 |
+
|
| 363 |
+
## π¨ UI/UX Features
|
| 364 |
+
|
| 365 |
+
### Dark Theme Premium
|
| 366 |
+
- Gradient backgrounds (purple to blue)
|
| 367 |
+
- Glass morphism effects
|
| 368 |
+
- Smooth animations & transitions
|
| 369 |
+
- Custom scrollbars
|
| 370 |
+
|
| 371 |
+
### Interactive Elements
|
| 372 |
+
- Hover effects on citations
|
| 373 |
+
- Click-to-scroll functionality
|
| 374 |
+
- Flash animations for highlights
|
| 375 |
+
- Responsive layout (60/40 split)
|
| 376 |
+
|
| 377 |
+
### Visual Feedback
|
| 378 |
+
- Status indicators (success/error/loading)
|
| 379 |
+
- Progress bars for uploads
|
| 380 |
+
- Real-time streaming responses
|
| 381 |
+
- Auto-scroll to citations
|
| 382 |
|
| 383 |
## π Troubleshooting
|
| 384 |
|
|
|
|
| 386 |
```bash
|
| 387 |
# Model saat ini sudah sangat ringan (~1GB)
|
| 388 |
# Jika masih ada masalah, pastikan koneksi internet stabil untuk download
|
| 389 |
+
# Model akan di-cache setelah download pertama
|
| 390 |
```
|
| 391 |
|
| 392 |
### PDF Extraction Error
|
| 393 |
- Coba method alternatif dengan edit `pdf_processor.py`
|
| 394 |
+
- Pastikan PDF tidak ter-password atau ter-encrypt
|
| 395 |
+
- Check jika PDF adalah scan (perlu OCR terpisah)
|
| 396 |
|
| 397 |
### Memory Error
|
| 398 |
+
```bash
|
| 399 |
+
# Reduce CHUNK_SIZE dan BATCH_SIZE
|
| 400 |
+
CHUNK_SIZE=300
|
| 401 |
+
# Use CPU instead of GPU if OOM
|
| 402 |
+
DEVICE=cpu
|
| 403 |
+
# Batasi jumlah dokumen di database
|
| 404 |
+
```
|
| 405 |
+
|
| 406 |
+
### Document Viewer Issues
|
| 407 |
+
- **Tidak muncul highlighting**: Pastikan RAG enabled
|
| 408 |
+
- **Citation tidak scroll**: Periksa console untuk JS errors
|
| 409 |
+
- **HTML tidak render**: Check metadata di vector store
|
| 410 |
+
- **Slow performance**: Reduce document size atau jumlah paragraphs
|
| 411 |
+
|
| 412 |
+
### Citation Not Accurate
|
| 413 |
+
- Increase `TOP_K_RETRIEVAL` untuk more sources
|
| 414 |
+
- Reduce `CHUNK_SIZE` untuk more granular chunks
|
| 415 |
+
- Adjust `CHUNK_OVERLAP` untuk better context
|
| 416 |
+
|
| 417 |
+
## π Update & Migration
|
| 418 |
+
|
| 419 |
+
### Dari Versi Sebelumnya
|
| 420 |
+
|
| 421 |
+
Jika upgrade dari versi tanpa document viewer:
|
| 422 |
+
|
| 423 |
+
```bash
|
| 424 |
+
# 1. Backup vector database
|
| 425 |
+
cp -r data/vector_db data/vector_db_backup
|
| 426 |
+
|
| 427 |
+
# 2. Clear dan re-process semua dokumen
|
| 428 |
+
# (metadata lama tidak compatible)
|
| 429 |
+
rm -rf data/vector_db/*
|
| 430 |
+
|
| 431 |
+
# 3. Re-upload dokumen melalui UI
|
| 432 |
+
# Dokumen akan di-process dengan metadata baru
|
| 433 |
+
```
|
| 434 |
|
| 435 |
## π License
|
| 436 |
|
|
|
|
| 440 |
|
| 441 |
Contributions welcome! Silakan buat issue atau pull request.
|
| 442 |
|
| 443 |
+
### Development Setup
|
| 444 |
+
```bash
|
| 445 |
+
# Setup development environment
|
| 446 |
+
pip install -r requirements.txt
|
| 447 |
+
|
| 448 |
+
# Run tests
|
| 449 |
+
pytest tests/
|
| 450 |
+
|
| 451 |
+
# Format code (if using black)
|
| 452 |
+
black utils/ app.py
|
| 453 |
+
```
|
| 454 |
+
|
| 455 |
+
## πΊοΈ Roadmap
|
| 456 |
+
|
| 457 |
+
### Upcoming Features
|
| 458 |
+
- [ ] Multi-document selector dropdown
|
| 459 |
+
- [ ] Search in document viewer
|
| 460 |
+
- [ ] Export chat with citations
|
| 461 |
+
- [ ] PDF annotation support
|
| 462 |
+
- [ ] Custom highlight colors
|
| 463 |
+
- [ ] Mobile responsive improvements
|
| 464 |
+
|
| 465 |
+
### Completed β
|
| 466 |
+
- [x] Document viewer with citations
|
| 467 |
+
- [x] Paragraph-level highlighting
|
| 468 |
+
- [x] Interactive citation links
|
| 469 |
+
- [x] Structured PDF processing
|
| 470 |
+
- [x] Enhanced metadata storage
|
| 471 |
+
|
| 472 |
## π§ Contact
|
| 473 |
|
| 474 |
Untuk pertanyaan dan support, silakan buat issue di repository ini.
|
|
|
|
| 476 |
---
|
| 477 |
|
| 478 |
<div align="center">
|
| 479 |
+
|
| 480 |
+
**Made with β€οΈ using Gradio and Qwen2**
|
| 481 |
+
|
| 482 |
+
*Enhanced with Document Viewer & Citation Highlighting*
|
| 483 |
+
|
| 484 |
</div>
|
app.py
CHANGED
|
@@ -310,15 +310,15 @@ with gr.Blocks(css=CUSTOM_CSS, theme=gr.themes.Soft(), title="RAG ChatBot - QWEN
|
|
| 310 |
)
|
| 311 |
|
| 312 |
# ===== Tab 2: Chat Interface =====
|
| 313 |
-
with gr.Tab("π¬ Chat
|
| 314 |
gr.Markdown("""
|
| 315 |
-
### Chat Interaktif dengan Document Viewer
|
| 316 |
-
Ajukan pertanyaan dan lihat sitasi sumber langsung pada dokumen
|
| 317 |
""")
|
| 318 |
|
| 319 |
with gr.Row():
|
| 320 |
# Left Column: Chat Interface (60%)
|
| 321 |
-
with gr.Column(scale=
|
| 322 |
chatbot = gr.Chatbot(
|
| 323 |
label="Percakapan",
|
| 324 |
type="messages",
|
|
|
|
| 310 |
)
|
| 311 |
|
| 312 |
# ===== Tab 2: Chat Interface =====
|
| 313 |
+
with gr.Tab("π¬ Chat"):
|
| 314 |
gr.Markdown("""
|
| 315 |
+
### Chat Interaktif dengan Document Viewer atau tanpa sumber terkait ilmu pengetahuan umum
|
| 316 |
+
Ajukan pertanyaan dan lihat sitasi sumber langsung pada dokumen atau tanpa sumber terkait ilmu pengetahuan umum.
|
| 317 |
""")
|
| 318 |
|
| 319 |
with gr.Row():
|
| 320 |
# Left Column: Chat Interface (60%)
|
| 321 |
+
with gr.Column(scale=5):
|
| 322 |
chatbot = gr.Chatbot(
|
| 323 |
label="Percakapan",
|
| 324 |
type="messages",
|