darkbreakerk commited on
Commit
c602e04
·
verified ·
1 Parent(s): 17e5404

Create svg html English presentation for this system

Browse files

# 🎯 HỆ THỐNG XỬ LÝ TÀI LIỆU VỚI KHÔI PHỤC NGỮ CẢNH SECTION

## 📌 TỔNG QUAN HỆ THỐNG

### Mục đích
Hệ thống xử lý và lưu trữ tài liệu markdown với khả năng:
- Tách tài liệu thành các **sections** có cấu trúc
- Chunking thông minh dựa trên token
- **Khôi phục ngữ cảnh đầy đủ** của section gốc khi search
- Tìm kiếm hybrid (Vector + BM25) hiệu suất cao

### Kiến trúc tổng quan
```
┌─────────────────┐
│   FastAPI App   │ ← API endpoints
│   (main.py)     │
└────────┬────────┘
         │
    ┌────▼─────────────────────────────┐
    │   Celery Worker (worker.py)      │
    │   - Xử lý bất đồng bộ            │
    │   - Quản lý job queue            │
    └────┬─────────────────────────────┘
         │
    ┌────▼──────────────────────────────┐
    │  ProcessPipeline                  │
    │  - Điều phối workflow             │
    └────┬──────────────────────────────┘
         │
    ┌────▼──────────────────────────────┐
    │  MarkdownProcessor                │
    │  - TextCleaner (split sections)   │
    │  - RaptorProcessor (chunking)     │
    └────┬──────────────────────────────┘
         │
    ┌────▼──────────────────────────────┐
    │  Storage Layer                    │
    │  ┌──────────────┐ ┌─────────────┐│
    │  │ PostgreSQL   │ │   Milvus    ││
    │  │ (Sections)   │ │  (Vectors)  ││
    │  └──────────────┘ └─────────────┘│
    └───────────────────────────────────┘
```

---

## 🔄 WORKFLOW XỬ LÝ CHI TIẾT

### 1️⃣ Giai đoạn tiếp nhận (API Layer)

**Endpoint**: `POST /index-file/`

```python
# src/api/index_routes.py
async def process_extracted_file(request: MinioFileRequest)
```

**Luồng xử lý**:
1. Nhận request với danh sách file markdown từ SeaweedFS
2. Tạo job ID cho mỗi file
3. Đẩy job vào Celery queue
4. Trả về response ngay lập tức (202 Accepted)

**Input**:
```json
{
  "list_path": [
    {
      "id": "uuid-file-1",
      "path": "extracted/document.md"
    }
  ]
}
```

---

### 2️⃣ Giai đoạn xử lý bất đồng bộ (Worker)

**File**: `worker.py`

**Task**: `process_extracted_file_task`

**Các bước**:
1. **Cập nhật trạng thái**: Publish "processing" status qua Redis
2. **Tải file**: Lấy markdown content từ SeaweedFS
3. **Xử lý nội dung**: Gọi ProcessPipeline
4. **Hoàn thành**: Publish "completed" status

```python
# worker.py (line 147-194)
@celery_app.task(bind=True, name="process_extracted_file_task")
def process_extracted_file_task(self, file_id, bucket_name,
                                 markdown_object_name, original_file_path):
    # 1. Update status
    REDIS_SERVICE.publish_processing_status(file_id, original_file_path)
   
    # 2. Retrieve markdown
    markdown_content = SEAWEEDFS_SERVICE.process_file(bucket_name, output_path)
   
    # 3. Process with pipeline
    result = PROCESS_PIPELINE.process_markdown_text(
        file_id=file_id,
        text=markdown_content,
        document_path=original_file_path
    )
   
    # 4. Publish completion
    REDIS_SERVICE.publish_done_status(file_id, original_file_path)
```

---

### 3️⃣ Giai đoạn xử lý Pipeline

**File**: `src/pipelines/process_pipeline.py`

**Class**: `ProcessPipeline`

**Nhiệm vụ**: Điều phối toàn bộ quá trình xử lý

```python
async def process_markdown_text(self, file_id, text, document_path):
    # Gọi MarkdownProcessor để xử lý
    result = await self.markdown_processor.process_text(
        markdown_text=text,
        file_id=file_id,
        document_path=document_path,
        store_vectors=True,
        use_raptor=True
    )
    return result
```

---

## 🎯 ĐIỂM ĐẶC BIỆT: XỬ LÝ THEO SECTION

### 4️⃣ Giai đoạn tách Section (TextCleaner)

**File**: `src/processors/cleaner.py`

**Class**: `TextCleaner`

#### 🔍 Phân tích cấu trúc tài liệu

**Method chính**: `split_into_sections()`

**Cơ chế hoạt động**:

1. **Nhận diện cấu trúc**:
   - Phát hiện markdown headers (`#`, `##`, `###`, ...)
   - Theo dõi page markers (`[PAGE:1]`, `[PAGE:2]`, ...)
   - Xác định ranh giới giữa các sections

2. **Tạo sections có cấu trúc**:

```python
# src/processors/cleaner.py (line 42-103)
def split_into_sections(self, text, file_id, document_path):
    sections = []
    lines = text.splitlines()
    current_page = 1
   
    # Duyệt từng dòng
    for line in lines:
        # Phát hiện page marker
        if page_match := re.search(self.page_marker_pattern, line):
            current_page = int(page_match.group(1))
            continue
       
        # Phát hiện header
        if header_match := re.match(r'^(#{1,6})\s+(.+)$', line):
            header_level = len(header_match.group(1))
            title = header_match.group(2).strip()
           
            # Thu thập nội dung cho đến header tiếp theo
            content_lines = []
            while i < len(lines):
                next_line = lines[i]
                if is_next_header(next_line, header_level):
                    break
                content_lines.append(next_line)
           
            # Tạo section
            self._create_section(sections, title, content, current_page)
   
    return sections
```

#### 📦 Cấu trúc Section được tạo ra

**Method**: `_create_section()`

```python
# src/processors/cleaner.py (line 105-131)
def _create_section(self, sections, title, content, page):
    section_id = str(uuid.uuid4())  # ← ID duy nhất cho section
    content = title + "\n" + content
   
    sections.append({
        "content": content,  # ← Nội dung đầy đủ của section
        "metadata": {
            "document_id": self.current_document_id,
            "document_path": self.current_document_path,
            "section_id": section_id,  # ← KEY POINT: Section ID
            "page_info": {
                "index": page - 1,
                "total": self.current_total_pages
            },
            "section_content": content  # ← Lưu lại nội dung gốc
        }
    })
```

**Ví dụ section được tạo**:
```json
{
  "content": "## Giới thiệu\nĐây là phần giới thiệu về hệ thống...",
  "metadata": {
    "document_id": "550e8400-e29b-41d4-a716-446655440000",
    "document_path": "input/document.pdf",
    "section_id": "123e4567-e89b-12d3-a456-426614174000",
    "page_info": {
      "index": 0,
      "total": 10
    },
    "section_content": "## Giới thiệu\nĐây là phần giới thiệu..."
  }
}
```

---

### 5️⃣ Giai đoạn Chunking thông minh (RaptorProcessor)

**File**: `src/services/raptor_processor.py`

**Class**: `RaptorProcessor`

#### 🎯 Mục tiêu Chunking

**Vấn đề**: Sections có thể quá dài (vượt quá giới hạn token của embedding model)

**Giải pháp**: Chia nhỏ sections thành chunks, **nhưng vẫn giữ liên kết với section gốc**

#### 📊 Quy trình Chunking

**Method**: `create_knowledge_base()`

```python
# src/services/raptor_processor.py (line 82-157)
async def create_knowledge_base(self, sections):
    # BƯỚC 1: Lưu toàn bộ sections vào PostgreSQL
    for section in sections:
        content = section["content"]
        metadata = section["metadata"]
       
        self.postgres_service.store_section(
            section_id=metadata["section_id"],  # ← Lưu với section_id
            document_id=metadata["document_id"],
            content=content  # ← Nội dung đầy đủ của section
        )
   
    # BƯỚC 2: Chuyển sections thành LangChain Documents
    documents = self._convert_sections_to_documents(sections)
   
    # BƯỚC 3: Chunking thông minh
    all_chunks = []
    for doc in documents:
        token_count = self.count_tokens(doc.page_content)
       
        if token_count <= self.chunk_size:
            # Section đủ nhỏ → giữ nguyên
            all_chunks.append(doc)
        else:
            # Section quá lớn → chia nhỏ
            split_docs = self.text_splitter.split_documents([doc])
            all_chunks.extend(split_docs)
   
    # BƯỚC 4: Thêm chunk_id cho mỗi chunk
    for chunk in all_chunks:
        chunk.metadata.update({
            "chunk_id": str(uuid.uuid4())  # ← ID riêng cho chunk
        })
        # Lưu ý: section_id vẫn được giữ nguyên từ metadata gốc
   
    # BƯỚC 5: Validate và lưu vào Milvus
    validated_chunks = self._normalize_and_validate_chunks(all_chunks)
    self.vector_storage_service.store_documents(validated_chunks)
```

#### 🔑

Files changed (6) hide show
  1. README.md +8 -5
  2. components/footer.js +100 -0
  3. components/navbar.js +168 -0
  4. index.html +266 -19
  5. script.js +38 -0
  6. style.css +33 -19
README.md CHANGED
@@ -1,10 +1,13 @@
1
  ---
2
- title: Docusense Explorer
3
- emoji: 🏆
4
- colorFrom: pink
5
- colorTo: purple
6
  sdk: static
7
  pinned: false
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
1
  ---
2
+ title: DocuSense Explorer 🚀
3
+ colorFrom: blue
4
+ colorTo: pink
5
+ emoji: 🐳
6
  sdk: static
7
  pinned: false
8
+ tags:
9
+ - deepsite-v3
10
  ---
11
 
12
+ # Welcome to your new DeepSite project!
13
+ This project was created with [DeepSite](https://huggingface.co/deepsite).
components/footer.js ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class CustomFooter extends HTMLElement {
2
+ connectedCallback() {
3
+ this.attachShadow({ mode: 'open' });
4
+ this.shadowRoot.innerHTML = `
5
+ <style>
6
+ footer {
7
+ @apply bg-gray-50 dark:bg-gray-800 border-t border-gray-200 dark:border-gray-700;
8
+ }
9
+
10
+ .container {
11
+ @apply max-w-7xl mx-auto px-4 sm:px-6 lg:px-8 py-8;
12
+ }
13
+
14
+ .footer-content {
15
+ @apply grid grid-cols-1 md:grid-cols-3 gap-8;
16
+ }
17
+
18
+ .footer-section h3 {
19
+ @apply text-lg font-semibold text-gray-900 dark:text-white mb-4;
20
+ }
21
+
22
+ .footer-links {
23
+ @apply space-y-3;
24
+ }
25
+
26
+ .footer-link {
27
+ @apply text-gray-600 dark:text-gray-300 hover:text-gray-900 dark:hover:text-white;
28
+ }
29
+
30
+ .footer-bottom {
31
+ @apply mt-8 pt-8 border-t border-gray-200 dark:border-gray-700
32
+ flex flex-col md:flex-row justify-between items-center;
33
+ }
34
+
35
+ .copyright {
36
+ @apply text-gray-500 dark:text-gray-400 text-sm;
37
+ }
38
+
39
+ .social-links {
40
+ @apply flex space-x-4 mt-4 md:mt-0;
41
+ }
42
+
43
+ .social-link {
44
+ @apply text-gray-400 dark:text-gray-300 hover:text-gray-500 dark:hover:text-gray-200;
45
+ }
46
+ </style>
47
+
48
+ <footer>
49
+ <div class="container">
50
+ <div class="footer-content">
51
+ <div class="footer-section">
52
+ <h3>DocuSense Explorer</h3>
53
+ <p class="text-gray-600 dark:text-gray-300">
54
+ Advanced document processing with intelligent context recovery for better search and retrieval.
55
+ </p>
56
+ </div>
57
+
58
+ <div class="footer-section">
59
+ <h3>Resources</h3>
60
+ <div class="footer-links">
61
+ <a href="#" class="footer-link">Documentation</a>
62
+ <a href="#" class="footer-link">API Reference</a>
63
+ <a href="#" class="footer-link">GitHub Repository</a>
64
+ </div>
65
+ </div>
66
+
67
+ <div class="footer-section">
68
+ <h3>Contact</h3>
69
+ <div class="footer-links">
70
+ <a href="#" class="footer-link">Email Us</a>
71
+ <a href="#" class="footer-link">Support Portal</a>
72
+ <a href="#" class="footer-link">Feedback</a>
73
+ </div>
74
+ </div>
75
+ </div>
76
+
77
+ <div class="footer-bottom">
78
+ <p class="copyright">
79
+ © ${new Date().getFullYear()} DocuSense Explorer. All rights reserved.
80
+ </p>
81
+
82
+ <div class="social-links">
83
+ <a href="#" class="social-link">
84
+ <i data-feather="github"></i>
85
+ </a>
86
+ <a href="#" class="social-link">
87
+ <i data-feather="twitter"></i>
88
+ </a>
89
+ <a href="#" class="social-link">
90
+ <i data-feather="linkedin"></i>
91
+ </a>
92
+ </div>
93
+ </div>
94
+ </div>
95
+ </footer>
96
+ `;
97
+ }
98
+ }
99
+
100
+ customElements.define('custom-footer', CustomFooter);
components/navbar.js ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class CustomNavbar extends HTMLElement {
2
+ connectedCallback() {
3
+ this.attachShadow({ mode: 'open' });
4
+ this.shadowRoot.innerHTML = `
5
+ <style>
6
+ nav {
7
+ @apply bg-white dark:bg-gray-800 shadow-sm;
8
+ }
9
+
10
+ .container {
11
+ @apply max-w-7xl mx-auto px-4 sm:px-6 lg:px-8;
12
+ }
13
+
14
+ .nav-content {
15
+ @apply flex justify-between items-center h-16;
16
+ }
17
+
18
+ .logo {
19
+ @apply flex-shrink-0 flex items-center text-blue-600 dark:text-blue-400 font-bold text-xl;
20
+ }
21
+
22
+ .nav-links {
23
+ @apply hidden md:ml-6 md:flex md:space-x-8;
24
+ }
25
+
26
+ .nav-link {
27
+ @apply inline-flex items-center px-1 pt-1 border-b-2 border-transparent
28
+ text-gray-500 dark:text-gray-300 hover:text-gray-700 dark:hover:text-gray-100
29
+ hover:border-gray-300 dark:hover:border-gray-500 text-sm font-medium;
30
+ }
31
+
32
+ .nav-link.active {
33
+ @apply border-blue-500 text-gray-900 dark:text-white;
34
+ }
35
+
36
+ .mobile-menu-button {
37
+ @apply inline-flex items-center justify-center p-2 rounded-md
38
+ text-gray-400 hover:text-gray-500 dark:hover:text-gray-300
39
+ hover:bg-gray-100 dark:hover:bg-gray-700 focus:outline-none;
40
+ }
41
+
42
+ .mobile-menu {
43
+ @apply md:hidden;
44
+ }
45
+
46
+ .mobile-menu-items {
47
+ @apply pt-2 pb-3 space-y-1;
48
+ }
49
+
50
+ .mobile-menu-link {
51
+ @apply block pl-3 pr-4 py-2 border-l-4 border-transparent
52
+ text-gray-500 dark:text-gray-300 hover:text-gray-700 dark:hover:text-gray-100
53
+ hover:bg-gray-50 dark:hover:bg-gray-700 hover:border-gray-300
54
+ dark:hover:border-gray-500 text-base font-medium;
55
+ }
56
+
57
+ .mobile-menu-link.active {
58
+ @apply border-blue-500 bg-blue-50 dark:bg-blue-900/20
59
+ text-blue-700 dark:text-blue-300;
60
+ }
61
+
62
+ .theme-toggle {
63
+ @apply p-2 rounded-full text-gray-400 hover:text-gray-500
64
+ dark:hover:text-gray-300 hover:bg-gray-100 dark:hover:bg-gray-700;
65
+ }
66
+
67
+ .hidden {
68
+ display: none;
69
+ }
70
+ </style>
71
+
72
+ <nav>
73
+ <div class="container">
74
+ <div class="nav-content">
75
+ <div class="flex items-center">
76
+ <a href="/" class="logo">
77
+ <i data-feather="file-text" class="mr-2"></i>
78
+ DocuSense
79
+ </a>
80
+ </div>
81
+
82
+ <div class="nav-links">
83
+ <a href="#overview" class="nav-link">Overview</a>
84
+ <a href="#demo" class="nav-link">Workflow</a>
85
+ <a href="#benefits" class="nav-link">Benefits</a>
86
+ </div>
87
+
88
+ <div class="flex items-center space-x-2">
89
+ <button class="theme-toggle" onclick="toggleDarkMode()">
90
+ <i data-feather="moon" class="dark:hidden"></i>
91
+ <i data-feather="sun" class="hidden dark:block"></i>
92
+ </button>
93
+
94
+ <button class="mobile-menu-button md:hidden" aria-expanded="false">
95
+ <i data-feather="menu"></i>
96
+ </button>
97
+ </div>
98
+ </div>
99
+ </div>
100
+
101
+ <!-- Mobile menu -->
102
+ <div class="mobile-menu hidden">
103
+ <div class="container">
104
+ <div class="mobile-menu-items">
105
+ <a href="#overview" class="mobile-menu-link">Overview</a>
106
+ <a href="#demo" class="mobile-menu-link">Workflow</a>
107
+ <a href="#benefits" class="mobile-menu-link">Benefits</a>
108
+ </div>
109
+ </div>
110
+ </div>
111
+ </nav>
112
+
113
+ <script>
114
+ // Mobile menu toggle
115
+ document.querySelector('.mobile-menu-button').addEventListener('click', function() {
116
+ const menu = document.querySelector('.mobile-menu');
117
+ menu.classList.toggle('hidden');
118
+
119
+ // Toggle icon between menu and x
120
+ const icon = this.querySelector('i');
121
+ if (menu.classList.contains('hidden')) {
122
+ icon.setAttribute('data-feather', 'menu');
123
+ } else {
124
+ icon.setAttribute('data-feather', 'x');
125
+ }
126
+ feather.replace();
127
+ });
128
+
129
+ // Close mobile menu when clicking a link
130
+ document.querySelectorAll('.mobile-menu-link').forEach(link => {
131
+ link.addEventListener('click', function() {
132
+ document.querySelector('.mobile-menu').classList.add('hidden');
133
+ document.querySelector('.mobile-menu-button i').setAttribute('data-feather', 'menu');
134
+ feather.replace();
135
+ });
136
+ });
137
+
138
+ // Update active link based on scroll position
139
+ window.addEventListener('scroll', function() {
140
+ const sections = ['overview', 'demo', 'benefits'];
141
+ let currentSection = '';
142
+
143
+ sections.forEach(section => {
144
+ const element = document.getElementById(section);
145
+ if (element) {
146
+ const rect = element.getBoundingClientRect();
147
+ if (rect.top <= 100 && rect.bottom >= 100) {
148
+ currentSection = section;
149
+ }
150
+ }
151
+ });
152
+
153
+ // Update active state
154
+ document.querySelectorAll('.nav-link, .mobile-menu-link').forEach(link => {
155
+ const href = link.getAttribute('href').substring(1);
156
+ if (href === currentSection) {
157
+ link.classList.add('active');
158
+ } else {
159
+ link.classList.remove('active');
160
+ }
161
+ });
162
+ });
163
+ </script>
164
+ `;
165
+ }
166
+ }
167
+
168
+ customElements.define('custom-navbar', CustomNavbar);
index.html CHANGED
@@ -1,19 +1,266 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
19
- </html>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>DocuSense Explorer - Document Processing System</title>
7
+ <link rel="stylesheet" href="style.css">
8
+ <script src="https://cdn.tailwindcss.com"></script>
9
+ <script src="https://cdn.jsdelivr.net/npm/feather-icons/dist/feather.min.js"></script>
10
+ <script src="https://unpkg.com/feather-icons"></script>
11
+ <script src="components/navbar.js"></script>
12
+ <script src="components/footer.js"></script>
13
+ </head>
14
+ <body class="bg-gray-50 dark:bg-gray-900 min-h-screen flex flex-col">
15
+ <custom-navbar></custom-navbar>
16
+
17
+ <main class="flex-grow container mx-auto px-4 py-8">
18
+ <div class="max-w-5xl mx-auto">
19
+ <!-- Hero Section -->
20
+ <section class="mb-16 text-center">
21
+ <div class="bg-gradient-to-r from-blue-600 to-indigo-700 text-white p-8 rounded-xl shadow-2xl">
22
+ <h1 class="text-4xl md:text-5xl font-bold mb-4">Document Processing <br>with Context Recovery</h1>
23
+ <p class="text-xl mb-8 max-w-3xl mx-auto">Smart section-based processing with full context retrieval</p>
24
+ <div class="flex justify-center gap-4">
25
+ <a href="#overview" class="bg-white text-blue-700 px-6 py-3 rounded-lg font-medium hover:bg-blue-50 transition">Learn More</a>
26
+ <a href="#demo" class="border-2 border-white text-white px-6 py-3 rounded-lg font-medium hover:bg-white hover:text-blue-700 transition">See Demo</a>
27
+ </div>
28
+ </div>
29
+ </section>
30
+
31
+ <!-- System Overview -->
32
+ <section id="overview" class="mb-16">
33
+ <h2 class="text-3xl font-bold mb-8 text-center dark:text-white">System Architecture</h2>
34
+
35
+ <div class="relative">
36
+ <!-- Architecture SVG -->
37
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow-lg">
38
+ <img src="architecture.svg" alt="System Architecture Diagram" class="w-full h-auto rounded-lg">
39
+ </div>
40
+
41
+ <!-- Key Features -->
42
+ <div class="grid md:grid-cols-2 lg:grid-cols-3 gap-6 mt-8">
43
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow hover:shadow-lg transition">
44
+ <div class="bg-blue-100 dark:bg-blue-900 w-12 h-12 rounded-full flex items-center justify-center mb-4">
45
+ <i data-feather="layers" class="text-blue-600 dark:text-blue-300"></i>
46
+ </div>
47
+ <h3 class="text-xl font-semibold mb-2 dark:text-white">Section-Based Processing</h3>
48
+ <p class="text-gray-600 dark:text-gray-300">Documents are intelligently split into meaningful sections while preserving hierarchy.</p>
49
+ </div>
50
+
51
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow hover:shadow-lg transition">
52
+ <div class="bg-purple-100 dark:bg-purple-900 w-12 h-12 rounded-full flex items-center justify-center mb-4">
53
+ <i data-feather="cpu" class="text-purple-600 dark:text-purple-300"></i>
54
+ </div>
55
+ <h3 class="text-xl font-semibold mb-2 dark:text-white">Context Recovery</h3>
56
+ <p class="text-gray-600 dark:text-gray-300">Retrieve the full original section content even when searching small chunks.</p>
57
+ </div>
58
+
59
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow hover:shadow-lg transition">
60
+ <div class="bg-green-100 dark:bg-green-900 w-12 h-12 rounded-full flex items-center justify-center mb-4">
61
+ <i data-feather="zap" class="text-green-600 dark:text-green-300"></i>
62
+ </div>
63
+ <h3 class="text-xl font-semibold mb-2 dark:text-white">Hybrid Search</h3>
64
+ <p class="text-gray-600 dark:text-gray-300">Combines vector similarity and BM25 for precise and relevant results.</p>
65
+ </div>
66
+ </div>
67
+ </div>
68
+ </section>
69
+
70
+ <!-- Workflow Demo -->
71
+ <section id="demo" class="mb-16">
72
+ <h2 class="text-3xl font-bold mb-8 text-center dark:text-white">Workflow Demonstration</h2>
73
+
74
+ <div class="bg-white dark:bg-gray-800 rounded-xl shadow-lg overflow-hidden">
75
+ <div class="grid md:grid-cols-2 gap-0">
76
+ <!-- Step Navigation -->
77
+ <div class="bg-gray-50 dark:bg-gray-700 p-6">
78
+ <div class="sticky top-6">
79
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Process Steps</h3>
80
+ <div class="space-y-2">
81
+ <button class="workflow-step active" data-step="1">1. Document Upload</button>
82
+ <button class="workflow-step" data-step="2">2. Section Splitting</button>
83
+ <button class="workflow-step" data-step="3">3. Chunk Generation</button>
84
+ <button class="workflow-step" data-step="4">4. Storage</button>
85
+ <button class="workflow-step" data-step="5">5. Search Process</button>
86
+ </div>
87
+ </div>
88
+ </div>
89
+
90
+ <!-- Step Content -->
91
+ <div class="p-6">
92
+ <div class="workflow-content" data-step-content="1">
93
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Document Upload</h3>
94
+ <p class="mb-4 text-gray-600 dark:text-gray-300">Users upload documents through the API endpoint:</p>
95
+ <pre class="bg-gray-100 dark:bg-gray-900 rounded-lg p-4 mb-4 overflow-x-auto">
96
+ <code class="text-sm">
97
+ POST /index-file/
98
+ {
99
+ "list_path": [
100
+ {
101
+ "id": "uuid-file-1",
102
+ "path": "extracted/document.md"
103
+ }
104
+ ]
105
+ }
106
+ </code>
107
+ </pre>
108
+ <p class="text-gray-600 dark:text-gray-300">The system creates a unique job ID for processing and returns immediately.</p>
109
+ </div>
110
+
111
+ <div class="workflow-content hidden" data-step-content="2">
112
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Section Splitting</h3>
113
+ <p class="mb-4 text-gray-600 dark:text-gray-300">Documents are split into hierarchical sections:</p>
114
+ <div class="bg-gray-100 dark:bg-gray-900 rounded-lg p-4 mb-4">
115
+ <div class="flex items-center mb-2">
116
+ <div class="w-2 h-2 bg-blue-500 rounded-full mr-2"></div>
117
+ <span class="font-mono text-sm"># Main Title</span>
118
+ </div>
119
+ <div class="flex items-center mb-2 ml-4">
120
+ <div class="w-2 h-2 bg-purple-500 rounded-full mr-2"></div>
121
+ <span class="font-mono text-sm">## Section 1</span>
122
+ </div>
123
+ <div class="flex items-center mb-2 ml-8">
124
+ <div class="w-2 h-2 bg-green-500 rounded-full mr-2"></div>
125
+ <span class="font-mono text-sm">### Subsection 1.1</span>
126
+ </div>
127
+ <div class="flex items-center ml-4">
128
+ <div class="w-2 h-2 bg-purple-500 rounded-full mr-2"></div>
129
+ <span class="font-mono text-sm">## Section 2</span>
130
+ </div>
131
+ </div>
132
+ <p class="text-gray-600 dark:text-gray-300">Each section maintains its page information and document context.</p>
133
+ </div>
134
+
135
+ <div class="workflow-content hidden" data-step-content="3">
136
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Chunk Generation</h3>
137
+ <p class="mb-4 text-gray-600 dark:text-gray-300">Large sections are split into optimally-sized chunks:</p>
138
+ <div class="flex items-start mb-4">
139
+ <div class="border-r-2 border-gray-300 dark:border-gray-600 pr-4 mr-4">
140
+ <div class="bg-blue-100 dark:bg-blue-900 px-3 py-2 rounded-lg mb-2">
141
+ <p class="text-sm">Section ID: 123e4567</p>
142
+ </div>
143
+ <div class="space-y-2">
144
+ <div class="bg-gray-200 dark:bg-gray-700 px-3 py-2 rounded-lg">
145
+ <p class="text-sm">Chunk 1 (Tokens: 512)</p>
146
+ </div>
147
+ <div class="bg-gray-200 dark:bg-gray-700 px-3 py-2 rounded-lg">
148
+ <p class="text-sm">Chunk 2 (Tokens: 498)</p>
149
+ </div>
150
+ </div>
151
+ </div>
152
+ <div>
153
+ <p class="text-gray-600 dark:text-gray-300 text-sm">Each chunk references its parent section while being small enough for efficient vector search.</p>
154
+ </div>
155
+ </div>
156
+ </div>
157
+
158
+ <div class="workflow-content hidden" data-step-content="4">
159
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Storage</h3>
160
+ <div class="grid grid-cols-1 md:grid-cols-2 gap-4 mb-4">
161
+ <div class="bg-gray-100 dark:bg-gray-900 p-4 rounded-lg">
162
+ <div class="flex items-center mb-2">
163
+ <i data-feather="database" class="mr-2 text-blue-500"></i>
164
+ <h4 class="font-medium">PostgreSQL</h4>
165
+ </div>
166
+ <p class="text-sm text-gray-600 dark:text-gray-300">Stores complete section content with metadata</p>
167
+ </div>
168
+ <div class="bg-gray-100 dark:bg-gray-900 p-4 rounded-lg">
169
+ <div class="flex items-center mb-2">
170
+ <i data-feather="box" class="mr-2 text-purple-500"></i>
171
+ <h4 class="font-medium">Milvus</h4>
172
+ </div>
173
+ <p class="text-sm text-gray-600 dark:text-gray-300">Stores vector embeddings of chunks with section references</p>
174
+ </div>
175
+ </div>
176
+ </div>
177
+
178
+ <div class="workflow-content hidden" data-step-content="5">
179
+ <h3 class="text-xl font-semibold mb-4 dark:text-white">Search Process</h3>
180
+ <ol class="space-y-4">
181
+ <li class="flex items-start">
182
+ <div class="bg-blue-500 text-white rounded-full w-6 h-6 flex items-center justify-center mr-3 mt-0.5 flex-shrink-0">1</div>
183
+ <div>
184
+ <p class="font-medium dark:text-white">Query Processing</p>
185
+ <p class="text-sm text-gray-600 dark:text-gray-300">Convert search query to embeddings and search Milvus</p>
186
+ </div>
187
+ </li>
188
+ <li class="flex items-start">
189
+ <div class="bg-blue-500 text-white rounded-full w-6 h-6 flex items-center justify-center mr-3 mt-0.5 flex-shrink-0">2</div>
190
+ <div>
191
+ <p class="font-medium dark:text-white">Result Aggregation</p>
192
+ <p class="text-sm text-gray-600 dark:text-gray-300">Group chunks by their section_id and score</p>
193
+ </div>
194
+ </li>
195
+ <li class="flex items-start">
196
+ <div class="bg-blue-500 text-white rounded-full w-6 h-6 flex items-center justify-center mr-3 mt-0.5 flex-shrink-0">3</div>
197
+ <div>
198
+ <p class="font-medium dark:text-white">Context Retrieval</p>
199
+ <p class="text-sm text-gray-600 dark:text-gray-300">Fetch full section content from PostgreSQL</p>
200
+ </div>
201
+ </li>
202
+ </ol>
203
+ </div>
204
+ </div>
205
+ </div>
206
+ </div>
207
+ </section>
208
+
209
+ <!-- Benefits Section -->
210
+ <section class="mb-16">
211
+ <h2 class="text-3xl font-bold mb-8 text-center dark:text-white">System Benefits</h2>
212
+
213
+ <div class="grid md:grid-cols-3 gap-6">
214
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow-lg border-t-4 border-blue-500">
215
+ <h3 class="text-xl font-semibold mb-3 dark:text-white">Full Context Recovery</h3>
216
+ <p class="text-gray-600 dark:text-gray-300">Retrieve complete section content even when searching small chunks, ensuring no context is lost.</p>
217
+ </div>
218
+
219
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow-lg border-t-4 border-purple-500">
220
+ <h3 class="text-xl font-semibold mb-3 dark:text-white">Hybrid Search</h3>
221
+ <p class="text-gray-600 dark:text-gray-300">Combine semantic vector search with keyword-based BM25 for more accurate results.</p>
222
+ </div>
223
+
224
+ <div class="bg-white dark:bg-gray-800 p-6 rounded-xl shadow-lg border-t-4 border-green-500">
225
+ <h3 class="text-xl font-semibold mb-3 dark:text-white">Optimal Performance</h3>
226
+ <p class="text-gray-600 dark:text-gray-300">Fast vector search on small chunks with instant retrieval of full sections from PostgreSQL.</p>
227
+ </div>
228
+ </div>
229
+ </section>
230
+ </div>
231
+ </main>
232
+
233
+ <custom-footer></custom-footer>
234
+
235
+ <script>
236
+ feather.replace();
237
+
238
+ // Workflow step navigation
239
+ document.addEventListener('DOMContentLoaded', function() {
240
+ const steps = document.querySelectorAll('.workflow-step');
241
+ const contents = document.querySelectorAll('.workflow-content');
242
+
243
+ steps.forEach(step => {
244
+ step.addEventListener('click', function() {
245
+ // Remove active class from all steps
246
+ steps.forEach(s => s.classList.remove('active', 'bg-blue-500', 'text-white'));
247
+
248
+ // Add active class to clicked step
249
+ this.classList.add('active', 'bg-blue-500', 'text-white');
250
+
251
+ // Hide all content
252
+ contents.forEach(c => c.classList.add('hidden'));
253
+
254
+ // Show corresponding content
255
+ const stepNum = this.getAttribute('data-step');
256
+ document.querySelector(`[data-step-content="${stepNum}"]`).classList.remove('hidden');
257
+ });
258
+ });
259
+
260
+ // Activate first step by default
261
+ document.querySelector('.workflow-step').click();
262
+ });
263
+ </script>
264
+ <script src="https://huggingface.co/deepsite/deepsite-badge.js"></script>
265
+ </body>
266
+ </html>
script.js ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Theme switcher functionality
2
+ document.addEventListener('DOMContentLoaded', function() {
3
+ // Check for saved theme preference or use system preference
4
+ const prefersDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
5
+ const savedTheme = localStorage.getItem('theme');
6
+
7
+ if (savedTheme === 'dark' || (!savedTheme && prefersDark)) {
8
+ document.documentElement.classList.add('dark');
9
+ }
10
+
11
+ // Smooth scrolling for anchor links
12
+ document.querySelectorAll('a[href^="#"]').forEach(anchor => {
13
+ anchor.addEventListener('click', function(e) {
14
+ e.preventDefault();
15
+
16
+ const targetId = this.getAttribute('href');
17
+ if (targetId === '#') return;
18
+
19
+ const targetElement = document.querySelector(targetId);
20
+ if (targetElement) {
21
+ targetElement.scrollIntoView({
22
+ behavior: 'smooth',
23
+ block: 'start'
24
+ });
25
+ }
26
+ });
27
+ });
28
+ });
29
+
30
+ // Function to toggle dark mode
31
+ function toggleDarkMode() {
32
+ const html = document.documentElement;
33
+ html.classList.toggle('dark');
34
+
35
+ // Save preference to localStorage
36
+ const isDark = html.classList.contains('dark');
37
+ localStorage.setItem('theme', isDark ? 'dark' : 'light');
38
+ }
style.css CHANGED
@@ -1,28 +1,42 @@
1
- body {
2
- padding: 2rem;
3
- font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
 
 
 
 
 
 
 
 
 
4
  }
5
 
6
- h1 {
7
- font-size: 16px;
8
- margin-top: 0;
 
 
9
  }
10
 
11
- p {
12
- color: rgb(107, 114, 128);
13
- font-size: 15px;
14
- margin-bottom: 10px;
15
- margin-top: 5px;
16
  }
17
 
18
- .card {
19
- max-width: 620px;
20
- margin: 0 auto;
21
- padding: 16px;
22
- border: 1px solid lightgray;
23
- border-radius: 16px;
24
  }
25
 
26
- .card p:last-child {
27
- margin-bottom: 0;
28
  }
 
 
 
 
 
 
 
 
 
1
+ @tailwind base;
2
+ @tailwind components;
3
+ @tailwind utilities;
4
+
5
+ :root {
6
+ --primary: #3b82f6; /* blue-500 */
7
+ --secondary: #8b5cf6; /* purple-500 */
8
+ }
9
+
10
+ .dark {
11
+ --primary: #60a5fa; /* blue-400 */
12
+ --secondary: #a78bfa; /* purple-400 */
13
  }
14
 
15
+ /* Workflow Step Styling */
16
+ .workflow-step {
17
+ @apply w-full text-left px-4 py-2 rounded-lg transition;
18
+ @apply bg-gray-100 dark:bg-gray-600 hover:bg-gray-200 dark:hover:bg-gray-500;
19
+ @apply text-gray-700 dark:text-gray-200 font-medium;
20
  }
21
 
22
+ .workflow-step.active {
23
+ @apply bg-blue-500 text-white;
 
 
 
24
  }
25
 
26
+ /* Custom scrollbar */
27
+ ::-webkit-scrollbar {
28
+ width: 8px;
29
+ height: 8px;
 
 
30
  }
31
 
32
+ ::-webkit-scrollbar-track {
33
+ @apply bg-gray-100 dark:bg-gray-800;
34
  }
35
+
36
+ ::-webkit-scrollbar-thumb {
37
+ @apply bg-gray-400 dark:bg-gray-600 rounded-full;
38
+ }
39
+
40
+ ::-webkit-scrollbar-thumb:hover {
41
+ @apply bg-gray-500 dark:bg-gray-500;
42
+ }