File size: 4,974 Bytes
b62e029
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
972e4b1
b62e029
 
 
 
972e4b1
 
 
 
 
b62e029
 
 
 
972e4b1
b62e029
 
972e4b1
 
b62e029
 
 
972e4b1
b62e029
 
 
 
 
 
 
 
972e4b1
b62e029
 
972e4b1
 
 
 
b62e029
 
 
 
 
 
972e4b1
 
b62e029
 
 
 
 
 
 
 
 
 
 
972e4b1
 
b62e029
 
972e4b1
b62e029
 
 
972e4b1
 
 
 
 
 
 
 
 
 
 
 
b62e029
 
 
 
 
 
972e4b1
b62e029
 
 
 
 
972e4b1
 
b62e029
 
 
 
972e4b1
b62e029
 
 
 
972e4b1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: Knowledge Engine
emoji: πŸ”
colorFrom: purple
colorTo: gray
sdk: docker
app_port: 7860
license: apache-2.0
pinned: false
---

# πŸ” Knowledge Engine

[![Spaces](https://img.shields.io/badge/Demo-Spaces-FF9D00?logo=huggingface)](https://huggingface.co/spaces/m97j/knowledge-engine)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg?logo=python)](https://www.python.org/downloads/release/python-3100/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)

> **High-performance Hybrid Search & Reranking Engine based on BGE-M3.** > An advanced knowledge retrieval API system designed for Agentic AI, combining Dense/Sparse embeddings and optimizing precision with Cross-Encoders.

---

## πŸš€ Key Features
* **Hybrid Search (RRF):** Seamlessly combines Dense & Sparse vector retrieval using Qdrant's Native Fusion API (BGE-M3).
* **Cross-Encoder Re-ranking:** Ensures top-tier precision by re-ordering search results contextually via `bge-reranker-v2-m3`.
* **Agent-Ready Output:** Natively provides XML-tagged context blocks optimized for immediate injection into LLMs and Agentic workflows.
* **Auto-Healing & Sync:** Robust startup logic via FastAPI `lifespan` that automatically pulls pre-processed knowledge bases from Hugging Face Datasets and synchronizes them.
* **Clean Architecture:** Highly modularized layers (API, Service, Storage, Models) using Dependency Injection for superior maintainability.

---

## πŸ— Project Structure
Follows the **Separation of Concerns (SoC)** principle to ensure the system remains extensible and testable.

```text
β”œβ”€β”€ api/          # API Routing & Schema Definitions
β”œβ”€β”€ core/         # Global Configuration (Pydantic V2) & Exception Handling
β”œβ”€β”€ models/       # AI Model Inference (Embedder, Reranker)
β”œβ”€β”€ services/     # Business Logic & Search Pipeline Orchestration
β”œβ”€β”€ storage/      # Infrastructure Layer (Qdrant, SQLite Clients)
β”œβ”€β”€ scripts/      # Data Pipeline & HF Dataset Sync Scripts
β”œβ”€β”€ templates/    # Demo UI (Jinja2 Templates)
└── main.py       # App Entry Point & Lifespan Management
```

---

## πŸ›  Tech Stack
* **Framework:** FastAPI
* **Vector DB:** Qdrant (Server Mode)
* **RDBMS:** SQLite (Metadata & Corpus Storage)
* **ML Models:**
    * [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) (Dense + Sparse Embedding)
    * [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) (Cross-Encoder)
* **DevOps:** Docker, GitHub Actions, Hugging Face Hub (Spaces & Datasets)
* **Corpus:** [FineWiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)(Currently consists only of kowiki; enwiki, eswiki, etc. to be added later)

---

## πŸ”§ Installation & Setup

### Prerequisites
* Python 3.10+
* Hugging Face Access Token (For initial setup/updates)

### Running Locally
1. Clone the repository:
   ```bash
   git clone [https://github.com/m97j/knowledge-engine.git](https://github.com/m97j/knowledge-engine.git)
   cd knowledge-engine
   ```
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```
3. Run the application:
   *(The system will automatically download the pre-built SQLite and Qdrant DB files from HF Datasets on startup via `scripts/setup_db.py`)*
   ```bash
   python main.py
   # OR
   uvicorn main:app --host 0.0.0.0 --port 7860
   ```

### Preprocessing Pipeline (Optional)
If you want to build the knowledge base from scratch:
```bash
# 1. Download qdrant binary (Linux x86_64)
wget [https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz](https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz)
tar -xvf qdrant-x86_64-unknown-linux-gnu.tar.gz
chmod +x qdrant

# 2. Execute Pipeline
python scripts/data_pipeline.py --lang en --chunk_batch_size 10000 --limit 50000 --batch_size 1024 --workers 4 --upload --repo_id user/id
```

---

## πŸ“‘ API Endpoints
| Method | Endpoint | Description |
| :--- | :--- | :--- |
| `GET` | `/` | Redirects to Search Demo UI |
| `POST` | `/api/v1/search/` | Executes JSON-based Hybrid Search (Returns structured JSON & LLM context) |
| `GET` | `/api/v1/system/health/ping` | System health check (Heartbeat) |

---

## πŸ’‘ Architecture Insights
1.  **O(1) Metadata Mapping:** By storing massive text payloads in SQLite and only vectors/IDs in Qdrant, we achieve extremely low latency during the reranking preparation phase.
2.  **Zero-Downtime Deployment:** Optimized for PaaS environments (like HF Spaces) through a containerized Docker setup and a custom `start.sh` that ensures DB readiness before FastAPI starts.

---

## πŸ“„ Documentation
For more detailed technical documentation and design decisions:
* [Personal Archive Link](https://minjae-portfolio.vercel.app/projects/ke)
* [Technical Design Blog](https://minjae-portfolio.vercel.app/blogs/ke-pd)


---