m97j commited on
Commit
972e4b1
Β·
1 Parent(s): 14b2e5c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -25
README.md CHANGED
@@ -15,30 +15,29 @@ pinned: false
15
  [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg?logo=python)](https://www.python.org/downloads/release/python-3100/)
16
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
17
 
18
- > **High-performance Hybrid Search & Reranking Engine based on BGE-M3.** > An advanced knowledge retrieval API system that combines Dense/Sparse embeddings and optimizes precision with Cross-Encoders.
19
-
20
 
21
  ---
22
 
23
  ## πŸš€ Key Features
24
- * **Hybrid Search:** Seamlessly combines Dense & Sparse vector retrieval using Qdrant's Native Fusion API (BGE-M3).
25
- * **Re-ranking:** Ensures top-tier precision by re-ordering search results via Cross-Encoder models.
26
- * **Clean Architecture:** Highly modularized layers (API, Service, Storage, Models) for superior maintainability and scalability.
27
- * **CI/CD Pipeline:** Fully automated deployment to Hugging Face Spaces using GitHub Actions and Docker.
28
- * **Auto-Healing Data:** Robust startup logic via FastAPI `lifespan` that automatically synchronizes and validates the knowledge base.
29
 
30
  ---
31
 
32
  ## πŸ— Project Structure
33
- This project follows the **Separation of Concerns (SoC)** principle to ensure the system remains extensible and testable.
34
 
35
  ```text
36
- β”œβ”€β”€ api/ # API Routing & Dependency Injection (DI)
37
- β”œβ”€β”€ core/ # Global Configuration (Pydantic Settings) & Exception Handling
38
  β”œβ”€β”€ models/ # AI Model Inference (Embedder, Reranker)
39
  β”œβ”€β”€ services/ # Business Logic & Search Pipeline Orchestration
40
  β”œβ”€β”€ storage/ # Infrastructure Layer (Qdrant, SQLite Clients)
41
- β”œβ”€β”€ scripts/ # Data Pipeline & Database Setup Scripts
42
  β”œβ”€β”€ templates/ # Demo UI (Jinja2 Templates)
43
  └── main.py # App Entry Point & Lifespan Management
44
  ```
@@ -47,20 +46,21 @@ This project follows the **Separation of Concerns (SoC)** principle to ensure th
47
 
48
  ## πŸ›  Tech Stack
49
  * **Framework:** FastAPI
50
- * **Vector DB:** Qdrant (Local Path Mode)
51
  * **RDBMS:** SQLite (Metadata & Corpus Storage)
52
  * **ML Models:**
53
- * `BAAI/bge-m3` (Multi-functional Embedding)
54
- * `BAAI/bge-reranker-v2-m3` (Cross-Encoder)
55
- * **DevOps:** Docker, GitHub Actions, Hugging Face Hub
 
56
 
57
  ---
58
 
59
  ## πŸ”§ Installation & Setup
60
 
61
  ### Prerequisites
62
- * Python 3.10 or higher
63
- * Hugging Face Access Token (Read/Write)
64
 
65
  ### Running Locally
66
  1. Clone the repository:
@@ -72,35 +72,48 @@ This project follows the **Separation of Concerns (SoC)** principle to ensure th
72
  ```bash
73
  pip install -r requirements.txt
74
  ```
75
- 3. Run the application (The system will automatically download the necessary DB files on startup):
 
76
  ```bash
77
  python main.py
78
- # OR using uvicorn
79
  uvicorn main:app --host 0.0.0.0 --port 7860
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ---
83
 
84
  ## πŸ“‘ API Endpoints
85
  | Method | Endpoint | Description |
86
  | :--- | :--- | :--- |
87
  | `GET` | `/` | Redirects to Search Demo UI |
88
- | `POST` | `/api/v1/search/` | Executes JSON-based Hybrid Search |
89
  | `GET` | `/api/v1/system/health/ping` | System health check (Heartbeat) |
90
 
91
  ---
92
 
93
  ## πŸ’‘ Architecture Insights
94
- 1. **Dependency Injection:** Uses FastAPI `app.state` to manage singletons of AI models and DB clients, allowing for easy mocking during unit testing.
95
- 2. **Hybrid RAG Pipeline:** Beyond simple vector similarity, this engine leverages Sparse embeddings for keyword-level precision, merged via Reciprocal Rank Fusion (RRF).
96
- 3. **Deployment Ready:** Optimized for PaaS environments (like HF Spaces) through a containerized Docker setup and automated CI/CD.
97
 
98
  ---
99
 
100
  ## πŸ“„ Documentation
101
- For more detailed technical documentation, design decisions, and troubleshooting, please visit:
102
  * [Personal Archive Link](https://minjae-portfolio.vercel.app/projects/ke)
103
  * [Technical Design Blog](https://minjae-portfolio.vercel.app/blogs/ke-pd)
104
 
105
 
106
- ---
 
 
15
  [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg?logo=python)](https://www.python.org/downloads/release/python-3100/)
16
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
17
 
18
+ > **High-performance Hybrid Search & Reranking Engine based on BGE-M3.** > An advanced knowledge retrieval API system designed for Agentic AI, combining Dense/Sparse embeddings and optimizing precision with Cross-Encoders.
 
19
 
20
  ---
21
 
22
  ## πŸš€ Key Features
23
+ * **Hybrid Search (RRF):** Seamlessly combines Dense & Sparse vector retrieval using Qdrant's Native Fusion API (BGE-M3).
24
+ * **Cross-Encoder Re-ranking:** Ensures top-tier precision by re-ordering search results contextually via `bge-reranker-v2-m3`.
25
+ * **Agent-Ready Output:** Natively provides XML-tagged context blocks optimized for immediate injection into LLMs and Agentic workflows.
26
+ * **Auto-Healing & Sync:** Robust startup logic via FastAPI `lifespan` that automatically pulls pre-processed knowledge bases from Hugging Face Datasets and synchronizes them.
27
+ * **Clean Architecture:** Highly modularized layers (API, Service, Storage, Models) using Dependency Injection for superior maintainability.
28
 
29
  ---
30
 
31
  ## πŸ— Project Structure
32
+ Follows the **Separation of Concerns (SoC)** principle to ensure the system remains extensible and testable.
33
 
34
  ```text
35
+ β”œβ”€β”€ api/ # API Routing & Schema Definitions
36
+ β”œβ”€β”€ core/ # Global Configuration (Pydantic V2) & Exception Handling
37
  β”œβ”€β”€ models/ # AI Model Inference (Embedder, Reranker)
38
  β”œβ”€β”€ services/ # Business Logic & Search Pipeline Orchestration
39
  β”œβ”€β”€ storage/ # Infrastructure Layer (Qdrant, SQLite Clients)
40
+ β”œβ”€β”€ scripts/ # Data Pipeline & HF Dataset Sync Scripts
41
  β”œβ”€β”€ templates/ # Demo UI (Jinja2 Templates)
42
  └── main.py # App Entry Point & Lifespan Management
43
  ```
 
46
 
47
  ## πŸ›  Tech Stack
48
  * **Framework:** FastAPI
49
+ * **Vector DB:** Qdrant (Server Mode)
50
  * **RDBMS:** SQLite (Metadata & Corpus Storage)
51
  * **ML Models:**
52
+ * [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) (Dense + Sparse Embedding)
53
+ * [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) (Cross-Encoder)
54
+ * **DevOps:** Docker, GitHub Actions, Hugging Face Hub (Spaces & Datasets)
55
+ * **Corpus:** [FineWiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)(Currently consists only of kowiki; enwiki, eswiki, etc. to be added later)
56
 
57
  ---
58
 
59
  ## πŸ”§ Installation & Setup
60
 
61
  ### Prerequisites
62
+ * Python 3.10+
63
+ * Hugging Face Access Token (For initial setup/updates)
64
 
65
  ### Running Locally
66
  1. Clone the repository:
 
72
  ```bash
73
  pip install -r requirements.txt
74
  ```
75
+ 3. Run the application:
76
+ *(The system will automatically download the pre-built SQLite and Qdrant DB files from HF Datasets on startup via `scripts/setup_db.py`)*
77
  ```bash
78
  python main.py
79
+ # OR
80
  uvicorn main:app --host 0.0.0.0 --port 7860
81
  ```
82
 
83
+ ### Preprocessing Pipeline (Optional)
84
+ If you want to build the knowledge base from scratch:
85
+ ```bash
86
+ # 1. Download qdrant binary (Linux x86_64)
87
+ wget [https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz](https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz)
88
+ tar -xvf qdrant-x86_64-unknown-linux-gnu.tar.gz
89
+ chmod +x qdrant
90
+
91
+ # 2. Execute Pipeline
92
+ python scripts/data_pipeline.py --lang en --chunk_batch_size 10000 --limit 50000 --batch_size 1024 --workers 4 --upload --repo_id user/id
93
+ ```
94
+
95
  ---
96
 
97
  ## πŸ“‘ API Endpoints
98
  | Method | Endpoint | Description |
99
  | :--- | :--- | :--- |
100
  | `GET` | `/` | Redirects to Search Demo UI |
101
+ | `POST` | `/api/v1/search/` | Executes JSON-based Hybrid Search (Returns structured JSON & LLM context) |
102
  | `GET` | `/api/v1/system/health/ping` | System health check (Heartbeat) |
103
 
104
  ---
105
 
106
  ## πŸ’‘ Architecture Insights
107
+ 1. **O(1) Metadata Mapping:** By storing massive text payloads in SQLite and only vectors/IDs in Qdrant, we achieve extremely low latency during the reranking preparation phase.
108
+ 2. **Zero-Downtime Deployment:** Optimized for PaaS environments (like HF Spaces) through a containerized Docker setup and a custom `start.sh` that ensures DB readiness before FastAPI starts.
 
109
 
110
  ---
111
 
112
  ## πŸ“„ Documentation
113
+ For more detailed technical documentation and design decisions:
114
  * [Personal Archive Link](https://minjae-portfolio.vercel.app/projects/ke)
115
  * [Technical Design Blog](https://minjae-portfolio.vercel.app/blogs/ke-pd)
116
 
117
 
118
+ ---
119
+