Spaces:

m97j
/

knowledge-engine

Sleeping

App Files Files Community

m97j commited on Apr 30

Commit

972e4b1

1 Parent(s): 14b2e5c

Update README.md

Browse files

Files changed (1) hide show

README.md +38 -25

README.md CHANGED Viewed

@@ -15,30 +15,29 @@ pinned: false
 [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg?logo=python)](https://www.python.org/downloads/release/python-3100/)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
-> **High-performance Hybrid Search & Reranking Engine based on BGE-M3.** > An advanced knowledge retrieval API system that combines Dense/Sparse embeddings and optimizes precision with Cross-Encoders.
 ---
 ## 🚀 Key Features
-* **Hybrid Search:** Seamlessly combines Dense & Sparse vector retrieval using Qdrant's Native Fusion API (BGE-M3).
-* **Re-ranking:** Ensures top-tier precision by re-ordering search results via Cross-Encoder models.
-* **Clean Architecture:** Highly modularized layers (API, Service, Storage, Models) for superior maintainability and scalability.
-* **CI/CD Pipeline:** Fully automated deployment to Hugging Face Spaces using GitHub Actions and Docker.
-* **Auto-Healing Data:** Robust startup logic via FastAPI `lifespan` that automatically synchronizes and validates the knowledge base.
 ---
 ## 🏗 Project Structure
-This project follows the **Separation of Concerns (SoC)** principle to ensure the system remains extensible and testable.
 ```text
-├── api/          # API Routing & Dependency Injection (DI)
-├── core/         # Global Configuration (Pydantic Settings) & Exception Handling
 ├── models/       # AI Model Inference (Embedder, Reranker)
 ├── services/     # Business Logic & Search Pipeline Orchestration
 ├── storage/      # Infrastructure Layer (Qdrant, SQLite Clients)
-├── scripts/      # Data Pipeline & Database Setup Scripts
 ├── templates/    # Demo UI (Jinja2 Templates)
 └── main.py       # App Entry Point & Lifespan Management
 ```
@@ -47,20 +46,21 @@ This project follows the **Separation of Concerns (SoC)** principle to ensure th
 ## 🛠 Tech Stack
 * **Framework:** FastAPI
-* **Vector DB:** Qdrant (Local Path Mode)
 * **RDBMS:** SQLite (Metadata & Corpus Storage)
 * **ML Models:**
-    * `BAAI/bge-m3` (Multi-functional Embedding)
-    * `BAAI/bge-reranker-v2-m3` (Cross-Encoder)
-* **DevOps:** Docker, GitHub Actions, Hugging Face Hub
 ---
 ## 🔧 Installation & Setup
 ### Prerequisites
-* Python 3.10 or higher
-* Hugging Face Access Token (Read/Write)
 ### Running Locally
 1. Clone the repository:
@@ -72,35 +72,48 @@ This project follows the **Separation of Concerns (SoC)** principle to ensure th
    ```bash
    pip install -r requirements.txt
    ```
-3. Run the application (The system will automatically download the necessary DB files on startup):
    ```bash
    python main.py
-   # OR using uvicorn
    uvicorn main:app --host 0.0.0.0 --port 7860
    ```
 ---
 ## 📡 API Endpoints
 | Method | Endpoint | Description |
 | :--- | :--- | :--- |
 | `GET` | `/` | Redirects to Search Demo UI |
-| `POST` | `/api/v1/search/` | Executes JSON-based Hybrid Search |
 | `GET` | `/api/v1/system/health/ping` | System health check (Heartbeat) |
 ---
 ## 💡 Architecture Insights
-1.  **Dependency Injection:** Uses FastAPI `app.state` to manage singletons of AI models and DB clients, allowing for easy mocking during unit testing.
-2.  **Hybrid RAG Pipeline:** Beyond simple vector similarity, this engine leverages Sparse embeddings for keyword-level precision, merged via Reciprocal Rank Fusion (RRF).
-3.  **Deployment Ready:** Optimized for PaaS environments (like HF Spaces) through a containerized Docker setup and automated CI/CD.
 ---
 ## 📄 Documentation
-For more detailed technical documentation, design decisions, and troubleshooting, please visit:
 * [Personal Archive Link](https://minjae-portfolio.vercel.app/projects/ke)
 * [Technical Design Blog](https://minjae-portfolio.vercel.app/blogs/ke-pd)
----

 [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg?logo=python)](https://www.python.org/downloads/release/python-3100/)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
+> **High-performance Hybrid Search & Reranking Engine based on BGE-M3.** > An advanced knowledge retrieval API system designed for Agentic AI, combining Dense/Sparse embeddings and optimizing precision with Cross-Encoders.
 ---
 ## 🚀 Key Features
+* **Hybrid Search (RRF):** Seamlessly combines Dense & Sparse vector retrieval using Qdrant's Native Fusion API (BGE-M3).
+* **Cross-Encoder Re-ranking:** Ensures top-tier precision by re-ordering search results contextually via `bge-reranker-v2-m3`.
+* **Agent-Ready Output:** Natively provides XML-tagged context blocks optimized for immediate injection into LLMs and Agentic workflows.
+* **Auto-Healing & Sync:** Robust startup logic via FastAPI `lifespan` that automatically pulls pre-processed knowledge bases from Hugging Face Datasets and synchronizes them.
+* **Clean Architecture:** Highly modularized layers (API, Service, Storage, Models) using Dependency Injection for superior maintainability.
 ---
 ## 🏗 Project Structure
+Follows the **Separation of Concerns (SoC)** principle to ensure the system remains extensible and testable.
 ```text
+├── api/          # API Routing & Schema Definitions
+├── core/         # Global Configuration (Pydantic V2) & Exception Handling
 ├── models/       # AI Model Inference (Embedder, Reranker)
 ├── services/     # Business Logic & Search Pipeline Orchestration
 ├── storage/      # Infrastructure Layer (Qdrant, SQLite Clients)
+├── scripts/      # Data Pipeline & HF Dataset Sync Scripts
 ├── templates/    # Demo UI (Jinja2 Templates)
 └── main.py       # App Entry Point & Lifespan Management
 ```
 ## 🛠 Tech Stack
 * **Framework:** FastAPI
+* **Vector DB:** Qdrant (Server Mode)
 * **RDBMS:** SQLite (Metadata & Corpus Storage)
 * **ML Models:**
+    * [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) (Dense + Sparse Embedding)
+    * [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) (Cross-Encoder)
+* **DevOps:** Docker, GitHub Actions, Hugging Face Hub (Spaces & Datasets)
+* **Corpus:** [FineWiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)(Currently consists only of kowiki; enwiki, eswiki, etc. to be added later)
 ---
 ## 🔧 Installation & Setup
 ### Prerequisites
+* Python 3.10+
+* Hugging Face Access Token (For initial setup/updates)
 ### Running Locally
 1. Clone the repository:
    ```bash
    pip install -r requirements.txt
    ```
+3. Run the application:
+   *(The system will automatically download the pre-built SQLite and Qdrant DB files from HF Datasets on startup via `scripts/setup_db.py`)*
    ```bash
    python main.py
+   # OR
    uvicorn main:app --host 0.0.0.0 --port 7860
    ```
+### Preprocessing Pipeline (Optional)
+If you want to build the knowledge base from scratch:
+```bash
+# 1. Download qdrant binary (Linux x86_64)
+wget [https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz](https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz)
+tar -xvf qdrant-x86_64-unknown-linux-gnu.tar.gz
+chmod +x qdrant
+# 2. Execute Pipeline
+python scripts/data_pipeline.py --lang en --chunk_batch_size 10000 --limit 50000 --batch_size 1024 --workers 4 --upload --repo_id user/id
+```
 ---
 ## 📡 API Endpoints
 | Method | Endpoint | Description |
 | :--- | :--- | :--- |
 | `GET` | `/` | Redirects to Search Demo UI |
+| `POST` | `/api/v1/search/` | Executes JSON-based Hybrid Search (Returns structured JSON & LLM context) |
 | `GET` | `/api/v1/system/health/ping` | System health check (Heartbeat) |
 ---
 ## 💡 Architecture Insights
+1.  **O(1) Metadata Mapping:** By storing massive text payloads in SQLite and only vectors/IDs in Qdrant, we achieve extremely low latency during the reranking preparation phase.
+2.  **Zero-Downtime Deployment:** Optimized for PaaS environments (like HF Spaces) through a containerized Docker setup and a custom `start.sh` that ensures DB readiness before FastAPI starts.
 ---
 ## 📄 Documentation
+For more detailed technical documentation and design decisions:
 * [Personal Archive Link](https://minjae-portfolio.vercel.app/projects/ke)
 * [Technical Design Blog](https://minjae-portfolio.vercel.app/blogs/ke-pd)
+---