dashVectorSpace / README.md
justmotes's picture
Upload README.md with huggingface_hub
4267104 verified
---
title: DashVector Experiment Matrix
emoji:
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
---
# dashVectorspace (xVector)
**Production-Grade Learned Hybrid Retrieval Engine**
This project implements a high-efficiency vector search engine using a **Learned Router** and **Custom Sharding** on top of **Qdrant**. It optimizes search efficiency by ~90% by routing queries to specific data clusters instead of performing a brute-force search across the entire dataset.
## Core Architecture
1. **The Brain (Router)**: A Machine Learning model (LightGBM/Logistic/MLP) predicts which cluster contains the answer.
2. **The Body (Vector DB)**: Qdrant with **Custom Sharding**. Data is partitioned into 32 Clusters + 1 Freshness Shard.
3. **The Optimization**: **Matryoshka Representation Learning (MRL)**. The Router uses sliced 64-dim vectors for speed, while the DB stores full vectors for accuracy.
## Project Structure
```
dashVectorspace/
├── config.py # Configuration (Clusters, Models, Paths)
├── main.py # Benchmark Runner (P&C Matrix)
├── requirements.txt # Dependencies
├── src/
│ ├── data_pipeline.py # Data loading & MRL slicing
│ ├── router.py # LearnedRouter (Train/Predict)
│ ├── vector_db.py # UnifiedQdrant (Custom Sharding)
│ └── active_learning.py # Hard Negative Logging
└── notebooks/
└── xVector_Analysis.ipynb # Analysis Notebook
```
## Setup & Usage
1. **Install Dependencies**:
```bash
pip install -r requirements.txt
```
2. **Run Benchmarks**:
Execute the main script to run the Permutation & Combination matrix of experiments:
```bash
python main.py
```
This will:
- Generate/Load Data (MS MARCO or Synthetic).
- Train different Router models (LightGBM, Logistic, MLP).
- Index data into Qdrant with Custom Sharding.
- Run test queries and report Accuracy, Latency, and Compute Savings.
3. **Analyze Results**:
Open `notebooks/xVector_Analysis.ipynb` to visualize the active learning logs and performance metrics.
## Key Features
- **Custom Sharding**: Explicit control over where data lives (Clusters 0-31) and a dedicated **Freshness Shard (999)** for new data.
- **Drift Defense**:
- **Layer 1**: Always searches the Freshness Shard.
- **Layer 2**: Falls back to **Global Search** if Router confidence is low (< 0.5).
- **Active Learning**: Logs "Hard Negatives" (low confidence or zero results) to `logs/active_learning_queue.jsonl` for future model retraining.