--- title: DashVector Experiment Matrix emoji: ⚡ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: false --- # dashVectorspace (xVector) **Production-Grade Learned Hybrid Retrieval Engine** This project implements a high-efficiency vector search engine using a **Learned Router** and **Custom Sharding** on top of **Qdrant**. It optimizes search efficiency by ~90% by routing queries to specific data clusters instead of performing a brute-force search across the entire dataset. ## Core Architecture 1. **The Brain (Router)**: A Machine Learning model (LightGBM/Logistic/MLP) predicts which cluster contains the answer. 2. **The Body (Vector DB)**: Qdrant with **Custom Sharding**. Data is partitioned into 32 Clusters + 1 Freshness Shard. 3. **The Optimization**: **Matryoshka Representation Learning (MRL)**. The Router uses sliced 64-dim vectors for speed, while the DB stores full vectors for accuracy. ## Project Structure ``` dashVectorspace/ ├── config.py # Configuration (Clusters, Models, Paths) ├── main.py # Benchmark Runner (P&C Matrix) ├── requirements.txt # Dependencies ├── src/ │ ├── data_pipeline.py # Data loading & MRL slicing │ ├── router.py # LearnedRouter (Train/Predict) │ ├── vector_db.py # UnifiedQdrant (Custom Sharding) │ └── active_learning.py # Hard Negative Logging └── notebooks/ └── xVector_Analysis.ipynb # Analysis Notebook ``` ## Setup & Usage 1. **Install Dependencies**: ```bash pip install -r requirements.txt ``` 2. **Run Benchmarks**: Execute the main script to run the Permutation & Combination matrix of experiments: ```bash python main.py ``` This will: - Generate/Load Data (MS MARCO or Synthetic). - Train different Router models (LightGBM, Logistic, MLP). - Index data into Qdrant with Custom Sharding. - Run test queries and report Accuracy, Latency, and Compute Savings. 3. **Analyze Results**: Open `notebooks/xVector_Analysis.ipynb` to visualize the active learning logs and performance metrics. ## Key Features - **Custom Sharding**: Explicit control over where data lives (Clusters 0-31) and a dedicated **Freshness Shard (999)** for new data. - **Drift Defense**: - **Layer 1**: Always searches the Freshness Shard. - **Layer 2**: Falls back to **Global Search** if Router confidence is low (< 0.5). - **Active Learning**: Logs "Hard Negatives" (low confidence or zero results) to `logs/active_learning_queue.jsonl` for future model retraining.