dashVectorSpace / README.md
justmotes's picture
Upload README.md with huggingface_hub
4267104 verified
|
raw
history blame
2.7 kB
metadata
title: DashVector Experiment Matrix
emoji: 
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false

dashVectorspace (xVector)

Production-Grade Learned Hybrid Retrieval Engine

This project implements a high-efficiency vector search engine using a Learned Router and Custom Sharding on top of Qdrant. It optimizes search efficiency by ~90% by routing queries to specific data clusters instead of performing a brute-force search across the entire dataset.

Core Architecture

  1. The Brain (Router): A Machine Learning model (LightGBM/Logistic/MLP) predicts which cluster contains the answer.
  2. The Body (Vector DB): Qdrant with Custom Sharding. Data is partitioned into 32 Clusters + 1 Freshness Shard.
  3. The Optimization: Matryoshka Representation Learning (MRL). The Router uses sliced 64-dim vectors for speed, while the DB stores full vectors for accuracy.

Project Structure

dashVectorspace/
├── config.py                   # Configuration (Clusters, Models, Paths)
├── main.py                     # Benchmark Runner (P&C Matrix)
├── requirements.txt            # Dependencies
├── src/
│   ├── data_pipeline.py        # Data loading & MRL slicing
│   ├── router.py               # LearnedRouter (Train/Predict)
│   ├── vector_db.py            # UnifiedQdrant (Custom Sharding)
│   └── active_learning.py      # Hard Negative Logging
└── notebooks/
    └── xVector_Analysis.ipynb  # Analysis Notebook

Setup & Usage

  1. Install Dependencies:

    pip install -r requirements.txt
    
  2. Run Benchmarks: Execute the main script to run the Permutation & Combination matrix of experiments:

    python main.py
    

    This will:

    • Generate/Load Data (MS MARCO or Synthetic).
    • Train different Router models (LightGBM, Logistic, MLP).
    • Index data into Qdrant with Custom Sharding.
    • Run test queries and report Accuracy, Latency, and Compute Savings.
  3. Analyze Results: Open notebooks/xVector_Analysis.ipynb to visualize the active learning logs and performance metrics.

Key Features

  • Custom Sharding: Explicit control over where data lives (Clusters 0-31) and a dedicated Freshness Shard (999) for new data.
  • Drift Defense:
    • Layer 1: Always searches the Freshness Shard.
    • Layer 2: Falls back to Global Search if Router confidence is low (< 0.5).
  • Active Learning: Logs "Hard Negatives" (low confidence or zero results) to logs/active_learning_queue.jsonl for future model retraining.