rag-w-binary-quant / README.md
serverdaun's picture
Add badges for continuous deployment and Hugging Face Spaces in README.md to enhance visibility and accessibility of project status.
fc565cd

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Rag with Binary Quantization
emoji: πŸ“œ
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: RAG with Binary Quantization for enhanced performance

CD to HF Space View on Hugging Face Spaces

RAG with Binary Quantization

A high-performance Retrieval-Augmented Generation (RAG) system that uses binary quantization for efficient vector storage and similarity search. This project implements a document Q&A system with optimized memory usage and fast retrieval capabilities.

πŸš€ Features

  • Binary Quantization: Converts high-dimensional embeddings to binary vectors for memory efficiency
  • Milvus Vector Database: Uses Milvus for scalable vector storage and similarity search
  • Gradio Web Interface: User-friendly web UI for document upload and chat
  • BGE Embeddings: Leverages BAAI/bge-large-en-v1.5 for high-quality text embeddings
  • OpenAI Integration: Uses GPT-4.1 for intelligent question answering
  • Batch Processing: Efficient document processing with configurable batch sizes

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Documents     │───▢│  BGE Embeddings  │───▢│ Binary Vectors  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User Query    │───▢│  Query Embedding │───▢│  Milvus Search  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Retrieved Docs │◀───│  Context Fusion  │◀───│  LLM Answer     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Installation

  1. Clone the repository:

    git clone <repository-url>
    cd rag-w-binary-quant
    
  2. Install dependencies:

    uv sync
    
  3. Set up environment variables: Create a .env file with your OpenAI API key:

    OPENAI_API_KEY=your_openai_api_key_here
    

πŸš€ Usage

Starting the Application

Run the Gradio web interface:

uv run app.py

The application will be available at http://localhost:7860

Using the Interface

  1. Upload Documents:

    • Go to the "Upload & Index" tab
    • Upload your documents (supports multiple file formats)
    • Click "Update Index" to process and index the documents
  2. Chat with Documents:

    • Switch to the "Chat" tab
    • Ask questions about your uploaded documents
    • Get intelligent answers based on the document content

πŸ”§ Configuration

Key configuration parameters in src/config.py:

  • EMBEDDING_MODEL_NAME: BAAI/bge-large-en-v1.5
  • COLLECTION_NAME: "fast_rag"
  • MILVUS_DB_PATH: "milvus_binary_quantized.db"
  • MODEL_NAME: "gpt-4.1"
  • TEMPERATURE: 0.2

πŸ“Š Performance Benefits

  • Memory Efficiency: Binary vectors use 8x less memory than float32 embeddings
  • Fast Search: Hamming distance computation is highly optimized
  • Scalable: Milvus provides enterprise-grade vector database capabilities
  • Accurate: BGE embeddings provide high-quality semantic representations

πŸ›οΈ Project Structure

rag-w-binary-quant/
β”œβ”€β”€ app.py                 # Gradio web interface
β”œβ”€β”€ main.py               # Main application entry point
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py         # Configuration settings
β”‚   β”œβ”€β”€ data_loader.py    # Document loading utilities
β”‚   β”œβ”€β”€ embedding_generator.py  # Binary embedding generation
β”‚   β”œβ”€β”€ vector_store.py   # Milvus vector database operations
β”‚   └── rag_pipeline.py   # RAG question answering pipeline
β”œβ”€β”€ documents/            # Uploaded document storage
└── README.md

πŸ” Technical Details

Binary Quantization Process

  1. Float32 Embeddings: Generate embeddings using BGE model
  2. Binary Conversion: Convert to binary using threshold (positive values β†’ 1, negative β†’ 0)
  3. Packing: Pack binary vectors into bytes for efficient storage
  4. Hamming Distance: Use Hamming distance for similarity search

Vector Search

  • Index Type: BIN_FLAT (exact search for binary vectors)
  • Metric: Hamming distance
  • Retrieval: Top-k most similar documents

πŸ™ Acknowledgments

  • BAAI for the BGE embedding model
  • Milvus for the vector database
  • Gradio for the web interface
  • OpenAI for the language model