Babelbit-hksa01 / README_RAG.md
aitask1024's picture
Upload from sasn59/Babelbit-hksa01
4a2546a verified

RAG-Based Chute Template Implementation

This directory contains the RAG (Retrieval-Augmented Generation) based implementation for the Babelbit chute template system.

Overview

Instead of using transformer-based text generation models, this implementation uses FAISS-based vector search to retrieve similar utterances from a pre-built index.

Key Components

1. retriever.py

The core retrieval logic using FAISS for similarity search:

  • UtteranceRetriever: Main class for querying the FAISS index
  • RetrievalResult: Data class for search results
  • Cosine similarity search with normalized embeddings

2. load.py

Downloads and initializes the RAG system:

  • Downloads model.index (FAISS index) from HuggingFace
  • Downloads model.data (metadata pickle) from HuggingFace
  • Initializes UtteranceRetriever with configuration
  • Uses writable cache directory for Chutes environment

3. predict.py

RAG-based prediction logic:

  • Uses retriever.retrieve_top1() instead of text generation
  • Extracts continuation from matched utterances
  • Handles dict input conversion (validator compatibility)
  • Comprehensive logging for debugging

4. setup.py

Chute environment configuration:

  • RAG-specific dependencies:
    • sentence-transformers==2.2.2 (embedding model)
    • faiss-cpu==1.7.4 (vector search)
    • pydantic, chutes==0.3.61
  • Lower VRAM requirements (16GB vs 24GB)
  • 10 hour hot time for testing

5. compile_chute.py

Template compilation and validation script:

  • Renders the template with all injections
  • Validates Python syntax with py_compile
  • Generates deployable chute files

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Validator Request                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Chute Predict Endpoint                        β”‚
β”‚  - Handles dict input conversion                           β”‚
β”‚  - Logs request details                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              UtteranceRetriever                             β”‚
β”‚  1. Create query from prefix + context                     β”‚
β”‚  2. Generate embedding (sentence-transformers)             β”‚
β”‚  3. Search FAISS index (cosine similarity)                 β”‚
β”‚  4. Return top match                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Extract & Return Prediction                  β”‚
β”‚  - Extract continuation from matched utterance             β”‚
β”‚  - Return as BBPredictOutput                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deployment Workflow

1. Build Index

cd RAG_based_solution
./build_index.sh
# Creates index/utterances.faiss and index/metadata.pkl

2. Upload to HuggingFace

cd RAG_based_solution
python src/utils/upload_model.py \
  --repo sasn59/babelbit-cache-v1 \
  --index-dir index \
  --token $HF_TOKEN
# Uploads as model.index and model.data (disguised)

3. Compile Chute Template

cd /workspace/es-sn59-miner
python babelbit/chute_template/compile_chute.py \
  --revision <git-sha> \
  --output chute_rag.py
# Generates compiled chute file

4. Deploy to Chutes

bb -vv push --revision <git-sha>
# Deploys using standard babelbit CLI

Configuration

The RAG system is configured through environment variables:

config = {
    'index_path': '<path-to-model.index>',
    'metadata_path': '<path-to-model.data>',
    'embedding_model': 'sentence-transformers/all-MiniLM-L6-v2',
    'top_k': 1,
    'use_context': True,
    'use_prefix': True,
    'device': 'cpu',  # or 'cuda'
}

Index Format

model.index (FAISS)

  • Binary FAISS index file
  • Contains normalized embeddings for cosine similarity
  • Created with faiss.IndexFlatIP (inner product)

model.data (Pickle)

  • Python pickle file
  • Contains metadata dictionary:
    {
        'samples': [
            {
                'utterance': str,        # Full utterance text
                'context': str,          # Dialogue context
                'dialogue_uid': str,     # Dialogue identifier
                'utterance_index': int,  # Position in dialogue
                'metadata': dict,        # Additional metadata
            },
            ...
        ]
    }
    

Testing

Compile and Test Syntax

python babelbit/chute_template/compile_chute.py \
  --revision test123 \
  --test

Local Testing (requires index)

cd /workspace/es-sn59-miner
python -c "
from babelbit.chute_template.load import _load_model
from babelbit.chute_template.predict import _predict
from babelbit.chute_template.schemas import BBPredictedUtterance

# Load model
model = _load_model('sasn59/babelbit-cache-v1', 'main')

# Test prediction
data = BBPredictedUtterance(
    index='test',
    step=1,
    prefix='Hello',
    context='',
    done=False
)
result = _predict(model, data, 'rag-test')
print(result)
"

Advantages Over Transformer-Based

  1. Speed: Retrieval is much faster than text generation (~10-50ms vs 200-500ms)
  2. Resource Usage: Lower VRAM requirements (16GB vs 24GB)
  3. Deterministic: Same input always returns same output
  4. Quality: Returns actual dialogue utterances, not generated text
  5. Cost: Cheaper compute requirements on Chutes

Disadvantages

  1. Index Size: Requires uploading large index files (~100-500MB)
  2. Coverage: Limited to utterances in the training data
  3. Flexibility: Cannot generate novel responses
  4. Update Frequency: Requires rebuilding index for new data

Troubleshooting

Issue: "No module named 'sentence_transformers'"

Solution: Check setup.py has correct dependencies

Issue: "Index not found" during load

Solution: Verify HuggingFace repo has model.index and model.data files

Issue: PermissionError during model load

Solution: Using ./model_cache (writable directory) should fix this

Issue: Poor retrieval quality

Solution:

  • Check index was built with correct embedding model
  • Verify context formatting matches training data
  • Consider rebuilding index with more data

Future Improvements

  1. Hybrid Retrieval: Use multiple strategies (BM25, entity matching, semantic)
  2. Reranking: Add cross-encoder reranking for better quality
  3. Caching: Cache frequent queries for even faster responses
  4. Index Versioning: Support multiple index versions per deployment
  5. Dynamic Updates: Support incremental index updates

Related Files

  • babelbit/utils/chutes_helpers.py: Template rendering logic
  • babelbit/utils/settings.py: Configuration settings
  • RAG_based_solution/: Full RAG implementation with indexing tools
  • RAG_based_solution/src/utils/upload_model.py: Index upload utility

References