RAG-Based Chute Template Implementation
This directory contains the RAG (Retrieval-Augmented Generation) based implementation for the Babelbit chute template system.
Overview
Instead of using transformer-based text generation models, this implementation uses FAISS-based vector search to retrieve similar utterances from a pre-built index.
Key Components
1. retriever.py
The core retrieval logic using FAISS for similarity search:
UtteranceRetriever: Main class for querying the FAISS indexRetrievalResult: Data class for search results- Cosine similarity search with normalized embeddings
2. load.py
Downloads and initializes the RAG system:
- Downloads
model.index(FAISS index) from HuggingFace - Downloads
model.data(metadata pickle) from HuggingFace - Initializes
UtteranceRetrieverwith configuration - Uses writable cache directory for Chutes environment
3. predict.py
RAG-based prediction logic:
- Uses
retriever.retrieve_top1()instead of text generation - Extracts continuation from matched utterances
- Handles dict input conversion (validator compatibility)
- Comprehensive logging for debugging
4. setup.py
Chute environment configuration:
- RAG-specific dependencies:
sentence-transformers==2.2.2(embedding model)faiss-cpu==1.7.4(vector search)pydantic,chutes==0.3.61
- Lower VRAM requirements (16GB vs 24GB)
- 10 hour hot time for testing
5. compile_chute.py
Template compilation and validation script:
- Renders the template with all injections
- Validates Python syntax with
py_compile - Generates deployable chute files
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Validator Request β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Chute Predict Endpoint β
β - Handles dict input conversion β
β - Logs request details β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UtteranceRetriever β
β 1. Create query from prefix + context β
β 2. Generate embedding (sentence-transformers) β
β 3. Search FAISS index (cosine similarity) β
β 4. Return top match β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Extract & Return Prediction β
β - Extract continuation from matched utterance β
β - Return as BBPredictOutput β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deployment Workflow
1. Build Index
cd RAG_based_solution
./build_index.sh
# Creates index/utterances.faiss and index/metadata.pkl
2. Upload to HuggingFace
cd RAG_based_solution
python src/utils/upload_model.py \
--repo sasn59/babelbit-cache-v1 \
--index-dir index \
--token $HF_TOKEN
# Uploads as model.index and model.data (disguised)
3. Compile Chute Template
cd /workspace/es-sn59-miner
python babelbit/chute_template/compile_chute.py \
--revision <git-sha> \
--output chute_rag.py
# Generates compiled chute file
4. Deploy to Chutes
bb -vv push --revision <git-sha>
# Deploys using standard babelbit CLI
Configuration
The RAG system is configured through environment variables:
config = {
'index_path': '<path-to-model.index>',
'metadata_path': '<path-to-model.data>',
'embedding_model': 'sentence-transformers/all-MiniLM-L6-v2',
'top_k': 1,
'use_context': True,
'use_prefix': True,
'device': 'cpu', # or 'cuda'
}
Index Format
model.index (FAISS)
- Binary FAISS index file
- Contains normalized embeddings for cosine similarity
- Created with
faiss.IndexFlatIP(inner product)
model.data (Pickle)
- Python pickle file
- Contains metadata dictionary:
{ 'samples': [ { 'utterance': str, # Full utterance text 'context': str, # Dialogue context 'dialogue_uid': str, # Dialogue identifier 'utterance_index': int, # Position in dialogue 'metadata': dict, # Additional metadata }, ... ] }
Testing
Compile and Test Syntax
python babelbit/chute_template/compile_chute.py \
--revision test123 \
--test
Local Testing (requires index)
cd /workspace/es-sn59-miner
python -c "
from babelbit.chute_template.load import _load_model
from babelbit.chute_template.predict import _predict
from babelbit.chute_template.schemas import BBPredictedUtterance
# Load model
model = _load_model('sasn59/babelbit-cache-v1', 'main')
# Test prediction
data = BBPredictedUtterance(
index='test',
step=1,
prefix='Hello',
context='',
done=False
)
result = _predict(model, data, 'rag-test')
print(result)
"
Advantages Over Transformer-Based
- Speed: Retrieval is much faster than text generation (~10-50ms vs 200-500ms)
- Resource Usage: Lower VRAM requirements (16GB vs 24GB)
- Deterministic: Same input always returns same output
- Quality: Returns actual dialogue utterances, not generated text
- Cost: Cheaper compute requirements on Chutes
Disadvantages
- Index Size: Requires uploading large index files (~100-500MB)
- Coverage: Limited to utterances in the training data
- Flexibility: Cannot generate novel responses
- Update Frequency: Requires rebuilding index for new data
Troubleshooting
Issue: "No module named 'sentence_transformers'"
Solution: Check setup.py has correct dependencies
Issue: "Index not found" during load
Solution: Verify HuggingFace repo has model.index and model.data files
Issue: PermissionError during model load
Solution: Using ./model_cache (writable directory) should fix this
Issue: Poor retrieval quality
Solution:
- Check index was built with correct embedding model
- Verify context formatting matches training data
- Consider rebuilding index with more data
Future Improvements
- Hybrid Retrieval: Use multiple strategies (BM25, entity matching, semantic)
- Reranking: Add cross-encoder reranking for better quality
- Caching: Cache frequent queries for even faster responses
- Index Versioning: Support multiple index versions per deployment
- Dynamic Updates: Support incremental index updates
Related Files
babelbit/utils/chutes_helpers.py: Template rendering logicbabelbit/utils/settings.py: Configuration settingsRAG_based_solution/: Full RAG implementation with indexing toolsRAG_based_solution/src/utils/upload_model.py: Index upload utility