Spaces:
Sleeping
Sleeping
metadata
title: Substrate
emoji: 🦀
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 7860
pinned: false
Substrate
Cross-repo code reasoning via advanced RAG - function-boundary chunking, hybrid BM25+dense retrieval, cross-encoder re-ranking, and LLM-as-Judge evaluation.
Demo query:
"What functions in transformers would break if numpy changed the default
dtype of np.float_ from float64 to float32?"
Setup (to run locally)
# 1. Clone this repo and switch to local configuration
git clone https://github.com/syedtaha22/substrate
cd substrate
git checkout local
# 2. Create virtualenv
python3 -m venv .venv
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env
# Edit .env - add your HuggingFace API token
Local LLM Setup (Required)
Install and start Ollama for local inference:
# Install Ollama: https://ollama.ai
# Pull required models
ollama pull llama3.1:8b
ollama pull mistral:7b
# Start Ollama server (runs at localhost:11434)
ollama serve
Pipeline (run once, offline)
The pipeline clones 6 repositories, parses function boundaries, embeds chunks, and builds keyword indices.
Example shown for function-level chunks (A3, A5); repeat with --strategy fixed and --strategy recursive for other configurations.
# Step 1: Clone all 6 target repos (sparse, ~800MB)
python pipeline/clone_repos.py
# Step 2: Parse functions and extract chunks
python pipeline/parse_repos.py --strategy function
# Repeat for: --strategy fixed, --strategy recursive
# Step 3: Embed chunks and upsert to vector database
python pipeline/embed_and_upsert.py --strategy function
# Step 4: Build BM25 keyword index
python pipeline/build_bm25.py --strategy function
Evaluation (Reproduce Paper Results)
All 33 evaluation queries are in eval/test_queries.yaml:
# Baseline: LLM with no retrieval context
python eval/eval_baseline.py
# Output: eval/results/baseline.json
# Retrieval-only evaluation (top-10 chunks, no generation)
python eval/eval_retrieval.py --strategy function
# Full RAG with LLM-as-Judge (faithfulness + relevancy scoring)
python eval/eval_rag.py --profile A5
# Takes ~2 hours on single GPU; run --profile A1 through A5 to test all configurations
Results JSON files contain per-query and per-tier metrics:
keyword_coverage: % of expected keywords in answerfaithfulness: % of answer claims grounded in retrieved codeanswer_relevancy: cosine similarity between query and generated-question embeddingspass_rate: % of queries exceeding threshold
Running the App locally
The Chainlit app provides an interactive interface to test queries against the A5 profile.
# Start the app
chainlit run app/app.py --port 8000
# Open http://localhost:8000 in your browser