substrate / README.md
Syed Taha
remove research section and clean up README.md
e35e4e5
metadata
title: Substrate
emoji: 🦀
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 7860
pinned: false

Substrate

Cross-repo code reasoning via advanced RAG - function-boundary chunking, hybrid BM25+dense retrieval, cross-encoder re-ranking, and LLM-as-Judge evaluation.

Demo query:

"What functions in transformers would break if numpy changed the default
dtype of np.float_ from float64 to float32?"

Setup (to run locally)

# 1. Clone this repo and switch to local configuration
git clone https://github.com/syedtaha22/substrate
cd substrate
git checkout local

# 2. Create virtualenv
python3 -m venv .venv
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env
# Edit .env - add your HuggingFace API token

Local LLM Setup (Required)

Install and start Ollama for local inference:

# Install Ollama: https://ollama.ai

# Pull required models
ollama pull llama3.1:8b
ollama pull mistral:7b

# Start Ollama server (runs at localhost:11434)
ollama serve

Pipeline (run once, offline)

The pipeline clones 6 repositories, parses function boundaries, embeds chunks, and builds keyword indices. Example shown for function-level chunks (A3, A5); repeat with --strategy fixed and --strategy recursive for other configurations.

# Step 1: Clone all 6 target repos (sparse, ~800MB)
python pipeline/clone_repos.py

# Step 2: Parse functions and extract chunks
python pipeline/parse_repos.py --strategy function
# Repeat for: --strategy fixed, --strategy recursive

# Step 3: Embed chunks and upsert to vector database
python pipeline/embed_and_upsert.py --strategy function

# Step 4: Build BM25 keyword index
python pipeline/build_bm25.py --strategy function

Evaluation (Reproduce Paper Results)

All 33 evaluation queries are in eval/test_queries.yaml:

# Baseline: LLM with no retrieval context
python eval/eval_baseline.py
# Output: eval/results/baseline.json

# Retrieval-only evaluation (top-10 chunks, no generation)
python eval/eval_retrieval.py --strategy function

# Full RAG with LLM-as-Judge (faithfulness + relevancy scoring)
python eval/eval_rag.py --profile A5
# Takes ~2 hours on single GPU; run --profile A1 through A5 to test all configurations

Results JSON files contain per-query and per-tier metrics:

  • keyword_coverage: % of expected keywords in answer
  • faithfulness: % of answer claims grounded in retrieved code
  • answer_relevancy: cosine similarity between query and generated-question embeddings
  • pass_rate: % of queries exceeding threshold

Running the App locally

The Chainlit app provides an interactive interface to test queries against the A5 profile.

# Start the app
chainlit run app/app.py --port 8000
# Open http://localhost:8000 in your browser