substrate / README.md
Syed Taha
remove research section and clean up README.md
e35e4e5
---
title: Substrate
emoji: 🦀
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---
# Substrate
> Cross-repo code reasoning via advanced RAG - function-boundary chunking,
> hybrid BM25+dense retrieval, cross-encoder re-ranking, and LLM-as-Judge evaluation.
**Demo query:**
```
"What functions in transformers would break if numpy changed the default
dtype of np.float_ from float64 to float32?"
```
## Setup (to run locally)
```bash
# 1. Clone this repo and switch to local configuration
git clone https://github.com/syedtaha22/substrate
cd substrate
git checkout local
# 2. Create virtualenv
python3 -m venv .venv
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env
# Edit .env - add your HuggingFace API token
```
## Local LLM Setup (Required)
Install and start Ollama for local inference:
```bash
# Install Ollama: https://ollama.ai
# Pull required models
ollama pull llama3.1:8b
ollama pull mistral:7b
# Start Ollama server (runs at localhost:11434)
ollama serve
```
## Pipeline (run once, offline)
The pipeline clones 6 repositories, parses function boundaries, embeds chunks, and builds keyword indices.
Example shown for function-level chunks (A3, A5); repeat with `--strategy fixed` and `--strategy recursive` for other configurations.
```bash
# Step 1: Clone all 6 target repos (sparse, ~800MB)
python pipeline/clone_repos.py
# Step 2: Parse functions and extract chunks
python pipeline/parse_repos.py --strategy function
# Repeat for: --strategy fixed, --strategy recursive
# Step 3: Embed chunks and upsert to vector database
python pipeline/embed_and_upsert.py --strategy function
# Step 4: Build BM25 keyword index
python pipeline/build_bm25.py --strategy function
```
## Evaluation (Reproduce Paper Results)
All 33 evaluation queries are in `eval/test_queries.yaml`:
```bash
# Baseline: LLM with no retrieval context
python eval/eval_baseline.py
# Output: eval/results/baseline.json
# Retrieval-only evaluation (top-10 chunks, no generation)
python eval/eval_retrieval.py --strategy function
# Full RAG with LLM-as-Judge (faithfulness + relevancy scoring)
python eval/eval_rag.py --profile A5
# Takes ~2 hours on single GPU; run --profile A1 through A5 to test all configurations
```
Results JSON files contain per-query and per-tier metrics:
- `keyword_coverage`: % of expected keywords in answer
- `faithfulness`: % of answer claims grounded in retrieved code
- `answer_relevancy`: cosine similarity between query and generated-question embeddings
- `pass_rate`: % of queries exceeding threshold
## Running the App locally
The Chainlit app provides an interactive interface to test queries against the A5 profile.
```bash
# Start the app
chainlit run app/app.py --port 8000
# Open http://localhost:8000 in your browser
```