Spaces:

AishaSurve
/

codebase-agent

Running

App Files Files Community

codebase-agent / README.md

AishaSurve

Codebase Intelligence Agent: code-aware RAG + test-gen agent + eval

8e72e1f 3 days ago

preview code

Raw

History Blame Contribute Delete

4.55 kB

	---
	title: Codebase Intelligence Agent
	emoji: 🧭
	colorFrom: indigo
	colorTo: blue
	sdk: streamlit
	app_file: app.py
	python_version: "3.11"
	pinned: false
	---

	# Codebase Intelligence Agent

	An AI assistant that understands a Python codebase. Upload a repository, ask
	questions and get answers with exact file + line citations, or have the
	agent generate pytest tests for any function by reading its real source.

	Built around code-aware retrieval (tree-sitter AST chunking, not naive text
	splitting) and measured with an evaluation harness.

	## Demo

	![demo](docs/demo.gif)

	- Ask the codebase — "where are JWT tokens created?" → grounded answer citing
	`app/core/security.py:38-44`, with the actual code shown as a source.
	- Generate tests — name a function → a tool-calling agent reads its real
	source and dependencies, then writes pytest tests grounded in that code.

	## Evaluation

	Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
	(`temperature=0`):

	\| Metric \| Result \|
	\|---\|---\|
	\| File-level retrieval accuracy \| 90% \|
	\| Function-level retrieval accuracy \| 75% \|
	\| Citation accuracy (answer cites the right file) \| 75% \|
	\| Median latency \| ~3.3s / query \|

	> Honest miss worth noting: "where is the FastAPI app created?" misses because
	> the app is instantiated at module level (`app = FastAPI()`) rather than in a
	> named function — module-level instantiation is harder to retrieve than named
	> symbols. Indexing top-level assignments specially is the fix (roadmap).

	## How it works

	```
	ZIP repo
	\|
	v
	File scanner skip venv/.git/__pycache__/node_modules, size cap
	\|
	v
	tree-sitter parser AST -> functions, classes, methods (+ exact line numbers)
	\|
	v
	Code chunker one chunk per definition + file/line metadata
	\|
	v
	Embeddings -------\
	\| \
	v v
	FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`)
	\ /
	v v
	Hybrid retrieval -> cross-encoder rerank -> top-5
	\|
	+--------+--------+
	\| \|
	v v
	Grounded Q&A Test-gen agent
	(file:line cites) (tool-calling loop)
	```

	Why it's code-aware: chunking by AST means each chunk is a whole function or
	class with its exact line range — so citations are precise and un-hallucinatable,
	and retrieval matches real code units instead of arbitrary text windows. The
	code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode`
	actually match.

	The agent: given a target function, the LLM calls `get_definition` and
	`search_code` to read the real source, then writes pytest tests grounded in it —
	a tool-calling loop (no framework), the model planning and acting rather than
	answering in one shot.

	## Tech stack

	Python · Streamlit · tree-sitter · sentence-transformers · FAISS · rank-bm25 ·
	cross-encoder reranker · OpenAI (`gpt-4.1-mini`, temperature 0)

	## Run locally

	```bash
	python -m venv .venv && .venv\Scripts\activate # Windows
	pip install -r requirements.txt
	echo OPENAI_API_KEY=sk-your-key > .env
	streamlit run app.py
	```

	## Evaluate

	```bash
	python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
	```

	## Project structure

	```
	src/
	├── ingestion/ scanner, tree-sitter parser, chunker
	├── rag/ embedder, FAISS, BM25, hybrid, reranker, answerer
	├── agent/ tools (search_code, get_definition) + tool-calling workflow
	└── evaluation/ eval harness
	app.py Streamlit UI (Ask + Generate tests)
	evaluate.py eval CLI
	```

	## Limitations & roadmap

	v1 is a deliberate vertical slice. Known limits and next steps:

	- Python only — multi-language via more tree-sitter grammars.
	- ZIP upload only — GitHub URL ingestion (clone) next.
	- Module-level symbols (e.g. `app = FastAPI()`) retrieve worse than named
	functions — index top-level assignments specially.
	- Citation accuracy is a strict string check — the answer must contain the
	filename; LLM-as-judge grading would measure correctness more fairly.
	- General-purpose embeddings — a code-specific embedding model
	(`jina-embeddings-v2-base-code`) would likely improve retrieval.
	- Future: code graph (call/import relationships), PR review mode, bug-fix tool,
	documentation agent.

	## About

	A self-directed project focused on code-aware RAG, tool-calling agents, and
	measured evaluation — not a generic "chat with your repo" demo.