codebase-agent / README.md
AishaSurve's picture
Codebase Intelligence Agent: code-aware RAG + test-gen agent + eval
8e72e1f
|
Raw
History Blame Contribute Delete
4.55 kB
---
title: Codebase Intelligence Agent
emoji: 🧭
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: "3.11"
pinned: false
---
# Codebase Intelligence Agent
An AI assistant that understands a Python codebase. Upload a repository, ask
questions and get answers with **exact file + line citations**, or have the
**agent generate pytest tests** for any function by reading its real source.
Built around code-aware retrieval (tree-sitter AST chunking, not naive text
splitting) and measured with an evaluation harness.
## Demo
![demo](docs/demo.gif)
- **Ask the codebase** β€” "where are JWT tokens created?" β†’ grounded answer citing
`app/core/security.py:38-44`, with the actual code shown as a source.
- **Generate tests** β€” name a function β†’ a tool-calling agent reads its real
source and dependencies, then writes pytest tests grounded in that code.
## Evaluation
Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
(`temperature=0`):
| Metric | Result |
|---|---|
| File-level retrieval accuracy | **90%** |
| Function-level retrieval accuracy | **75%** |
| Citation accuracy (answer cites the right file) | **75%** |
| Median latency | **~3.3s / query** |
> Honest miss worth noting: "where is the FastAPI app created?" misses because
> the app is instantiated at module level (`app = FastAPI()`) rather than in a
> named function β€” module-level instantiation is harder to retrieve than named
> symbols. Indexing top-level assignments specially is the fix (roadmap).
## How it works
```
ZIP repo
|
v
File scanner skip venv/.git/__pycache__/node_modules, size cap
|
v
tree-sitter parser AST -> functions, classes, methods (+ exact line numbers)
|
v
Code chunker one chunk per definition + file/line metadata
|
v
Embeddings -------\
| \
v v
FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`)
\ /
v v
Hybrid retrieval -> cross-encoder rerank -> top-5
|
+--------+--------+
| |
v v
Grounded Q&A Test-gen agent
(file:line cites) (tool-calling loop)
```
**Why it's code-aware:** chunking by AST means each chunk is a whole function or
class with its exact line range β€” so citations are precise and un-hallucinatable,
and retrieval matches real code units instead of arbitrary text windows. The
code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode`
actually match.
**The agent:** given a target function, the LLM calls `get_definition` and
`search_code` to read the real source, then writes pytest tests grounded in it β€”
a tool-calling loop (no framework), the model planning and acting rather than
answering in one shot.
## Tech stack
Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β·
cross-encoder reranker Β· OpenAI (`gpt-4.1-mini`, temperature 0)
## Run locally
```bash
python -m venv .venv && .venv\Scripts\activate # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py
```
## Evaluate
```bash
python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
```
## Project structure
```
src/
β”œβ”€β”€ ingestion/ scanner, tree-sitter parser, chunker
β”œβ”€β”€ rag/ embedder, FAISS, BM25, hybrid, reranker, answerer
β”œβ”€β”€ agent/ tools (search_code, get_definition) + tool-calling workflow
└── evaluation/ eval harness
app.py Streamlit UI (Ask + Generate tests)
evaluate.py eval CLI
```
## Limitations & roadmap
v1 is a deliberate vertical slice. Known limits and next steps:
- **Python only** β€” multi-language via more tree-sitter grammars.
- **ZIP upload only** β€” GitHub URL ingestion (clone) next.
- **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named
functions β€” index top-level assignments specially.
- **Citation accuracy is a strict string check** β€” the answer must contain the
filename; LLM-as-judge grading would measure correctness more fairly.
- **General-purpose embeddings** β€” a code-specific embedding model
(`jina-embeddings-v2-base-code`) would likely improve retrieval.
- Future: code graph (call/import relationships), PR review mode, bug-fix tool,
documentation agent.
## About
A self-directed project focused on code-aware RAG, tool-calling agents, and
measured evaluation β€” not a generic "chat with your repo" demo.