Spaces:

AishaSurve
/

codebase-agent

Running

App Files Files Community

codebase-agent / README.md

AishaSurve

Codebase Intelligence Agent: code-aware RAG + test-gen agent + eval

8e72e1f 3 days ago

preview code

Raw

History Blame Contribute Delete

4.55 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

metadata

title: Codebase Intelligence Agent
emoji: 🧭
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: '3.11'
pinned: false

Codebase Intelligence Agent

An AI assistant that understands a Python codebase. Upload a repository, ask questions and get answers with exact file + line citations, or have the agent generate pytest tests for any function by reading its real source.

Built around code-aware retrieval (tree-sitter AST chunking, not naive text splitting) and measured with an evaluation harness.

Demo

Ask the codebase — "where are JWT tokens created?" → grounded answer citing app/core/security.py:38-44, with the actual code shown as a source.
Generate tests — name a function → a tool-calling agent reads its real source and dependencies, then writes pytest tests grounded in that code.

Evaluation

Measured on a real FastAPI backend (74 files, 369 definitions), deterministic (temperature=0):

Metric	Result
File-level retrieval accuracy	90%
Function-level retrieval accuracy	75%
Citation accuracy (answer cites the right file)	75%
Median latency	~3.3s / query

Honest miss worth noting: "where is the FastAPI app created?" misses because the app is instantiated at module level (app = FastAPI()) rather than in a named function — module-level instantiation is harder to retrieve than named symbols. Indexing top-level assignments specially is the fix (roadmap).

How it works

ZIP repo
   |
   v
File scanner        skip venv/.git/__pycache__/node_modules, size cap
   |
   v
tree-sitter parser  AST -> functions, classes, methods (+ exact line numbers)
   |
   v
Code chunker        one chunk per definition + file/line metadata
   |
   v
Embeddings -------\
   |               \
   v                v
FAISS (semantic)  BM25 (code-aware tokenizer: matches `jwt.encode`)
        \         /
         v       v
        Hybrid retrieval -> cross-encoder rerank -> top-5
              |
     +--------+--------+
     |                 |
     v                 v
  Grounded Q&A      Test-gen agent
  (file:line cites)  (tool-calling loop)

Why it's code-aware: chunking by AST means each chunk is a whole function or class with its exact line range — so citations are precise and un-hallucinatable, and retrieval matches real code units instead of arbitrary text windows. The code-aware BM25 tokenizer splits on symbols, so exact searches like jwt.encode actually match.

The agent: given a target function, the LLM calls get_definition and search_code to read the real source, then writes pytest tests grounded in it — a tool-calling loop (no framework), the model planning and acting rather than answering in one shot.

Tech stack

Python · Streamlit · tree-sitter · sentence-transformers · FAISS · rank-bm25 · cross-encoder reranker · OpenAI (gpt-4.1-mini, temperature 0)

Run locally

python -m venv .venv && .venv\Scripts\activate     # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py

Evaluate

python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json

Project structure

src/
├── ingestion/   scanner, tree-sitter parser, chunker
├── rag/         embedder, FAISS, BM25, hybrid, reranker, answerer
├── agent/       tools (search_code, get_definition) + tool-calling workflow
└── evaluation/  eval harness
app.py           Streamlit UI (Ask + Generate tests)
evaluate.py      eval CLI

Limitations & roadmap

v1 is a deliberate vertical slice. Known limits and next steps:

Python only — multi-language via more tree-sitter grammars.
ZIP upload only — GitHub URL ingestion (clone) next.
Module-level symbols (e.g. app = FastAPI()) retrieve worse than named functions — index top-level assignments specially.
Citation accuracy is a strict string check — the answer must contain the filename; LLM-as-judge grading would measure correctness more fairly.
General-purpose embeddings — a code-specific embedding model (jina-embeddings-v2-base-code) would likely improve retrieval.
Future: code graph (call/import relationships), PR review mode, bug-fix tool, documentation agent.

About

A self-directed project focused on code-aware RAG, tool-calling agents, and measured evaluation — not a generic "chat with your repo" demo.