codebase-agent / README.md
AishaSurve's picture
Codebase Intelligence Agent: code-aware RAG + test-gen agent + eval
8e72e1f
|
Raw
History Blame Contribute Delete
4.55 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: Codebase Intelligence Agent
emoji: 🧭
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: '3.11'
pinned: false

Codebase Intelligence Agent

An AI assistant that understands a Python codebase. Upload a repository, ask questions and get answers with exact file + line citations, or have the agent generate pytest tests for any function by reading its real source.

Built around code-aware retrieval (tree-sitter AST chunking, not naive text splitting) and measured with an evaluation harness.

Demo

demo

  • Ask the codebase β€” "where are JWT tokens created?" β†’ grounded answer citing app/core/security.py:38-44, with the actual code shown as a source.
  • Generate tests β€” name a function β†’ a tool-calling agent reads its real source and dependencies, then writes pytest tests grounded in that code.

Evaluation

Measured on a real FastAPI backend (74 files, 369 definitions), deterministic (temperature=0):

Metric Result
File-level retrieval accuracy 90%
Function-level retrieval accuracy 75%
Citation accuracy (answer cites the right file) 75%
Median latency ~3.3s / query

Honest miss worth noting: "where is the FastAPI app created?" misses because the app is instantiated at module level (app = FastAPI()) rather than in a named function β€” module-level instantiation is harder to retrieve than named symbols. Indexing top-level assignments specially is the fix (roadmap).

How it works

ZIP repo
   |
   v
File scanner        skip venv/.git/__pycache__/node_modules, size cap
   |
   v
tree-sitter parser  AST -> functions, classes, methods (+ exact line numbers)
   |
   v
Code chunker        one chunk per definition + file/line metadata
   |
   v
Embeddings -------\
   |               \
   v                v
FAISS (semantic)  BM25 (code-aware tokenizer: matches `jwt.encode`)
        \         /
         v       v
        Hybrid retrieval -> cross-encoder rerank -> top-5
              |
     +--------+--------+
     |                 |
     v                 v
  Grounded Q&A      Test-gen agent
  (file:line cites)  (tool-calling loop)

Why it's code-aware: chunking by AST means each chunk is a whole function or class with its exact line range β€” so citations are precise and un-hallucinatable, and retrieval matches real code units instead of arbitrary text windows. The code-aware BM25 tokenizer splits on symbols, so exact searches like jwt.encode actually match.

The agent: given a target function, the LLM calls get_definition and search_code to read the real source, then writes pytest tests grounded in it β€” a tool-calling loop (no framework), the model planning and acting rather than answering in one shot.

Tech stack

Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β· cross-encoder reranker Β· OpenAI (gpt-4.1-mini, temperature 0)

Run locally

python -m venv .venv && .venv\Scripts\activate     # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py

Evaluate

python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json

Project structure

src/
β”œβ”€β”€ ingestion/   scanner, tree-sitter parser, chunker
β”œβ”€β”€ rag/         embedder, FAISS, BM25, hybrid, reranker, answerer
β”œβ”€β”€ agent/       tools (search_code, get_definition) + tool-calling workflow
└── evaluation/  eval harness
app.py           Streamlit UI (Ask + Generate tests)
evaluate.py      eval CLI

Limitations & roadmap

v1 is a deliberate vertical slice. Known limits and next steps:

  • Python only β€” multi-language via more tree-sitter grammars.
  • ZIP upload only β€” GitHub URL ingestion (clone) next.
  • Module-level symbols (e.g. app = FastAPI()) retrieve worse than named functions β€” index top-level assignments specially.
  • Citation accuracy is a strict string check β€” the answer must contain the filename; LLM-as-judge grading would measure correctness more fairly.
  • General-purpose embeddings β€” a code-specific embedding model (jina-embeddings-v2-base-code) would likely improve retrieval.
  • Future: code graph (call/import relationships), PR review mode, bug-fix tool, documentation agent.

About

A self-directed project focused on code-aware RAG, tool-calling agents, and measured evaluation β€” not a generic "chat with your repo" demo.