Spaces:

AishaSurve
/

codebase-agent

Running

File size: 4,548 Bytes

8e72e1f

---
title: Codebase Intelligence Agent
emoji: 🧭
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: "3.11"
pinned: false
---

# Codebase Intelligence Agent

An AI assistant that understands a Python codebase. Upload a repository, ask
questions and get answers with **exact file + line citations**, or have the
**agent generate pytest tests** for any function by reading its real source.

Built around code-aware retrieval (tree-sitter AST chunking, not naive text
splitting) and measured with an evaluation harness.

## Demo

![demo](docs/demo.gif)

- **Ask the codebase** — "where are JWT tokens created?" → grounded answer citing
  `app/core/security.py:38-44`, with the actual code shown as a source.
- **Generate tests** — name a function → a tool-calling agent reads its real
  source and dependencies, then writes pytest tests grounded in that code.

## Evaluation

Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
(`temperature=0`):

| Metric | Result |
|---|---|
| File-level retrieval accuracy | **90%** |
| Function-level retrieval accuracy | **75%** |
| Citation accuracy (answer cites the right file) | **75%** |
| Median latency | **~3.3s / query** |

> Honest miss worth noting: "where is the FastAPI app created?" misses because
> the app is instantiated at module level (`app = FastAPI()`) rather than in a
> named function — module-level instantiation is harder to retrieve than named
> symbols. Indexing top-level assignments specially is the fix (roadmap).

## How it works

```
ZIP repo
   |
   v
File scanner        skip venv/.git/__pycache__/node_modules, size cap
   |
   v
tree-sitter parser  AST -> functions, classes, methods (+ exact line numbers)
   |
   v
Code chunker        one chunk per definition + file/line metadata
   |
   v
Embeddings -------\
   |               \
   v                v
FAISS (semantic)  BM25 (code-aware tokenizer: matches `jwt.encode`)
        \         /
         v       v
        Hybrid retrieval -> cross-encoder rerank -> top-5
              |
     +--------+--------+
     |                 |
     v                 v
  Grounded Q&A      Test-gen agent
  (file:line cites)  (tool-calling loop)
```

**Why it's code-aware:** chunking by AST means each chunk is a whole function or
class with its exact line range — so citations are precise and un-hallucinatable,
and retrieval matches real code units instead of arbitrary text windows. The
code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode`
actually match.

**The agent:** given a target function, the LLM calls `get_definition` and
`search_code` to read the real source, then writes pytest tests grounded in it —
a tool-calling loop (no framework), the model planning and acting rather than
answering in one shot.

## Tech stack

Python · Streamlit · tree-sitter · sentence-transformers · FAISS · rank-bm25 ·
cross-encoder reranker · OpenAI (`gpt-4.1-mini`, temperature 0)

## Run locally

```bash
python -m venv .venv && .venv\Scripts\activate     # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py
```

## Evaluate

```bash
python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
```

## Project structure

```
src/
├── ingestion/   scanner, tree-sitter parser, chunker
├── rag/         embedder, FAISS, BM25, hybrid, reranker, answerer
├── agent/       tools (search_code, get_definition) + tool-calling workflow
└── evaluation/  eval harness
app.py           Streamlit UI (Ask + Generate tests)
evaluate.py      eval CLI
```

## Limitations & roadmap

v1 is a deliberate vertical slice. Known limits and next steps:

- **Python only** — multi-language via more tree-sitter grammars.
- **ZIP upload only** — GitHub URL ingestion (clone) next.
- **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named
  functions — index top-level assignments specially.
- **Citation accuracy is a strict string check** — the answer must contain the
  filename; LLM-as-judge grading would measure correctness more fairly.
- **General-purpose embeddings** — a code-specific embedding model
  (`jina-embeddings-v2-base-code`) would likely improve retrieval.
- Future: code graph (call/import relationships), PR review mode, bug-fix tool,
  documentation agent.

## About

A self-directed project focused on code-aware RAG, tool-calling agents, and
measured evaluation — not a generic "chat with your repo" demo.