Spaces:
Running
Running
File size: 4,548 Bytes
8e72e1f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
title: Codebase Intelligence Agent
emoji: π§
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: "3.11"
pinned: false
---
# Codebase Intelligence Agent
An AI assistant that understands a Python codebase. Upload a repository, ask
questions and get answers with **exact file + line citations**, or have the
**agent generate pytest tests** for any function by reading its real source.
Built around code-aware retrieval (tree-sitter AST chunking, not naive text
splitting) and measured with an evaluation harness.
## Demo

- **Ask the codebase** β "where are JWT tokens created?" β grounded answer citing
`app/core/security.py:38-44`, with the actual code shown as a source.
- **Generate tests** β name a function β a tool-calling agent reads its real
source and dependencies, then writes pytest tests grounded in that code.
## Evaluation
Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
(`temperature=0`):
| Metric | Result |
|---|---|
| File-level retrieval accuracy | **90%** |
| Function-level retrieval accuracy | **75%** |
| Citation accuracy (answer cites the right file) | **75%** |
| Median latency | **~3.3s / query** |
> Honest miss worth noting: "where is the FastAPI app created?" misses because
> the app is instantiated at module level (`app = FastAPI()`) rather than in a
> named function β module-level instantiation is harder to retrieve than named
> symbols. Indexing top-level assignments specially is the fix (roadmap).
## How it works
```
ZIP repo
|
v
File scanner skip venv/.git/__pycache__/node_modules, size cap
|
v
tree-sitter parser AST -> functions, classes, methods (+ exact line numbers)
|
v
Code chunker one chunk per definition + file/line metadata
|
v
Embeddings -------\
| \
v v
FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`)
\ /
v v
Hybrid retrieval -> cross-encoder rerank -> top-5
|
+--------+--------+
| |
v v
Grounded Q&A Test-gen agent
(file:line cites) (tool-calling loop)
```
**Why it's code-aware:** chunking by AST means each chunk is a whole function or
class with its exact line range β so citations are precise and un-hallucinatable,
and retrieval matches real code units instead of arbitrary text windows. The
code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode`
actually match.
**The agent:** given a target function, the LLM calls `get_definition` and
`search_code` to read the real source, then writes pytest tests grounded in it β
a tool-calling loop (no framework), the model planning and acting rather than
answering in one shot.
## Tech stack
Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β·
cross-encoder reranker Β· OpenAI (`gpt-4.1-mini`, temperature 0)
## Run locally
```bash
python -m venv .venv && .venv\Scripts\activate # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py
```
## Evaluate
```bash
python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
```
## Project structure
```
src/
βββ ingestion/ scanner, tree-sitter parser, chunker
βββ rag/ embedder, FAISS, BM25, hybrid, reranker, answerer
βββ agent/ tools (search_code, get_definition) + tool-calling workflow
βββ evaluation/ eval harness
app.py Streamlit UI (Ask + Generate tests)
evaluate.py eval CLI
```
## Limitations & roadmap
v1 is a deliberate vertical slice. Known limits and next steps:
- **Python only** β multi-language via more tree-sitter grammars.
- **ZIP upload only** β GitHub URL ingestion (clone) next.
- **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named
functions β index top-level assignments specially.
- **Citation accuracy is a strict string check** β the answer must contain the
filename; LLM-as-judge grading would measure correctness more fairly.
- **General-purpose embeddings** β a code-specific embedding model
(`jina-embeddings-v2-base-code`) would likely improve retrieval.
- Future: code graph (call/import relationships), PR review mode, bug-fix tool,
documentation agent.
## About
A self-directed project focused on code-aware RAG, tool-calling agents, and
measured evaluation β not a generic "chat with your repo" demo. |