File size: 4,548 Bytes
8e72e1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
title: Codebase Intelligence Agent
emoji: 🧭
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: "3.11"
pinned: false
---

# Codebase Intelligence Agent

An AI assistant that understands a Python codebase. Upload a repository, ask
questions and get answers with **exact file + line citations**, or have the
**agent generate pytest tests** for any function by reading its real source.

Built around code-aware retrieval (tree-sitter AST chunking, not naive text
splitting) and measured with an evaluation harness.

## Demo

![demo](docs/demo.gif)

- **Ask the codebase** β€” "where are JWT tokens created?" β†’ grounded answer citing
  `app/core/security.py:38-44`, with the actual code shown as a source.
- **Generate tests** β€” name a function β†’ a tool-calling agent reads its real
  source and dependencies, then writes pytest tests grounded in that code.

## Evaluation

Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
(`temperature=0`):

| Metric | Result |
|---|---|
| File-level retrieval accuracy | **90%** |
| Function-level retrieval accuracy | **75%** |
| Citation accuracy (answer cites the right file) | **75%** |
| Median latency | **~3.3s / query** |

> Honest miss worth noting: "where is the FastAPI app created?" misses because
> the app is instantiated at module level (`app = FastAPI()`) rather than in a
> named function β€” module-level instantiation is harder to retrieve than named
> symbols. Indexing top-level assignments specially is the fix (roadmap).

## How it works

```
ZIP repo
   |
   v
File scanner        skip venv/.git/__pycache__/node_modules, size cap
   |
   v
tree-sitter parser  AST -> functions, classes, methods (+ exact line numbers)
   |
   v
Code chunker        one chunk per definition + file/line metadata
   |
   v
Embeddings -------\
   |               \
   v                v
FAISS (semantic)  BM25 (code-aware tokenizer: matches `jwt.encode`)
        \         /
         v       v
        Hybrid retrieval -> cross-encoder rerank -> top-5
              |
     +--------+--------+
     |                 |
     v                 v
  Grounded Q&A      Test-gen agent
  (file:line cites)  (tool-calling loop)
```

**Why it's code-aware:** chunking by AST means each chunk is a whole function or
class with its exact line range β€” so citations are precise and un-hallucinatable,
and retrieval matches real code units instead of arbitrary text windows. The
code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode`
actually match.

**The agent:** given a target function, the LLM calls `get_definition` and
`search_code` to read the real source, then writes pytest tests grounded in it β€”
a tool-calling loop (no framework), the model planning and acting rather than
answering in one shot.

## Tech stack

Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β·
cross-encoder reranker Β· OpenAI (`gpt-4.1-mini`, temperature 0)

## Run locally

```bash
python -m venv .venv && .venv\Scripts\activate     # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py
```

## Evaluate

```bash
python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
```

## Project structure

```
src/
β”œβ”€β”€ ingestion/   scanner, tree-sitter parser, chunker
β”œβ”€β”€ rag/         embedder, FAISS, BM25, hybrid, reranker, answerer
β”œβ”€β”€ agent/       tools (search_code, get_definition) + tool-calling workflow
└── evaluation/  eval harness
app.py           Streamlit UI (Ask + Generate tests)
evaluate.py      eval CLI
```

## Limitations & roadmap

v1 is a deliberate vertical slice. Known limits and next steps:

- **Python only** β€” multi-language via more tree-sitter grammars.
- **ZIP upload only** β€” GitHub URL ingestion (clone) next.
- **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named
  functions β€” index top-level assignments specially.
- **Citation accuracy is a strict string check** β€” the answer must contain the
  filename; LLM-as-judge grading would measure correctness more fairly.
- **General-purpose embeddings** β€” a code-specific embedding model
  (`jina-embeddings-v2-base-code`) would likely improve retrieval.
- Future: code graph (call/import relationships), PR review mode, bug-fix tool,
  documentation agent.

## About

A self-directed project focused on code-aware RAG, tool-calling agents, and
measured evaluation β€” not a generic "chat with your repo" demo.