Spaces:
Running
A newer version of the Streamlit SDK is available: 1.58.0
title: Codebase Intelligence Agent
emoji: π§
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: app.py
python_version: '3.11'
pinned: false
Codebase Intelligence Agent
An AI assistant that understands a Python codebase. Upload a repository, ask questions and get answers with exact file + line citations, or have the agent generate pytest tests for any function by reading its real source.
Built around code-aware retrieval (tree-sitter AST chunking, not naive text splitting) and measured with an evaluation harness.
Demo
- Ask the codebase β "where are JWT tokens created?" β grounded answer citing
app/core/security.py:38-44, with the actual code shown as a source. - Generate tests β name a function β a tool-calling agent reads its real source and dependencies, then writes pytest tests grounded in that code.
Evaluation
Measured on a real FastAPI backend (74 files, 369 definitions), deterministic
(temperature=0):
| Metric | Result |
|---|---|
| File-level retrieval accuracy | 90% |
| Function-level retrieval accuracy | 75% |
| Citation accuracy (answer cites the right file) | 75% |
| Median latency | ~3.3s / query |
Honest miss worth noting: "where is the FastAPI app created?" misses because the app is instantiated at module level (
app = FastAPI()) rather than in a named function β module-level instantiation is harder to retrieve than named symbols. Indexing top-level assignments specially is the fix (roadmap).
How it works
ZIP repo
|
v
File scanner skip venv/.git/__pycache__/node_modules, size cap
|
v
tree-sitter parser AST -> functions, classes, methods (+ exact line numbers)
|
v
Code chunker one chunk per definition + file/line metadata
|
v
Embeddings -------\
| \
v v
FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`)
\ /
v v
Hybrid retrieval -> cross-encoder rerank -> top-5
|
+--------+--------+
| |
v v
Grounded Q&A Test-gen agent
(file:line cites) (tool-calling loop)
Why it's code-aware: chunking by AST means each chunk is a whole function or
class with its exact line range β so citations are precise and un-hallucinatable,
and retrieval matches real code units instead of arbitrary text windows. The
code-aware BM25 tokenizer splits on symbols, so exact searches like jwt.encode
actually match.
The agent: given a target function, the LLM calls get_definition and
search_code to read the real source, then writes pytest tests grounded in it β
a tool-calling loop (no framework), the model planning and acting rather than
answering in one shot.
Tech stack
Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β·
cross-encoder reranker Β· OpenAI (gpt-4.1-mini, temperature 0)
Run locally
python -m venv .venv && .venv\Scripts\activate # Windows
pip install -r requirements.txt
echo OPENAI_API_KEY=sk-your-key > .env
streamlit run app.py
Evaluate
python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json
Project structure
src/
βββ ingestion/ scanner, tree-sitter parser, chunker
βββ rag/ embedder, FAISS, BM25, hybrid, reranker, answerer
βββ agent/ tools (search_code, get_definition) + tool-calling workflow
βββ evaluation/ eval harness
app.py Streamlit UI (Ask + Generate tests)
evaluate.py eval CLI
Limitations & roadmap
v1 is a deliberate vertical slice. Known limits and next steps:
- Python only β multi-language via more tree-sitter grammars.
- ZIP upload only β GitHub URL ingestion (clone) next.
- Module-level symbols (e.g.
app = FastAPI()) retrieve worse than named functions β index top-level assignments specially. - Citation accuracy is a strict string check β the answer must contain the filename; LLM-as-judge grading would measure correctness more fairly.
- General-purpose embeddings β a code-specific embedding model
(
jina-embeddings-v2-base-code) would likely improve retrieval. - Future: code graph (call/import relationships), PR review mode, bug-fix tool, documentation agent.
About
A self-directed project focused on code-aware RAG, tool-calling agents, and measured evaluation β not a generic "chat with your repo" demo.
