--- title: Codebase Intelligence Agent emoji: ๐Ÿงญ colorFrom: indigo colorTo: blue sdk: streamlit app_file: app.py python_version: "3.11" pinned: false --- # Codebase Intelligence Agent An AI assistant that understands a Python codebase. Upload a repository, ask questions and get answers with **exact file + line citations**, or have the **agent generate pytest tests** for any function by reading its real source. Built around code-aware retrieval (tree-sitter AST chunking, not naive text splitting) and measured with an evaluation harness. ## Demo ![demo](docs/demo.gif) - **Ask the codebase** โ€” "where are JWT tokens created?" โ†’ grounded answer citing `app/core/security.py:38-44`, with the actual code shown as a source. - **Generate tests** โ€” name a function โ†’ a tool-calling agent reads its real source and dependencies, then writes pytest tests grounded in that code. ## Evaluation Measured on a real FastAPI backend (74 files, 369 definitions), deterministic (`temperature=0`): | Metric | Result | |---|---| | File-level retrieval accuracy | **90%** | | Function-level retrieval accuracy | **75%** | | Citation accuracy (answer cites the right file) | **75%** | | Median latency | **~3.3s / query** | > Honest miss worth noting: "where is the FastAPI app created?" misses because > the app is instantiated at module level (`app = FastAPI()`) rather than in a > named function โ€” module-level instantiation is harder to retrieve than named > symbols. Indexing top-level assignments specially is the fix (roadmap). ## How it works ``` ZIP repo | v File scanner skip venv/.git/__pycache__/node_modules, size cap | v tree-sitter parser AST -> functions, classes, methods (+ exact line numbers) | v Code chunker one chunk per definition + file/line metadata | v Embeddings -------\ | \ v v FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`) \ / v v Hybrid retrieval -> cross-encoder rerank -> top-5 | +--------+--------+ | | v v Grounded Q&A Test-gen agent (file:line cites) (tool-calling loop) ``` **Why it's code-aware:** chunking by AST means each chunk is a whole function or class with its exact line range โ€” so citations are precise and un-hallucinatable, and retrieval matches real code units instead of arbitrary text windows. The code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode` actually match. **The agent:** given a target function, the LLM calls `get_definition` and `search_code` to read the real source, then writes pytest tests grounded in it โ€” a tool-calling loop (no framework), the model planning and acting rather than answering in one shot. ## Tech stack Python ยท Streamlit ยท tree-sitter ยท sentence-transformers ยท FAISS ยท rank-bm25 ยท cross-encoder reranker ยท OpenAI (`gpt-4.1-mini`, temperature 0) ## Run locally ```bash python -m venv .venv && .venv\Scripts\activate # Windows pip install -r requirements.txt echo OPENAI_API_KEY=sk-your-key > .env streamlit run app.py ``` ## Evaluate ```bash python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json ``` ## Project structure ``` src/ โ”œโ”€โ”€ ingestion/ scanner, tree-sitter parser, chunker โ”œโ”€โ”€ rag/ embedder, FAISS, BM25, hybrid, reranker, answerer โ”œโ”€โ”€ agent/ tools (search_code, get_definition) + tool-calling workflow โ””โ”€โ”€ evaluation/ eval harness app.py Streamlit UI (Ask + Generate tests) evaluate.py eval CLI ``` ## Limitations & roadmap v1 is a deliberate vertical slice. Known limits and next steps: - **Python only** โ€” multi-language via more tree-sitter grammars. - **ZIP upload only** โ€” GitHub URL ingestion (clone) next. - **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named functions โ€” index top-level assignments specially. - **Citation accuracy is a strict string check** โ€” the answer must contain the filename; LLM-as-judge grading would measure correctness more fairly. - **General-purpose embeddings** โ€” a code-specific embedding model (`jina-embeddings-v2-base-code`) would likely improve retrieval. - Future: code graph (call/import relationships), PR review mode, bug-fix tool, documentation agent. ## About A self-directed project focused on code-aware RAG, tool-calling agents, and measured evaluation โ€” not a generic "chat with your repo" demo.