Spaces:
Running
Running
| title: Codebase Intelligence Agent | |
| emoji: π§ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: streamlit | |
| app_file: app.py | |
| python_version: "3.11" | |
| pinned: false | |
| # Codebase Intelligence Agent | |
| An AI assistant that understands a Python codebase. Upload a repository, ask | |
| questions and get answers with **exact file + line citations**, or have the | |
| **agent generate pytest tests** for any function by reading its real source. | |
| Built around code-aware retrieval (tree-sitter AST chunking, not naive text | |
| splitting) and measured with an evaluation harness. | |
| ## Demo | |
|  | |
| - **Ask the codebase** β "where are JWT tokens created?" β grounded answer citing | |
| `app/core/security.py:38-44`, with the actual code shown as a source. | |
| - **Generate tests** β name a function β a tool-calling agent reads its real | |
| source and dependencies, then writes pytest tests grounded in that code. | |
| ## Evaluation | |
| Measured on a real FastAPI backend (74 files, 369 definitions), deterministic | |
| (`temperature=0`): | |
| | Metric | Result | | |
| |---|---| | |
| | File-level retrieval accuracy | **90%** | | |
| | Function-level retrieval accuracy | **75%** | | |
| | Citation accuracy (answer cites the right file) | **75%** | | |
| | Median latency | **~3.3s / query** | | |
| > Honest miss worth noting: "where is the FastAPI app created?" misses because | |
| > the app is instantiated at module level (`app = FastAPI()`) rather than in a | |
| > named function β module-level instantiation is harder to retrieve than named | |
| > symbols. Indexing top-level assignments specially is the fix (roadmap). | |
| ## How it works | |
| ``` | |
| ZIP repo | |
| | | |
| v | |
| File scanner skip venv/.git/__pycache__/node_modules, size cap | |
| | | |
| v | |
| tree-sitter parser AST -> functions, classes, methods (+ exact line numbers) | |
| | | |
| v | |
| Code chunker one chunk per definition + file/line metadata | |
| | | |
| v | |
| Embeddings -------\ | |
| | \ | |
| v v | |
| FAISS (semantic) BM25 (code-aware tokenizer: matches `jwt.encode`) | |
| \ / | |
| v v | |
| Hybrid retrieval -> cross-encoder rerank -> top-5 | |
| | | |
| +--------+--------+ | |
| | | | |
| v v | |
| Grounded Q&A Test-gen agent | |
| (file:line cites) (tool-calling loop) | |
| ``` | |
| **Why it's code-aware:** chunking by AST means each chunk is a whole function or | |
| class with its exact line range β so citations are precise and un-hallucinatable, | |
| and retrieval matches real code units instead of arbitrary text windows. The | |
| code-aware BM25 tokenizer splits on symbols, so exact searches like `jwt.encode` | |
| actually match. | |
| **The agent:** given a target function, the LLM calls `get_definition` and | |
| `search_code` to read the real source, then writes pytest tests grounded in it β | |
| a tool-calling loop (no framework), the model planning and acting rather than | |
| answering in one shot. | |
| ## Tech stack | |
| Python Β· Streamlit Β· tree-sitter Β· sentence-transformers Β· FAISS Β· rank-bm25 Β· | |
| cross-encoder reranker Β· OpenAI (`gpt-4.1-mini`, temperature 0) | |
| ## Run locally | |
| ```bash | |
| python -m venv .venv && .venv\Scripts\activate # Windows | |
| pip install -r requirements.txt | |
| echo OPENAI_API_KEY=sk-your-key > .env | |
| streamlit run app.py | |
| ``` | |
| ## Evaluate | |
| ```bash | |
| python evaluate.py --repo path/to/python/repo --testset data/eval/testset.json | |
| ``` | |
| ## Project structure | |
| ``` | |
| src/ | |
| βββ ingestion/ scanner, tree-sitter parser, chunker | |
| βββ rag/ embedder, FAISS, BM25, hybrid, reranker, answerer | |
| βββ agent/ tools (search_code, get_definition) + tool-calling workflow | |
| βββ evaluation/ eval harness | |
| app.py Streamlit UI (Ask + Generate tests) | |
| evaluate.py eval CLI | |
| ``` | |
| ## Limitations & roadmap | |
| v1 is a deliberate vertical slice. Known limits and next steps: | |
| - **Python only** β multi-language via more tree-sitter grammars. | |
| - **ZIP upload only** β GitHub URL ingestion (clone) next. | |
| - **Module-level symbols** (e.g. `app = FastAPI()`) retrieve worse than named | |
| functions β index top-level assignments specially. | |
| - **Citation accuracy is a strict string check** β the answer must contain the | |
| filename; LLM-as-judge grading would measure correctness more fairly. | |
| - **General-purpose embeddings** β a code-specific embedding model | |
| (`jina-embeddings-v2-base-code`) would likely improve retrieval. | |
| - Future: code graph (call/import relationships), PR review mode, bug-fix tool, | |
| documentation agent. | |
| ## About | |
| A self-directed project focused on code-aware RAG, tool-calling agents, and | |
| measured evaluation β not a generic "chat with your repo" demo. |