Spaces:
Running
Running
| # π·οΈ Code Crawler - Complete Architecture Walkthrough | |
| ## Table of Contents | |
| 1. [Project Overview](#project-overview) | |
| 2. [System Architecture](#system-architecture) | |
| 3. [Data Flow Pipeline](#data-flow-pipeline) | |
| 4. [RAG Implementation](#rag-implementation) | |
| 5. [AST Analysis & Graph Creation](#ast-analysis--graph-creation) | |
| 6. [Code Chunking Strategy](#code-chunking-strategy) | |
| 7. [Retrieval System](#retrieval-system) | |
| 8. [Agentic Workflow](#agentic-workflow) | |
| 9. [Frontend & API](#frontend--api) | |
| 10. [Component Deep Dives](#component-deep-dives) | |
| --- | |
| ## Project Overview | |
| **Code Crawler** is an AI-powered codebase assistant that combines multiple advanced techniques: | |
| - **RAG (Retrieval-Augmented Generation)**: Vector-based semantic search over code | |
| - **AST Analysis**: Abstract Syntax Tree parsing for understanding code structure | |
| - **Graph RAG**: Knowledge graph enhancement for relationship-aware retrieval | |
| - **Agentic Workflows**: Multi-step reasoning with tool use (LangGraph) | |
| - **Multi-LLM Support**: Gemini, Groq (Llama 3.3) | |
| ### Key Features | |
| | Feature | Description | | |
| |---------|-------------| | |
| | π¬ Chat Mode | Natural language Q&A about codebase | | |
| | π Search Mode | Regex pattern search across files | | |
| | π§ Refactor Mode | AI-assisted code refactoring | | |
| | β¨ Generate Mode | Spec generation (PO-friendly, Dev Specs, User Stories) | | |
| --- | |
| ## System Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β CODE CRAWLER SYSTEM β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β | |
| β β DATA INGEST ββββββΆβ PROCESSING ββββββΆβ STORAGE β β | |
| β β β β β β β β | |
| β β β’ ZIP Files β β β’ AST Parsing β β β’ Vector DB β β | |
| β β β’ GitHub URLs β β β’ Chunking β β (Chroma/FAISS)β β | |
| β β β’ Local Dirs β β β’ Embeddings β β β’ AST Graph β β | |
| β β β’ Web Docs β β β’ Graph Build β β (GraphML) β β | |
| β βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β RETRIEVAL LAYER β β | |
| β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β | |
| β β β Vector β β LLM β β Graph β β Reranker β β β | |
| β β β Retriever ββββ Retriever ββββ Enhanced ββββ (Cross- β β β | |
| β β β β β β β Retriever β β Encoder) β β β | |
| β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β CHAT ENGINE β β | |
| β β β β | |
| β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β | |
| β β β Linear RAG β OR β Agentic Workflow β β β | |
| β β β (Simple) β β (LangGraph) β β β | |
| β β β β β β β β | |
| β β β Query β Retrieveβ β Agent β Tool β Agent β β β | |
| β β β β Answer β β β β β β | |
| β β β β β search_codebase β β β | |
| β β β β β read_file β β β | |
| β β β β β list_files β β β | |
| β β β β β find_callers β β β | |
| β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β FRONTEND LAYER β β | |
| β β β β | |
| β β Streamlit App FastAPI (REST) Next.js (React) β β | |
| β β βββ app.py βββ /api/index βββ /chat β β | |
| β β βββ Code_Studio.py βββ /api/chat βββ /generate β β | |
| β β βββ /api/health βββ /search β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Data Flow Pipeline | |
| ### 1. Ingestion Flow | |
| ``` | |
| User Input (ZIP/GitHub/Local) | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β UniversalIngestor β | |
| β (universal_ingestor.py) β | |
| β β | |
| β βββββββββββββββ βββββββββββββββββββ β | |
| β β _detect_ β β Handler Classes β β | |
| β β handler() ββββΆβ β β | |
| β βββββββββββββββ β β’ ZIPFileManagerβ β | |
| β β β’ GitHubRepoMgr β β | |
| β β β’ LocalDirMgr β β | |
| β β β’ WebDocManager β β | |
| β βββββββββββββββββββ β | |
| ββββββββββββββββββββββ¬βββββββββββββββββββββ | |
| β | |
| βΌ | |
| List[Document] + local_path | |
| ``` | |
| **Example: GitHub Repository Processing** | |
| ```python | |
| # 1. User provides: "https://github.com/owner/repo" | |
| # 2. UniversalIngestor detects GitHub URL | |
| ingestor = UniversalIngestor(source) | |
| # delegate = GitHubRepoManager | |
| # 3. Download (clone or ZIP fallback) | |
| ingestor.download() | |
| # Clones to: /tmp/code_chatbot/owner_repo/ | |
| # 4. Walk files | |
| for content, metadata in ingestor.walk(): | |
| # content = "def hello(): ..." | |
| # metadata = {"file_path": "/tmp/.../main.py", "source": "main.py"} | |
| ``` | |
| ### 2. Indexing Flow | |
| ``` | |
| Documents | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Indexer β | |
| β (indexer.py) β | |
| β β | |
| β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β | |
| β β StructuralChunkerββββΆβ Embedding Model ββββΆβ Vector Store β β | |
| β β β β (Gemini/HF) β β (Chroma/FAISS)β β | |
| β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β | |
| β β | |
| β Additionally: β | |
| β βββββββββββββββββββ βββββββββββββββββββ β | |
| β β ASTGraphBuilder ββββΆβ GraphML File β β | |
| β βββββββββββββββββββ βββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## RAG Implementation | |
| The RAG system in this project is implemented in `code_chatbot/rag.py` with these key components: | |
| ### ChatEngine Class | |
| ```python | |
| class ChatEngine: | |
| def __init__(self, retriever, model_name, provider, ...): | |
| # 1. Base retriever (from vector store) | |
| self.base_retriever = retriever | |
| # 2. Enhanced retriever with reranking | |
| self.vector_retriever = build_enhanced_retriever( | |
| base_retriever=retriever, | |
| use_multi_query=use_multi_query, | |
| use_reranking=True # Uses Cross-Encoder | |
| ) | |
| # 3. LLM Retriever (file-aware) | |
| self.llm_retriever = LLMRetriever(llm, repo_files) | |
| # 4. Ensemble Retriever (combines both) | |
| self.retriever = EnsembleRetriever( | |
| retrievers=[self.vector_retriever, self.llm_retriever], | |
| weights=[0.6, 0.4] # 60% vector, 40% LLM | |
| ) | |
| ``` | |
| ### RAG Flow Example | |
| ``` | |
| User Query: "How does the authentication work?" | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β 1. RETRIEVAL β | |
| β ββββββββββββββββββββ ββββββββββββββββββββ β | |
| β β Vector Retriever β β LLM Retriever β β | |
| β β β β β β | |
| β β Semantic search β β LLM picks files β β | |
| β β in Chroma DB β β from structure β β | |
| β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β | |
| β β β β | |
| β ββββββββββββββ¬βββββββββββββ β | |
| β βΌ β | |
| β βββββββββββββββββββββββ β | |
| β β EnsembleRetriever β β | |
| β β (60% + 40% weighted)β β | |
| β βββββββββββ¬ββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββ β | |
| β β Reranker β β | |
| β β (Cross-Encoder) β β | |
| β β ms-marco-MiniLM β β | |
| β βββββββββββ¬ββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β Top 5 Most Relevant Docs β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β 2. GENERATION β | |
| β β | |
| β System Prompt + Context + History + Question β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββ β | |
| β β LLM (Gemini/Groq) β β | |
| β βββββββββββ¬ββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β Answer + Sources β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## AST Analysis & Graph Creation | |
| The AST analysis is implemented in `code_chatbot/ast_analysis.py` using **tree-sitter** for multi-language parsing. | |
| ### How AST Parsing Works | |
| ```python | |
| # Example: Parsing a Python file | |
| # Source code: | |
| """ | |
| from typing import List | |
| class UserService: | |
| def __init__(self, db): | |
| self.db = db | |
| def get_user(self, user_id: int) -> User: | |
| return self.db.find(user_id) | |
| def create_user(self, name: str) -> User: | |
| user = User(name=name) | |
| self.db.save(user) | |
| return user | |
| """ | |
| # tree-sitter parses this into an AST: | |
| """ | |
| module | |
| βββ import_from_statement | |
| β βββ module: "typing" | |
| β βββ names: ["List"] | |
| βββ class_definition | |
| β βββ name: "UserService" | |
| β βββ block | |
| β βββ function_definition (name: "__init__") | |
| β βββ function_definition (name: "get_user") | |
| β β βββ call (function: "self.db.find") | |
| β βββ function_definition (name: "create_user") | |
| β βββ call (function: "User") | |
| β βββ call (function: "self.db.save") | |
| """ | |
| ``` | |
| ### EnhancedCodeAnalyzer | |
| ```python | |
| class EnhancedCodeAnalyzer: | |
| """Builds a knowledge graph from code""" | |
| def __init__(self): | |
| self.graph = nx.DiGraph() # NetworkX directed graph | |
| self.functions = {} # node_id -> FunctionInfo | |
| self.classes = {} # node_id -> ClassInfo | |
| self.imports = {} # file_path -> [ImportInfo] | |
| self.definitions = {} # name -> [node_ids] | |
| ``` | |
| ### Graph Structure Example | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β AST KNOWLEDGE GRAPH β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β Nodes: β | |
| β ββββββββββββββββββββ β | |
| β β Type: "file" β β | |
| β β Name: "api.py" β β | |
| β ββββββββββ¬ββββββββββ β | |
| β β defines β | |
| β βΌ β | |
| β ββββββββββββββββββββ ββββββββββββββββββββ β | |
| β β Type: "class" β β Type: "function" β β | |
| β β Name: "UserAPI" β β Name: "main" β β | |
| β ββββββββββ¬ββββββββββ ββββββββββββββββββββ β | |
| β β has_method β | |
| β βΌ β | |
| β ββββββββββββββββββββ β | |
| β β Type: "method" ββββcallsββββΆ UserService.get_user β | |
| β β Name: "get" β β | |
| β ββββββββββββββββββββ β | |
| β β | |
| β Edges: β | |
| β β’ defines: file -> class/function β | |
| β β’ has_method: class -> method β | |
| β β’ calls: function -> function β | |
| β β’ imports: file -> module β | |
| β β’ inherits_from: class -> class β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Call Graph Resolution | |
| ```python | |
| def resolve_call_graph(self): | |
| """ | |
| After parsing all files, resolve function calls to their definitions. | |
| Example: | |
| - File A has: service.get_user(id) | |
| - File B has: def get_user(self, id): ... | |
| Resolution: | |
| - Finds that "get_user" is defined in File B | |
| - Creates edge: A::caller_func --calls--> B::UserService.get_user | |
| """ | |
| for caller_id, callee_name, line in self.unresolved_calls: | |
| # Try direct match | |
| if callee_name in self.definitions: | |
| for target_id in self.definitions[callee_name]: | |
| self.graph.add_edge(caller_id, target_id, relation="calls") | |
| ``` | |
| --- | |
| ## Code Chunking Strategy | |
| The chunking system in `code_chatbot/chunker.py` uses **structural chunking** based on AST boundaries. | |
| ### Chunking Philosophy | |
| ``` | |
| Traditional Text Chunking: | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β def process_data(): β CHUNK 1 β | |
| β data = load() β β | |
| β # Some processing β β | |
| β ββββββββββββββββββββββββββββΌβββββββββββββ | |
| β result = transform() β CHUNK 2 β β Breaks mid-function! | |
| β return result β β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| Structural Chunking (This Project): | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β def process_data(): β β | |
| β data = load() β CHUNK 1 β β Complete function | |
| β result = transform() β β | |
| β return result β β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β def another_function(): β β | |
| β ... β CHUNK 2 β β Complete function | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### StructuralChunker Implementation | |
| ```python | |
| class StructuralChunker: | |
| """Uses tree-sitter to chunk code at semantic boundaries""" | |
| def __init__(self, max_tokens: int = 800): | |
| self.max_tokens = max_tokens | |
| self._init_parsers() # Python, JS, TS parsers | |
| def _chunk_node(self, node, file_content, file_metadata): | |
| """ | |
| Recursive chunking algorithm: | |
| 1. If node fits in max_tokens β return as single chunk | |
| 2. If node is too large β recurse into children | |
| 3. Merge neighboring small chunks | |
| """ | |
| chunk = FileChunk(file_content, file_metadata, | |
| node.start_byte, node.end_byte) | |
| # Fits? Return it | |
| if chunk.num_tokens <= self.max_tokens: | |
| return [chunk] | |
| # Too large? Recurse | |
| child_chunks = [] | |
| for child in node.children: | |
| child_chunks.extend(self._chunk_node(child, ...)) | |
| # Merge small neighbors | |
| return self._merge_small_chunks(child_chunks) | |
| ``` | |
| ### Chunk Metadata (Rich Context) | |
| Each chunk carries rich metadata: | |
| ```python | |
| @dataclass | |
| class FileChunk: | |
| file_content: str | |
| file_metadata: Dict | |
| start_byte: int | |
| end_byte: int | |
| # Enhanced metadata | |
| symbols_defined: List[str] # ["UserService", "UserService.get_user"] | |
| imports_used: List[str] # ["from typing import List"] | |
| complexity_score: int # Cyclomatic complexity | |
| parent_context: str # "UserService" (parent class) | |
| ``` | |
| This metadata is stored in the vector DB and used for filtering/ranking. | |
| --- | |
| ## Retrieval System | |
| ### Multi-Stage Retrieval Pipeline | |
| ``` | |
| Query: "How does user authentication work?" | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STAGE 1: Initial Retrieval (k=10) β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Vector Store (Chroma) β β | |
| β β β β | |
| β β Query Embedding ββsimilarityβββΆ Document Embeddings β β | |
| β β β β | |
| β β Returns: 10 candidate documents β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STAGE 2: LLM-Based File Selection β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β LLMRetriever β β | |
| β β β β | |
| β β File Tree: β β | |
| β β βββ src/ β β | |
| β β β βββ auth/ β β | |
| β β β β βββ login.py βββ LLM selects this β β | |
| β β β β βββ middleware.py βββ And this β β | |
| β β β βββ api/ β β | |
| β β βββ tests/ β β | |
| β β β β | |
| β β LLM Prompt: "Select top 5 relevant files for: ..." β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STAGE 3: Ensemble Combination β | |
| β β | |
| β Vector Results (weight: 0.6) + LLM Results (weight: 0.4) β | |
| β β | |
| β Combined: 12-15 unique documents β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STAGE 4: Graph Enhancement β | |
| β β | |
| β For each retrieved document: β | |
| β 1. Find its node in AST graph β | |
| β 2. Get neighboring nodes (related files) β | |
| β 3. Add related files to context β | |
| β β | |
| β Example: login.py found β adds auth_utils.py (imports it) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β STAGE 5: Reranking β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Cross-Encoder Reranker β β | |
| β β (ms-marco-MiniLM-L-6-v2) β β | |
| β β β β | |
| β β For each (query, document) pair: β β | |
| β β score = cross_encoder.predict([query, doc.content]) β β | |
| β β β β | |
| β β Sort by score, return top 5 β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| Final: Top 5 Documents | |
| ``` | |
| ### Reranker (Cross-Encoder) | |
| ```python | |
| class Reranker: | |
| """ | |
| Uses a Cross-Encoder for precise relevance scoring. | |
| Unlike bi-encoders (used for initial retrieval), cross-encoders | |
| process query AND document together, giving more accurate scores. | |
| """ | |
| def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"): | |
| self.model = CrossEncoder(model_name) | |
| def rerank(self, query: str, documents: List[Document], top_k=5): | |
| # Score each document against the query | |
| pairs = [[query, doc.page_content] for doc in documents] | |
| scores = self.model.predict(pairs) | |
| # Sort by score | |
| scored = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) | |
| return [doc for doc, _ in scored[:top_k]] | |
| ``` | |
| --- | |
| ## Agentic Workflow | |
| The agentic workflow uses **LangGraph** to enable multi-step reasoning with tool use. | |
| ### Agent Graph Structure | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LANGGRAPH AGENT β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β βββββββββββββββ β | |
| β βββββββββββ START βββββββββββ β | |
| β β βββββββββββββββ β β | |
| β βΌ β β | |
| β βββββββββββββββββββββββββββββββββββββββ β β | |
| β β AGENT NODE β β β | |
| β β β β β | |
| β β 1. Process messages β β β | |
| β β 2. Call LLM with tools bound β β β | |
| β β 3. LLM decides: β β β | |
| β β - Call a tool? β go to TOOLS β β β | |
| β β - Final answer? β go to END β β β | |
| β ββββββββββββββββ¬βββββββββββββββββββββββ β β | |
| β β β β | |
| β has_tool_call? β β | |
| β β β β β | |
| β Yes β β No β β | |
| β β β β β | |
| β βΌ ββββββββββββββββββββββββββββΆβ€ β | |
| β βββββββββββββββββββββββββββββββββββββββ β β | |
| β β TOOLS NODE β β β | |
| β β β β β | |
| β β Execute tool calls: β β β | |
| β β β’ search_codebase(query) β β β | |
| β β β’ read_file(path) β β β | |
| β β β’ list_files(dir) β β β | |
| β β β’ find_callers(func) β β β | |
| β β β’ find_callees(func) β β β | |
| β β β’ find_call_chain(a, b) β β β | |
| β β β β β | |
| β β Add tool results to messages β β β | |
| β ββββββββββββββββ¬βββββββββββββββββββββββ β β | |
| β β β β | |
| β βββββββββββββββββββββββββββ β | |
| β β | |
| β βΌ β | |
| β βββββββββββββββ β | |
| β β END β β | |
| β βββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Available Tools | |
| ```python | |
| # 1. search_codebase - Semantic search in vector store | |
| @tool("search_codebase") | |
| def search_codebase(query: str): | |
| """Search the codebase for relevant code snippets.""" | |
| docs = retriever.invoke(query) | |
| return format_results(docs[:5]) | |
| # 2. read_file - Read complete file content | |
| @tool("read_file") | |
| def read_file(file_path: str): | |
| """Read the content of a specific file.""" | |
| with open(full_path, "r") as f: | |
| return f.read() | |
| # 3. list_files - Directory listing | |
| @tool("list_files") | |
| def list_files(path: str = "."): | |
| """List files in a directory.""" | |
| return "\n".join(os.listdir(target_path)) | |
| # 4. find_callers - Call graph: who calls this? | |
| @tool("find_callers") | |
| def find_callers(function_name: str): | |
| """Find all functions that call this function.""" | |
| return analyzer.get_callers(function_name) | |
| # 5. find_callees - Call graph: what does this call? | |
| @tool("find_callees") | |
| def find_callees(function_name: str): | |
| """Find all functions called by this function.""" | |
| return analyzer.get_callees(function_name) | |
| # 6. find_call_chain - Call graph: trace execution path | |
| @tool("find_call_chain") | |
| def find_call_chain(start: str, end: str): | |
| """Find the call path from one function to another.""" | |
| return analyzer.get_call_chain(start, end) | |
| ``` | |
| ### Agent Execution Example | |
| ``` | |
| User: "How does the login function handle invalid passwords?" | |
| Agent Thought: I need to find the login function first. | |
| Agent Action: search_codebase("login function authentication") | |
| Tool Result: [login.py snippet showing authenticate() function] | |
| Agent Thought: I found authenticate(). Let me see the full implementation. | |
| Agent Action: read_file("src/auth/login.py") | |
| Tool Result: [Full file content] | |
| Agent Thought: I see it calls verify_password(). Let me check that. | |
| Agent Action: find_callees("authenticate") | |
| Tool Result: verify_password, get_user, create_session | |
| Agent Action: search_codebase("verify_password invalid password") | |
| Tool Result: [password_utils.py with error handling] | |
| Agent Final Answer: The login function handles invalid passwords by... | |
| ``` | |
| --- | |
| ## Frontend & API | |
| ### Streamlit App Structure | |
| ``` | |
| app.py (Main Entry) | |
| β | |
| βββ Ingestion Screen | |
| β βββ Source Type Selection (ZIP/GitHub/Web) | |
| β βββ File Upload / URL Input | |
| β βββ "Process & Index" Button | |
| β | |
| βββ Redirects to β pages/1_β‘_Code_Studio.py | |
| Code_Studio.py | |
| β | |
| βββ Left Panel (Tabs) | |
| β βββ π Explorer - File tree navigation | |
| β βββ π Search - Regex pattern search | |
| β βββ π¬ Chat - RAG conversation | |
| β βββ β¨ Generate - Spec generation | |
| β | |
| βββ Right Panel | |
| βββ Code Viewer - Syntax highlighted file view | |
| ``` | |
| ### FastAPI REST API | |
| ``` | |
| /api | |
| βββ /health GET - Health check | |
| β | |
| βββ /index POST - Index a codebase | |
| β Body: { | |
| β source: "https://github.com/...", | |
| β provider: "gemini", | |
| β use_agent: true | |
| β } | |
| β | |
| βββ /chat POST - Ask questions | |
| Body: { | |
| question: "How does auth work?", | |
| provider: "gemini", | |
| use_agent: true | |
| } | |
| Response: { | |
| answer: "...", | |
| sources: [...], | |
| mode: "agent", | |
| processing_time: 2.5 | |
| } | |
| ``` | |
| --- | |
| ## Component Deep Dives | |
| ### Merkle Tree (Incremental Indexing) | |
| ```python | |
| class MerkleTree: | |
| """ | |
| Enables incremental indexing by detecting file changes. | |
| How it works: | |
| 1. Build a hash tree mirroring directory structure | |
| 2. Each file node has SHA-256 hash of content | |
| 3. Each directory node has hash of children hashes | |
| 4. Compare old vs new tree to find changes | |
| """ | |
| def compare_trees(self, old, new) -> ChangeSet: | |
| # Returns: added, modified, deleted, unchanged files | |
| ``` | |
| **Example:** | |
| ``` | |
| First Index: | |
| project/ | |
| βββ main.py (hash: abc123) | |
| βββ utils.py (hash: def456) | |
| Root hash: sha256(abc123 + def456) = xyz789 | |
| Second Index (utils.py changed): | |
| project/ | |
| βββ main.py (hash: abc123) β unchanged | |
| βββ utils.py (hash: ghi012) β NEW HASH! | |
| Root hash changed! β Only re-index utils.py | |
| ``` | |
| ### Path Obfuscation (Privacy) | |
| ```python | |
| class PathObfuscator: | |
| """ | |
| Obfuscates file paths for sensitive codebases. | |
| Original: /home/user/secret-project/src/auth/login.py | |
| Obfuscated: /f8a3b2c1/d4e5f6a7/89012345.py | |
| Mapping stored securely, reversible only with key. | |
| """ | |
| ``` | |
| ### Rate Limiter (API Management) | |
| ```python | |
| class AdaptiveRateLimiter: | |
| """ | |
| Handles rate limits for free-tier APIs. | |
| Gemini Free Tier: 15 RPM, 32K TPM, 1500 RPD | |
| Strategies: | |
| 1. Track usage in rolling window | |
| 2. Adaptive delay based on remaining quota | |
| 3. Exponential backoff on 429 errors | |
| 4. Model fallback chain (flash β pro β legacy) | |
| """ | |
| ``` | |
| --- | |
| ## Configuration System | |
| ```python | |
| @dataclass | |
| class RAGConfig: | |
| """Central configuration for entire pipeline""" | |
| # Chunking | |
| chunking: ChunkingConfig | |
| max_chunk_tokens: int = 800 | |
| min_chunk_tokens: int = 100 | |
| preserve_imports: bool = True | |
| calculate_complexity: bool = True | |
| # Privacy | |
| privacy: PrivacyConfig | |
| enable_path_obfuscation: bool = False | |
| # Indexing | |
| indexing: IndexingConfig | |
| enable_incremental_indexing: bool = True | |
| batch_size: int = 100 | |
| ignore_patterns: List[str] = [...] | |
| # Retrieval | |
| retrieval: RetrievalConfig | |
| enable_reranking: bool = True | |
| retrieval_k: int = 10 | |
| rerank_top_k: int = 5 | |
| similarity_threshold: float = 0.5 | |
| ``` | |
| --- | |
| ## File Dependency Map | |
| ``` | |
| app.py | |
| βββ code_chatbot/universal_ingestor.py | |
| βββ code_chatbot/indexer.py | |
| β βββ code_chatbot/chunker.py (StructuralChunker) | |
| β βββ code_chatbot/merkle_tree.py (MerkleTree) | |
| β βββ code_chatbot/config.py (RAGConfig) | |
| β βββ code_chatbot/db_connection.py (Chroma client) | |
| βββ code_chatbot/rag.py (ChatEngine) | |
| β βββ code_chatbot/retriever_wrapper.py | |
| β β βββ code_chatbot/reranker.py (Reranker) | |
| β βββ code_chatbot/llm_retriever.py (LLMRetriever) | |
| β βββ code_chatbot/agent_workflow.py | |
| β β βββ code_chatbot/tools.py | |
| β βββ code_chatbot/prompts.py | |
| βββ code_chatbot/ast_analysis.py (EnhancedCodeAnalyzer) | |
| βββ code_chatbot/graph_rag.py (GraphEnhancedRetriever) | |
| pages/1_β‘_Code_Studio.py | |
| βββ components/file_explorer.py | |
| βββ components/code_viewer.py | |
| βββ components/panels.py | |
| βββ components/style.py | |
| api/main.py | |
| βββ api/routes/chat.py | |
| βββ api/routes/index.py | |
| βββ api/routes/health.py | |
| βββ api/schemas.py | |
| βββ api/state.py | |
| ``` | |
| --- | |
| ## Summary | |
| This project implements a sophisticated code understanding system with: | |
| 1. **Multi-Source Ingestion**: ZIP, GitHub, Local, Web | |
| 2. **Structural Chunking**: AST-aware code splitting | |
| 3. **Hybrid Retrieval**: Vector + LLM + Graph-enhanced | |
| 4. **Cross-Encoder Reranking**: Precision at the top | |
| 5. **Agentic Workflow**: Multi-step reasoning with tools | |
| 6. **Call Graph Analysis**: Function relationship tracking | |
| 7. **Incremental Indexing**: Merkle tree change detection | |
| 8. **Multi-LLM Support**: Gemini, Groq with fallbacks | |
| The architecture is designed for scalability, accuracy, and developer experience. | |