Spaces:
Running
π·οΈ Code Crawler - Complete Architecture Walkthrough
Table of Contents
- Project Overview
- System Architecture
- Data Flow Pipeline
- RAG Implementation
- AST Analysis & Graph Creation
- Code Chunking Strategy
- Retrieval System
- Agentic Workflow
- Frontend & API
- Component Deep Dives
Project Overview
Code Crawler is an AI-powered codebase assistant that combines multiple advanced techniques:
- RAG (Retrieval-Augmented Generation): Vector-based semantic search over code
- AST Analysis: Abstract Syntax Tree parsing for understanding code structure
- Graph RAG: Knowledge graph enhancement for relationship-aware retrieval
- Agentic Workflows: Multi-step reasoning with tool use (LangGraph)
- Multi-LLM Support: Gemini, Groq (Llama 3.3)
Key Features
| Feature | Description |
|---|---|
| π¬ Chat Mode | Natural language Q&A about codebase |
| π Search Mode | Regex pattern search across files |
| π§ Refactor Mode | AI-assisted code refactoring |
| β¨ Generate Mode | Spec generation (PO-friendly, Dev Specs, User Stories) |
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CODE CRAWLER SYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β DATA INGEST ββββββΆβ PROCESSING ββββββΆβ STORAGE β β
β β β β β β β β
β β β’ ZIP Files β β β’ AST Parsing β β β’ Vector DB β β
β β β’ GitHub URLs β β β’ Chunking β β (Chroma/FAISS)β β
β β β’ Local Dirs β β β’ Embeddings β β β’ AST Graph β β
β β β’ Web Docs β β β’ Graph Build β β (GraphML) β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RETRIEVAL LAYER β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Vector β β LLM β β Graph β β Reranker β β β
β β β Retriever ββββ Retriever ββββ Enhanced ββββ (Cross- β β β
β β β β β β β Retriever β β Encoder) β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CHAT ENGINE β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β Linear RAG β OR β Agentic Workflow β β β
β β β (Simple) β β (LangGraph) β β β
β β β β β β β β
β β β Query β Retrieveβ β Agent β Tool β Agent β β β
β β β β Answer β β β β β β
β β β β β search_codebase β β β
β β β β β read_file β β β
β β β β β list_files β β β
β β β β β find_callers β β β
β β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FRONTEND LAYER β β
β β β β
β β Streamlit App FastAPI (REST) Next.js (React) β β
β β βββ app.py βββ /api/index βββ /chat β β
β β βββ Code_Studio.py βββ /api/chat βββ /generate β β
β β βββ /api/health βββ /search β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow Pipeline
1. Ingestion Flow
User Input (ZIP/GitHub/Local)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β UniversalIngestor β
β (universal_ingestor.py) β
β β
β βββββββββββββββ βββββββββββββββββββ β
β β _detect_ β β Handler Classes β β
β β handler() ββββΆβ β β
β βββββββββββββββ β β’ ZIPFileManagerβ β
β β β’ GitHubRepoMgr β β
β β β’ LocalDirMgr β β
β β β’ WebDocManager β β
β βββββββββββββββββββ β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
List[Document] + local_path
Example: GitHub Repository Processing
# 1. User provides: "https://github.com/owner/repo"
# 2. UniversalIngestor detects GitHub URL
ingestor = UniversalIngestor(source)
# delegate = GitHubRepoManager
# 3. Download (clone or ZIP fallback)
ingestor.download()
# Clones to: /tmp/code_chatbot/owner_repo/
# 4. Walk files
for content, metadata in ingestor.walk():
# content = "def hello(): ..."
# metadata = {"file_path": "/tmp/.../main.py", "source": "main.py"}
2. Indexing Flow
Documents
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Indexer β
β (indexer.py) β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β
β β StructuralChunkerββββΆβ Embedding Model ββββΆβ Vector Store β β
β β β β (Gemini/HF) β β (Chroma/FAISS)β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β
β β
β Additionally: β
β βββββββββββββββββββ βββββββββββββββββββ β
β β ASTGraphBuilder ββββΆβ GraphML File β β
β βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
RAG Implementation
The RAG system in this project is implemented in code_chatbot/rag.py with these key components:
ChatEngine Class
class ChatEngine:
def __init__(self, retriever, model_name, provider, ...):
# 1. Base retriever (from vector store)
self.base_retriever = retriever
# 2. Enhanced retriever with reranking
self.vector_retriever = build_enhanced_retriever(
base_retriever=retriever,
use_multi_query=use_multi_query,
use_reranking=True # Uses Cross-Encoder
)
# 3. LLM Retriever (file-aware)
self.llm_retriever = LLMRetriever(llm, repo_files)
# 4. Ensemble Retriever (combines both)
self.retriever = EnsembleRetriever(
retrievers=[self.vector_retriever, self.llm_retriever],
weights=[0.6, 0.4] # 60% vector, 40% LLM
)
RAG Flow Example
User Query: "How does the authentication work?"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. RETRIEVAL β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Vector Retriever β β LLM Retriever β β
β β β β β β
β β Semantic search β β LLM picks files β β
β β in Chroma DB β β from structure β β
β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β
β β β β
β ββββββββββββββ¬βββββββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β EnsembleRetriever β β
β β (60% + 40% weighted)β β
β βββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββ β
β β Reranker β β
β β (Cross-Encoder) β β
β β ms-marco-MiniLM β β
β βββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β Top 5 Most Relevant Docs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. GENERATION β
β β
β System Prompt + Context + History + Question β
β β β
β βΌ β
β βββββββββββββββββββββββ β
β β LLM (Gemini/Groq) β β
β βββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β Answer + Sources β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AST Analysis & Graph Creation
The AST analysis is implemented in code_chatbot/ast_analysis.py using tree-sitter for multi-language parsing.
How AST Parsing Works
# Example: Parsing a Python file
# Source code:
"""
from typing import List
class UserService:
def __init__(self, db):
self.db = db
def get_user(self, user_id: int) -> User:
return self.db.find(user_id)
def create_user(self, name: str) -> User:
user = User(name=name)
self.db.save(user)
return user
"""
# tree-sitter parses this into an AST:
"""
module
βββ import_from_statement
β βββ module: "typing"
β βββ names: ["List"]
βββ class_definition
β βββ name: "UserService"
β βββ block
β βββ function_definition (name: "__init__")
β βββ function_definition (name: "get_user")
β β βββ call (function: "self.db.find")
β βββ function_definition (name: "create_user")
β βββ call (function: "User")
β βββ call (function: "self.db.save")
"""
EnhancedCodeAnalyzer
class EnhancedCodeAnalyzer:
"""Builds a knowledge graph from code"""
def __init__(self):
self.graph = nx.DiGraph() # NetworkX directed graph
self.functions = {} # node_id -> FunctionInfo
self.classes = {} # node_id -> ClassInfo
self.imports = {} # file_path -> [ImportInfo]
self.definitions = {} # name -> [node_ids]
Graph Structure Example
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AST KNOWLEDGE GRAPH β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Nodes: β
β ββββββββββββββββββββ β
β β Type: "file" β β
β β Name: "api.py" β β
β ββββββββββ¬ββββββββββ β
β β defines β
β βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Type: "class" β β Type: "function" β β
β β Name: "UserAPI" β β Name: "main" β β
β ββββββββββ¬ββββββββββ ββββββββββββββββββββ β
β β has_method β
β βΌ β
β ββββββββββββββββββββ β
β β Type: "method" ββββcallsββββΆ UserService.get_user β
β β Name: "get" β β
β ββββββββββββββββββββ β
β β
β Edges: β
β β’ defines: file -> class/function β
β β’ has_method: class -> method β
β β’ calls: function -> function β
β β’ imports: file -> module β
β β’ inherits_from: class -> class β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Call Graph Resolution
def resolve_call_graph(self):
"""
After parsing all files, resolve function calls to their definitions.
Example:
- File A has: service.get_user(id)
- File B has: def get_user(self, id): ...
Resolution:
- Finds that "get_user" is defined in File B
- Creates edge: A::caller_func --calls--> B::UserService.get_user
"""
for caller_id, callee_name, line in self.unresolved_calls:
# Try direct match
if callee_name in self.definitions:
for target_id in self.definitions[callee_name]:
self.graph.add_edge(caller_id, target_id, relation="calls")
Code Chunking Strategy
The chunking system in code_chatbot/chunker.py uses structural chunking based on AST boundaries.
Chunking Philosophy
Traditional Text Chunking:
βββββββββββββββββββββββββββββββββββββββββββ
β def process_data(): β CHUNK 1 β
β data = load() β β
β # Some processing β β
β ββββββββββββββββββββββββββββΌβββββββββββββ
β result = transform() β CHUNK 2 β β Breaks mid-function!
β return result β β
βββββββββββββββββββββββββββββββββββββββββββ
Structural Chunking (This Project):
βββββββββββββββββββββββββββββββββββββββββββ
β def process_data(): β β
β data = load() β CHUNK 1 β β Complete function
β result = transform() β β
β return result β β
βββββββββββββββββββββββββββββββββββββββββββ€
β def another_function(): β β
β ... β CHUNK 2 β β Complete function
βββββββββββββββββββββββββββββββββββββββββββ
StructuralChunker Implementation
class StructuralChunker:
"""Uses tree-sitter to chunk code at semantic boundaries"""
def __init__(self, max_tokens: int = 800):
self.max_tokens = max_tokens
self._init_parsers() # Python, JS, TS parsers
def _chunk_node(self, node, file_content, file_metadata):
"""
Recursive chunking algorithm:
1. If node fits in max_tokens β return as single chunk
2. If node is too large β recurse into children
3. Merge neighboring small chunks
"""
chunk = FileChunk(file_content, file_metadata,
node.start_byte, node.end_byte)
# Fits? Return it
if chunk.num_tokens <= self.max_tokens:
return [chunk]
# Too large? Recurse
child_chunks = []
for child in node.children:
child_chunks.extend(self._chunk_node(child, ...))
# Merge small neighbors
return self._merge_small_chunks(child_chunks)
Chunk Metadata (Rich Context)
Each chunk carries rich metadata:
@dataclass
class FileChunk:
file_content: str
file_metadata: Dict
start_byte: int
end_byte: int
# Enhanced metadata
symbols_defined: List[str] # ["UserService", "UserService.get_user"]
imports_used: List[str] # ["from typing import List"]
complexity_score: int # Cyclomatic complexity
parent_context: str # "UserService" (parent class)
This metadata is stored in the vector DB and used for filtering/ranking.
Retrieval System
Multi-Stage Retrieval Pipeline
Query: "How does user authentication work?"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Initial Retrieval (k=10) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Vector Store (Chroma) β β
β β β β
β β Query Embedding ββsimilarityβββΆ Document Embeddings β β
β β β β
β β Returns: 10 candidate documents β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: LLM-Based File Selection β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LLMRetriever β β
β β β β
β β File Tree: β β
β β βββ src/ β β
β β β βββ auth/ β β
β β β β βββ login.py βββ LLM selects this β β
β β β β βββ middleware.py βββ And this β β
β β β βββ api/ β β
β β βββ tests/ β β
β β β β
β β LLM Prompt: "Select top 5 relevant files for: ..." β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: Ensemble Combination β
β β
β Vector Results (weight: 0.6) + LLM Results (weight: 0.4) β
β β
β Combined: 12-15 unique documents β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: Graph Enhancement β
β β
β For each retrieved document: β
β 1. Find its node in AST graph β
β 2. Get neighboring nodes (related files) β
β 3. Add related files to context β
β β
β Example: login.py found β adds auth_utils.py (imports it) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 5: Reranking β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cross-Encoder Reranker β β
β β (ms-marco-MiniLM-L-6-v2) β β
β β β β
β β For each (query, document) pair: β β
β β score = cross_encoder.predict([query, doc.content]) β β
β β β β
β β Sort by score, return top 5 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Final: Top 5 Documents
Reranker (Cross-Encoder)
class Reranker:
"""
Uses a Cross-Encoder for precise relevance scoring.
Unlike bi-encoders (used for initial retrieval), cross-encoders
process query AND document together, giving more accurate scores.
"""
def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
def rerank(self, query: str, documents: List[Document], top_k=5):
# Score each document against the query
pairs = [[query, doc.page_content] for doc in documents]
scores = self.model.predict(pairs)
# Sort by score
scored = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in scored[:top_k]]
Agentic Workflow
The agentic workflow uses LangGraph to enable multi-step reasoning with tool use.
Agent Graph Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LANGGRAPH AGENT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ β
β βββββββββββ START βββββββββββ β
β β βββββββββββββββ β β
β βΌ β β
β βββββββββββββββββββββββββββββββββββββββ β β
β β AGENT NODE β β β
β β β β β
β β 1. Process messages β β β
β β 2. Call LLM with tools bound β β β
β β 3. LLM decides: β β β
β β - Call a tool? β go to TOOLS β β β
β β - Final answer? β go to END β β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β β
β β β β
β has_tool_call? β β
β β β β β
β Yes β β No β β
β β β β β
β βΌ ββββββββββββββββββββββββββββΆβ€ β
β βββββββββββββββββββββββββββββββββββββββ β β
β β TOOLS NODE β β β
β β β β β
β β Execute tool calls: β β β
β β β’ search_codebase(query) β β β
β β β’ read_file(path) β β β
β β β’ list_files(dir) β β β
β β β’ find_callers(func) β β β
β β β’ find_callees(func) β β β
β β β’ find_call_chain(a, b) β β β
β β β β β
β β Add tool results to messages β β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββ β
β β
β βΌ β
β βββββββββββββββ β
β β END β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Available Tools
# 1. search_codebase - Semantic search in vector store
@tool("search_codebase")
def search_codebase(query: str):
"""Search the codebase for relevant code snippets."""
docs = retriever.invoke(query)
return format_results(docs[:5])
# 2. read_file - Read complete file content
@tool("read_file")
def read_file(file_path: str):
"""Read the content of a specific file."""
with open(full_path, "r") as f:
return f.read()
# 3. list_files - Directory listing
@tool("list_files")
def list_files(path: str = "."):
"""List files in a directory."""
return "\n".join(os.listdir(target_path))
# 4. find_callers - Call graph: who calls this?
@tool("find_callers")
def find_callers(function_name: str):
"""Find all functions that call this function."""
return analyzer.get_callers(function_name)
# 5. find_callees - Call graph: what does this call?
@tool("find_callees")
def find_callees(function_name: str):
"""Find all functions called by this function."""
return analyzer.get_callees(function_name)
# 6. find_call_chain - Call graph: trace execution path
@tool("find_call_chain")
def find_call_chain(start: str, end: str):
"""Find the call path from one function to another."""
return analyzer.get_call_chain(start, end)
Agent Execution Example
User: "How does the login function handle invalid passwords?"
Agent Thought: I need to find the login function first.
Agent Action: search_codebase("login function authentication")
Tool Result: [login.py snippet showing authenticate() function]
Agent Thought: I found authenticate(). Let me see the full implementation.
Agent Action: read_file("src/auth/login.py")
Tool Result: [Full file content]
Agent Thought: I see it calls verify_password(). Let me check that.
Agent Action: find_callees("authenticate")
Tool Result: verify_password, get_user, create_session
Agent Action: search_codebase("verify_password invalid password")
Tool Result: [password_utils.py with error handling]
Agent Final Answer: The login function handles invalid passwords by...
Frontend & API
Streamlit App Structure
app.py (Main Entry)
β
βββ Ingestion Screen
β βββ Source Type Selection (ZIP/GitHub/Web)
β βββ File Upload / URL Input
β βββ "Process & Index" Button
β
βββ Redirects to β pages/1_β‘_Code_Studio.py
Code_Studio.py
β
βββ Left Panel (Tabs)
β βββ π Explorer - File tree navigation
β βββ π Search - Regex pattern search
β βββ π¬ Chat - RAG conversation
β βββ β¨ Generate - Spec generation
β
βββ Right Panel
βββ Code Viewer - Syntax highlighted file view
FastAPI REST API
/api
βββ /health GET - Health check
β
βββ /index POST - Index a codebase
β Body: {
β source: "https://github.com/...",
β provider: "gemini",
β use_agent: true
β }
β
βββ /chat POST - Ask questions
Body: {
question: "How does auth work?",
provider: "gemini",
use_agent: true
}
Response: {
answer: "...",
sources: [...],
mode: "agent",
processing_time: 2.5
}
Component Deep Dives
Merkle Tree (Incremental Indexing)
class MerkleTree:
"""
Enables incremental indexing by detecting file changes.
How it works:
1. Build a hash tree mirroring directory structure
2. Each file node has SHA-256 hash of content
3. Each directory node has hash of children hashes
4. Compare old vs new tree to find changes
"""
def compare_trees(self, old, new) -> ChangeSet:
# Returns: added, modified, deleted, unchanged files
Example:
First Index:
project/
βββ main.py (hash: abc123)
βββ utils.py (hash: def456)
Root hash: sha256(abc123 + def456) = xyz789
Second Index (utils.py changed):
project/
βββ main.py (hash: abc123) β unchanged
βββ utils.py (hash: ghi012) β NEW HASH!
Root hash changed! β Only re-index utils.py
Path Obfuscation (Privacy)
class PathObfuscator:
"""
Obfuscates file paths for sensitive codebases.
Original: /home/user/secret-project/src/auth/login.py
Obfuscated: /f8a3b2c1/d4e5f6a7/89012345.py
Mapping stored securely, reversible only with key.
"""
Rate Limiter (API Management)
class AdaptiveRateLimiter:
"""
Handles rate limits for free-tier APIs.
Gemini Free Tier: 15 RPM, 32K TPM, 1500 RPD
Strategies:
1. Track usage in rolling window
2. Adaptive delay based on remaining quota
3. Exponential backoff on 429 errors
4. Model fallback chain (flash β pro β legacy)
"""
Configuration System
@dataclass
class RAGConfig:
"""Central configuration for entire pipeline"""
# Chunking
chunking: ChunkingConfig
max_chunk_tokens: int = 800
min_chunk_tokens: int = 100
preserve_imports: bool = True
calculate_complexity: bool = True
# Privacy
privacy: PrivacyConfig
enable_path_obfuscation: bool = False
# Indexing
indexing: IndexingConfig
enable_incremental_indexing: bool = True
batch_size: int = 100
ignore_patterns: List[str] = [...]
# Retrieval
retrieval: RetrievalConfig
enable_reranking: bool = True
retrieval_k: int = 10
rerank_top_k: int = 5
similarity_threshold: float = 0.5
File Dependency Map
app.py
βββ code_chatbot/universal_ingestor.py
βββ code_chatbot/indexer.py
β βββ code_chatbot/chunker.py (StructuralChunker)
β βββ code_chatbot/merkle_tree.py (MerkleTree)
β βββ code_chatbot/config.py (RAGConfig)
β βββ code_chatbot/db_connection.py (Chroma client)
βββ code_chatbot/rag.py (ChatEngine)
β βββ code_chatbot/retriever_wrapper.py
β β βββ code_chatbot/reranker.py (Reranker)
β βββ code_chatbot/llm_retriever.py (LLMRetriever)
β βββ code_chatbot/agent_workflow.py
β β βββ code_chatbot/tools.py
β βββ code_chatbot/prompts.py
βββ code_chatbot/ast_analysis.py (EnhancedCodeAnalyzer)
βββ code_chatbot/graph_rag.py (GraphEnhancedRetriever)
pages/1_β‘_Code_Studio.py
βββ components/file_explorer.py
βββ components/code_viewer.py
βββ components/panels.py
βββ components/style.py
api/main.py
βββ api/routes/chat.py
βββ api/routes/index.py
βββ api/routes/health.py
βββ api/schemas.py
βββ api/state.py
Summary
This project implements a sophisticated code understanding system with:
- Multi-Source Ingestion: ZIP, GitHub, Local, Web
- Structural Chunking: AST-aware code splitting
- Hybrid Retrieval: Vector + LLM + Graph-enhanced
- Cross-Encoder Reranking: Precision at the top
- Agentic Workflow: Multi-step reasoning with tools
- Call Graph Analysis: Function relationship tracking
- Incremental Indexing: Merkle tree change detection
- Multi-LLM Support: Gemini, Groq with fallbacks
The architecture is designed for scalability, accuracy, and developer experience.