import React, { useState } from 'react'; import { ChevronRight, ChevronDown, Database, Code, Brain, Search, FileText, GitBranch, Layers, Workflow, Server, Cpu, ArrowRight, Zap } from 'lucide-react'; const ArchitectureViz = () => { const [activeTab, setActiveTab] = useState('overview'); const [expandedSections, setExpandedSections] = useState({}); const toggleSection = (section) => { setExpandedSections(prev => ({ ...prev, [section]: !prev[section] })); }; const tabs = [ { id: 'overview', label: 'System Overview', icon: Layers }, { id: 'rag', label: 'RAG Pipeline', icon: Search }, { id: 'ast', label: 'AST & Graphs', icon: GitBranch }, { id: 'chunking', label: 'Code Chunking', icon: Code }, { id: 'agent', label: 'Agentic Workflow', icon: Brain }, { id: 'retrieval', label: 'Retrieval System', icon: Database }, ]; const ComponentCard = ({ title, description, icon: Icon, color, children }) => (

{title}

{description}

{children}
); const FlowArrow = () => (
); const renderOverview = () => (

Code Crawler Architecture

An AI-powered codebase assistant combining RAG, AST analysis, Graph databases, and Agentic workflows.

ZIPFileManager
GitHubRepoManager
LocalDirectoryManager
WebDocManager
StructuralChunker (tree-sitter)
EnhancedCodeAnalyzer
Gemini/HuggingFace Embeddings
Chroma / FAISS / Qdrant
GraphML (NetworkX)
Merkle Tree Snapshots

Data Flow

Input Source Ingestor Chunker Embeddings Vector DB
Vector Retriever (60%)
LLM Retriever (40%)
Graph Enhancement
Cross-Encoder Reranker
Linear RAG (simple Q&A)
Agentic Workflow (LangGraph)
Tools: search, read, list, call_graph
); const renderRAG = () => (

RAG Pipeline Implementation

The RAG (Retrieval-Augmented Generation) system combines vector search with LLM-based file selection and cross-encoder reranking for high-precision code retrieval.

1. Query Processing

{`query = "How does authentication work?" # Optionally expand with multi-query expanded_queries = multi_query_expander(query)`}

2. Hybrid Retrieval

{`# Vector similarity search (60% weight) vector_docs = chroma_db.similarity_search(query, k=10) # LLM-based file selection (40% weight) llm_docs = llm_retriever.select_files(query, file_tree) # Combine with EnsembleRetriever combined = ensemble([vector_docs, llm_docs], weights=[0.6, 0.4])`}

3. Graph Enhancement

{`# For each retrieved doc, find related files via AST graph for doc in combined: neighbors = ast_graph.neighbors(doc.file_path) for neighbor in neighbors: if relation == "imports" or relation == "calls": augmented_docs.append(read_file(neighbor))`}

4. Cross-Encoder Reranking

{`# Score each (query, document) pair with cross-encoder pairs = [[query, doc.content] for doc in augmented_docs] scores = cross_encoder.predict(pairs) # Return top 5 by score final_docs = sorted(zip(docs, scores), by=score)[:5]`}

5. Generation

{`# Build context from retrieved docs context = format_docs(final_docs) # Generate answer with LLM prompt = system_prompt.format(context=context) answer = llm.invoke([SystemMessage(prompt), HumanMessage(query)])`}

Key Files

code_chatbot/rag.py

ChatEngine class

code_chatbot/retriever_wrapper.py

RerankingRetriever

code_chatbot/llm_retriever.py

LLM-based file selection

code_chatbot/reranker.py

Cross-encoder reranking

); const renderAST = () => (

AST Analysis & Knowledge Graph

Uses tree-sitter to parse code into Abstract Syntax Trees, then builds a NetworkX directed graph capturing code relationships.

Node Types

  • file
  • class
  • function
  • method

Edge Types (Relations)

  • defines - file → class/function
  • has_method - class → method
  • calls - function → function
  • imports - file → module
  • inherits_from - class → class

Example: Parsing Python Code

{`# Source Code
class UserService:
    def get_user(self, user_id):
        return self.db.find(user_id)  # calls db.find

# Generated Graph
(file: user_service.py)
    │
    └──defines──▶ (class: UserService)
                      │
                      └──has_method──▶ (method: get_user)
                                           │
                                           └──calls──▶ (function: db.find)`}
          

Call Graph Tools

find_callers("authenticate")

→ Returns all functions that call authenticate()

find_callees("process_request")

→ Returns all functions called by process_request()

find_call_chain("main", "save_to_db")

→ Returns execution paths from main() to save_to_db()

); const renderChunking = () => (

Structural Code Chunking

Unlike naive text splitting, this system uses tree-sitter to chunk code at semantic boundaries (functions, classes) while respecting token limits.

❌ Naive Text Chunking

{`def process_data():
    data = load()
    # ──────────────── CHUNK BREAK ────
    result = transform(data)
    return result  # Broken mid-function!`}
            

✓ Structural Chunking

{`# CHUNK 1 - Complete function
def process_data():
    data = load()
    result = transform(data)
    return result

# CHUNK 2 - Complete function
def another_func():
    ...`}
            

Chunking Algorithm

  1. 1 Parse file into AST using tree-sitter
  2. 2 Recursively visit nodes (functions, classes, etc.)
  3. 3 If node fits in max_tokens (800) → return as chunk
  4. 4 If too large → split into children, recurse
  5. 5 Merge neighboring small chunks to avoid fragments

Rich Chunk Metadata

{`FileChunk {
  file_path: "src/auth/login.py",
  start_byte: 245,
  end_byte: 892,
  line_range: "L12-L45",
  language: "python",
  chunk_type: "function_definition",
  name: "authenticate",

  // Enhanced metadata
  symbols_defined: ["authenticate", "verify_token"],
  imports_used: ["from jwt import decode"],
  complexity_score: 7,  // Cyclomatic complexity
  parent_context: "AuthService"  // Parent class
}`}
          
); const renderAgent = () => (

Agentic Workflow (LangGraph)

The agent can perform multi-step reasoning using tools, enabling complex analysis that simple RAG cannot handle.

Agent State Machine

START
AGENT NODE
Process messages → Call LLM → Decide action
tool_call?
TOOLS NODE
Execute tools
final answer?
END
search_codebase

Vector search in codebase

read_file

Read complete file content

list_files

Directory listing

find_callers

Who calls this function?

find_callees

What does this call?

find_call_chain

Trace execution path

Example Agent Execution

User: "How does login handle invalid passwords?"
Agent Thought: I need to find the login function first.
→ Tool Call: search_codebase("login authentication")
Agent Thought: Found authenticate(). Let me see full implementation.
→ Tool Call: read_file("src/auth/login.py")
Agent Thought: It calls verify_password(). Let me check that.
→ Tool Call: find_callees("authenticate")
Final Answer: The login handles invalid passwords by...
); const renderRetrieval = () => (

Multi-Stage Retrieval System

Stage 1: Vector Retrieval (k=10)

Semantic similarity search in Chroma/FAISS using embeddings

Stage 2: LLM File Selection

LLM analyzes file tree structure and selects relevant files

Stage 3: Ensemble Combination

Weighted merge: 60% vector + 40% LLM selection

Stage 4: Graph Enhancement

Add related files from AST graph (imports, calls)

Stage 5: Cross-Encoder Reranking

Score each (query, doc) pair, return top 5

Vector DB Support

Chroma
Default, local
FAISS
Fallback, fast
Qdrant
Cloud option
); return (

🕷️ Code Crawler Architecture

Interactive System Documentation

{tabs.map(tab => ( ))}
{activeTab === 'overview' && renderOverview()} {activeTab === 'rag' && renderRAG()} {activeTab === 'ast' && renderAST()} {activeTab === 'chunking' && renderChunking()} {activeTab === 'agent' && renderAgent()} {activeTab === 'retrieval' && renderRetrieval()}
); }; export default ArchitectureViz;