Code Crawler Architecture

An AI-powered codebase assistant combining RAG, AST analysis, Graph databases, and Agentic workflows.

ZIPFileManager

GitHubRepoManager

LocalDirectoryManager

WebDocManager

StructuralChunker (tree-sitter)

EnhancedCodeAnalyzer

Gemini/HuggingFace Embeddings

Chroma / FAISS / Qdrant

GraphML (NetworkX)

Merkle Tree Snapshots

Data Flow

Input Source Ingestor Chunker Embeddings Vector DB

Vector Retriever (60%)

LLM Retriever (40%)

Graph Enhancement

Cross-Encoder Reranker

Linear RAG (simple Q&A)

Agentic Workflow (LangGraph)

Tools: search, read, list, call_graph

RAG Pipeline Implementation

The RAG (Retrieval-Augmented Generation) system combines vector search with LLM-based file selection and cross-encoder reranking for high-precision code retrieval.

1. Query Processing


              {`query = "How does authentication work?"
# Optionally expand with multi-query
expanded_queries = multi_query_expander(query)`}

2. Hybrid Retrieval


              {`# Vector similarity search (60% weight)
vector_docs = chroma_db.similarity_search(query, k=10)

# LLM-based file selection (40% weight)
llm_docs = llm_retriever.select_files(query, file_tree)

# Combine with EnsembleRetriever
combined = ensemble([vector_docs, llm_docs], weights=[0.6, 0.4])`}

3. Graph Enhancement


              {`# For each retrieved doc, find related files via AST graph
for doc in combined:
    neighbors = ast_graph.neighbors(doc.file_path)
    for neighbor in neighbors:
        if relation == "imports" or relation == "calls":
            augmented_docs.append(read_file(neighbor))`}

4. Cross-Encoder Reranking


              {`# Score each (query, document) pair with cross-encoder
pairs = [[query, doc.content] for doc in augmented_docs]
scores = cross_encoder.predict(pairs)

# Return top 5 by score
final_docs = sorted(zip(docs, scores), by=score)[:5]`}

5. Generation


              {`# Build context from retrieved docs
context = format_docs(final_docs)

# Generate answer with LLM
prompt = system_prompt.format(context=context)
answer = llm.invoke([SystemMessage(prompt), HumanMessage(query)])`}

Key Files

code_chatbot/rag.py

ChatEngine class

code_chatbot/retriever_wrapper.py

RerankingRetriever

code_chatbot/llm_retriever.py

LLM-based file selection

code_chatbot/reranker.py

Cross-encoder reranking

AST Analysis & Knowledge Graph

Uses tree-sitter to parse code into Abstract Syntax Trees, then builds a NetworkX directed graph capturing code relationships.

Node Types

file
class
function
method

Edge Types (Relations)

defines - file → class/function
has_method - class → method
calls - function → function
imports - file → module
inherits_from - class → class

Example: Parsing Python Code

{`# Source Code
class UserService:
    def get_user(self, user_id):
        return self.db.find(user_id)  # calls db.find

# Generated Graph
(file: user_service.py)
    │
    └──defines──▶ (class: UserService)
                      │
                      └──has_method──▶ (method: get_user)
                                           │
                                           └──calls──▶ (function: db.find)`}

Call Graph Tools

find_callers("authenticate")

→ Returns all functions that call authenticate()

find_callees("process_request")

→ Returns all functions called by process_request()

find_call_chain("main", "save_to_db")

→ Returns execution paths from main() to save_to_db()

Structural Code Chunking

Unlike naive text splitting, this system uses tree-sitter to chunk code at semantic boundaries (functions, classes) while respecting token limits.

❌ Naive Text Chunking

{`def process_data():
    data = load()
    # ──────────────── CHUNK BREAK ────
    result = transform(data)
    return result  # Broken mid-function!`}

✓ Structural Chunking

{`# CHUNK 1 - Complete function
def process_data():
    data = load()
    result = transform(data)
    return result

# CHUNK 2 - Complete function
def another_func():
    ...`}

Chunking Algorithm

1 Parse file into AST using tree-sitter
2 Recursively visit nodes (functions, classes, etc.)
3 If node fits in max_tokens (800) → return as chunk
4 If too large → split into children, recurse
5 Merge neighboring small chunks to avoid fragments

Rich Chunk Metadata

{`FileChunk {
  file_path: "src/auth/login.py",
  start_byte: 245,
  end_byte: 892,
  line_range: "L12-L45",
  language: "python",
  chunk_type: "function_definition",
  name: "authenticate",

  // Enhanced metadata
  symbols_defined: ["authenticate", "verify_token"],
  imports_used: ["from jwt import decode"],
  complexity_score: 7,  // Cyclomatic complexity
  parent_context: "AuthService"  // Parent class
}`}

Agentic Workflow (LangGraph)

The agent can perform multi-step reasoning using tools, enabling complex analysis that simple RAG cannot handle.

Agent State Machine

START

AGENT NODE

Process messages → Call LLM → Decide action

tool_call?

TOOLS NODE

Execute tools

final answer?

END

search_codebase

Vector search in codebase

read_file

Read complete file content

list_files

Directory listing

find_callers

Who calls this function?

find_callees

What does this call?

find_call_chain

Trace execution path

Example Agent Execution

User: "How does login handle invalid passwords?"

Agent Thought: I need to find the login function first.

→ Tool Call: search_codebase("login authentication")

Agent Thought: Found authenticate(). Let me see full implementation.

→ Tool Call: read_file("src/auth/login.py")

Agent Thought: It calls verify_password(). Let me check that.

→ Tool Call: find_callees("authenticate")

Final Answer: The login handles invalid passwords by...

Multi-Stage Retrieval System

Stage 1: Vector Retrieval (k=10)

Semantic similarity search in Chroma/FAISS using embeddings

Stage 2: LLM File Selection

LLM analyzes file tree structure and selects relevant files

Stage 3: Ensemble Combination

Weighted merge: 60% vector + 40% LLM selection

Stage 4: Graph Enhancement

Add related files from AST graph (imports, calls)

Stage 5: Cross-Encoder Reranking

Score each (query, doc) pair, return top 5

Vector DB Support

Chroma

Default, local

FAISS

Fallback, fast

Qdrant

Cloud option

{title}