# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview **CodePilot** - An autonomous AI coding agent that takes GitHub issues, understands codebases, writes code in sandboxed environments, and creates pull requests autonomously. **Tech Stack:** Python 3.11+, Claude Sonnet 4.5 (Anthropic API), E2B sandboxed execution, LangChain/LangGraph, Chainlit UI **Current Phase:** Phase 5 Complete (Chainlit UI with multi-agent visualization) ## Architecture ### Multi-Agent Workflow System CodePilot uses a **dual-mode orchestrator** that routes tasks to different workflows: ``` ┌─────────────────────────────────────────────────────────┐ │ ORCHESTRATOR │ │ (Task Classification) │ └─────────────────────────────────────────────────────────┘ │ ┌───────────────┴────────────────┐ │ │ "explore" "code" │ │ v v ┌────────────────┐ ┌──────────────────────────────┐ │ ExplorerAgent │ │ Full Multi-Agent Pipeline │ │ (Direct) │ │ Explorer → Clarify → Plan │ └────────────────┘ │ ↓ │ │ Coder ⟷ Reviewer │ │ (iterative) │ └──────────────────────────────┘ ``` **Task Classification Logic** (see `orchestrator.py:92-201`): - **Explore tasks**: Questions starting with "find", "where", "what", "how", "explain" → Uses ExplorerAgent only - **Code tasks**: Commands starting with "add", "create", "implement", "fix" → Full pipeline - Short queries (<100 chars) default to explore; long queries default to code **Full Pipeline Flow** (code tasks): 1. **Explorer** - Gathers codebase context using token-efficient tools 2. **Clarifier** - Planner generates questions, pauses for user answers (v3.3+) 3. **Planner** - Creates implementation plan (NO tools, pure LLM reasoning) 4. **Coder** - Implements code, tests in sandbox (NO search, uses Explorer's context) 5. **Reviewer** - Reviews code, approves or sends back to Coder with feedback ### Context Engineering (Hybrid Retrieval) The core differentiator is **Reciprocal Rank Fusion (RRF)** combining two search methods: ``` Query → ┌─ BM25 (keyword) ──────┐ │ │ ├─ Embeddings (semantic)┤ → RRF Fusion → Top K Results │ (sentence-transformers) └───────────────────────┘ ``` **Implementation**: `codepilot/context/hybrid_retriever.py` - BM25: Exact matches (function names, variable names) - Embeddings: Semantic matches (related concepts) - RRF formula: `score = Σ(weight_i / (k + rank_i))` where k=60 - Default weights: 50% BM25, 50% embeddings ### Token-Efficient Tools **Critical for cost management** - agents should prefer: 1. `get_file_outline(path)` - Shows class/function signatures (~50 tokens vs ~2000 for full file) 2. `get_code_chunk(path, name)` - Extracts specific function/class by name 3. `search_repository(query)` - Hybrid search (use BEFORE reading files) Only use `read_file` when you need complete file contents. ### Agent Tool Access (v3.0+ separation) Each agent has **restricted tool access** to prevent inefficiency: - **ExplorerAgent**: `search_repository`, `get_file_outline`, `get_code_chunk`, `search_code`, `list_files` - **PlannerAgent**: **NO TOOLS** (pure LLM reasoning, receives exploration context) - **CoderAgent**: `write_file`, `get_code_chunk`, `read_file` (NO search tools) - **ReviewerAgent**: `get_file_outline`, `get_code_chunk`, `read_file` **Key insight**: v3.0 removed duplicate searching. Explorer searches once, all agents reuse that context. ## Development Commands **Setup:** ```bash python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt ``` **Verify installation:** ```bash python test_setup.py # Checks API keys are loaded ``` **Run Chainlit UI (Primary interface):** ```bash chainlit run chainlit_app.py # Opens at http://localhost:8000 # Ctrl+C to stop, then pkill -f chainlit to clean up background processes ``` **Test individual components:** ```bash # Context Engineering (Phase 2) python test_context.py # Multi-Agent Workflow (Phase 3) python test_multi_agent.py # E2B Sandbox (Phase 4) python test_sandbox.py python test_workflow_with_sandbox.py ``` **Environment variables** (create `.env` file): ``` ANTHROPIC_API_KEY=sk-ant-... E2B_API_KEY=e2b_... ``` ## Current Implementation Status **✅ COMPLETED (Phases 1-5):** - Phase 1: LLM client, tool registry, base agent, core tools - Phase 2: Hybrid retrieval (BM25 + embeddings), AST-aware parsing, codebase indexing - Phase 3: Multi-agent architecture (Explorer, Planner, Coder, Reviewer, Orchestrator) - Phase 4: E2B sandbox integration for isolated code execution - Phase 5: Chainlit UI with real-time agent progress visualization **🚧 NEXT PHASES:** - Phase 6 (Weeks 17-18): GitHub Integration - webhooks, automated PR creation - Phase 7 (Weeks 19-21): Evals & Benchmarks - SWE-bench evaluation - Phase 8 (Weeks 22-24): Production Hardening - error handling, monitoring, deployment See `devon-project-plan.md` for complete 24-week roadmap. ## Key Design Principles 1. **Context Engineering is the Differentiator** - Not UI/UX, the hybrid retrieval and AST-aware chunking 2. **ReAct Pattern** - All agents use: Reason → Act (with tools) → Observe → Repeat 3. **AST-Aware Processing** - Parse code structurally using tree-sitter, not as text 4. **Sandboxed Execution** - All code runs in E2B containers, never on host machine 5. **Single-Search Architecture** - Explorer searches once, all downstream agents reuse context (v3.0+) 6. **Clarification Before Action** - Planner asks questions before creating plan (v3.3+) ## Important Implementation Details ### Tool Schema Format All tools follow Claude/Anthropic function calling format: ```python { "type": "function", "function": { "name": "tool_name", "description": "Clear description for LLM", "parameters": { "type": "object", "properties": {...}, "required": [...] } } } ``` ### Path Handling (Critical for Coder) - **Planner must provide FULL ABSOLUTE PATHS** (e.g., `/tmp/codepilot_repos/flask_abc123/examples/app.py`) - **Coder uses paths EXACTLY as written** in the plan - Repository path is injected in Chainlit context (see `chainlit_app.py:661-672`) ### File Operations - `write_file` auto-creates parent directories - `run_command` has 30-second timeout - All tool functions return formatted strings (success messages or errors) ### Version Tracking Files include version constants for debugging hot-reload issues: - `orchestrator.py:12` - `ORCHESTRATOR_VERSION` - `chainlit_app.py:25-26` - `APP_VERSION`, `BUILD_ID` ### Conversation Management Agents use `ConversationManager` (`codepilot/agents/conversation.py`) to maintain message history in OpenAI/Anthropic format. This handles: - System/user/assistant messages - Tool calls and tool results - Proper formatting for both Claude and OpenAI APIs ## Critical Files - `codepilot/agents/orchestrator.py` - Task classification and multi-agent state machine - `codepilot/agents/planner_agent.py` - Pure LLM planning (no tools) + clarification questions - `codepilot/agents/coder_agent.py` - Code implementation (no search tools) - `codepilot/agents/explorer_agent.py` - Codebase exploration (search tools only) - `codepilot/context/hybrid_retriever.py` - RRF fusion algorithm - `codepilot/tools/registry.py` - Tool schemas and function mappings - `chainlit_app.py` - Interactive UI with GitHub repo cloning and progress visualization - `requirements.txt` - Python dependencies ## Project Structure ``` codepilot/ ├── llm/ # LLM client wrappers (Claude, OpenAI) ├── agents/ # Multi-agent system (Orchestrator, Planner, Coder, Reviewer, Explorer) ├── tools/ # Tool implementations (file ops, context search, GitHub) ├── context/ # Hybrid retrieval (BM25, embeddings, parser, indexer) └── sandbox/ # E2B sandbox integration chainlit_app.py # Main UI application ```