File size: 9,128 Bytes
45bf590
 
 
 
 
 
 
 
6f39ef4
 
45bf590
 
 
6f39ef4
 
 
45bf590
 
6f39ef4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45bf590
 
6f39ef4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45bf590
6f39ef4
 
 
 
 
 
 
 
 
 
45bf590
 
 
 
 
 
6f39ef4
45bf590
 
 
6f39ef4
45bf590
6f39ef4
45bf590
 
6f39ef4
45bf590
 
 
6f39ef4
45bf590
 
6f39ef4
45bf590
6f39ef4
45bf590
 
6f39ef4
45bf590
 
6f39ef4
45bf590
 
 
 
6f39ef4
45bf590
 
 
 
 
6f39ef4
 
 
 
 
 
 
 
 
 
 
 
 
45bf590
6f39ef4
45bf590
 
 
6f39ef4
 
 
 
 
 
45bf590
6f39ef4
45bf590
6f39ef4
 
45bf590
 
 
 
 
6f39ef4
45bf590
 
 
 
 
 
 
 
 
6f39ef4
 
 
 
45bf590
6f39ef4
 
 
45bf590
 
6f39ef4
 
 
 
45bf590
6f39ef4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

**CodePilot** - An autonomous AI coding agent that takes GitHub issues, understands codebases, writes code in sandboxed environments, and creates pull requests autonomously.

**Tech Stack:** Python 3.11+, Claude Sonnet 4.5 (Anthropic API), E2B sandboxed execution, LangChain/LangGraph, Chainlit UI
**Current Phase:** Phase 5 Complete (Chainlit UI with multi-agent visualization)

## Architecture

### Multi-Agent Workflow System

CodePilot uses a **dual-mode orchestrator** that routes tasks to different workflows:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ORCHESTRATOR                         β”‚
β”‚              (Task Classification)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                                β”‚
    "explore"                         "code"
         β”‚                                β”‚
         v                                v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ExplorerAgent  β”‚         β”‚ Full Multi-Agent Pipeline    β”‚
β”‚   (Direct)     β”‚         β”‚ Explorer β†’ Clarify β†’ Plan    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚         ↓                    β”‚
                           β”‚   Coder ⟷ Reviewer           β”‚
                           β”‚  (iterative)                 β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Task Classification Logic** (see `orchestrator.py:92-201`):
- **Explore tasks**: Questions starting with "find", "where", "what", "how", "explain" β†’ Uses ExplorerAgent only
- **Code tasks**: Commands starting with "add", "create", "implement", "fix" β†’ Full pipeline
- Short queries (<100 chars) default to explore; long queries default to code

**Full Pipeline Flow** (code tasks):
1. **Explorer** - Gathers codebase context using token-efficient tools
2. **Clarifier** - Planner generates questions, pauses for user answers (v3.3+)
3. **Planner** - Creates implementation plan (NO tools, pure LLM reasoning)
4. **Coder** - Implements code, tests in sandbox (NO search, uses Explorer's context)
5. **Reviewer** - Reviews code, approves or sends back to Coder with feedback

### Context Engineering (Hybrid Retrieval)

The core differentiator is **Reciprocal Rank Fusion (RRF)** combining two search methods:

```
Query β†’ β”Œβ”€ BM25 (keyword) ──────┐
        β”‚                       β”‚
        β”œβ”€ Embeddings (semantic)─ β†’ RRF Fusion β†’ Top K Results
        β”‚   (sentence-transformers)
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Implementation**: `codepilot/context/hybrid_retriever.py`
- BM25: Exact matches (function names, variable names)
- Embeddings: Semantic matches (related concepts)
- RRF formula: `score = Ξ£(weight_i / (k + rank_i))` where k=60
- Default weights: 50% BM25, 50% embeddings

### Token-Efficient Tools

**Critical for cost management** - agents should prefer:
1. `get_file_outline(path)` - Shows class/function signatures (~50 tokens vs ~2000 for full file)
2. `get_code_chunk(path, name)` - Extracts specific function/class by name
3. `search_repository(query)` - Hybrid search (use BEFORE reading files)

Only use `read_file` when you need complete file contents.

### Agent Tool Access (v3.0+ separation)

Each agent has **restricted tool access** to prevent inefficiency:

- **ExplorerAgent**: `search_repository`, `get_file_outline`, `get_code_chunk`, `search_code`, `list_files`
- **PlannerAgent**: **NO TOOLS** (pure LLM reasoning, receives exploration context)
- **CoderAgent**: `write_file`, `get_code_chunk`, `read_file` (NO search tools)
- **ReviewerAgent**: `get_file_outline`, `get_code_chunk`, `read_file`

**Key insight**: v3.0 removed duplicate searching. Explorer searches once, all agents reuse that context.

## Development Commands

**Setup:**
```bash
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
```

**Verify installation:**
```bash
python test_setup.py  # Checks API keys are loaded
```

**Run Chainlit UI (Primary interface):**
```bash
chainlit run chainlit_app.py
# Opens at http://localhost:8000
# Ctrl+C to stop, then pkill -f chainlit to clean up background processes
```

**Test individual components:**
```bash
# Context Engineering (Phase 2)
python test_context.py

# Multi-Agent Workflow (Phase 3)
python test_multi_agent.py

# E2B Sandbox (Phase 4)
python test_sandbox.py
python test_workflow_with_sandbox.py
```

**Environment variables** (create `.env` file):
```
ANTHROPIC_API_KEY=sk-ant-...
E2B_API_KEY=e2b_...
```

## Current Implementation Status

**βœ… COMPLETED (Phases 1-5):**
- Phase 1: LLM client, tool registry, base agent, core tools
- Phase 2: Hybrid retrieval (BM25 + embeddings), AST-aware parsing, codebase indexing
- Phase 3: Multi-agent architecture (Explorer, Planner, Coder, Reviewer, Orchestrator)
- Phase 4: E2B sandbox integration for isolated code execution
- Phase 5: Chainlit UI with real-time agent progress visualization

**🚧 NEXT PHASES:**
- Phase 6 (Weeks 17-18): GitHub Integration - webhooks, automated PR creation
- Phase 7 (Weeks 19-21): Evals & Benchmarks - SWE-bench evaluation
- Phase 8 (Weeks 22-24): Production Hardening - error handling, monitoring, deployment

See `devon-project-plan.md` for complete 24-week roadmap.

## Key Design Principles

1. **Context Engineering is the Differentiator** - Not UI/UX, the hybrid retrieval and AST-aware chunking
2. **ReAct Pattern** - All agents use: Reason β†’ Act (with tools) β†’ Observe β†’ Repeat
3. **AST-Aware Processing** - Parse code structurally using tree-sitter, not as text
4. **Sandboxed Execution** - All code runs in E2B containers, never on host machine
5. **Single-Search Architecture** - Explorer searches once, all downstream agents reuse context (v3.0+)
6. **Clarification Before Action** - Planner asks questions before creating plan (v3.3+)

## Important Implementation Details

### Tool Schema Format
All tools follow Claude/Anthropic function calling format:
```python
{
    "type": "function",
    "function": {
        "name": "tool_name",
        "description": "Clear description for LLM",
        "parameters": {
            "type": "object",
            "properties": {...},
            "required": [...]
        }
    }
}
```

### Path Handling (Critical for Coder)
- **Planner must provide FULL ABSOLUTE PATHS** (e.g., `/tmp/codepilot_repos/flask_abc123/examples/app.py`)
- **Coder uses paths EXACTLY as written** in the plan
- Repository path is injected in Chainlit context (see `chainlit_app.py:661-672`)

### File Operations
- `write_file` auto-creates parent directories
- `run_command` has 30-second timeout
- All tool functions return formatted strings (success messages or errors)

### Version Tracking
Files include version constants for debugging hot-reload issues:
- `orchestrator.py:12` - `ORCHESTRATOR_VERSION`
- `chainlit_app.py:25-26` - `APP_VERSION`, `BUILD_ID`

### Conversation Management
Agents use `ConversationManager` (`codepilot/agents/conversation.py`) to maintain message history in OpenAI/Anthropic format. This handles:
- System/user/assistant messages
- Tool calls and tool results
- Proper formatting for both Claude and OpenAI APIs

## Critical Files

- `codepilot/agents/orchestrator.py` - Task classification and multi-agent state machine
- `codepilot/agents/planner_agent.py` - Pure LLM planning (no tools) + clarification questions
- `codepilot/agents/coder_agent.py` - Code implementation (no search tools)
- `codepilot/agents/explorer_agent.py` - Codebase exploration (search tools only)
- `codepilot/context/hybrid_retriever.py` - RRF fusion algorithm
- `codepilot/tools/registry.py` - Tool schemas and function mappings
- `chainlit_app.py` - Interactive UI with GitHub repo cloning and progress visualization
- `requirements.txt` - Python dependencies

## Project Structure

```
codepilot/
β”œβ”€β”€ llm/               # LLM client wrappers (Claude, OpenAI)
β”œβ”€β”€ agents/            # Multi-agent system (Orchestrator, Planner, Coder, Reviewer, Explorer)
β”œβ”€β”€ tools/             # Tool implementations (file ops, context search, GitHub)
β”œβ”€β”€ context/           # Hybrid retrieval (BM25, embeddings, parser, indexer)
└── sandbox/           # E2B sandbox integration
chainlit_app.py        # Main UI application
```