Spaces:

ayushm98
/

codepilot

Runtime error

App Files Files Community

ayushm98 commited on Jan 14

Commit

6f39ef4

1 Parent(s): 94dfc0a

Improve UI: cleaner welcome, better progress display, simplified results

Browse files

Files changed (3) hide show

CLAUDE.md +143 -100
chainlit.md +21 -8
chainlit_app.py +84 -192

CLAUDE.md CHANGED Viewed

@@ -6,146 +6,160 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 **CodePilot** - An autonomous AI coding agent that takes GitHub issues, understands codebases, writes code in sandboxed environments, and creates pull requests autonomously.
-**Tech Stack:** Python 3.11+, Claude Sonnet 4.5 (Anthropic API), E2B sandboxed execution, LangChain/LangGraph
-**Development Timeline:** 24-week phased implementation (currently in Phase 5: Chainlit UI - COMPLETE)
 ## Architecture
-This project follows a layered architecture (planned - see devon-project-plan.md for full roadmap):
 ```
-Multi-Agent System (Planner, Coder, Reviewer)
-    ↓
-Context Engine (Hybrid Retrieval, AST-aware chunking)
-    ↓
-Tool Layer (read_file, write_file, run_command, search_code)
-    ↓
-E2B Sandbox (Isolated code execution)
 ```
-## Current Implementation Status
-**✅ COMPLETED PHASES:**
-**Phase 1: Foundation (Weeks 1-3)**
-- ✅ LLM client wrapper (`codepilot/llm/claude_client.py`) - Claude API with tool calling
-- ✅ Tool registry (`codepilot/tools/registry.py`) - Function calling infrastructure
-- ✅ Base agent (`codepilot/agents/base_agent.py`) - Core ReAct loop
-- ✅ Core tools: `read_file`, `write_file`, `run_command`, `search_codebase`, `list_files`
-**Phase 2: Context Engineering (Weeks 4-8)**
-- ✅ BM25 keyword search (`codepilot/context/bm25_search.py`)
-- ✅ Dense embeddings (`codepilot/context/embeddings.py`) - sentence-transformers
-- ✅ Hybrid retrieval (`codepilot/context/retrieval.py`) - Combined BM25 + semantic search
-- ✅ Code parser (`codepilot/context/parser.py`) - AST-aware chunking
-- ✅ Codebase indexer (`codepilot/context/indexer.py`) - Full codebase indexing
-- ✅ Context selector (`codepilot/context/selector.py`) - Smart context selection
-- ✅ Context tools: `index_codebase`, `search_codebase`, `get_relevant_context`
-**Phase 3: Multi-Agent Architecture (Weeks 9-12)**
-- ✅ Planner agent (`codepilot/agents/planner_agent.py`) - Creates implementation plans
-- ✅ Coder agent (`codepilot/agents/coder_agent.py`) - Writes and tests code
-- ✅ Reviewer agent (`codepilot/agents/reviewer_agent.py`) - Code review and approval
-- ✅ Orchestrator (`codepilot/agents/orchestrator.py`) - State machine coordination
-**Phase 4: E2B Sandbox Integration (Weeks 13-14)**
-- ✅ E2B sandbox manager (`codepilot/sandbox/e2b_sandbox.py`) - Isolated execution
-- ✅ Sandbox tools (`codepilot/sandbox/sandbox_tools.py`) - upload, execute, run commands
-- ✅ Integration with Coder agent - Automatic sandbox testing workflow
-**Phase 5: Chainlit UI (Weeks 15-16)**
-- ✅ Chainlit application (`chainlit_app.py`) - Interactive chat interface
-- ✅ Real-time workflow visualization with Chainlit Steps
-- ✅ Detailed agent progress tracking (Planner → Coder → Reviewer)
-- ✅ Code preview and test results display
-- ✅ User guide (`CHAINLIT_GUIDE.md`)
-**NEXT PHASES:**
-**Phase 6: GitHub Integration (Weeks 17-18)** - Not started
-- GitHub webhooks for issue tracking
-- Automated PR creation
-- Branch management
-**Phase 7: Evals & Benchmarks (Weeks 19-21)** - Not started
-- SWE-bench evaluation
-- Custom test suite
-**Phase 8: Production Hardening (Weeks 22-24)** - Not started
-- Error handling and retries
-- Logging and monitoring
-- Deployment configuration
 ## Development Commands
 **Setup:**
 ```bash
 python -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
 pip install -r requirements.txt
 ```
-**Verify setup:**
 ```bash
-python test_setup.py  # Checks that API keys are loaded correctly
 ```
-**Run Chainlit UI (Phase 5):**
 ```bash
 chainlit run chainlit_app.py
 # Opens at http://localhost:8000
-# See CHAINLIT_GUIDE.md for full usage guide
 ```
-**Test individual phases:**
 ```bash
-# Phase 2: Context Engineering
 python test_context.py
-# Phase 3: Multi-Agent Workflow
 python test_multi_agent.py
-# Phase 4: E2B Sandbox
 python test_sandbox.py
 python test_workflow_with_sandbox.py
 ```
-**Environment variables required in .env:**
 ```
 ANTHROPIC_API_KEY=sk-ant-...
 E2B_API_KEY=e2b_...
 ```
-## Project Phases (from devon-project-plan.md)
-1. **Phase 1 (Weeks 1-3):** Foundation - Basic agent loop, tool calling, LLM abstraction
-2. **Phase 2 (Weeks 4-8):** Context Engineering - Hybrid retrieval (BM25 + dense), AST-aware chunking
-3. **Phase 3 (Weeks 9-12):** Multi-Agent Architecture - Orchestrator with specialized agents
-4. **Phase 4 (Weeks 13-14):** E2B Sandbox Integration
-5. **Phase 5 (Weeks 15-16):** Chainlit UI
-6. **Phase 6 (Weeks 17-18):** GitHub Integration (webhooks, PRs)
-7. **Phase 7 (Weeks 19-21):** Evals & Benchmarks (SWE-bench)
-8. **Phase 8 (Weeks 22-24):** Production Hardening
 ## Key Design Principles
-**From the project plan:**
-- **Focus on Context Engineering:** This is the differentiator, not UI/UX
-- **ReAct Pattern:** Reason about what to do, Act with tools, observe results, repeat
-- **AST-Aware Processing:** Parse code structurally, not as text (tree-sitter for multi-language support)
-- **Hybrid Retrieval:** Combine BM25 (exact matches) + dense embeddings (semantic search)
-- **Sandboxed Execution:** All code runs in E2B containers, never on host
-- **Multi-Agent Orchestration:** Specialized agents (Planner, Coder, Reviewer) coordinated by orchestrator
-## Tool Schema Format
-Tools follow Claude/Anthropic function calling format:
 ```python
 {
     "type": "function",
     "function": {
         "name": "tool_name",
-        "description": "Clear description for LLM to understand when to use",
         "parameters": {
             "type": "object",
             "properties": {...},
@@ -155,17 +169,46 @@ Tools follow Claude/Anthropic function calling format:
 }
 ```
-## Implementation Notes
 - All tool functions return formatted strings (success messages or errors)
-- `write_file` auto-creates parent directories if needed
-- `run_command` has 30-second timeout to prevent hanging
-- Error handling uses specific exceptions (FileNotFoundError, PermissionError) before generic fallback
-## Important Files
-- `devon-project-plan.md` - Complete 24-week implementation roadmap with architectural details
-- `codepilot/llm/claude_client.py` - Claude API wrapper with tool calling
-- `codepilot/agents/orchestrator.py` - Multi-agent state machine
-- `requirements.txt` - Python dependencies (anthropic, e2b-code-interpreter, langchain, langgraph)
-- `.env` - API keys (not committed, in .gitignore)

 **CodePilot** - An autonomous AI coding agent that takes GitHub issues, understands codebases, writes code in sandboxed environments, and creates pull requests autonomously.
+**Tech Stack:** Python 3.11+, Claude Sonnet 4.5 (Anthropic API), E2B sandboxed execution, LangChain/LangGraph, Chainlit UI
+**Current Phase:** Phase 5 Complete (Chainlit UI with multi-agent visualization)
 ## Architecture
+### Multi-Agent Workflow System
+CodePilot uses a **dual-mode orchestrator** that routes tasks to different workflows:
 ```
+┌─────────────────────────────────────────────────────────┐
+│                    ORCHESTRATOR                         │
+│              (Task Classification)                      │
+└─────────────────────────────────────────────────────────┘
+                         │
+         ┌───────────────┴────────────────┐
+         │                                │
+    "explore"                         "code"
+         │                                │
+         v                                v
+┌────────────────┐         ┌──────────────────────────────┐
+│ ExplorerAgent  │         │ Full Multi-Agent Pipeline    │
+│   (Direct)     │         │ Explorer → Clarify → Plan    │
+└────────────────┘         │         ↓                    │
+                           │   Coder ⟷ Reviewer           │
+                           │  (iterative)                 │
+                           └──────────────────────────────┘
 ```
+**Task Classification Logic** (see `orchestrator.py:92-201`):
+- **Explore tasks**: Questions starting with "find", "where", "what", "how", "explain" → Uses ExplorerAgent only
+- **Code tasks**: Commands starting with "add", "create", "implement", "fix" → Full pipeline
+- Short queries (<100 chars) default to explore; long queries default to code
+**Full Pipeline Flow** (code tasks):
+1. **Explorer** - Gathers codebase context using token-efficient tools
+2. **Clarifier** - Planner generates questions, pauses for user answers (v3.3+)
+3. **Planner** - Creates implementation plan (NO tools, pure LLM reasoning)
+4. **Coder** - Implements code, tests in sandbox (NO search, uses Explorer's context)
+5. **Reviewer** - Reviews code, approves or sends back to Coder with feedback
+### Context Engineering (Hybrid Retrieval)
+The core differentiator is **Reciprocal Rank Fusion (RRF)** combining two search methods:
+```
+Query → ┌─ BM25 (keyword) ──────┐
+        │                       │
+        ├─ Embeddings (semantic)┤ → RRF Fusion → Top K Results
+        │   (sentence-transformers)
+        └───────────────────────┘
+```
+**Implementation**: `codepilot/context/hybrid_retriever.py`
+- BM25: Exact matches (function names, variable names)
+- Embeddings: Semantic matches (related concepts)
+- RRF formula: `score = Σ(weight_i / (k + rank_i))` where k=60
+- Default weights: 50% BM25, 50% embeddings
+### Token-Efficient Tools
+**Critical for cost management** - agents should prefer:
+1. `get_file_outline(path)` - Shows class/function signatures (~50 tokens vs ~2000 for full file)
+2. `get_code_chunk(path, name)` - Extracts specific function/class by name
+3. `search_repository(query)` - Hybrid search (use BEFORE reading files)
+Only use `read_file` when you need complete file contents.
+### Agent Tool Access (v3.0+ separation)
+Each agent has **restricted tool access** to prevent inefficiency:
+- **ExplorerAgent**: `search_repository`, `get_file_outline`, `get_code_chunk`, `search_code`, `list_files`
+- **PlannerAgent**: **NO TOOLS** (pure LLM reasoning, receives exploration context)
+- **CoderAgent**: `write_file`, `get_code_chunk`, `read_file` (NO search tools)
+- **ReviewerAgent**: `get_file_outline`, `get_code_chunk`, `read_file`
+**Key insight**: v3.0 removed duplicate searching. Explorer searches once, all agents reuse that context.
 ## Development Commands
 **Setup:**
 ```bash
 python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
 pip install -r requirements.txt
 ```
+**Verify installation:**
 ```bash
+python test_setup.py  # Checks API keys are loaded
 ```
+**Run Chainlit UI (Primary interface):**
 ```bash
 chainlit run chainlit_app.py
 # Opens at http://localhost:8000
+# Ctrl+C to stop, then pkill -f chainlit to clean up background processes
 ```
+**Test individual components:**
 ```bash
+# Context Engineering (Phase 2)
 python test_context.py
+# Multi-Agent Workflow (Phase 3)
 python test_multi_agent.py
+# E2B Sandbox (Phase 4)
 python test_sandbox.py
 python test_workflow_with_sandbox.py
 ```
+**Environment variables** (create `.env` file):
 ```
 ANTHROPIC_API_KEY=sk-ant-...
 E2B_API_KEY=e2b_...
 ```
+## Current Implementation Status
+**✅ COMPLETED (Phases 1-5):**
+- Phase 1: LLM client, tool registry, base agent, core tools
+- Phase 2: Hybrid retrieval (BM25 + embeddings), AST-aware parsing, codebase indexing
+- Phase 3: Multi-agent architecture (Explorer, Planner, Coder, Reviewer, Orchestrator)
+- Phase 4: E2B sandbox integration for isolated code execution
+- Phase 5: Chainlit UI with real-time agent progress visualization
+**🚧 NEXT PHASES:**
+- Phase 6 (Weeks 17-18): GitHub Integration - webhooks, automated PR creation
+- Phase 7 (Weeks 19-21): Evals & Benchmarks - SWE-bench evaluation
+- Phase 8 (Weeks 22-24): Production Hardening - error handling, monitoring, deployment
+See `devon-project-plan.md` for complete 24-week roadmap.
 ## Key Design Principles
+1. **Context Engineering is the Differentiator** - Not UI/UX, the hybrid retrieval and AST-aware chunking
+2. **ReAct Pattern** - All agents use: Reason → Act (with tools) → Observe → Repeat
+3. **AST-Aware Processing** - Parse code structurally using tree-sitter, not as text
+4. **Sandboxed Execution** - All code runs in E2B containers, never on host machine
+5. **Single-Search Architecture** - Explorer searches once, all downstream agents reuse context (v3.0+)
+6. **Clarification Before Action** - Planner asks questions before creating plan (v3.3+)
+## Important Implementation Details
+### Tool Schema Format
+All tools follow Claude/Anthropic function calling format:
 ```python
 {
     "type": "function",
     "function": {
         "name": "tool_name",
+        "description": "Clear description for LLM",
         "parameters": {
             "type": "object",
             "properties": {...},
 }
 ```
+### Path Handling (Critical for Coder)
+- **Planner must provide FULL ABSOLUTE PATHS** (e.g., `/tmp/codepilot_repos/flask_abc123/examples/app.py`)
+- **Coder uses paths EXACTLY as written** in the plan
+- Repository path is injected in Chainlit context (see `chainlit_app.py:661-672`)
+### File Operations
+- `write_file` auto-creates parent directories
+- `run_command` has 30-second timeout
 - All tool functions return formatted strings (success messages or errors)
+### Version Tracking
+Files include version constants for debugging hot-reload issues:
+- `orchestrator.py:12` - `ORCHESTRATOR_VERSION`
+- `chainlit_app.py:25-26` - `APP_VERSION`, `BUILD_ID`
+### Conversation Management
+Agents use `ConversationManager` (`codepilot/agents/conversation.py`) to maintain message history in OpenAI/Anthropic format. This handles:
+- System/user/assistant messages
+- Tool calls and tool results
+- Proper formatting for both Claude and OpenAI APIs
+## Critical Files
+- `codepilot/agents/orchestrator.py` - Task classification and multi-agent state machine
+- `codepilot/agents/planner_agent.py` - Pure LLM planning (no tools) + clarification questions
+- `codepilot/agents/coder_agent.py` - Code implementation (no search tools)
+- `codepilot/agents/explorer_agent.py` - Codebase exploration (search tools only)
+- `codepilot/context/hybrid_retriever.py` - RRF fusion algorithm
+- `codepilot/tools/registry.py` - Tool schemas and function mappings
+- `chainlit_app.py` - Interactive UI with GitHub repo cloning and progress visualization
+- `requirements.txt` - Python dependencies
+## Project Structure
+```
+codepilot/
+├── llm/               # LLM client wrappers (Claude, OpenAI)
+├── agents/            # Multi-agent system (Orchestrator, Planner, Coder, Reviewer, Explorer)
+├── tools/             # Tool implementations (file ops, context search, GitHub)
+├── context/           # Hybrid retrieval (BM25, embeddings, parser, indexer)
+└── sandbox/           # E2B sandbox integration
+chainlit_app.py        # Main UI application
+```

chainlit.md CHANGED Viewed

@@ -1,14 +1,27 @@
-# Welcome to Chainlit! 🚀🤖
-Hi there, Developer! 👋 We're excited to have you on board. Chainlit is a powerful tool designed to help you prototype, debug and share applications built on top of LLMs.
-## Useful Links 🔗
-- **Documentation:** Get started with our comprehensive [Chainlit Documentation](https://docs.chainlit.io) 📚
-- **Discord Community:** Join our friendly [Chainlit Discord](https://discord.gg/k73SQ3FyUh) to ask questions, share your projects, and connect with other developers! 💬
-We can't wait to see what you create with Chainlit! Happy coding! 💻😊
-## Welcome screen
-To modify the welcome screen, edit the `chainlit.md` file at the root of your project. If you do not want a welcome screen, just leave this file empty.

+# 🤖 CodePilot
+**Autonomous AI coding agent powered by Claude Sonnet 4.5**
+## What I Can Do
+- 🔍 **Understand your codebase** - Hybrid search finds relevant code instantly
+- 📋 **Plan implementations** - Break down tasks into clear steps
+- ✍️ **Write production code** - Multi-agent system writes, tests, and reviews
+- 🏖️ **Run safely in sandboxes** - All code tested in isolated E2B environments
+## How to Use
+1. **Paste a GitHub URL** (public repos only)
+2. **Describe what you want** (e.g., "add a health check endpoint")
+3. **Watch the agents work** - Explorer → Planner → Coder → Reviewer
+4. **Get production-ready code** - Tested and reviewed automatically
+## Example
+```
+https://github.com/pallets/flask add a /health endpoint
+```
+---
+**Ready!** Paste a GitHub URL and tell me what to build 🚀

chainlit_app.py CHANGED Viewed

@@ -22,8 +22,8 @@ from concurrent.futures import ThreadPoolExecutor
 # ============================================================
 # STARTUP VERSION CHECK - Change this to detect if rebuild worked
 # ============================================================
-APP_VERSION = "3.6.1-plan-fix"
-BUILD_ID = "2024-12-20-v6"
 print("=" * 60)
 print(f"[STARTUP] CodePilot Chainlit App")
 print(f"[STARTUP] APP_VERSION: {APP_VERSION}")
@@ -80,7 +80,13 @@ def format_code_output(code_changes: dict) -> str:
     if not code_changes:
         return "No code changes."
-    output = []
     for file_path, content in code_changes.items():
         # Get just the filename for display
         filename = os.path.basename(file_path)
@@ -89,9 +95,9 @@ def format_code_output(code_changes: dict) -> str:
         # Use collapsible details/summary
         output.append(f"<details>")
-        output.append(f"<summary><strong>{filename}</strong> ({line_count} lines) - Click to expand</summary>")
         output.append(f"")
-        output.append(f"**Path:** `{file_path}`")
         output.append(f"```{lang}")
         output.append(content)
         output.append("```")
@@ -224,9 +230,9 @@ def format_progress_display(status: dict, total_cost: float) -> str:
         if done:
             return "✅"
         elif active:
-            return "⏳"
         else:
-            return "⬜"
     def get_activity(agent: str) -> str:
         """Get activity text for an agent."""
@@ -234,17 +240,17 @@ def format_progress_display(status: dict, total_cost: float) -> str:
         if agent == 'Explorer':
             if status['explorer_done']:
-                return status.get('explorer_activity') or 'Done'
             elif current == 'Explorer':
                 return status.get('explorer_activity') or 'Analyzing codebase...'
-            return 'Waiting'
         elif agent == 'Planner':
             if status['planner_done']:
-                return status.get('planner_activity') or 'Plan created'
             elif current == 'Planner':
                 return status.get('planner_activity') or 'Creating plan...'
-            return 'Waiting'
         elif agent == 'Coder':
             if status['coder_done']:
@@ -252,221 +258,114 @@ def format_progress_display(status: dict, total_cost: float) -> str:
                 if activity:
                     return activity
                 files = status.get('files_written', 0)
-                return f'Wrote {files} files' if files else 'Done'
             elif current == 'Coder':
                 return status.get('coder_activity') or 'Writing code...'
-            return 'Waiting'
         elif agent == 'Reviewer':
             if status['reviewer_done']:
                 if status['approved']:
-                    return '**Approved**'
                 else:
-                    return '**Rejected**'
             elif current == 'Reviewer':
                 return status.get('reviewer_activity') or 'Reviewing...'
-            return 'Waiting'
-        return 'Waiting'
     current = status['current_agent']
-    lines = ["## Progress\n"]
-    lines.append("| Agent | Activity |")
-    lines.append("|-------|----------|")
-    lines.append(f"| Explorer | {icon(status['explorer_done'], current == 'Explorer')} {get_activity('Explorer')} |")
-    lines.append(f"| Planner | {icon(status['planner_done'], current == 'Planner')} {get_activity('Planner')} |")
-    lines.append(f"| Coder | {icon(status['coder_done'], current == 'Coder')} {get_activity('Coder')} |")
-    lines.append(f"| Reviewer | {icon(status['reviewer_done'], current == 'Reviewer')} {get_activity('Reviewer')} |")
-    lines.append(f"\n**Cost:** ${total_cost:.4f}")
     return "\n".join(lines)
 def format_final_result(result: dict, total_cost: float) -> str:
     """Format final result with detailed test checks."""
-    lines = ["## Results\n"]
     success = result.get('success', False)
-    has_plan = bool(result.get('plan'))
     code_changes = result.get('code_changes', {})
-    has_code = bool(code_changes)
     file_count = len(code_changes) if code_changes else 0
     review_feedback = result.get('review_feedback', '')
-    # Detailed checks table
-    lines.append("| Test | Status |")
-    lines.append("|------|--------|")
-    # 1. Plan created
-    lines.append(f"| Plan created | {'✅ Pass' if has_plan else '❌ Fail'} |")
-    # 2. Files written
-    if has_code:
-        lines.append(f"| Files written | ✅ Pass ({file_count} files) |")
-    else:
-        lines.append("| Files written | ❌ Fail |")
-    # 3. Valid syntax (infer from review - if approved, syntax is valid)
     if success:
-        lines.append("| Valid syntax | ✅ Pass |")
-    elif has_code and review_feedback:
-        # Check if syntax error mentioned in feedback
-        if 'syntax' in review_feedback.lower() or 'error' in review_feedback.lower():
-            lines.append("| Valid syntax | ❌ Fail |")
-        else:
-            lines.append("| Valid syntax | ✅ Pass |")
-    elif has_code:
-        lines.append("| Valid syntax | ⬜ Pending |")
     else:
-        lines.append("| Valid syntax | ⬜ N/A |")
-    # 4. Follows patterns (infer from approval)
-    if success:
-        lines.append("| Follows patterns | ✅ Pass |")
-    elif has_code and review_feedback:
-        if 'pattern' in review_feedback.lower() or 'convention' in review_feedback.lower():
-            lines.append("| Follows patterns | ❌ Fail |")
-        else:
-            lines.append("| Follows patterns | ✅ Pass |")
-    elif has_code:
-        lines.append("| Follows patterns | ⬜ Pending |")
-    else:
-        lines.append("| Follows patterns | ⬜ N/A |")
-    # 5. Matches requirements (infer from approval)
-    if success:
-        lines.append("| Matches requirements | ✅ Pass |")
-    elif has_code and review_feedback:
-        if 'requirement' in review_feedback.lower() or 'missing' in review_feedback.lower():
-            lines.append("| Matches requirements | ❌ Fail |")
-        else:
-            lines.append("| Matches requirements | ✅ Pass |")
-    elif has_code:
-        lines.append("| Matches requirements | ⬜ Pending |")
-    else:
-        lines.append("| Matches requirements | ⬜ N/A |")
-    # 6. Code review
-    if review_feedback:
-        if success:
-            lines.append("| Code review | ✅ Approved |")
-        else:
-            lines.append("| Code review | ❌ Rejected |")
-    else:
-        lines.append("| Code review | ⬜ Pending |")
-    # Cost at bottom
-    lines.append(f"\n**Cost:** ${total_cost:.4f}")
     return "\n".join(lines)
 def format_plan_display(plan: str) -> str:
-    """Format plan as numbered implementation steps (7-8 max)."""
     if not plan:
         return ""
-    lines = ["## Implementation Plan\n"]
     plan_lines = plan.split('\n')
     steps = []
-    # Section headers to skip (template sections, not actual steps)
-    skip_patterns = [
-        'title:', 'description:', 'overview:', 'summary:', 'objective:',
-        'installation:', 'running', 'testing', 'use case:', 'example:',
-        'note:', 'warning:', 'important:', 'files:', 'dependencies:',
-        'prerequisites:', 'requirements:', 'setup:', 'configuration:',
-        'readme', 'docstring', 'documentation'
-    ]
-    def is_section_header(text: str) -> bool:
-        """Check if text is a section header, not an action step."""
-        text_lower = text.lower().strip()
-        # Skip if starts with common header patterns
-        for pattern in skip_patterns:
-            if text_lower.startswith(pattern):
-                return True
-        # Skip if it's just a label ending with colon and nothing else useful
-        if text_lower.endswith(':') and len(text_lower) < 30:
-            return True
-        return False
-    # Strategy 1: Look for existing numbered steps (1., 2., etc.)
     for line in plan_lines:
         stripped = line.strip()
-        # Match numbered items like "1.", "1)", "1:"
-        if stripped and len(stripped) > 2:
-            import re
             match = re.match(r'^(\d+)[.)\]:]\s*(.+)', stripped)
             if match:
                 step_text = match.group(2).strip()
-                # Skip section headers, file paths, and too short items
-                if (len(step_text) > 10 and
-                    not step_text.startswith('/') and
-                    not is_section_header(step_text)):
-                    steps.append(step_text)
-    # Strategy 2: Look for bullet points if no numbered steps found
-    if len(steps) < 3:
-        steps = []
-        for line in plan_lines:
-            stripped = line.strip()
-            # Match bullet points
-            if stripped.startswith(('-', '*', '•')) and len(stripped) > 5:
-                step_text = stripped.lstrip('-*• ').strip()
-                # Skip headers, file paths, section headers
-                if (len(step_text) > 15 and
-                    not step_text.startswith('#') and
-                    not step_text.startswith('/') and
-                    not is_section_header(step_text)):
-                    steps.append(step_text)
-    # Strategy 3: Extract key sentences with action verbs
-    if len(steps) < 3:
-        steps = []
-        action_verbs = ['create', 'add', 'implement', 'write', 'update', 'modify',
-                        'define', 'set up', 'configure', 'import', 'export', 'build']
-        for line in plan_lines:
-            stripped = line.strip()
-            stripped_lower = stripped.lower()
-            # Skip section headers
-            if is_section_header(stripped):
-                continue
-            for verb in action_verbs:
-                if verb in stripped_lower and len(stripped) > 20:
-                    # Clean up the line
-                    clean = stripped.lstrip('-*• 0123456789.):]').strip()
-                    if clean and clean not in steps and not is_section_header(clean):
-                        steps.append(clean)
-                        break
-    # Deduplicate and limit to 8 steps
-    seen = set()
-    unique_steps = []
-    for step in steps:
-        step_lower = step.lower()[:30]  # Compare first 30 chars
-        if step_lower not in seen:
-            seen.add(step_lower)
-            unique_steps.append(step)
-    steps = unique_steps[:8]
-    # Format as numbered list
-    if steps:
-        for i, step in enumerate(steps, 1):
-            # Truncate long steps
-            if len(step) > 80:
-                step = step[:77] + '...'
-            lines.append(f"{i}. {step}")
     else:
-        # Fallback: show first meaningful line
-        for line in plan_lines:
-            if line.strip() and not line.startswith('#') and not is_section_header(line):
-                lines.append(f"1. {line.strip()[:80]}")
-                break
-        if len(lines) == 1:
-            lines.append("1. Implementation plan created")
     lines.append("")
     return "\n".join(lines)
@@ -479,16 +378,9 @@ async def start():
     print("[CHAINLIT] on_chat_start triggered")
     await cl.Message(
-        content=f"# CodePilot - Autonomous AI Coding Agent\n\n"
-                f"**Version:** `{APP_VERSION}` | **Build:** `{BUILD_ID}`\n\n"
-                "I can help you write code, fix bugs, and implement features!\n\n"
-                "**How to use:**\n"
-                "1. Paste a **public GitHub URL** and I'll clone and analyze it\n"
-                "2. Tell me what you want to build or fix\n"
-                "3. Watch my agents (Explorer → Planner → Coder → Reviewer) work!\n\n"
-                "**Example:**\n"
-                "```\nhttps://github.com/pallets/flask add a health check endpoint example\n```\n\n"
-                "**Ready!** Paste a GitHub URL with your task."
     ).send()
     print("[CHAINLIT] Welcome message sent")
@@ -623,7 +515,7 @@ async def main(message: cl.Message):
             # 2. Then show code
             if result.get('code_changes'):
-                await cl.Message(content="## Generated Code\n\n" + format_code_output(result['code_changes'])).send()
             # 3. Finally show result table
             await cl.Message(content=format_final_result(result, total_cost)).send()
@@ -722,7 +614,7 @@ AVAILABLE TOOLS:
     # 2. Then show generated code
     if result.get('code_changes'):
-        await cl.Message(content="## Generated Code\n\n" + format_code_output(result['code_changes'])).send()
     # 3. Finally show result table
     await cl.Message(content=format_final_result(result, total_cost)).send()

 # ============================================================
 # STARTUP VERSION CHECK - Change this to detect if rebuild worked
 # ============================================================
+APP_VERSION = "3.7.0-clean-ui"
+BUILD_ID = "2026-01-14-v1"
 print("=" * 60)
 print(f"[STARTUP] CodePilot Chainlit App")
 print(f"[STARTUP] APP_VERSION: {APP_VERSION}")
     if not code_changes:
         return "No code changes."
+    output = ["## 💻 Generated Code\n"]
+    # Summary
+    file_count = len(code_changes)
+    total_lines = sum(len(content.split('\n')) for content in code_changes.values())
+    output.append(f"**{file_count} file{'s' if file_count != 1 else ''} • {total_lines} lines**\n")
     for file_path, content in code_changes.items():
         # Get just the filename for display
         filename = os.path.basename(file_path)
         # Use collapsible details/summary
         output.append(f"<details>")
+        output.append(f"<summary>📄 <strong>{filename}</strong> ({line_count} lines)</summary>")
         output.append(f"")
+        output.append(f"**Path:** `{file_path}`\n")
         output.append(f"```{lang}")
         output.append(content)
         output.append("```")
         if done:
             return "✅"
         elif active:
+            return "🔄"
         else:
+            return "⏸️"
     def get_activity(agent: str) -> str:
         """Get activity text for an agent."""
         if agent == 'Explorer':
             if status['explorer_done']:
+                return status.get('explorer_activity') or 'Complete'
             elif current == 'Explorer':
                 return status.get('explorer_activity') or 'Analyzing codebase...'
+            return ''
         elif agent == 'Planner':
             if status['planner_done']:
+                return 'Complete'
             elif current == 'Planner':
                 return status.get('planner_activity') or 'Creating plan...'
+            return ''
         elif agent == 'Coder':
             if status['coder_done']:
                 if activity:
                     return activity
                 files = status.get('files_written', 0)
+                return f'Complete ({files} files)' if files else 'Complete'
             elif current == 'Coder':
                 return status.get('coder_activity') or 'Writing code...'
+            return ''
         elif agent == 'Reviewer':
             if status['reviewer_done']:
                 if status['approved']:
+                    return '**Approved ✓**'
                 else:
+                    return '**Needs revision**'
             elif current == 'Reviewer':
                 return status.get('reviewer_activity') or 'Reviewing...'
+            return ''
+        return ''
     current = status['current_agent']
+    lines = []
+    # Progress bar
+    done_count = sum([status['explorer_done'], status['planner_done'],
+                      status['coder_done'], status['reviewer_done']])
+    progress_bar = "█" * done_count + "░" * (4 - done_count)
+    lines.append(f"**Progress:** {progress_bar} {done_count}/4 agents")
+    lines.append("")
+    # Agent status
+    lines.append(f"{icon(status['explorer_done'], current == 'Explorer')} **Explorer** {get_activity('Explorer')}")
+    lines.append(f"{icon(status['planner_done'], current == 'Planner')} **Planner** {get_activity('Planner')}")
+    lines.append(f"{icon(status['coder_done'], current == 'Coder')} **Coder** {get_activity('Coder')}")
+    lines.append(f"{icon(status['reviewer_done'], current == 'Reviewer')} **Reviewer** {get_activity('Reviewer')}")
+    lines.append(f"\n💰 **Cost:** ${total_cost:.4f}")
     return "\n".join(lines)
 def format_final_result(result: dict, total_cost: float) -> str:
     """Format final result with detailed test checks."""
     success = result.get('success', False)
     code_changes = result.get('code_changes', {})
     file_count = len(code_changes) if code_changes else 0
     review_feedback = result.get('review_feedback', '')
+    lines = []
+    # Overall status
     if success:
+        lines.append("## ✅ Task Complete!\n")
+        lines.append(f"**Files changed:** {file_count}")
+        lines.append(f"**Review:** Approved")
+    elif code_changes:
+        lines.append("## ⚠️ Code Written (Needs Revision)\n")
+        lines.append(f"**Files changed:** {file_count}")
+        lines.append(f"**Review:** Needs changes")
+        if review_feedback:
+            lines.append(f"\n**Feedback:**\n{review_feedback}")
     else:
+        lines.append("## ❌ Task Failed\n")
+        error = result.get('error', 'Unknown error')
+        lines.append(f"**Error:** {error}")
+    lines.append(f"\n💰 **Cost:** ${total_cost:.4f}")
     return "\n".join(lines)
 def format_plan_display(plan: str) -> str:
+    """Format plan cleanly with a simple summary."""
     if not plan:
         return ""
+    lines = ["## 📋 Implementation Plan\n"]
+    # Simple approach: just show the plan in a clean format
+    # Extract key steps if numbered, otherwise show abbreviated version
     plan_lines = plan.split('\n')
     steps = []
+    import re
     for line in plan_lines:
         stripped = line.strip()
+        # Match numbered items like "1.", "2.", etc.
+        if stripped:
             match = re.match(r'^(\d+)[.)\]:]\s*(.+)', stripped)
             if match:
+                step_num = match.group(1)
                 step_text = match.group(2).strip()
+                if len(step_text) > 10 and not step_text.startswith('/'):
+                    # Truncate long steps
+                    if len(step_text) > 100:
+                        step_text = step_text[:97] + '...'
+                    steps.append(f"{step_num}. {step_text}")
+    if steps and len(steps) <= 10:
+        # Show numbered steps if we found them
+        lines.extend(steps)
     else:
+        # Otherwise just show first few lines of the plan
+        preview_lines = [l.strip() for l in plan_lines[:8] if l.strip() and not l.strip().startswith('#')]
+        if preview_lines:
+            lines.append('\n'.join(preview_lines[:5]))
+            if len(preview_lines) > 5:
+                lines.append("\n*...plan continues...*")
+        else:
+            lines.append("Plan created successfully")
     lines.append("")
     return "\n".join(lines)
     print("[CHAINLIT] on_chat_start triggered")
     await cl.Message(
+        content="👋 **CodePilot ready!**\n\n"
+                "Paste a GitHub URL + your task to get started.\n\n"
+                "*The welcome screen above explains everything you need to know.*"
     ).send()
     print("[CHAINLIT] Welcome message sent")
             # 2. Then show code
             if result.get('code_changes'):
+                await cl.Message(content=format_code_output(result['code_changes'])).send()
             # 3. Finally show result table
             await cl.Message(content=format_final_result(result, total_cost)).send()
     # 2. Then show generated code
     if result.get('code_changes'):
+        await cl.Message(content=format_code_output(result['code_changes'])).send()
     # 3. Finally show result table
     await cl.Message(content=format_final_result(result, total_cost)).send()