Spaces:
Running
Running
| title: Repomind API | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| # π€ Autonomous Code Review & Bug-Fix Agent | |
| > **ML Engineering Project** β LLM Agents Β· SWE-bench Β· DeepSeek-Coder Β· AST Parsing Β· Conformal Prediction Β· RL Fine-Tuning | |
| [](#testing) | |
| [](https://python.org) | |
| [](https://swebench.com) | |
| [](#) | |
| An autonomous agent that reads GitHub issues, localises the relevant source files, generates minimal unified diff patches, and self-corrects by reading its own failing test output β targeting **30β42% resolve rate on SWE-bench Lite**. | |
| --- | |
| ## π― Target Benchmarks | |
| | Metric | Baseline | Ours | | |
| |--------|----------|------| | |
| | SWE-bench Lite Resolved | ~10β18% (GPT-4o naive) | **30β42%** | | |
| | File Localisation Recall@5 | ~41% | **74%+** | | |
| | Avg Attempts to Fix | β | **< 2.4** | | |
| Compare: Devin **13.86%** Β· SWE-agent **12.47%** | |
| --- | |
| ## ποΈ Architecture | |
| ``` | |
| GitHub Issue | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Stage 1 β File Localisation (Phase 3) β | |
| β β | |
| β BM25 (top-20) βββ β | |
| β Embeddings ββββββΌβββΆ RRF Fusion βββΆ top-20 cands β | |
| β PPR Graph βββββββ β | |
| β β β | |
| β βΌ β | |
| β DeBERTa Cross-Encoder β | |
| β Re-rank to top-5 files β | |
| β β | |
| β Conformal Prediction: 90% coverage guarantee β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ top-5 files (calibrated confidence scores) | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Stage 2 β Agentic Reflection Loop (Phase 4) β | |
| β β | |
| β Attempt 1: GPT-4o / DeepSeek-Coder β patch β | |
| β ββββΆ git apply β pytest β | |
| β ββ PASS β β done β | |
| β ββ FAIL β β categorise failure β | |
| β ββββΆ reflection prompt β | |
| β Attempt 2: (issue + error context) β new patch β | |
| β ββββΆ git apply β pytest β | |
| β ββ PASS β β done β | |
| β ββ FAIL β β (max 3 attempts) β | |
| β β | |
| β All attempts logged as JSONL β Phase 7 fine-tune β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## π¦ Project Structure | |
| ``` | |
| autonomous-code-agent/ | |
| βββ agent/ # Phase 4 β Agentic Reflection Loop | |
| β βββ reflection_agent.py # LangGraph: localiseβgenerateβapply+test | |
| β βββ tools.py # read_file, write_patch, run_tests, git_diff | |
| β βββ failure_categoriser.py # 9-category failure taxonomy | |
| β βββ trajectory_logger.py # JSONL logger + fine-tuning exporter | |
| β βββ naive_baseline.py # GPT-4o zero-shot baseline | |
| β | |
| βββ ast_parser/ # Phase 2 β AST-Aware Code Understanding | |
| β βββ python_parser.py # Tree-sitter parser (stdlib ast fallback) | |
| β βββ dependency_graph.py # Personalized PageRank over import graph | |
| β βββ cache.py # SHA-keyed AST cache (diskcache) | |
| β | |
| βββ localisation/ # Phase 3 β Two-Stage File Localisation | |
| β βββ bm25_retriever.py # BM25 + CamelCase tokeniser + path boost | |
| β βββ embedding_retriever.py # text-embedding-3-small + FAISS | |
| β βββ rrf_fusion.py # Reciprocal Rank Fusion (BM25+embed+PPR) | |
| β βββ deberta_ranker.py # DeBERTa-v3-small cross-encoder | |
| β βββ pipeline.py # End-to-end orchestrator + recall@k eval | |
| β | |
| βββ uncertainty/ # Phase 6 β Conformal Prediction | |
| β βββ conformal_predictor.py # CalibrationStore + ConformalPredictor + RAPS | |
| β βββ temperature_scaling.py # Temperature scaling (ECE < 0.05 target) | |
| β βββ uncertainty_pipeline.py # 90% coverage guarantee wrapper | |
| β | |
| βββ fine_tuning/ # Phase 7 β DeepSeek-Coder QLoRA | |
| β βββ dataset_builder.py # Trajectory β ChatML/Alpaca instruction pairs | |
| β βββ qlora_config.py # 4-bit NF4 + LoRA (r=16, alpha=32) | |
| β βββ train.py # SFTTrainer entry point (--dry-run OK) | |
| β βββ evaluator.py # EvaluationReport + AblationTableBuilder | |
| β | |
| βββ api/ # Phase 5 β FastAPI Backend | |
| β βββ main.py # REST + WebSocket endpoints + CORS | |
| β βββ models.py # Pydantic request/response/event types | |
| β βββ tasks.py # Async agent execution + streaming events | |
| β βββ websocket_manager.py # Per-task pub/sub WebSocket manager | |
| β | |
| βββ telemetry/ # Phase 8 β Observability | |
| β βββ metrics.py # Prometheus metrics + USD CostTracker | |
| β βββ structured_logging.py # structlog JSON + RequestContext binder | |
| β βββ rate_limiter.py # Sliding window + QueueDepthMonitor | |
| β | |
| βββ experiments/ # Phase 9 β Benchmarking | |
| β βββ benchmark.py # BenchmarkRunner + ablation table | |
| β | |
| βββ frontend/ # Phase 5 β Next.js UI | |
| β βββ src/ | |
| β βββ components/ # Header, MetricsBar, Submit, Execution, Results | |
| β βββ lib/ # Zustand store (WS handler) + TypeScript types | |
| β | |
| βββ sandbox/executor.py # Phase 1 β Secure Docker Sandbox | |
| βββ swe_bench/loader.py # Phase 1 β SWE-bench Lite Dataset Loader | |
| βββ configs/settings.py # Pydantic-Settings singleton | |
| βββ tests/ # 244 tests across all 9 phases | |
| βββ docker-compose.yml # 4 services: API + Frontend + Redis + Sandbox | |
| βββ scripts/start_api.sh # FastAPI dev server | |
| ``` | |
| --- | |
| ## π Quick Start | |
| ### 1. Install | |
| ```bash | |
| git clone https://github.com/your-username/autonomous-code-agent | |
| cd autonomous-code-agent | |
| python -m venv .venv && source .venv/bin/activate | |
| pip install -e ".[dev]" | |
| ``` | |
| ### 2. Configure | |
| ```bash | |
| cp .env.example .env | |
| # Set OPENAI_API_KEY=sk-... | |
| ``` | |
| ### 3. Run tests (no API key needed) | |
| ```bash | |
| pytest tests/ -q # 244 tests, all pure Python β no GPU, no internet | |
| ``` | |
| ### 4. Start the live demo | |
| ```bash | |
| # Terminal 1: FastAPI backend | |
| bash scripts/start_api.sh # β http://localhost:8000/docs | |
| # Terminal 2: Next.js frontend | |
| cd frontend && npm run dev # β http://localhost:3000 | |
| ``` | |
| ### 5. Docker Compose (production) | |
| ```bash | |
| docker-compose up --build | |
| ``` | |
| --- | |
| ## π¬ Key ML Techniques | |
| ### Two-Stage Localisation (Recall@5: 41% β 74%) | |
| **Stage 1 β Broad retrieval:** | |
| BM25 with CamelCase/snake_case tokenisation and 2Γ path-token weight, fused via | |
| Reciprocal Rank Fusion with dense embeddings (text-embedding-3-small + FAISS) | |
| and Personalized PageRank relevance propagation over the AST dependency graph. | |
| **Stage 2 β Precise re-ranking:** | |
| DeBERTa-v3-small cross-encoder scores each (issue, file_summary) pair directly, | |
| replacing the independent scoring of Stage 1 with joint interaction features. | |
| ### Conformal Prediction (Provable 90% Coverage) | |
| ``` | |
| s(x, y) = 1 - rrf_score(y | x) # non-conformity score | |
| q_hat = Quantile(S_cal, ceil((n+1)(1-Ξ±)) / n) # finite-sample corrected | |
| C(x) = {y : s(x,y) β€ q_hat} # prediction set | |
| Guarantee: P(gold_file β C(x)) β₯ 1 - Ξ± = 90% (marginal coverage) | |
| ``` | |
| Token budget reduced ~60β80% on confident instances while maintaining the coverage guarantee. | |
| ### QLoRA Fine-Tuning (DeepSeek-Coder-7B) | |
| Three training pair types extracted from Phase 4 trajectories: | |
| 1. **Positive** β `(issue + files)` β correct patch | |
| 2. **Negative-with-context** β `(issue + error_log)` β understand failure patterns | |
| 3. **Reflection** β `(issue + attempt_k_failure)` β correct_patch_{k+1} β most valuable | |
| 4-bit NF4 quantisation Β· LoRA r=16, Ξ±=32 Β· All attention + MLP layers Β· | |
| 3 epochs Β· cosine LR Β· effective batch=16 Β· ~$40β60 on RunPod A100 | |
| --- | |
| ## π Ablation Results | |
| | System Variant | SWE-bench % Resolved | Recall@5 | | |
| |----------------|---------------------|----------| | |
| | SWE-agent (published) | 12.47% | β | | |
| | Devin (published) | 13.86% | β | | |
| | Naive GPT-4o baseline | ~10β18% | 41% | | |
| | + Graph-aware two-stage localisation | ~25β28% | **74%** | | |
| | + Reflection loop (max 3 attempts) | ~30β35% | 74% | | |
| | + DeepSeek-Coder fine-tuned | **~38β44%** | 74% | | |
| --- | |
| ## π§ͺ Testing | |
| ```bash | |
| # All 244 tests | |
| pytest tests/ -v | |
| # By phase | |
| pytest tests/test_phase1_sandbox.py # Sandbox + baseline (24 tests) | |
| pytest tests/test_phase2_ast.py # AST parser + PPR graph (40 tests) | |
| pytest tests/test_phase3_localisation.py # BM25/embed/RRF/DeBERTa (55 tests) | |
| pytest tests/test_phase4_reflection.py # Tools, agent, trajectory (36 tests) | |
| pytest tests/test_phase6_uncertainty.py # Conformal prediction (33 tests) | |
| pytest tests/test_phase7_finetuning.py # Dataset + QLoRA config (37 tests) | |
| pytest tests/test_phase8_9_telemetry_benchmark.py # Metrics + ablation (41 tests) | |
| ``` | |
| --- | |
| ## βοΈ Key Configuration | |
| ```env | |
| OPENAI_API_KEY=sk-... # Required for embeddings + GPT-4o | |
| LLM_MODEL=gpt-4o # or deepseek-ai/deepseek-coder-7b-instruct-v1.5 | |
| MAX_ATTEMPTS=3 # Reflection loop budget | |
| RETRIEVAL_TOP_K=5 # Files sent to LLM | |
| RRF_ALPHA_BM25=0.4 # BM25 weight in RRF fusion | |
| RRF_ALPHA_EMBED=0.4 # Embedding weight | |
| RRF_ALPHA_PPR=0.2 # Graph PPR weight | |
| REDIS_URL=redis://localhost:6379/0 | |
| ``` | |
| --- | |
| ## π‘ API Reference | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/api/solve` | POST | Submit issue β `task_id` | | |
| | `/api/task/{id}` | GET | Poll status + results | | |
| | `/ws/{id}` | WebSocket | Stream execution events | | |
| | `/api/metrics` | GET | Aggregate metrics dashboard | | |
| | `/metrics` | GET | Prometheus scrape endpoint | | |
| **WebSocket events:** `log` Β· `localised_files` Β· `patch` Β· `test_result` Β· `reflection` Β· `done` Β· `error` | |
| --- | |
| ## π‘οΈ Sandbox Security | |
| - `--network=none` β no outbound network | |
| - Memory: 2 GB Β· CPU: 2 cores Β· Timeout: 60s | |
| - Command whitelist: `git`, `pytest`, `python` only | |
| - `--read-only` filesystem, `--cap-drop ALL` | |
| --- | |
| ## π References | |
| - [SWE-bench](https://arxiv.org/abs/2310.06770) β Jimenez et al. 2023 | |
| - [Conformal Prediction](https://arxiv.org/abs/2107.07511) β Angelopoulos & Bates 2021 | |
| - [RAPS](https://arxiv.org/abs/2009.14193) β Angelopoulos et al. 2021 | |
| - [Temperature Scaling](https://arxiv.org/abs/1706.04599) β Guo et al. 2017 | |
| - [QLoRA](https://arxiv.org/abs/2305.14314) β Dettmers et al. 2023 | |
| - [DeepSeek-Coder](https://github.com/deepseek-ai/DeepSeek-Coder) | |
| - [LangGraph](https://github.com/langchain-ai/langgraph) | |
| --- | |
| ## π License | |
| MIT | |