Spaces:

Krishp1
/

Autonomous-Coding-Agent

Running

File size: 11,160 Bytes

0eebcd6

# 🤖 Autonomous Python Coding Agent

> **A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code — it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.**

[![Python](https://img.shields.io/badge/Python-3.11-blue?style=flat-square&logo=python)](https://python.org)
[![LangGraph](https://img.shields.io/badge/LangGraph-0.2.0-green?style=flat-square)](https://github.com/langchain-ai/langgraph)
[![Groq](https://img.shields.io/badge/Groq-Llama%203.1-orange?style=flat-square)](https://groq.com)
[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5.0-purple?style=flat-square)](https://chromadb.com)
[![Streamlit](https://img.shields.io/badge/Streamlit-1.35-red?style=flat-square)](https://streamlit.io)
[![License](https://img.shields.io/badge/License-MIT-lightgrey?style=flat-square)](LICENSE)
[![Live Demo](https://img.shields.io/badge/🤗%20Live%20Demo-HuggingFace-yellow?style=flat-square)](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

---

## 🚀 Live Demo

**[▶ Try it on Hugging Face Spaces](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)**

---

## 📸 Demo

![Agent Demo](demo.gif)

---

## 🔥 What makes this different from just using ChatGPT?

| Feature | ChatGPT / Basic Agent | This Agent |
|---|---|---|
| Code generation | ✅ | ✅ |
| Syntax validation | ❌ Run and hope | ✅ AST parse before running |
| Test cases | ❌ Manual | ✅ Auto-generated by agent |
| Stress testing | ❌ | ✅ 500+ random inputs via Hypothesis |
| Memory | ❌ Stateless | ✅ ChromaDB learns from past bugs |
| Security audit | ❌ | ✅ Detects eval, exec, hardcoded keys |
| Performance check | ❌ | ✅ Benchmarks 1000 runs, rejects slow code |
| Self-review | ❌ | ✅ Agent scores own confidence 1-10 |
| Self-healing | ❌ | ✅ Loops back and fixes failures automatically |
| Separate retry counters | ❌ | ✅ Per-node counters prevent pipeline blockage |

---

## 📊 Key Metrics

| Metric | Value |
|---|---|
| Pipeline nodes | 13 |
| Verification layers | 5 (AST → Tests → Hypothesis → Security → Complexity) |
| Max retries (debugger) | 3 |
| Max retries (security, complexity) | 2 each — independent counters |
| Hypothesis test cases | 500+ random inputs per run |
| Benchmark iterations | 1,000 runs |
| Performance threshold | < 5ms per call |
| Memory backend | ChromaDB vector similarity search |
| LLM | Llama 3.1 8B Instant via Groq |
| Avg pipeline runtime | ~20–40 seconds |
| Lines of code | ~600 across 5 files |

---

## 🏗️ Architecture — 13-Node Pipeline

```
User Input (Python Task)
         │
         ▼
    ┌─────────┐
    │ Planner │ ── Breaks task into blueprint
    └────┬────┘
         │
         ▼
    ┌───────┐
    │ Coder │ ── Writes code using plan + ChromaDB memory
    └────┬──┘
         │
         ▼
    ┌───────────────┐
    │ AST Validator │ ── Syntax + hallucinated imports + type hints
    └──────┬────────┘    (no execution needed — milliseconds)
           │
      Pass │   Fail ──► Debugger ──► back to AST
           ▼
┌────────────────┐
│ Test Generator │ ── Auto-generates pytest-style test cases
└───────┬────────┘
        │
        ▼
    ┌────────┐
    │ Tester │ ── Runs code + generated tests in sandbox
    └───┬────┘
        │
   Pass │   Fail ──► Debugger (max 3 retries)
        ▼
┌────────────┐
│ Hypothesis │ ── 500+ random inputs, property-based testing
└─────┬──────┘    (never blocks pipeline — informational only)
      │
      ▼
┌───────────┐
│ Benchmark │ ── Runs 1000x, rejects if > 5ms/call
└─────┬─────┘
      │
      ▼
┌──────────┐
│ Security │ ── Detects eval/exec/hardcoded secrets
└─────┬────┘    (own retry counter — max 2)
      │
      ▼
┌────────────┐
│ Complexity │ ── Line count + nesting depth + LLM score/10
└──────┬─────┘    (own retry counter — max 2)
       │
       ▼
┌─────────────────┐
│ Self Reflection │ ── Agent scores own confidence 1-10
└────────┬────────┘    Rewrites if confidence < 7
         │
         ▼
    ┌──────────┐
    │ Reviewer │ ── Polishes + docstrings + type hints
    └─────┬────┘
          │
          ▼
    ┌──────────┐
    │Explainer │ ── Writes human-readable explanation
    └─────┬────┘
          │
          ▼
       OUTPUT
  Final Code + Explanation
```

---

## 📁 Project Structure

```
autonomous-coding-agent/
├── app.py              ← Streamlit UI
├── main.py             ← Graph builder + entry point
├── state.py            ← Shared TypedDict state (whiteboard)
├── nodes.py            ← All 13 node functions + LLM + ChromaDB
├── edges.py            ← All 7 conditional route functions
├── requirements.txt    ← Dependencies
└── README.md
```

---

## ⚡ Run Locally

### Prerequisites
- Python 3.11+
- Groq API key — get free at [console.groq.com](https://console.groq.com)

### Step 1 — Clone the repo
```bash
git clone https://github.com/krishpatel/autonomous-coding-agent.git
cd autonomous-coding-agent
```

### Step 2 — Create virtual environment
```bash
python -m venv venv

# Mac/Linux
source venv/bin/activate

# Windows
venv\Scripts\activate
```

### Step 3 — Install dependencies
```bash
pip install -r requirements.txt
```

### Step 4 — Set your API key
```bash
# Mac/Linux
export GROQ_API_KEY=your_groq_api_key_here

# Windows
set GROQ_API_KEY=your_groq_api_key_here
```

Or create a `.env` file:
```bash
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
```

### Step 5 — Run CLI (no UI)
```bash
python main.py
```

### Step 6 — Run Streamlit UI
```bash
streamlit run app.py
```

Open [http://localhost:8501](http://localhost:8501) in your browser.

---

## 🐳 Run with Docker (optional)

```dockerfile
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501"]
```

```bash
# Build
docker build -t coding-agent .

# Run
docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent
```

---

## 🌐 Deploy to Hugging Face Spaces

```bash
# Install HF CLI
pip install huggingface_hub

# Login
huggingface-cli login

# Create space and push
huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
git push hf main
```

Then add your secret in HF Spaces Settings:
```
GROQ_API_KEY = your_key_here
```

---

## 🛠️ Tech Stack

```
LangGraph    — Stateful multi-agent graph orchestration
Groq API     — LLM inference (Llama 3.1 8B Instant)
ChromaDB     — Vector database for bug fix memory
Hypothesis   — Property-based stress testing
Streamlit    — Production UI
subprocess   — Sandboxed isolated code execution
ast          — Static code analysis without execution
hashlib      — Deterministic ChromaDB IDs
importlib    — Real-time import hallucination detection
```

---

## 💡 Key Engineering Decisions

### Why LangGraph over plain LangChain?
LangGraph handles **cyclic workflows** — when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.

### Why AST validation before running?
Running broken code wastes subprocess time. AST parsing catches syntax errors in **milliseconds** without execution — like a proofreader checking spelling before printing.

### Why Hypothesis for testing?
Hand-written tests only cover cases you think of. Hypothesis **auto-generates 500+ random inputs** and verifies properties that should always hold. Catches edge cases no human would write.

### Why separate retry counters per node?
One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.

### Why hashlib instead of Python's hash()?
Python's `hash()` is **randomized every session** for security. Same error → different ChromaDB ID → agent can never retrieve past fixes. `hashlib.md5` is deterministic across all sessions.

### Why combined Reviewer + Explainer?
Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (`FINAL_CODE:` / `EXPLANATION:`) saves an entire API round trip.

---

## 🐛 Real Bugs Found and Fixed

**Bug 1 — False Positive in Tester**
`returncode == 0` doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking `stdout` is not empty after successful run.

**Bug 2 — ChromaDB Hash Randomization**
Python's `hash()` is session-randomized. Same bug → different ID every run → memory retrieval never works. Fixed with `hashlib.md5().hexdigest()[:8]` for deterministic cross-session IDs.

**Bug 3 — Python 3.11 F-string Backslash**
Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.

**Bug 4 — Shared Retry Counter**
One `retries` counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding `security_retries` and `complexity_retries` as independent counters.

---

## 🔑 Environment Variables

| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | ✅ Yes | Get free at console.groq.com |
| `GITHUB_TOKEN` | ❌ No | Only needed for AutoReview AI project |

---

## 📝 Resume Line

> **Autonomous Python Coding Agent** | LangGraph · Groq · ChromaDB · Streamlit
> Built a 13-node self-healing pipeline with 5-layer verification — AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.

---

## 👨‍💻 Author

**Krish Patel** — AI Engineer  
[GitHub](https://github.com/krishpatel) · [LinkedIn](https://linkedin.com/in/krishpatel) · [Live Demo](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

---

*Built as part of AI Engineer internship portfolio — Bangalore, 2026*