Spaces:

Krishp1
/

Autonomous-Coding-Agent

Running

App Files Files Community

Autonomous-Coding-Agent / README-3.md

Krishp1

Upload 7 files

0eebcd6 verified 21 days ago

preview code

Raw

History Blame Contribute Delete

11.2 kB

	# 🤖 Autonomous Python Coding Agent

	> A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code — it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.

	[![Python](https://img.shields.io/badge/Python-3.11-blue?style=flat-square&logo=python)](https://python.org)
	[![LangGraph](https://img.shields.io/badge/LangGraph-0.2.0-green?style=flat-square)](https://github.com/langchain-ai/langgraph)
	[![Groq](https://img.shields.io/badge/Groq-Llama%203.1-orange?style=flat-square)](https://groq.com)
	[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5.0-purple?style=flat-square)](https://chromadb.com)
	[![Streamlit](https://img.shields.io/badge/Streamlit-1.35-red?style=flat-square)](https://streamlit.io)
	[![License](https://img.shields.io/badge/License-MIT-lightgrey?style=flat-square)](LICENSE)
	[![Live Demo](https://img.shields.io/badge/🤗%20Live%20Demo-HuggingFace-yellow?style=flat-square)](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

	---

	## 🚀 Live Demo

	[▶ Try it on Hugging Face Spaces](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

	---

	## 📸 Demo

	![Agent Demo](demo.gif)

	---

	## 🔥 What makes this different from just using ChatGPT?

	\| Feature \| ChatGPT / Basic Agent \| This Agent \|
	\|---\|---\|---\|
	\| Code generation \| ✅ \| ✅ \|
	\| Syntax validation \| ❌ Run and hope \| ✅ AST parse before running \|
	\| Test cases \| ❌ Manual \| ✅ Auto-generated by agent \|
	\| Stress testing \| ❌ \| ✅ 500+ random inputs via Hypothesis \|
	\| Memory \| ❌ Stateless \| ✅ ChromaDB learns from past bugs \|
	\| Security audit \| ❌ \| ✅ Detects eval, exec, hardcoded keys \|
	\| Performance check \| ❌ \| ✅ Benchmarks 1000 runs, rejects slow code \|
	\| Self-review \| ❌ \| ✅ Agent scores own confidence 1-10 \|
	\| Self-healing \| ❌ \| ✅ Loops back and fixes failures automatically \|
	\| Separate retry counters \| ❌ \| ✅ Per-node counters prevent pipeline blockage \|

	---

	## 📊 Key Metrics

	\| Metric \| Value \|
	\|---\|---\|
	\| Pipeline nodes \| 13 \|
	\| Verification layers \| 5 (AST → Tests → Hypothesis → Security → Complexity) \|
	\| Max retries (debugger) \| 3 \|
	\| Max retries (security, complexity) \| 2 each — independent counters \|
	\| Hypothesis test cases \| 500+ random inputs per run \|
	\| Benchmark iterations \| 1,000 runs \|
	\| Performance threshold \| < 5ms per call \|
	\| Memory backend \| ChromaDB vector similarity search \|
	\| LLM \| Llama 3.1 8B Instant via Groq \|
	\| Avg pipeline runtime \| ~20–40 seconds \|
	\| Lines of code \| ~600 across 5 files \|

	---

	## 🏗️ Architecture — 13-Node Pipeline

	```
	User Input (Python Task)
	│
	▼
	┌─────────┐
	│ Planner │ ── Breaks task into blueprint
	└────┬────┘
	│
	▼
	┌───────┐
	│ Coder │ ── Writes code using plan + ChromaDB memory
	└────┬──┘
	│
	▼
	┌───────────────┐
	│ AST Validator │ ── Syntax + hallucinated imports + type hints
	└──────┬────────┘ (no execution needed — milliseconds)
	│
	Pass │ Fail ──► Debugger ──► back to AST
	▼
	┌────────────────┐
	│ Test Generator │ ── Auto-generates pytest-style test cases
	└───────┬────────┘
	│
	▼
	┌────────┐
	│ Tester │ ── Runs code + generated tests in sandbox
	└───┬────┘
	│
	Pass │ Fail ──► Debugger (max 3 retries)
	▼
	┌────────────┐
	│ Hypothesis │ ── 500+ random inputs, property-based testing
	└─────┬──────┘ (never blocks pipeline — informational only)
	│
	▼
	┌───────────┐
	│ Benchmark │ ── Runs 1000x, rejects if > 5ms/call
	└─────┬─────┘
	│
	▼
	┌──────────┐
	│ Security │ ── Detects eval/exec/hardcoded secrets
	└─────┬────┘ (own retry counter — max 2)
	│
	▼
	┌────────────┐
	│ Complexity │ ── Line count + nesting depth + LLM score/10
	└──────┬─────┘ (own retry counter — max 2)
	│
	▼
	┌─────────────────┐
	│ Self Reflection │ ── Agent scores own confidence 1-10
	└────────┬────────┘ Rewrites if confidence < 7
	│
	▼
	┌──────────┐
	│ Reviewer │ ── Polishes + docstrings + type hints
	└─────┬────┘
	│
	▼
	┌──────────┐
	│Explainer │ ── Writes human-readable explanation
	└─────┬────┘
	│
	▼
	OUTPUT
	Final Code + Explanation
	```

	---

	## 📁 Project Structure

	```
	autonomous-coding-agent/
	├── app.py ← Streamlit UI
	├── main.py ← Graph builder + entry point
	├── state.py ← Shared TypedDict state (whiteboard)
	├── nodes.py ← All 13 node functions + LLM + ChromaDB
	├── edges.py ← All 7 conditional route functions
	├── requirements.txt ← Dependencies
	└── README.md
	```

	---

	## ⚡ Run Locally

	### Prerequisites
	- Python 3.11+
	- Groq API key — get free at [console.groq.com](https://console.groq.com)

	### Step 1 — Clone the repo
	```bash
	git clone https://github.com/krishpatel/autonomous-coding-agent.git
	cd autonomous-coding-agent
	```

	### Step 2 — Create virtual environment
	```bash
	python -m venv venv

	# Mac/Linux
	source venv/bin/activate

	# Windows
	venv\Scripts\activate
	```

	### Step 3 — Install dependencies
	```bash
	pip install -r requirements.txt
	```

	### Step 4 — Set your API key
	```bash
	# Mac/Linux
	export GROQ_API_KEY=your_groq_api_key_here

	# Windows
	set GROQ_API_KEY=your_groq_api_key_here
	```

	Or create a `.env` file:
	```bash
	echo "GROQ_API_KEY=your_groq_api_key_here" > .env
	```

	### Step 5 — Run CLI (no UI)
	```bash
	python main.py
	```

	### Step 6 — Run Streamlit UI
	```bash
	streamlit run app.py
	```

	Open [http://localhost:8501](http://localhost:8501) in your browser.

	---

	## 🐳 Run with Docker (optional)

	```dockerfile
	# Dockerfile
	FROM python:3.11-slim
	WORKDIR /app
	COPY requirements.txt .
	RUN pip install -r requirements.txt
	COPY . .
	EXPOSE 8501
	CMD ["streamlit", "run", "app.py", "--server.port=8501"]
	```

	```bash
	# Build
	docker build -t coding-agent .

	# Run
	docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent
	```

	---

	## 🌐 Deploy to Hugging Face Spaces

	```bash
	# Install HF CLI
	pip install huggingface_hub

	# Login
	huggingface-cli login

	# Create space and push
	huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
	git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
	git push hf main
	```

	Then add your secret in HF Spaces Settings:
	```
	GROQ_API_KEY = your_key_here
	```

	---

	## 🛠️ Tech Stack

	```
	LangGraph — Stateful multi-agent graph orchestration
	Groq API — LLM inference (Llama 3.1 8B Instant)
	ChromaDB — Vector database for bug fix memory
	Hypothesis — Property-based stress testing
	Streamlit — Production UI
	subprocess — Sandboxed isolated code execution
	ast — Static code analysis without execution
	hashlib — Deterministic ChromaDB IDs
	importlib — Real-time import hallucination detection
	```

	---

	## 💡 Key Engineering Decisions

	### Why LangGraph over plain LangChain?
	LangGraph handles cyclic workflows — when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.

	### Why AST validation before running?
	Running broken code wastes subprocess time. AST parsing catches syntax errors in milliseconds without execution — like a proofreader checking spelling before printing.

	### Why Hypothesis for testing?
	Hand-written tests only cover cases you think of. Hypothesis auto-generates 500+ random inputs and verifies properties that should always hold. Catches edge cases no human would write.

	### Why separate retry counters per node?
	One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.

	### Why hashlib instead of Python's hash()?
	Python's `hash()` is randomized every session for security. Same error → different ChromaDB ID → agent can never retrieve past fixes. `hashlib.md5` is deterministic across all sessions.

	### Why combined Reviewer + Explainer?
	Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (`FINAL_CODE:` / `EXPLANATION:`) saves an entire API round trip.

	---

	## 🐛 Real Bugs Found and Fixed

	Bug 1 — False Positive in Tester
	`returncode == 0` doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking `stdout` is not empty after successful run.

	Bug 2 — ChromaDB Hash Randomization
	Python's `hash()` is session-randomized. Same bug → different ID every run → memory retrieval never works. Fixed with `hashlib.md5().hexdigest()[:8]` for deterministic cross-session IDs.

	Bug 3 — Python 3.11 F-string Backslash
	Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.

	Bug 4 — Shared Retry Counter
	One `retries` counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding `security_retries` and `complexity_retries` as independent counters.

	---

	## 🔑 Environment Variables

	\| Variable \| Required \| Description \|
	\|---\|---\|---\|
	\| `GROQ_API_KEY` \| ✅ Yes \| Get free at console.groq.com \|
	\| `GITHUB_TOKEN` \| ❌ No \| Only needed for AutoReview AI project \|

	---

	## 📝 Resume Line

	> Autonomous Python Coding Agent \| LangGraph · Groq · ChromaDB · Streamlit
	> Built a 13-node self-healing pipeline with 5-layer verification — AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.

	---

	## 👨‍💻 Author

	Krish Patel — AI Engineer
	[GitHub](https://github.com/krishpatel) · [LinkedIn](https://linkedin.com/in/krishpatel) · [Live Demo](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

	---

	Built as part of AI Engineer internship portfolio — Bangalore, 2026