Krishp1's picture
Upload 7 files
0eebcd6 verified
|
Raw
History Blame Contribute Delete
11.2 kB
# πŸ€– Autonomous Python Coding Agent
> **A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code β€” it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.**
[![Python](https://img.shields.io/badge/Python-3.11-blue?style=flat-square&logo=python)](https://python.org)
[![LangGraph](https://img.shields.io/badge/LangGraph-0.2.0-green?style=flat-square)](https://github.com/langchain-ai/langgraph)
[![Groq](https://img.shields.io/badge/Groq-Llama%203.1-orange?style=flat-square)](https://groq.com)
[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5.0-purple?style=flat-square)](https://chromadb.com)
[![Streamlit](https://img.shields.io/badge/Streamlit-1.35-red?style=flat-square)](https://streamlit.io)
[![License](https://img.shields.io/badge/License-MIT-lightgrey?style=flat-square)](LICENSE)
[![Live Demo](https://img.shields.io/badge/πŸ€—%20Live%20Demo-HuggingFace-yellow?style=flat-square)](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)
---
## πŸš€ Live Demo
**[β–Ά Try it on Hugging Face Spaces](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)**
---
## πŸ“Έ Demo
![Agent Demo](demo.gif)
---
## πŸ”₯ What makes this different from just using ChatGPT?
| Feature | ChatGPT / Basic Agent | This Agent |
|---|---|---|
| Code generation | βœ… | βœ… |
| Syntax validation | ❌ Run and hope | βœ… AST parse before running |
| Test cases | ❌ Manual | βœ… Auto-generated by agent |
| Stress testing | ❌ | βœ… 500+ random inputs via Hypothesis |
| Memory | ❌ Stateless | βœ… ChromaDB learns from past bugs |
| Security audit | ❌ | βœ… Detects eval, exec, hardcoded keys |
| Performance check | ❌ | βœ… Benchmarks 1000 runs, rejects slow code |
| Self-review | ❌ | βœ… Agent scores own confidence 1-10 |
| Self-healing | ❌ | βœ… Loops back and fixes failures automatically |
| Separate retry counters | ❌ | βœ… Per-node counters prevent pipeline blockage |
---
## πŸ“Š Key Metrics
| Metric | Value |
|---|---|
| Pipeline nodes | 13 |
| Verification layers | 5 (AST β†’ Tests β†’ Hypothesis β†’ Security β†’ Complexity) |
| Max retries (debugger) | 3 |
| Max retries (security, complexity) | 2 each β€” independent counters |
| Hypothesis test cases | 500+ random inputs per run |
| Benchmark iterations | 1,000 runs |
| Performance threshold | < 5ms per call |
| Memory backend | ChromaDB vector similarity search |
| LLM | Llama 3.1 8B Instant via Groq |
| Avg pipeline runtime | ~20–40 seconds |
| Lines of code | ~600 across 5 files |
---
## πŸ—οΈ Architecture β€” 13-Node Pipeline
```
User Input (Python Task)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Planner β”‚ ── Breaks task into blueprint
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ Coder β”‚ ── Writes code using plan + ChromaDB memory
β””β”€β”€β”€β”€β”¬β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AST Validator β”‚ ── Syntax + hallucinated imports + type hints
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ (no execution needed β€” milliseconds)
β”‚
Pass β”‚ Fail ──► Debugger ──► back to AST
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test Generator β”‚ ── Auto-generates pytest-style test cases
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tester β”‚ ── Runs code + generated tests in sandbox
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
Pass β”‚ Fail ──► Debugger (max 3 retries)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hypothesis β”‚ ── 500+ random inputs, property-based testing
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ (never blocks pipeline β€” informational only)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Benchmark β”‚ ── Runs 1000x, rejects if > 5ms/call
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Security β”‚ ── Detects eval/exec/hardcoded secrets
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ (own retry counter β€” max 2)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Complexity β”‚ ── Line count + nesting depth + LLM score/10
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ (own retry counter β€” max 2)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Self Reflection β”‚ ── Agent scores own confidence 1-10
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Rewrites if confidence < 7
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reviewer β”‚ ── Polishes + docstrings + type hints
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Explainer β”‚ ── Writes human-readable explanation
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β–Ό
OUTPUT
Final Code + Explanation
```
---
## πŸ“ Project Structure
```
autonomous-coding-agent/
β”œβ”€β”€ app.py ← Streamlit UI
β”œβ”€β”€ main.py ← Graph builder + entry point
β”œβ”€β”€ state.py ← Shared TypedDict state (whiteboard)
β”œβ”€β”€ nodes.py ← All 13 node functions + LLM + ChromaDB
β”œβ”€β”€ edges.py ← All 7 conditional route functions
β”œβ”€β”€ requirements.txt ← Dependencies
└── README.md
```
---
## ⚑ Run Locally
### Prerequisites
- Python 3.11+
- Groq API key β€” get free at [console.groq.com](https://console.groq.com)
### Step 1 β€” Clone the repo
```bash
git clone https://github.com/krishpatel/autonomous-coding-agent.git
cd autonomous-coding-agent
```
### Step 2 β€” Create virtual environment
```bash
python -m venv venv
# Mac/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
```
### Step 3 β€” Install dependencies
```bash
pip install -r requirements.txt
```
### Step 4 β€” Set your API key
```bash
# Mac/Linux
export GROQ_API_KEY=your_groq_api_key_here
# Windows
set GROQ_API_KEY=your_groq_api_key_here
```
Or create a `.env` file:
```bash
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
```
### Step 5 β€” Run CLI (no UI)
```bash
python main.py
```
### Step 6 β€” Run Streamlit UI
```bash
streamlit run app.py
```
Open [http://localhost:8501](http://localhost:8501) in your browser.
---
## 🐳 Run with Docker (optional)
```dockerfile
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501"]
```
```bash
# Build
docker build -t coding-agent .
# Run
docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent
```
---
## 🌐 Deploy to Hugging Face Spaces
```bash
# Install HF CLI
pip install huggingface_hub
# Login
huggingface-cli login
# Create space and push
huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
git push hf main
```
Then add your secret in HF Spaces Settings:
```
GROQ_API_KEY = your_key_here
```
---
## πŸ› οΈ Tech Stack
```
LangGraph β€” Stateful multi-agent graph orchestration
Groq API β€” LLM inference (Llama 3.1 8B Instant)
ChromaDB β€” Vector database for bug fix memory
Hypothesis β€” Property-based stress testing
Streamlit β€” Production UI
subprocess β€” Sandboxed isolated code execution
ast β€” Static code analysis without execution
hashlib β€” Deterministic ChromaDB IDs
importlib β€” Real-time import hallucination detection
```
---
## πŸ’‘ Key Engineering Decisions
### Why LangGraph over plain LangChain?
LangGraph handles **cyclic workflows** β€” when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.
### Why AST validation before running?
Running broken code wastes subprocess time. AST parsing catches syntax errors in **milliseconds** without execution β€” like a proofreader checking spelling before printing.
### Why Hypothesis for testing?
Hand-written tests only cover cases you think of. Hypothesis **auto-generates 500+ random inputs** and verifies properties that should always hold. Catches edge cases no human would write.
### Why separate retry counters per node?
One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.
### Why hashlib instead of Python's hash()?
Python's `hash()` is **randomized every session** for security. Same error β†’ different ChromaDB ID β†’ agent can never retrieve past fixes. `hashlib.md5` is deterministic across all sessions.
### Why combined Reviewer + Explainer?
Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (`FINAL_CODE:` / `EXPLANATION:`) saves an entire API round trip.
---
## πŸ› Real Bugs Found and Fixed
**Bug 1 β€” False Positive in Tester**
`returncode == 0` doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking `stdout` is not empty after successful run.
**Bug 2 β€” ChromaDB Hash Randomization**
Python's `hash()` is session-randomized. Same bug β†’ different ID every run β†’ memory retrieval never works. Fixed with `hashlib.md5().hexdigest()[:8]` for deterministic cross-session IDs.
**Bug 3 β€” Python 3.11 F-string Backslash**
Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.
**Bug 4 β€” Shared Retry Counter**
One `retries` counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding `security_retries` and `complexity_retries` as independent counters.
---
## πŸ”‘ Environment Variables
| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | βœ… Yes | Get free at console.groq.com |
| `GITHUB_TOKEN` | ❌ No | Only needed for AutoReview AI project |
---
## πŸ“ Resume Line
> **Autonomous Python Coding Agent** | LangGraph Β· Groq Β· ChromaDB Β· Streamlit
> Built a 13-node self-healing pipeline with 5-layer verification β€” AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.
---
## πŸ‘¨β€πŸ’» Author
**Krish Patel** β€” AI Engineer
[GitHub](https://github.com/krishpatel) Β· [LinkedIn](https://linkedin.com/in/krishpatel) Β· [Live Demo](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)
---
*Built as part of AI Engineer internship portfolio β€” Bangalore, 2026*