Spaces:
Running
π€ Autonomous Python Coding Agent
A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code β it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.
π Live Demo
βΆ Try it on Hugging Face Spaces
πΈ Demo
π₯ What makes this different from just using ChatGPT?
| Feature | ChatGPT / Basic Agent | This Agent |
|---|---|---|
| Code generation | β | β |
| Syntax validation | β Run and hope | β AST parse before running |
| Test cases | β Manual | β Auto-generated by agent |
| Stress testing | β | β 500+ random inputs via Hypothesis |
| Memory | β Stateless | β ChromaDB learns from past bugs |
| Security audit | β | β Detects eval, exec, hardcoded keys |
| Performance check | β | β Benchmarks 1000 runs, rejects slow code |
| Self-review | β | β Agent scores own confidence 1-10 |
| Self-healing | β | β Loops back and fixes failures automatically |
| Separate retry counters | β | β Per-node counters prevent pipeline blockage |
π Key Metrics
| Metric | Value |
|---|---|
| Pipeline nodes | 13 |
| Verification layers | 5 (AST β Tests β Hypothesis β Security β Complexity) |
| Max retries (debugger) | 3 |
| Max retries (security, complexity) | 2 each β independent counters |
| Hypothesis test cases | 500+ random inputs per run |
| Benchmark iterations | 1,000 runs |
| Performance threshold | < 5ms per call |
| Memory backend | ChromaDB vector similarity search |
| LLM | Llama 3.1 8B Instant via Groq |
| Avg pipeline runtime | ~20β40 seconds |
| Lines of code | ~600 across 5 files |
ποΈ Architecture β 13-Node Pipeline
User Input (Python Task)
β
βΌ
βββββββββββ
β Planner β ββ Breaks task into blueprint
ββββββ¬βββββ
β
βΌ
βββββββββ
β Coder β ββ Writes code using plan + ChromaDB memory
ββββββ¬βββ
β
βΌ
βββββββββββββββββ
β AST Validator β ββ Syntax + hallucinated imports + type hints
ββββββββ¬βββββββββ (no execution needed β milliseconds)
β
Pass β Fail βββΊ Debugger βββΊ back to AST
βΌ
ββββββββββββββββββ
β Test Generator β ββ Auto-generates pytest-style test cases
βββββββββ¬βββββββββ
β
βΌ
ββββββββββ
β Tester β ββ Runs code + generated tests in sandbox
βββββ¬βββββ
β
Pass β Fail βββΊ Debugger (max 3 retries)
βΌ
ββββββββββββββ
β Hypothesis β ββ 500+ random inputs, property-based testing
βββββββ¬βββββββ (never blocks pipeline β informational only)
β
βΌ
βββββββββββββ
β Benchmark β ββ Runs 1000x, rejects if > 5ms/call
βββββββ¬ββββββ
β
βΌ
ββββββββββββ
β Security β ββ Detects eval/exec/hardcoded secrets
βββββββ¬βββββ (own retry counter β max 2)
β
βΌ
ββββββββββββββ
β Complexity β ββ Line count + nesting depth + LLM score/10
ββββββββ¬ββββββ (own retry counter β max 2)
β
βΌ
βββββββββββββββββββ
β Self Reflection β ββ Agent scores own confidence 1-10
ββββββββββ¬βββββββββ Rewrites if confidence < 7
β
βΌ
ββββββββββββ
β Reviewer β ββ Polishes + docstrings + type hints
βββββββ¬βββββ
β
βΌ
ββββββββββββ
βExplainer β ββ Writes human-readable explanation
βββββββ¬βββββ
β
βΌ
OUTPUT
Final Code + Explanation
π Project Structure
autonomous-coding-agent/
βββ app.py β Streamlit UI
βββ main.py β Graph builder + entry point
βββ state.py β Shared TypedDict state (whiteboard)
βββ nodes.py β All 13 node functions + LLM + ChromaDB
βββ edges.py β All 7 conditional route functions
βββ requirements.txt β Dependencies
βββ README.md
β‘ Run Locally
Prerequisites
- Python 3.11+
- Groq API key β get free at console.groq.com
Step 1 β Clone the repo
git clone https://github.com/krishpatel/autonomous-coding-agent.git
cd autonomous-coding-agent
Step 2 β Create virtual environment
python -m venv venv
# Mac/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
Step 3 β Install dependencies
pip install -r requirements.txt
Step 4 β Set your API key
# Mac/Linux
export GROQ_API_KEY=your_groq_api_key_here
# Windows
set GROQ_API_KEY=your_groq_api_key_here
Or create a .env file:
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
Step 5 β Run CLI (no UI)
python main.py
Step 6 β Run Streamlit UI
streamlit run app.py
Open http://localhost:8501 in your browser.
π³ Run with Docker (optional)
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501"]
# Build
docker build -t coding-agent .
# Run
docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent
π Deploy to Hugging Face Spaces
# Install HF CLI
pip install huggingface_hub
# Login
huggingface-cli login
# Create space and push
huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
git push hf main
Then add your secret in HF Spaces Settings:
GROQ_API_KEY = your_key_here
π οΈ Tech Stack
LangGraph β Stateful multi-agent graph orchestration
Groq API β LLM inference (Llama 3.1 8B Instant)
ChromaDB β Vector database for bug fix memory
Hypothesis β Property-based stress testing
Streamlit β Production UI
subprocess β Sandboxed isolated code execution
ast β Static code analysis without execution
hashlib β Deterministic ChromaDB IDs
importlib β Real-time import hallucination detection
π‘ Key Engineering Decisions
Why LangGraph over plain LangChain?
LangGraph handles cyclic workflows β when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.
Why AST validation before running?
Running broken code wastes subprocess time. AST parsing catches syntax errors in milliseconds without execution β like a proofreader checking spelling before printing.
Why Hypothesis for testing?
Hand-written tests only cover cases you think of. Hypothesis auto-generates 500+ random inputs and verifies properties that should always hold. Catches edge cases no human would write.
Why separate retry counters per node?
One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.
Why hashlib instead of Python's hash()?
Python's hash() is randomized every session for security. Same error β different ChromaDB ID β agent can never retrieve past fixes. hashlib.md5 is deterministic across all sessions.
Why combined Reviewer + Explainer?
Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (FINAL_CODE: / EXPLANATION:) saves an entire API round trip.
π Real Bugs Found and Fixed
Bug 1 β False Positive in Tester
returncode == 0 doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking stdout is not empty after successful run.
Bug 2 β ChromaDB Hash Randomization
Python's hash() is session-randomized. Same bug β different ID every run β memory retrieval never works. Fixed with hashlib.md5().hexdigest()[:8] for deterministic cross-session IDs.
Bug 3 β Python 3.11 F-string Backslash Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.
Bug 4 β Shared Retry Counter
One retries counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding security_retries and complexity_retries as independent counters.
π Environment Variables
| Variable | Required | Description |
|---|---|---|
GROQ_API_KEY |
β Yes | Get free at console.groq.com |
GITHUB_TOKEN |
β No | Only needed for AutoReview AI project |
π Resume Line
Autonomous Python Coding Agent | LangGraph Β· Groq Β· ChromaDB Β· Streamlit Built a 13-node self-healing pipeline with 5-layer verification β AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.
π¨βπ» Author
Krish Patel β AI Engineer
GitHub Β· LinkedIn Β· Live Demo
Built as part of AI Engineer internship portfolio β Bangalore, 2026
