Krishp1's picture
Upload 7 files
0eebcd6 verified
|
Raw
History Blame Contribute Delete
11.2 kB

πŸ€– Autonomous Python Coding Agent

A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code β€” it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.

Python LangGraph Groq ChromaDB Streamlit License Live Demo


πŸš€ Live Demo

β–Ά Try it on Hugging Face Spaces


πŸ“Έ Demo

Agent Demo


πŸ”₯ What makes this different from just using ChatGPT?

Feature ChatGPT / Basic Agent This Agent
Code generation βœ… βœ…
Syntax validation ❌ Run and hope βœ… AST parse before running
Test cases ❌ Manual βœ… Auto-generated by agent
Stress testing ❌ βœ… 500+ random inputs via Hypothesis
Memory ❌ Stateless βœ… ChromaDB learns from past bugs
Security audit ❌ βœ… Detects eval, exec, hardcoded keys
Performance check ❌ βœ… Benchmarks 1000 runs, rejects slow code
Self-review ❌ βœ… Agent scores own confidence 1-10
Self-healing ❌ βœ… Loops back and fixes failures automatically
Separate retry counters ❌ βœ… Per-node counters prevent pipeline blockage

πŸ“Š Key Metrics

Metric Value
Pipeline nodes 13
Verification layers 5 (AST β†’ Tests β†’ Hypothesis β†’ Security β†’ Complexity)
Max retries (debugger) 3
Max retries (security, complexity) 2 each β€” independent counters
Hypothesis test cases 500+ random inputs per run
Benchmark iterations 1,000 runs
Performance threshold < 5ms per call
Memory backend ChromaDB vector similarity search
LLM Llama 3.1 8B Instant via Groq
Avg pipeline runtime ~20–40 seconds
Lines of code ~600 across 5 files

πŸ—οΈ Architecture β€” 13-Node Pipeline

User Input (Python Task)
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Planner β”‚ ── Breaks task into blueprint
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Coder β”‚ ── Writes code using plan + ChromaDB memory
    β””β”€β”€β”€β”€β”¬β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ AST Validator β”‚ ── Syntax + hallucinated imports + type hints
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    (no execution needed β€” milliseconds)
           β”‚
      Pass β”‚   Fail ──► Debugger ──► back to AST
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test Generator β”‚ ── Auto-generates pytest-style test cases
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Tester β”‚ ── Runs code + generated tests in sandbox
    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β”‚
   Pass β”‚   Fail ──► Debugger (max 3 retries)
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hypothesis β”‚ ── 500+ random inputs, property-based testing
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    (never blocks pipeline β€” informational only)
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Benchmark β”‚ ── Runs 1000x, rejects if > 5ms/call
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Security β”‚ ── Detects eval/exec/hardcoded secrets
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    (own retry counter β€” max 2)
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Complexity β”‚ ── Line count + nesting depth + LLM score/10
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    (own retry counter β€” max 2)
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Self Reflection β”‚ ── Agent scores own confidence 1-10
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    Rewrites if confidence < 7
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Reviewer β”‚ ── Polishes + docstrings + type hints
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚Explainer β”‚ ── Writes human-readable explanation
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
       OUTPUT
  Final Code + Explanation

πŸ“ Project Structure

autonomous-coding-agent/
β”œβ”€β”€ app.py              ← Streamlit UI
β”œβ”€β”€ main.py             ← Graph builder + entry point
β”œβ”€β”€ state.py            ← Shared TypedDict state (whiteboard)
β”œβ”€β”€ nodes.py            ← All 13 node functions + LLM + ChromaDB
β”œβ”€β”€ edges.py            ← All 7 conditional route functions
β”œβ”€β”€ requirements.txt    ← Dependencies
└── README.md

⚑ Run Locally

Prerequisites

Step 1 β€” Clone the repo

git clone https://github.com/krishpatel/autonomous-coding-agent.git
cd autonomous-coding-agent

Step 2 β€” Create virtual environment

python -m venv venv

# Mac/Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

Step 3 β€” Install dependencies

pip install -r requirements.txt

Step 4 β€” Set your API key

# Mac/Linux
export GROQ_API_KEY=your_groq_api_key_here

# Windows
set GROQ_API_KEY=your_groq_api_key_here

Or create a .env file:

echo "GROQ_API_KEY=your_groq_api_key_here" > .env

Step 5 β€” Run CLI (no UI)

python main.py

Step 6 β€” Run Streamlit UI

streamlit run app.py

Open http://localhost:8501 in your browser.


🐳 Run with Docker (optional)

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501"]
# Build
docker build -t coding-agent .

# Run
docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent

🌐 Deploy to Hugging Face Spaces

# Install HF CLI
pip install huggingface_hub

# Login
huggingface-cli login

# Create space and push
huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
git push hf main

Then add your secret in HF Spaces Settings:

GROQ_API_KEY = your_key_here

πŸ› οΈ Tech Stack

LangGraph    β€” Stateful multi-agent graph orchestration
Groq API     β€” LLM inference (Llama 3.1 8B Instant)
ChromaDB     β€” Vector database for bug fix memory
Hypothesis   β€” Property-based stress testing
Streamlit    β€” Production UI
subprocess   β€” Sandboxed isolated code execution
ast          β€” Static code analysis without execution
hashlib      β€” Deterministic ChromaDB IDs
importlib    β€” Real-time import hallucination detection

πŸ’‘ Key Engineering Decisions

Why LangGraph over plain LangChain?

LangGraph handles cyclic workflows β€” when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.

Why AST validation before running?

Running broken code wastes subprocess time. AST parsing catches syntax errors in milliseconds without execution β€” like a proofreader checking spelling before printing.

Why Hypothesis for testing?

Hand-written tests only cover cases you think of. Hypothesis auto-generates 500+ random inputs and verifies properties that should always hold. Catches edge cases no human would write.

Why separate retry counters per node?

One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.

Why hashlib instead of Python's hash()?

Python's hash() is randomized every session for security. Same error β†’ different ChromaDB ID β†’ agent can never retrieve past fixes. hashlib.md5 is deterministic across all sessions.

Why combined Reviewer + Explainer?

Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (FINAL_CODE: / EXPLANATION:) saves an entire API round trip.


πŸ› Real Bugs Found and Fixed

Bug 1 β€” False Positive in Tester returncode == 0 doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking stdout is not empty after successful run.

Bug 2 β€” ChromaDB Hash Randomization Python's hash() is session-randomized. Same bug β†’ different ID every run β†’ memory retrieval never works. Fixed with hashlib.md5().hexdigest()[:8] for deterministic cross-session IDs.

Bug 3 β€” Python 3.11 F-string Backslash Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.

Bug 4 β€” Shared Retry Counter One retries counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding security_retries and complexity_retries as independent counters.


πŸ”‘ Environment Variables

Variable Required Description
GROQ_API_KEY βœ… Yes Get free at console.groq.com
GITHUB_TOKEN ❌ No Only needed for AutoReview AI project

πŸ“ Resume Line

Autonomous Python Coding Agent | LangGraph Β· Groq Β· ChromaDB Β· Streamlit Built a 13-node self-healing pipeline with 5-layer verification β€” AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.


πŸ‘¨β€πŸ’» Author

Krish Patel β€” AI Engineer
GitHub Β· LinkedIn Β· Live Demo


Built as part of AI Engineer internship portfolio β€” Bangalore, 2026