File size: 11,160 Bytes
0eebcd6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# πŸ€– Autonomous Python Coding Agent

> **A production-grade, self-healing multi-agent pipeline that doesn't just generate Python code β€” it autonomously writes, validates, tests, secures, benchmarks, and reflects on its own output before shipping.**

[![Python](https://img.shields.io/badge/Python-3.11-blue?style=flat-square&logo=python)](https://python.org)
[![LangGraph](https://img.shields.io/badge/LangGraph-0.2.0-green?style=flat-square)](https://github.com/langchain-ai/langgraph)
[![Groq](https://img.shields.io/badge/Groq-Llama%203.1-orange?style=flat-square)](https://groq.com)
[![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5.0-purple?style=flat-square)](https://chromadb.com)
[![Streamlit](https://img.shields.io/badge/Streamlit-1.35-red?style=flat-square)](https://streamlit.io)
[![License](https://img.shields.io/badge/License-MIT-lightgrey?style=flat-square)](LICENSE)
[![Live Demo](https://img.shields.io/badge/πŸ€—%20Live%20Demo-HuggingFace-yellow?style=flat-square)](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

---

## πŸš€ Live Demo

**[β–Ά Try it on Hugging Face Spaces](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)**

---

## πŸ“Έ Demo

![Agent Demo](demo.gif)

---

## πŸ”₯ What makes this different from just using ChatGPT?

| Feature | ChatGPT / Basic Agent | This Agent |
|---|---|---|
| Code generation | βœ… | βœ… |
| Syntax validation | ❌ Run and hope | βœ… AST parse before running |
| Test cases | ❌ Manual | βœ… Auto-generated by agent |
| Stress testing | ❌ | βœ… 500+ random inputs via Hypothesis |
| Memory | ❌ Stateless | βœ… ChromaDB learns from past bugs |
| Security audit | ❌ | βœ… Detects eval, exec, hardcoded keys |
| Performance check | ❌ | βœ… Benchmarks 1000 runs, rejects slow code |
| Self-review | ❌ | βœ… Agent scores own confidence 1-10 |
| Self-healing | ❌ | βœ… Loops back and fixes failures automatically |
| Separate retry counters | ❌ | βœ… Per-node counters prevent pipeline blockage |

---

## πŸ“Š Key Metrics

| Metric | Value |
|---|---|
| Pipeline nodes | 13 |
| Verification layers | 5 (AST β†’ Tests β†’ Hypothesis β†’ Security β†’ Complexity) |
| Max retries (debugger) | 3 |
| Max retries (security, complexity) | 2 each β€” independent counters |
| Hypothesis test cases | 500+ random inputs per run |
| Benchmark iterations | 1,000 runs |
| Performance threshold | < 5ms per call |
| Memory backend | ChromaDB vector similarity search |
| LLM | Llama 3.1 8B Instant via Groq |
| Avg pipeline runtime | ~20–40 seconds |
| Lines of code | ~600 across 5 files |

---

## πŸ—οΈ Architecture β€” 13-Node Pipeline

```
User Input (Python Task)
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Planner β”‚ ── Breaks task into blueprint
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Coder β”‚ ── Writes code using plan + ChromaDB memory
    β””β”€β”€β”€β”€β”¬β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ AST Validator β”‚ ── Syntax + hallucinated imports + type hints
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    (no execution needed β€” milliseconds)
           β”‚
      Pass β”‚   Fail ──► Debugger ──► back to AST
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test Generator β”‚ ── Auto-generates pytest-style test cases
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Tester β”‚ ── Runs code + generated tests in sandbox
    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β”‚
   Pass β”‚   Fail ──► Debugger (max 3 retries)
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hypothesis β”‚ ── 500+ random inputs, property-based testing
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    (never blocks pipeline β€” informational only)
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Benchmark β”‚ ── Runs 1000x, rejects if > 5ms/call
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Security β”‚ ── Detects eval/exec/hardcoded secrets
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    (own retry counter β€” max 2)
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Complexity β”‚ ── Line count + nesting depth + LLM score/10
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    (own retry counter β€” max 2)
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Self Reflection β”‚ ── Agent scores own confidence 1-10
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    Rewrites if confidence < 7
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Reviewer β”‚ ── Polishes + docstrings + type hints
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚Explainer β”‚ ── Writes human-readable explanation
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
       OUTPUT
  Final Code + Explanation
```

---

## πŸ“ Project Structure

```
autonomous-coding-agent/
β”œβ”€β”€ app.py              ← Streamlit UI
β”œβ”€β”€ main.py             ← Graph builder + entry point
β”œβ”€β”€ state.py            ← Shared TypedDict state (whiteboard)
β”œβ”€β”€ nodes.py            ← All 13 node functions + LLM + ChromaDB
β”œβ”€β”€ edges.py            ← All 7 conditional route functions
β”œβ”€β”€ requirements.txt    ← Dependencies
└── README.md
```

---

## ⚑ Run Locally

### Prerequisites
- Python 3.11+
- Groq API key β€” get free at [console.groq.com](https://console.groq.com)

### Step 1 β€” Clone the repo
```bash
git clone https://github.com/krishpatel/autonomous-coding-agent.git
cd autonomous-coding-agent
```

### Step 2 β€” Create virtual environment
```bash
python -m venv venv

# Mac/Linux
source venv/bin/activate

# Windows
venv\Scripts\activate
```

### Step 3 β€” Install dependencies
```bash
pip install -r requirements.txt
```

### Step 4 β€” Set your API key
```bash
# Mac/Linux
export GROQ_API_KEY=your_groq_api_key_here

# Windows
set GROQ_API_KEY=your_groq_api_key_here
```

Or create a `.env` file:
```bash
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
```

### Step 5 β€” Run CLI (no UI)
```bash
python main.py
```

### Step 6 β€” Run Streamlit UI
```bash
streamlit run app.py
```

Open [http://localhost:8501](http://localhost:8501) in your browser.

---

## 🐳 Run with Docker (optional)

```dockerfile
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501"]
```

```bash
# Build
docker build -t coding-agent .

# Run
docker run -e GROQ_API_KEY=your_key -p 8501:8501 coding-agent
```

---

## 🌐 Deploy to Hugging Face Spaces

```bash
# Install HF CLI
pip install huggingface_hub

# Login
huggingface-cli login

# Create space and push
huggingface-cli repo create autonomous-coding-agent --type space --space_sdk streamlit
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-coding-agent
git push hf main
```

Then add your secret in HF Spaces Settings:
```
GROQ_API_KEY = your_key_here
```

---

## πŸ› οΈ Tech Stack

```
LangGraph    β€” Stateful multi-agent graph orchestration
Groq API     β€” LLM inference (Llama 3.1 8B Instant)
ChromaDB     β€” Vector database for bug fix memory
Hypothesis   β€” Property-based stress testing
Streamlit    β€” Production UI
subprocess   β€” Sandboxed isolated code execution
ast          β€” Static code analysis without execution
hashlib      β€” Deterministic ChromaDB IDs
importlib    β€” Real-time import hallucination detection
```

---

## πŸ’‘ Key Engineering Decisions

### Why LangGraph over plain LangChain?
LangGraph handles **cyclic workflows** β€” when tests fail, the agent loops back through the debugger and restarts verification from AST. LangChain's linear chains can't do this cleanly.

### Why AST validation before running?
Running broken code wastes subprocess time. AST parsing catches syntax errors in **milliseconds** without execution β€” like a proofreader checking spelling before printing.

### Why Hypothesis for testing?
Hand-written tests only cover cases you think of. Hypothesis **auto-generates 500+ random inputs** and verifies properties that should always hold. Catches edge cases no human would write.

### Why separate retry counters per node?
One shared counter caused security failing 3 times to kill the entire pipeline before the debugger got its attempts. Separate counters for security and complexity mean each node fails independently without blocking others.

### Why hashlib instead of Python's hash()?
Python's `hash()` is **randomized every session** for security. Same error β†’ different ChromaDB ID β†’ agent can never retrieve past fixes. `hashlib.md5` is deterministic across all sessions.

### Why combined Reviewer + Explainer?
Two separate LLM calls for polishing and explaining wasted ~8 seconds. One combined call with structured output (`FINAL_CODE:` / `EXPLANATION:`) saves an entire API round trip.

---

## πŸ› Real Bugs Found and Fixed

**Bug 1 β€” False Positive in Tester**
`returncode == 0` doesn't mean the function was called. A file that only defines functions exits successfully but prints nothing. Fixed by checking `stdout` is not empty after successful run.

**Bug 2 β€” ChromaDB Hash Randomization**
Python's `hash()` is session-randomized. Same bug β†’ different ID every run β†’ memory retrieval never works. Fixed with `hashlib.md5().hexdigest()[:8]` for deterministic cross-session IDs.

**Bug 3 β€” Python 3.11 F-string Backslash**
Python 3.11 doesn't allow backslashes inside f-string expressions. Benchmark node embedded code inside f-strings. Fixed using string concatenation instead.

**Bug 4 β€” Shared Retry Counter**
One `retries` counter shared across all nodes caused security/complexity failures to consume the debugger's retry budget. Fixed by adding `security_retries` and `complexity_retries` as independent counters.

---

## πŸ”‘ Environment Variables

| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | βœ… Yes | Get free at console.groq.com |
| `GITHUB_TOKEN` | ❌ No | Only needed for AutoReview AI project |

---

## πŸ“ Resume Line

> **Autonomous Python Coding Agent** | LangGraph Β· Groq Β· ChromaDB Β· Streamlit
> Built a 13-node self-healing pipeline with 5-layer verification β€” AST validation, auto-generated tests, Hypothesis property testing (500+ random inputs), security audit, and self-reflection confidence scoring. ChromaDB vector memory enables cross-session bug fix learning. Deployed on Hugging Face Spaces.

---

## πŸ‘¨β€πŸ’» Author

**Krish Patel** β€” AI Engineer  
[GitHub](https://github.com/krishpatel) Β· [LinkedIn](https://linkedin.com/in/krishpatel) Β· [Live Demo](https://huggingface.co/spaces/krishpatel/autonomous-coding-agent)

---

*Built as part of AI Engineer internship portfolio β€” Bangalore, 2026*