An Autonomous Multi-Agent Framework that Transforms
Natural-Language Specifications into Fully Tested Software
Phase 1 — Proposed Design Document | Itanta Hackathon 2026
🔑 Uniform Contract: Every agent extends BaseAgent and implements exactly 3 methods:
build_system_prompt(),
build_user_prompt(),
parse_response().
All accept AgentContext → return AgentResult. No shared mutable state.
🔐 Isolation Guarantee: Agents have ZERO filesystem access — they return file
dicts {path → content}, and the Orchestrator writes through the sandboxed FileManager. No agent can
mutate another agent's output.
The Orchestrator is the sole coordinator. Agents never invoke each other directly. Each phase produces artifacts consumed by the next:
| Phase | Agent | Produces | Consumed By |
|---|---|---|---|
| 1 | Intake | StructuredSpec | Architect, all |
| 2 | Architect | Architecture dict | Planner, QA, Coder |
| 3 | Planner | ImplementationPlan | Orchestrator loop |
| 4 | QA | Test files | Coder, Test Runner |
| 5 | Coder | Production files | Test Runner, Security |
| 6 | Recovery | Fix instructions | Coder (retry) |
| 7 | Security | Audit report | Summary |
ForgeAI pauses at configurable checkpoints for human review. All controllable via a single YAML file:
🧠 Diagnostic Output Example:
| Criterion | Gemini 2.5 | GPT-4o | Claude 3.5 |
|---|---|---|---|
| Speed | ⚡ Fastest | 🔵 Fast | 🟡 Medium |
| Context | 1M tokens | 128K | 200K |
| Cost | Free tier | Paid | Paid |
| JSON Mode | ✅ Native | ✅ | ❌ |
| Code Quality | Excellent | Excellent | Excellent |
💡 Why 1M context? Our agents pass full project state — spec, architecture, existing files, error history — to the LLM. A large context window is critical for maintaining coherence across complex multi-file projects.
⚠️ Why NOT LangChain / CrewAI / AutoGen?
Abstraction overhead. Custom FSM gives us full control over state transitions and recovery logic.
Debugging difficulty. With LangChain, debugging agent failures through deep abstraction layers is painful.
Evaluation criteria. The hackathon evaluates our systems design, not a framework we imported.
| # | Risk | Impact | Probability | Mitigation Strategy |
|---|---|---|---|---|
| R1 | LLM Hallucination | 🔴 High | 🔴 High | TDD-first ensures every line is validated. Recovery agent re-diagnoses with accumulated error context across retries. |
| R2 | Complex Tier Failure | 🔴 High | 🟡 Med | Planner assigns risk levels per task. High-risk tasks get extra retries. Gracefully skip non-critical tasks. |
| R3 | Test Flakiness | 🟡 Med | 🟡 Med | Recovery Agent can modify test code if bug is in test. Double-validate test design against spec. |
| R4 | API Rate Limiting | 🟡 Med | 🟡 Med | LLM Gateway has exponential backoff (2s→4s→8s). Max 3 retries per API call. Proactive token tracking. |
| R5 | Context Overflow | 🟡 Med | 🟢 Low | Gemini's 1M token context virtually eliminates this. Files truncated to 1500 chars when injected into prompts. |
| Criterion | Pts | Our Strategy | Status |
|---|---|---|---|
| Agentic Autonomy | 30 | Zero-touch pipeline. Auto-approve mode. | ✅ |
| TDD & Verification | 25 | QA writes tests BEFORE Coder writes code. | ✅ |
| Complex Logic & State | 20 | 16-state FSM. Dependency-aware scheduling. | ✅ |
| Failure Recovery | 10 | 4-tier cascade: Retry→Modify→Skip→Escalate. | ✅ |
| Code Quality | 10 | Modular, typed, documented, config-driven. | ✅ |
| Extended Features | 5 | Security audit, Docker, Web dashboard. | ✅ |
🚀 ForgeAI is architecturally complete and ready for Phase 2 implementation.
7 agents. 16-state FSM. TDD-first. Intelligent
recovery. Built from scratch — no LangChain, no shortcuts, pure systems engineering.